1
|
Djeghout B, Le-Viet T, Martins LDO, Savva GM, Evans R, Baker D, Page A, Elumogo N, Wain J, Janecko N. Capturing clinically relevant Campylobacter attributes through direct whole genome sequencing of stool. Microb Genom 2024; 10. [PMID: 39213166 DOI: 10.1099/mgen.0.001284] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/04/2024] Open
Abstract
Campylobacter is the leading bacterial cause of infectious intestinal disease, but the pathogen typically accounts for a very small proportion of the overall stool microbiome in each patient. Diagnosis is even more difficult due to the fastidious nature of Campylobacter in the laboratory setting. This has, in part, driven a change in recent years, from culture-based to rapid PCR-based diagnostic assays which have improved diagnostic detection, whilst creating a knowledge gap in our clinical and epidemiological understanding of Campylobacter genotypes - no isolates to sequence. In this study, direct metagenomic sequencing approaches were used to assess the possibility of replacing genome sequences with metagenome sequences; metagenomic sequencing outputs were used to describe clinically relevant attributes of Campylobacter genotypes. A total of 37 diarrhoeal stool samples with Campylobacter and five samples with an unknown pathogen result were collected and processed with and without filtration, DNA was extracted, and metagenomes were sequenced by short-read sequencing. Culture-based methods were used to validate Campylobacter metagenome-derived genome (MDG) results. Sequence output metrics were assessed for Campylobacter genome quality and accuracy of characterization. Of the 42 samples passing quality checks for analysis, identification of Campylobacter to the genus and species level was dependent on Campylobacter genome read count, coverage and genome completeness. A total of 65% (24/37) of samples were reliably identified to the genus level through Campylobacter MDG, 73% (27/37) by culture and 97% (36/37) by qPCR. The Campylobacter genomes with a genome completeness of over 60% (n=21) were all accurately identified at the species level (100%). Of those, 72% (15/21) were identified to sequence types (STs), and 95% (20/21) accurately identified antimicrobial resistance (AMR) gene determinants. Filtration of stool samples enhanced Campylobacter MDG recovery and genome quality metrics compared to the corresponding unfiltered samples, which improved the identification of STs and AMR profiles. The phylogenetic analysis in this study demonstrated the clustering of the metagenome-derived with culture-derived genomes and revealed the reliability of genomes from direct stool sequencing. Furthermore, Campylobacter genome spiking percentages ranging from 0 to 2% total metagenome abundance in the ONT MinION sequencer, configured to adaptive sequencing, exhibited better assembly quality and accurate identification of STs, particularly in the analysis of metagenomes containing 2 and 1% of Campylobacter jejuni genomes. Direct sequencing of Campylobacter from stool samples provides clinically relevant and epidemiologically important genomic information without the reliance on cultured genomes.
Collapse
Affiliation(s)
- Bilal Djeghout
- Quadram Institute Bioscience, Norwich Research Park, Norwich NR4 7UQ, UK
| | - Thanh Le-Viet
- Quadram Institute Bioscience, Norwich Research Park, Norwich NR4 7UQ, UK
| | | | - George M Savva
- Quadram Institute Bioscience, Norwich Research Park, Norwich NR4 7UQ, UK
| | - Rhiannon Evans
- Quadram Institute Bioscience, Norwich Research Park, Norwich NR4 7UQ, UK
| | - David Baker
- Quadram Institute Bioscience, Norwich Research Park, Norwich NR4 7UQ, UK
| | - Andrew Page
- Quadram Institute Bioscience, Norwich Research Park, Norwich NR4 7UQ, UK
| | - Ngozi Elumogo
- Quadram Institute Bioscience, Norwich Research Park, Norwich NR4 7UQ, UK
- Eastern Pathology Alliance, Norfolk and Norwich University Hospital, Norwich NR4 7UY, UK
| | - John Wain
- Quadram Institute Bioscience, Norwich Research Park, Norwich NR4 7UQ, UK
- Norwich Medical School, University of East Anglia, Norwich NR4 7TJ, UK
| | - Nicol Janecko
- Quadram Institute Bioscience, Norwich Research Park, Norwich NR4 7UQ, UK
| |
Collapse
|
2
|
Vieira Mourato B, Tsers I, Denker S, Klötzl F, Haubold B. Marker discovery in the large. BIOINFORMATICS ADVANCES 2024; 4:vbae113. [PMID: 39132289 PMCID: PMC11310107 DOI: 10.1093/bioadv/vbae113] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 05/16/2024] [Revised: 07/06/2024] [Accepted: 07/26/2024] [Indexed: 08/13/2024]
Abstract
Motivation Markers for diagnostic polymerase chain reactions are routinely constructed by taking regions common to the genomes of a target organism and subtracting the regions found in the targets' closest relatives, their neighbors. This approach is implemented in the published package Fur, which originally required memory proportional to the number of nucleotides in the neighborhood. This does not scale well. Results Here, we describe a new version of Fur that only requires memory proportional to the longest neighbor. In spite of its greater memory efficiency, the new Fur remains fast and is accurate. We demonstrate this by applying it to simulated sequences and comparing it to an efficient alternative. Then we use the new Fur to extract markers from 120 reference bacteria. To make this feasible, we also introduce software for automatically finding target and neighbor genomes and for assessing markers. We pick the best primers from the 10 most sequenced reference bacteria and show their excellent in silico sensitivity and specificity. Availability and implementation Fur is available from github.com/evolbioinf/fur, in the Docker image hub.docker.com/r/beatrizvm/mapro, and in the Code Ocean capsule 10.24433/CO.7955947.v1.
Collapse
Affiliation(s)
- Beatriz Vieira Mourato
- Research Group Bioinformatics, Max-Planck-Institute for Evolutionary Biology, 24306 Plön, Schleswig-Holstein, Germany
| | - Ivan Tsers
- Research Group Bioinformatics, Max-Planck-Institute for Evolutionary Biology, 24306 Plön, Schleswig-Holstein, Germany
| | - Svenja Denker
- Research Group Bioinformatics, Max-Planck-Institute for Evolutionary Biology, 24306 Plön, Schleswig-Holstein, Germany
- Universität zu Lübeck, Lübeck, Schleswig-Holstein, Germany
| | | | - Bernhard Haubold
- Research Group Bioinformatics, Max-Planck-Institute for Evolutionary Biology, 24306 Plön, Schleswig-Holstein, Germany
| |
Collapse
|
3
|
Prusokiene A, Boonham N, Fox A, Howard TP. Mottle: Accurate pairwise substitution distance at high divergence through the exploitation of short-read mappers and gradient descent. PLoS One 2024; 19:e0298834. [PMID: 38512939 PMCID: PMC10956839 DOI: 10.1371/journal.pone.0298834] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/07/2023] [Accepted: 01/30/2024] [Indexed: 03/23/2024] Open
Abstract
Current tools for estimating the substitution distance between two related sequences struggle to remain accurate at a high divergence. Difficulties at distant homologies, such as false seeding and over-alignment, create a high barrier for the development of a stable estimator. This is especially true for viral genomes, which carry a high rate of mutation, small size, and sparse taxonomy. Developing an accurate substitution distance measure would help to elucidate the relationship between highly divergent sequences, interrogate their evolutionary history, and better facilitate the discovery of new viral genomes. To tackle these problems, we propose an approach that uses short-read mappers to create whole-genome maps, and gradient descent to isolate the homologous fraction and calculate the final distance value. We implement this approach as Mottle. With the use of simulated and biological sequences, Mottle was able to remain stable to 0.66-0.96 substitutions per base pair and identify viral outgroup genomes with 95% accuracy at the family-order level. Our results indicate that Mottle performs as well as existing programs in identifying taxonomic relationships, with more accurate numerical estimation of genomic distance over greater divergences. By contrast, one limitation is a reduced numerical accuracy at low divergences, and on genomes where insertions and deletions are uncommon, when compared to alternative approaches. We propose that Mottle may therefore be of particular interest in the study of viruses, viral relationships, and notably for viral discovery platforms, helping in benchmarking of homology search tools and defining the limits of taxonomic classification methods. The code for Mottle is available at https://github.com/tphoward/Mottle_Repo.
Collapse
Affiliation(s)
- Alisa Prusokiene
- Faculty of Science, Agriculture and Engineering, School of Natural and Environmental Sciences, Newcastle University, United Kingdom
| | - Neil Boonham
- Faculty of Science, Agriculture and Engineering, School of Natural and Environmental Sciences, Newcastle University, United Kingdom
| | - Adrian Fox
- Faculty of Science, Agriculture and Engineering, School of Natural and Environmental Sciences, Newcastle University, United Kingdom
- Fera Ltd., Biotech Campus, York, United Kingdom
| | - Thomas P. Howard
- Faculty of Science, Agriculture and Engineering, School of Natural and Environmental Sciences, Newcastle University, United Kingdom
| |
Collapse
|
4
|
Santin M, Molokin A, Orozco-Mosqueda GE, Almeria S, Maloney J. The first Cyclospora cayetanensis lineage A genome from an isolate from Mexico. BMC Genomics 2024; 25:246. [PMID: 38443790 PMCID: PMC10913667 DOI: 10.1186/s12864-024-10163-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/17/2023] [Accepted: 02/26/2024] [Indexed: 03/07/2024] Open
Abstract
BACKGROUND Cyclospora cayetanensis is a protozoan parasite that causes intestinal illness in humans worldwide. Despite its global distribution, most genomic data for C. cayetanensis has been obtained from isolates collected in the United States, leaving genetic variability among globally distributed isolates underexplored. RESULTS In the present study, the genome of an isolate of C. cayetanensis obtained from a child with diarrhea living in Mexico was sequenced and assembled. Evaluation of the assembly using a lineage typing system recently developed by the Centers for Disease Control and Prevention revealed that this isolate is lineage A. CONCLUSIONS Given that the only other whole genome assembly available from Mexico was classified as lineage B, the data presented here represent an important step in expanding our knowledge of the diversity of C. cayetanensis isolates from Mexico at the genomic level.
Collapse
Affiliation(s)
- Monica Santin
- Environmental Microbial and Food Safety Laboratory, Agricultural Research Service, Department of Agriculture, 20705, Beltsville, MD, USA
| | - Aleksey Molokin
- Environmental Microbial and Food Safety Laboratory, Agricultural Research Service, Department of Agriculture, 20705, Beltsville, MD, USA
| | - Guadalupe E Orozco-Mosqueda
- Hospital Infantil de Morelia Eva Sámano de López Mateos, Servicio de Salud de Michoacán, 58020, Morelia, Michoacán, México
| | - Sonia Almeria
- U.S. Food and Drug Administration, Center for Food Safety and Applied Nutrition, Office of Applied Research and Safety Assessment, Division of Virulence Assessment, 20708, Laurel, MD, USA
| | - Jenny Maloney
- Environmental Microbial and Food Safety Laboratory, Agricultural Research Service, Department of Agriculture, 20705, Beltsville, MD, USA.
| |
Collapse
|
5
|
Colautti A, Comi G, Peterlunger E, Iacumin L. Ancient Roman bacterium against current issues: strain Aquil_B6, Paenisporosarcina quisquiliarum, or Psychrobacillus psychrodurans? Microbiol Spectr 2023; 11:e0068623. [PMID: 37975675 PMCID: PMC10714998 DOI: 10.1128/spectrum.00686-23] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/14/2023] [Accepted: 10/08/2023] [Indexed: 11/19/2023] Open
Abstract
IMPORTANCE Since 1988, through the United States government's founding, the National Center for Biotechnology Information (NCBI) has provided an invaluable service to scientific advancement. The universality and total freedom of use if on the one hand allow the use of this database on a global level by all researchers for their valuable work, on the other hand, it has the disadvantage of making it difficult to check the correctness of all the materials present. It is, therefore, of fundamental importance for the correctness and ethics of research to improve the databases at our disposal, identifying and amending the critical issues. This work aims to provide the scientific community with a new sequence for the type strain Paenisporosarcina quisquiliarum SK 55 and broaden the knowledge of the Psychrobacillus psychrodurans species, in particular, considering the ancient strain Aquil_B6 found in an ancient Roman amphora.
Collapse
Affiliation(s)
- Andrea Colautti
- Department of Agricultural, Food, Environmental and Animal Science, University of Udine, Udine, Italy
| | - Giuseppe Comi
- Department of Agricultural, Food, Environmental and Animal Science, University of Udine, Udine, Italy
| | - Enrico Peterlunger
- Department of Agricultural, Food, Environmental and Animal Science, University of Udine, Udine, Italy
| | - Lucilla Iacumin
- Department of Agricultural, Food, Environmental and Animal Science, University of Udine, Udine, Italy
| |
Collapse
|
6
|
Cai X, Peng Y, Yang G, Feng L, Tian X, Huang P, Mao Y, Xu L. Populational genomic insights of Paraclostridium bifermentans as an emerging human pathogen. Front Microbiol 2023; 14:1293206. [PMID: 38029151 PMCID: PMC10665999 DOI: 10.3389/fmicb.2023.1293206] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/12/2023] [Accepted: 10/26/2023] [Indexed: 12/01/2023] Open
Abstract
Paraclostridium bifermentans (P.b) is an emerging human pathogen that is phylogenomically close to Paeniclostridium sordellii (P.s), while their populational genomic features and virulence capacity remain understudied. Here, we performed comparative genomic analyses of P.b and compared their pan-genomic features and virulence coding profiles to those of P.s. Our results revealed that P.b has a more plastic pangenome, a larger genome size, and a higher GC content than P.s. Interestingly, the P.b and P.s share similar core-genomic functions, but P.b encodes more functions in nutrient metabolism and energy conversion and fewer functions in host defense in their accessory-genomes. The P.b may initiate extracellular infection processes similar to those of P.s and Clostridium perfringens by encoding three toxin homologs (i.e., microbial collagenase, thiol-activated cytolysin, phospholipase C, which are involved in extracellular matrices degradation and membrane damaging) in their core-genomes. However, P.b is less toxic than the P.s by encoding fewer secretion toxins in the core-genome and fewer lethal toxins in the accessory-genome. Notably, P.b carries more toxins genes in their accessory-genomes, particularly those of plasmid origin. Moreover, three within-species and highly conserved plasmid groups, encoding virulence, gene acquisition, and adaptation, were carried by 25-33% of P.b strains and clustered by isolation source rather than geography. This study characterized the pan-genomic virulence features of P.b for the first time, and revealed that P. bifermentans is an emerging pathogen that can threaten human health in many aspects, emphasizing the importance of phenotypic and genomic characterizations of in situ clinical isolates.
Collapse
Affiliation(s)
- Xunchao Cai
- Department of Gastroenterology and Hepatology, Shenzhen University General Hospital, Shenzhen University, Shenzhen, Guangdong, China
- Marshall Laboratory of Biomedical Engineering, Shenzhen University, Shenzhen, Guangdong, China
| | - Yao Peng
- Department of Gastroenterology and Hepatology, Shenzhen University General Hospital, Shenzhen University, Shenzhen, Guangdong, China
| | - Gongli Yang
- Department of Gastroenterology and Hepatology, Shenzhen University General Hospital, Shenzhen University, Shenzhen, Guangdong, China
| | - Lijuan Feng
- Department of Gastroenterology and Hepatology, Shenzhen University General Hospital, Shenzhen University, Shenzhen, Guangdong, China
| | - Xiaojuan Tian
- Department of Gastroenterology and Hepatology, Shenzhen University General Hospital, Shenzhen University, Shenzhen, Guangdong, China
| | - Ping Huang
- Department of Gastroenterology and Hepatology, Shenzhen University General Hospital, Shenzhen University, Shenzhen, Guangdong, China
| | - Yanping Mao
- College of Chemistry and Environmental Engineering, Shenzhen University, Shenzhen, Guangdong, China
| | - Long Xu
- Department of Gastroenterology and Hepatology, Shenzhen University General Hospital, Shenzhen University, Shenzhen, Guangdong, China
- Marshall Laboratory of Biomedical Engineering, Shenzhen University, Shenzhen, Guangdong, China
| |
Collapse
|
7
|
Fruzangohar M, Moolhuijzen P, Bakaj N, Taylor J. CoreDetector: a flexible and efficient program for core-genome alignment of evolutionary diverse genomes. Bioinformatics 2023; 39:btad628. [PMID: 37878789 PMCID: PMC10663985 DOI: 10.1093/bioinformatics/btad628] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/29/2023] [Revised: 09/20/2023] [Accepted: 10/23/2023] [Indexed: 10/27/2023] Open
Abstract
MOTIVATION Whole genome alignment of eukaryote species remains an important method for the determination of sequence and structural variations and can also be used to ascertain the representative non-redundant core-genome sequence of a population. Many whole genome alignment tools were first developed for the more mature analysis of prokaryote species with few current tools containing the functionality to process larger genomes of eukaryotes as well as genomes of more divergent species. In addition, the functionality of these tools becomes computationally prohibitive due to the significant compute resources needed to handle larger genomes. RESULTS In this research, we present CoreDetector, an easy-to-use general-purpose program that can align the core-genome sequences for a range of genome sizes and divergence levels. To illustrate the flexibility of CoreDetector, we conducted alignments of a large set of closely related fungal pathogen and hexaploid wheat cultivar genomes as well as more divergent fly and rodent species genomes. In all cases, compared to existing multiple genome alignment tools, CoreDetector exhibited improved flexibility, efficiency, and competitive accuracy in tested cases. AVAILABILITY AND IMPLEMENTATION CoreDetector was developed in the cross platform, and easily deployable, Java language. A packaged pipeline is readily executable in a bash terminal without any external need for Perl or Python environments. Installation, example data, and usage instructions for CoreDetector are freely available from https://github.com/mfruzan/CoreDetector.
Collapse
Affiliation(s)
- Mario Fruzangohar
- The Biometry Hub, School of Agriculture, Food and Wine, University of Adelaide, Urrbrae, South Australia 5064, Australia
| | - Paula Moolhuijzen
- Centre for Crop Disease Management, School of Molecular and Life Sciences, Curtin University, Bentley, Western Australia 6102, Australia
| | - Nicolette Bakaj
- The Biometry Hub, School of Agriculture, Food and Wine, University of Adelaide, Urrbrae, South Australia 5064, Australia
| | - Julian Taylor
- The Biometry Hub, School of Agriculture, Food and Wine, University of Adelaide, Urrbrae, South Australia 5064, Australia
| |
Collapse
|
8
|
Landemaine L, Da Costa G, Fissier E, Francis C, Morand S, Verbeke J, Michel ML, Briandet R, Sokol H, Gueniche A, Bernard D, Chatel JM, Aguilar L, Langella P, Clavaud C, Richard ML. Staphylococcus epidermidis isolates from atopic or healthy skin have opposite effect on skin cells: potential implication of the AHR pathway modulation. Front Immunol 2023; 14:1098160. [PMID: 37304256 PMCID: PMC10250813 DOI: 10.3389/fimmu.2023.1098160] [Citation(s) in RCA: 4] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/14/2022] [Accepted: 05/04/2023] [Indexed: 06/13/2023] Open
Abstract
Introduction Staphylococcus epidermidis is a commensal bacterium ubiquitously present on human skin. This species is considered as a key member of the healthy skin microbiota, involved in the defense against pathogens, modulating the immune system, and involved in wound repair. Simultaneously, S. epidermidis is the second cause of nosocomial infections and an overgrowth of S. epidermidis has been described in skin disorders such as atopic dermatitis. Diverse isolates of S. epidermidis co-exist on the skin. Elucidating the genetic and phenotypic specificities of these species in skin health and disease is key to better understand their role in various skin conditions. Additionally, the exact mechanisms by which commensals interact with host cells is partially understood. We hypothesized that S. epidermidis isolates identified from different skin origins could play distinct roles on skin differentiation and that these effects could be mediated by the aryl hydrocarbon receptor (AhR) pathway. Methods For this purpose, a library of 12 strains originated from healthy skin (non-hyperseborrheic (NH) and hyperseborrheic (H) skin types) and disease skin (atopic (AD) skin type) was characterized at the genomic and phenotypic levels. Results and discussion Here we showed that strains from atopic lesional skin alter the epidermis structure of a 3D reconstructed skin model whereas strains from NH healthy skin do not. All strains from NH healthy skin induced AhR/OVOL1 path and produced high quantities of indole metabolites in co-culture with NHEK; especially indole-3-aldehyde (IAld) and indole-3-lactic acid (ILA); while AD strains did not induce AhR/OVOL1 path but its inhibitor STAT6 and produced the lowest levels of indoles as compared to the other strains. As a consequence, strains from AD skin altered the differentiation markers FLG and DSG1. The results presented here, on a library of 12 strains, showed that S. epidermidis originated from NH healthy skin and atopic skin have opposite effects on the epidermal cohesion and structure and that these differences could be linked to their capacity to produce metabolites, which in turn could activate AHR pathway. Our results on a specific library of strains provide new insights into how S. epidermidis may interact with the skin to promote health or disease.
Collapse
Affiliation(s)
- Leslie Landemaine
- Université Paris-Saclay, INRAE, AgroParisTech, Micalis Institute, Jouy-en-Josas, France
- L’Oréal Research and Innovation, Aulnay-sous-Bois, France
| | - Gregory Da Costa
- Université Paris-Saclay, INRAE, AgroParisTech, Micalis Institute, Jouy-en-Josas, France
- Paris Center for Microbiome Medicine (PaCeMM), Fédération Hospitalo-Universitaire, Paris, France
| | - Elsa Fissier
- Université Paris-Saclay, INRAE, AgroParisTech, Micalis Institute, Jouy-en-Josas, France
- Paris Center for Microbiome Medicine (PaCeMM), Fédération Hospitalo-Universitaire, Paris, France
| | - Carine Francis
- L’Oréal Research and Innovation, Aulnay-sous-Bois, France
| | | | | | - Marie-Laure Michel
- Université Paris-Saclay, INRAE, AgroParisTech, Micalis Institute, Jouy-en-Josas, France
- Paris Center for Microbiome Medicine (PaCeMM), Fédération Hospitalo-Universitaire, Paris, France
| | - Romain Briandet
- Université Paris-Saclay, INRAE, AgroParisTech, Micalis Institute, Jouy-en-Josas, France
| | - Harry Sokol
- Paris Center for Microbiome Medicine (PaCeMM), Fédération Hospitalo-Universitaire, Paris, France
- Sorbonne Université, INSERM UMRS-938, Centre de Recherche Saint-Antoine, Assistance Publique - Hôpitaux de Paris (AP-HP), Paris, France
| | | | | | - Jean-Marc Chatel
- Université Paris-Saclay, INRAE, AgroParisTech, Micalis Institute, Jouy-en-Josas, France
- Paris Center for Microbiome Medicine (PaCeMM), Fédération Hospitalo-Universitaire, Paris, France
| | - Luc Aguilar
- L’Oréal Research and Innovation, Aulnay-sous-Bois, France
| | - Philippe Langella
- Université Paris-Saclay, INRAE, AgroParisTech, Micalis Institute, Jouy-en-Josas, France
- Paris Center for Microbiome Medicine (PaCeMM), Fédération Hospitalo-Universitaire, Paris, France
| | - Cecile Clavaud
- L’Oréal Research and Innovation, Aulnay-sous-Bois, France
| | - Mathias L. Richard
- Université Paris-Saclay, INRAE, AgroParisTech, Micalis Institute, Jouy-en-Josas, France
- Paris Center for Microbiome Medicine (PaCeMM), Fédération Hospitalo-Universitaire, Paris, France
| |
Collapse
|
9
|
Tang R, Yu Z, Li J. KINN: An alignment-free accurate phylogeny reconstruction method based on inner distance distributions of k-mer pairs in biological sequences. Mol Phylogenet Evol 2023; 179:107662. [PMID: 36375789 DOI: 10.1016/j.ympev.2022.107662] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/30/2022] [Revised: 10/10/2022] [Accepted: 11/02/2022] [Indexed: 11/13/2022]
Abstract
Alignment-based methods have faced disadvantages in sequence comparison and phylogeny reconstruction due to their high computational complexity. Alignment-free methods for sequence comparison and phylogeny inference have attracted a great deal of attention in recent years. Here, we explore an alignment-free approach that uses inner distance distributions of k-mer pairs in biological sequences for phylogeny inference. For every sequence in a dataset, our method transforms the sequence into a numeric feature vector consisting of features each representing a specific k-mer pair's contribution to the characterization of the sequentiality uniqueness of the sequence. This newly defined k-mer pair's contribution is an integration of the reverse Kullback-Leibler divergence, pseudo mode and the classic entropy of an inner distance distribution of the k-mer pair in the sequence. Our method has been tested on datasets of complete genome sequences, complete protein sequences, and gene sequences of rRNA of various lengths. Our method achieves the best performance in comparison with state-of-the-art alignment-free methods as measured by the Robinson-Foulds distance between the reference and the constructed phylogeny trees.
Collapse
Affiliation(s)
- Runbin Tang
- Hunan Key Laboratory for Computation and Simulation in Science and Engineering and Key Laboratory of Intelligent Computing and Information Processing of Ministry of Education, Xiangtan University, Hunan 411105, China; School of Mathematical Sciences, Chongqing Normal University, Chongqing 401331, China
| | - Zuguo Yu
- Hunan Key Laboratory for Computation and Simulation in Science and Engineering and Key Laboratory of Intelligent Computing and Information Processing of Ministry of Education, Xiangtan University, Hunan 411105, China.
| | - Jinyan Li
- Data Science Institute, University of Technology Sydney, Ultimo, NSW 2007, Australia.
| |
Collapse
|
10
|
Anjum N, Nabil RL, Rafi RI, Bayzid MS, Rahman MS. CD-MAWS: An Alignment-Free Phylogeny Estimation Method Using Cosine Distance on Minimal Absent Word Sets. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2023; 20:196-205. [PMID: 34928803 DOI: 10.1109/tcbb.2021.3136792] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/04/2023]
Abstract
Multiple sequence alignment has been the traditional and well established approach of sequence analysis and comparison, though it is time and memory consuming. As the scale of sequencing data is increasing day by day, the importance of faster yet accurate alignment-free methods is on the rise. Several alignment-free sequence analysis methods have been established in the literature in recent years, which extract numerical features from genomic data to analyze sequences and also to estimate phylogenetic relationship among genes and species. Minimal Absent Word (MAW) is an effective concept for representing characteristics of a sequence in an alignment-free manner. In this study, we present CD-MAWS, a distance measure based on cosine of the angle between composition vectors constructed using minimal absent words, for sequence analysis in a computationally inexpensive manner. We have benchmarked CD-MAWS using several AFProject datasets, such as Fish mtDNA, E.coli, Plants, Shigella and Yersinia datasets, and found it to perform quite well. Applied on several other biological datasets such as mammal mtDNA, bacterial genomes and viral genomes, CD-MAWS resolved phylogenetic relationships similar to or better than state-of-the-art alignment-free methods such as Mash, Skmer, Co-phylog and kSNP3.
Collapse
|
11
|
Bohnsack KS, Kaden M, Abel J, Villmann T. Alignment-Free Sequence Comparison: A Systematic Survey From a Machine Learning Perspective. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2023; 20:119-135. [PMID: 34990369 DOI: 10.1109/tcbb.2022.3140873] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/04/2023]
Abstract
The encounter of large amounts of biological sequence data generated during the last decades and the algorithmic and hardware improvements have offered the possibility to apply machine learning techniques in bioinformatics. While the machine learning community is aware of the necessity to rigorously distinguish data transformation from data comparison and adopt reasonable combinations thereof, this awareness is often lacking in the field of comparative sequence analysis. With realization of the disadvantages of alignments for sequence comparison, some typical applications use more and more so-called alignment-free approaches. In light of this development, we present a conceptual framework for alignment-free sequence comparison, which highlights the delineation of: 1) the sequence data transformation comprising of adequate mathematical sequence coding and feature generation, from 2) the subsequent (dis-)similarity evaluation of the transformed data by means of problem-specific but mathematically consistent proximity measures. We consider coding to be an information-loss free data transformation in order to get an appropriate representation, whereas feature generation is inevitably information-lossy with the intention to extract just the task-relevant information. This distinction sheds light on the plethora of methods available and assists in identifying suitable methods in machine learning and data analysis to compare the sequences under these premises.
Collapse
|
12
|
Moolhuijzen PM, See PT, Shi G, Powell HR, Cockram J, Jørgensen LN, Benslimane H, Strelkov SE, Turner J, Liu Z, Moffat CS. A global pangenome for the wheat fungal pathogen Pyrenophora tritici-repentis and prediction of effector protein structural homology. Microb Genom 2022; 8:mgen000872. [PMID: 36214662 PMCID: PMC9676058 DOI: 10.1099/mgen.0.000872] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/30/2022] Open
Abstract
The adaptive potential of plant fungal pathogens is largely governed by the gene content of a species, consisting of core and accessory genes across the pathogen isolate repertoire. To approximate the complete gene repertoire of a globally significant crop fungal pathogen, a pan genomic analysis was undertaken for Pyrenophora tritici-repentis (Ptr), the causal agent of tan (or yellow) spot disease in wheat. In this study, 15 new Ptr genomes were sequenced, assembled and annotated, including isolates from three races not previously sequenced. Together with 11 previously published Ptr genomes, a pangenome for 26 Ptr isolates from Australia, Europe, North Africa and America, representing nearly all known races, revealed a conserved core-gene content of 57 % and presents a new Ptr resource for searching natural homologues (orthologues not acquired by horizontal transfer from another species) using remote protein structural homology. Here, we identify for the first time a non-synonymous mutation in the Ptr necrotrophic effector gene ToxB, multiple copies of the inactive toxb within an isolate, a distant natural Pyrenophora homologue of a known Parastagonopora nodorum necrotrophic effector (SnTox3), and clear genomic break points for the ToxA effector horizontal transfer region. This comprehensive genomic analysis of Ptr races includes nine isolates sequenced via long read technologies. Accordingly, these resources provide a more complete representation of the species, and serve as a resource to monitor variations potentially involved in pathogenicity.
Collapse
Affiliation(s)
- Paula M. Moolhuijzen
- Centre for Crop Disease and Management, School of Molecular and Life Sciences, Curtin University, Bentley, Western Australia, Australia
- *Correspondence: Paula M. Moolhuijzen,
| | - Pao Theen See
- Centre for Crop Disease and Management, School of Molecular and Life Sciences, Curtin University, Bentley, Western Australia, Australia
| | - Gongjun Shi
- Department of Plant Pathology, North Dakota State University, Fargo, North Dakota, USA
| | - Harold R. Powell
- Department of Life Sciences, Centre for Integrative Systems Biology and Bioinformatics, Imperial College London, London, England, UK
| | - James Cockram
- NIAB, 93 Lawrence Weaver Road, Cambridge, CB3 0LE, UK
| | | | - Hamida Benslimane
- Département de Botanique, Ecole Nationale Supérieure Agronomique (ENSA), Hassan Badi, El-Harrach, Algiers, Algeria
| | - Stephen E. Strelkov
- Department of Agricultural, Food and Nutritional Science, University of Alberta, Edmonton, AB, Canada
| | | | - Zhaohui Liu
- Department of Plant Pathology, North Dakota State University, Fargo, North Dakota, USA
- *Correspondence: Zhaohui Liu,
| | - Caroline S. Moffat
- Centre for Crop Disease and Management, School of Molecular and Life Sciences, Curtin University, Bentley, Western Australia, Australia
| |
Collapse
|
13
|
Uddin M, Islam MK, Hassan MR, Jahan F, Baek JH. A fast and efficient algorithm for DNA sequence similarity identification. COMPLEX INTELL SYST 2022; 9:1265-1280. [PMID: 36035628 PMCID: PMC9395857 DOI: 10.1007/s40747-022-00846-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/02/2021] [Accepted: 08/05/2022] [Indexed: 11/22/2022]
Abstract
DNA sequence similarity analysis is necessary for enormous purposes including genome analysis, extracting biological information, finding the evolutionary relationship of species. There are two types of sequence analysis which are alignment-based (AB) and alignment-free (AF). AB is effective for small homologous sequences but becomes NP-hard problem for long sequences. However, AF algorithms can solve the major limitations of AB. But most of the existing AF methods show high time complexity and memory consumption, less precision, and less performance on benchmark datasets. To minimize these limitations, we develop an AF algorithm using a 2D \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$$k-mer$$\end{document}k-mer count matrix inspired by the CGR approach. Then we shrink the matrix by analyzing the neighbors and then measure similarities using the best combinations of pairwise distance (PD) and phylogenetic tree methods. We also dynamically choose the value of k for \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$$k-mer$$\end{document}k-mer. We develop an efficient system for finding the positions of \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$$k-mer$$\end{document}k-mer in the count matrix. We apply our system in six different datasets. We achieve the top rank for two benchmark datasets from AFproject, 100% accuracy for two datasets (16 S Ribosomal, 18 Eutherian), and achieve a milestone for time complexity and memory consumption in comparison to the existing study datasets (HEV, HIV-1). Therefore, the comparative results of the benchmark datasets and existing studies demonstrate that our method is highly effective, efficient, and accurate. Thus, our method can be used with the top level of authenticity for DNA sequence similarity measurement.
Collapse
|
14
|
Sequence Comparison Without Alignment: The SpaM Approaches. METHODS IN MOLECULAR BIOLOGY (CLIFTON, N.J.) 2021; 2231:121-134. [PMID: 33289890 DOI: 10.1007/978-1-0716-1036-7_8] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Subscribe] [Scholar Register] [Indexed: 10/22/2022]
Abstract
Sequence alignment is at the heart of DNA and protein sequence analysis. For the data volumes that are nowadays produced by massively parallel sequencing technologies, however, pairwise and multiple alignment methods are often too slow. Therefore, fast alignment-free approaches to sequence comparison have become popular in recent years. Most of these approaches are based on word frequencies, for words of a fixed length, or on word-matching statistics. Other approaches are using the length of maximal word matches. While these methods are very fast, most of them rely on ad hoc measures of sequences similarity or dissimilarity that are hard to interpret. In this chapter, I describe a number of alignment-free methods that we developed in recent years. Our approaches are based on spaced-word matches ("SpaM"), i.e. on inexact word matches, that are allowed to contain mismatches at certain pre-defined positions. Unlike most previous alignment-free approaches, our approaches are able to accurately estimate phylogenetic distances between DNA or protein sequences using a stochastic model of molecular evolution.
Collapse
|
15
|
Girgis HZ, James BT, Luczak BB. Identity: rapid alignment-free prediction of sequence alignment identity scores using self-supervised general linear models. NAR Genom Bioinform 2021; 3:lqab001. [PMID: 33554117 PMCID: PMC7850047 DOI: 10.1093/nargab/lqab001] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/14/2019] [Revised: 12/07/2020] [Accepted: 01/08/2021] [Indexed: 11/12/2022] Open
Abstract
Pairwise global alignment is a fundamental step in sequence analysis. Optimal alignment algorithms are quadratic-slow especially on long sequences. In many applications that involve large sequence datasets, all what is needed is calculating the identity scores (percentage of identical nucleotides in an optimal alignment-including gaps-of two sequences); there is no need for visualizing how every two sequences are aligned. For these applications, we propose Identity, which produces global identity scores for a large number of pairs of DNA sequences using alignment-free methods and self-supervised general linear models. For the first time, the new tool can predict pairwise identity scores in linear time and space. On two large-scale sequence databases, Identity provided the best compromise between sensitivity and precision while being faster than BLAST, Mash, MUMmer4 and USEARCH by 2-80 times. Identity was the best performing tool when searching for low-identity matches. While constructing phylogenetic trees from about 6000 transcripts, the tree due to the scores reported by Identity was the closest to the reference tree (in contrast to andi, FSWM and Mash). Identity is capable of producing pairwise identity scores of millions-of-nucleotides-long bacterial genomes; this task cannot be accomplished by any global-alignment-based tool. Availability: https://github.com/BioinformaticsToolsmith/Identity.
Collapse
Affiliation(s)
- Hani Z Girgis
- Bioinformatics Toolsmith Laboratory, Department of Electrical Engineering and Computer Science, Texas A&M University-Kingsville, 700 University Boulevard, Kingsville, TX 78363, USA
| | - Benjamin T James
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, 32 Vassar Street, Cambridge, MA 02139, USA
| | - Brian B Luczak
- Department of Mathematics, Vanderbilt University, 1326 Stevenson Center Lane, Nashville, TN 3721, USA
| |
Collapse
|
16
|
Haubold B, Klötzl F, Hellberg L, Thompson D, Cavalar M. Fur: Find Unique Genomic Regions for Diagnostic PCR. Bioinformatics 2021; 37:2081-2087. [PMID: 33515232 PMCID: PMC8352509 DOI: 10.1093/bioinformatics/btab059] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/09/2020] [Revised: 12/04/2020] [Accepted: 01/26/2021] [Indexed: 11/12/2022] Open
Abstract
Motivation Unique marker sequences are highly sought after in molecular diagnostics. Nevertheless, there are only few programs available to search for marker sequences, compared to the many programs for similarity search. We therefore wrote the program Fur for Finding Unique genomic Regions. Results Fur takes as input a sample of target sequences and a sample of closely related neighbors. It returns the regions present in all targets and absent from all neighbors. The recently published program genmap can also be used for this purpose and we compared it to fur. When analyzing a sample of 33 genomes representing the major phylogroups of E.coli, fur was 40 times faster than genmap but used three times more memory. On the other hand, genmap yielded three times more markers, but they were less accurate when tested in silico on a sample of 237 E.coli genomes. We also designed phylogroup-specific PCR primers based on the markers proposed by genmap and fur, and tested them by analyzing their virtual amplicons in GenBank. Finally, we used fur to design primers specific to a Lactobacillus species, and found excellent sensitivity and specificity in vitro. Availability and implementation Fur sources and documentation are available from https://github.com/evolbioinf/fur. The compiled software is posted as a docker container at https://hub.docker.com/r/haubold/fox. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Bernhard Haubold
- Department of Evolutionary Genetics, Max-Planck-Institute for Evolutionary Biology, Pl öon, Germany
| | - Fabian Klötzl
- Department of Evolutionary Genetics, Max-Planck-Institute for Evolutionary Biology, Pl öon, Germany
| | - Lars Hellberg
- Molecular Infection Diagnostics, Euroimmun Medizinische Labordiagnostika, Lüubeck, Germany
| | - Daniel Thompson
- Molecular Infection Diagnostics, Euroimmun Medizinische Labordiagnostika, Lüubeck, Germany
| | - Markus Cavalar
- Molecular Infection Diagnostics, Euroimmun Medizinische Labordiagnostika, Lüubeck, Germany
| |
Collapse
|
17
|
Criscuolo A. On the transformation of MinHash-based uncorrected distances into proper evolutionary distances for phylogenetic inference. F1000Res 2020; 9:1309. [PMID: 33335719 PMCID: PMC7713896 DOI: 10.12688/f1000research.26930.1] [Citation(s) in RCA: 13] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Accepted: 10/12/2020] [Indexed: 12/29/2022] Open
Abstract
Recently developed MinHash-based techniques were proven successful in quickly estimating the level of similarity between large nucleotide sequences. This article discusses their usage and limitations in practice to approximating uncorrected distances between genomes, and transforming these pairwise dissimilarities into proper evolutionary distances. It is notably shown that complex distance measures can be easily approximated using simple transformation formulae based on few parameters. MinHash-based techniques can therefore be very useful for implementing fast yet accurate alignment-free phylogenetic reconstruction procedures from large sets of genomes. This last point of view is assessed with a simulation study using a dedicated bioinformatics tool.
Collapse
Affiliation(s)
- Alexis Criscuolo
- Hub de Bioinformatique et Biostatistique - Département Biologie Computationnelle, Institut Pasteur, USR 3756, CNRS, 75015 Paris, France
| |
Collapse
|
18
|
Röhling S, Linne A, Schellhorn J, Hosseini M, Dencker T, Morgenstern B. The number of k-mer matches between two DNA sequences as a function of k and applications to estimate phylogenetic distances. PLoS One 2020; 15:e0228070. [PMID: 32040534 PMCID: PMC7010260 DOI: 10.1371/journal.pone.0228070] [Citation(s) in RCA: 23] [Impact Index Per Article: 5.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/07/2020] [Accepted: 01/08/2020] [Indexed: 12/14/2022] Open
Abstract
We study the number Nk of length-k word matches between pairs of evolutionarily related DNA sequences, as a function of k. We show that the Jukes-Cantor distance between two genome sequences-i.e. the number of substitutions per site that occurred since they evolved from their last common ancestor-can be estimated from the slope of a function F that depends on Nk and that is affine-linear within a certain range of k. Integers kmin and kmax can be calculated depending on the length of the input sequences, such that the slope of F in the relevant range can be estimated from the values F(kmin) and F(kmax). This approach can be generalized to so-called Spaced-word Matches (SpaM), where mismatches are allowed at positions specified by a user-defined binary pattern. Based on these theoretical results, we implemented a prototype software program for alignment-free sequence comparison called Slope-SpaM. Test runs on real and simulated sequence data show that Slope-SpaM can accurately estimate phylogenetic distances for distances up to around 0.5 substitutions per position. The statistical stability of our results is improved if spaced words are used instead of contiguous words. Unlike previous alignment-free methods that are based on the number of (spaced) word matches, Slope-SpaM produces accurate results, even if sequences share only local homologies.
Collapse
Affiliation(s)
- Sophie Röhling
- University of Göttingen, Department of Bioinformatics, Göttingen, Germany
| | - Alexander Linne
- University of Göttingen, Department of Bioinformatics, Göttingen, Germany
| | - Jendrik Schellhorn
- University of Göttingen, Department of Bioinformatics, Göttingen, Germany
| | | | - Thomas Dencker
- University of Göttingen, Department of Bioinformatics, Göttingen, Germany
| | - Burkhard Morgenstern
- University of Göttingen, Department of Bioinformatics, Göttingen, Germany
- Göttingen Center of Molecular Biosciences (GZMB), Göttingen, Germany
| |
Collapse
|