1
|
Genomic surveillance of SARS-CoV-2 using long-range PCR primers. Front Microbiol 2024; 15:1272972. [PMID: 38440140 PMCID: PMC10910555 DOI: 10.3389/fmicb.2024.1272972] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/04/2023] [Accepted: 01/03/2024] [Indexed: 03/06/2024] Open
Abstract
Introduction Whole Genome Sequencing (WGS) of the SARS-CoV-2 virus is crucial in the surveillance of the COVID-19 pandemic. Several primer schemes have been developed to sequence nearly all of the ~30,000 nucleotide SARS-CoV-2 genome, using a multiplex PCR approach to amplify cDNA copies of the viral genomic RNA. Midnight primers and ARTIC V4.1 primers are the most popular primer schemes that can amplify segments of SARS-CoV-2 (400 bp and 1200 bp, respectively) tiled across the viral RNA genome. Mutations within primer binding sites and primer-primer interactions can result in amplicon dropouts and coverage bias, yielding low-quality genomes with 'Ns' inserted in the missing amplicon regions, causing inaccurate lineage assignments, and making it challenging to monitor lineage-specific mutations in Variants of Concern (VoCs). Methods In this study we used a set of seven long-range PCR primer pairs to sequence clinical isolates of SARS-CoV-2 on Oxford Nanopore sequencer. These long-range primers generate seven amplicons approximately 4500 bp that covered whole genome of SARS-CoV-2. One of these regions includes the full-length S-gene by using a set of flanking primers. We also evaluated the performance of these long-range primers with Midnight primers by sequencing 94 clinical isolates in a Nanopore flow cell. Results and discussion Using a small set of long-range primers to sequence SARS-CoV-2 genomes reduces the possibility of amplicon dropout and coverage bias. The key finding of this study is that long range primers can be used in single-molecule sequencing of RNA viruses in surveillance of emerging variants. We also show that by designing primers flanking the S-gene, we can obtain reliable identification of SARS-CoV-2 variants.
Collapse
|
2
|
Genomic Surveillance of SARS-CoV-2 Using Long-Range PCR Primers. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.07.10.548464. [PMID: 37502853 PMCID: PMC10369864 DOI: 10.1101/2023.07.10.548464] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 07/29/2023]
Abstract
Whole Genome Sequencing (WGS) of the SARS-CoV-2 virus is crucial in the surveillance of the COVID-19 pandemic. Several primer schemes have been developed to sequence the ~30,000 nucleotide SARS-CoV-2 genome that use a multiplex PCR approach to amplify cDNA copies of the viral genomic RNA. Midnight primers and ARTIC V4.1 primers are the most popular primer schemes that can amplify segments of SARS-CoV-2 (400 bp and 1200 bp, respectively) tiled across the viral RNA genome. Mutations within primer binding sites and primer-primer interactions can result in amplicon dropouts and coverage bias, yielding low-quality genomes with 'Ns' inserted in the missing amplicon regions, causing inaccurate lineage assignments, and making it challenging to monitor lineage-specific mutations in Variants of Concern (VoCs). This study uses seven long-range PCR primers with an amplicon size of ~4500 bp to tile across the complete SARS-CoV-2 genome. One of these regions includes the full-length S-gene by using a set of flanking primers. Using a small set of long-range primers to sequence SARS-CoV-2 genomes reduces the possibility of amplicon dropout and coverage bias.
Collapse
|
3
|
Big data in genomic research for big questions with examples from covid-19 and other zoonoses. J Appl Microbiol 2023; 134:6917140. [PMID: 36626787 DOI: 10.1093/jambio/lxac055] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/04/2022] [Revised: 07/20/2022] [Accepted: 11/09/2022] [Indexed: 01/12/2023]
Abstract
Omics research inevitably involves the collection and analysis of big data, which can only be handled by automated approaches. Here we point out that the analysis of big data in the field of genomics dictates certain requirements, such as specialized software, quality control of input data, and simplification for visualization of the results. The latter results in a loss of information, as is exemplified for phylogenetic trees. Clear communication of big data analyses can be enhanced by novel visualization strategies. The interpretation of findings is sometimes hampered when dedicated analytical tools are not fully understood by microbiologists, while the researchers performing these analyses may not have a full overview of the biology of the microbes under study. These issues are illustrated here, using SARS-Cov-2 and Salmonella enterica as zoonotic examples. Whereas in scientific communications jargon should be avoided or explained, nomenclature to group similar organisms and distinguish these from more distant relatives is not only essential, but also influences the interpretation of results. Unfortunately, changes in taxonomically accepted names are now so frequent that they hamper rather than assist research, as is illustrated with difficulties of microbiome studies. Nomenclature to group viral isolates, as is done for SARS-Cov2, is also not without difficulties. Some weaknesses in current omics research stem from poor quality of data or biased databases, and problems can be magnified by machine learning approaches. Moreover, the overall opus of scientific publications can now be considered "big data", as is illustrated by the avalanche of COVID-19-related publications. The peer-review model of scientific publishing is only barely coping with this novel situation, resulting in retractions and the publication of bogus works. The avalanche of scientific publications that originated from the current pandemic can obstruct literature searches, and this will unfortunately continue over time.
Collapse
|
4
|
Comparison of Monkeypox virus genomes from the 2017 Nigeria outbreak and the 2022 outbreak. J Appl Microbiol 2022; 133:3690-3698. [PMID: 36074056 PMCID: PMC9828465 DOI: 10.1111/jam.15806] [Citation(s) in RCA: 9] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/21/2022] [Revised: 09/01/2022] [Accepted: 09/04/2022] [Indexed: 01/12/2023]
Abstract
AIMS The current Monkeypox virus (MPX) outbreak is not only the largest known outbreak to date caused by a strain belonging to the West-African clade, but also results in remarkably different clinical and epidemiological features compared to previous outbreaks of this virus. Here, we consider the possibility that mutations in the viral genome may be responsible for its changed characteristics. METHODS AND RESULTS Six genome sequences of isolates from the current outbreak were compared to five genomes of isolates from the 2017 outbreak in Nigeria and to two historic genomes, all belonging to the West-African clade. We report differences that are consistently present in the 2022 isolates but not in the others. Although some variation in repeat units was observed, only two were consistently found in the 2022 genomes only, and these were located in intergenic regions. A total of 55 single nucleotide polymorphisms were consistently present in the 2022 isolates compared to the 2017 isolates. Of these, 25 caused an amino acid substitution in a predicted protein. CONCLUSIONS The nature of the substitution and the annotation of the affected protein identified potential candidates that might affect the virulence of the virus. These included the viral DNA helicase and transcription factors. SIGNIFICANCE This bioinformatic analysis provides guidance for wet-lab research to identify changed properties of the MPX.
Collapse
|
5
|
The first three waves of the Covid-19 pandemic hint at a limited genetic repertoire for SARS-CoV-2. FEMS Microbiol Rev 2022; 46:fuac003. [PMID: 35076068 PMCID: PMC9075578 DOI: 10.1093/femsre/fuac003] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/12/2021] [Revised: 12/17/2021] [Accepted: 01/13/2022] [Indexed: 11/22/2022] Open
Abstract
The genomic diversity of SARS-CoV-2 is the result of a relatively low level of spontaneous mutations introduced during viral replication. With millions of SARS-CoV-2 genome sequences now available, we can begin to assess the overall genetic repertoire of this virus. We find that during 2020, there was a global wave of one variant that went largely unnoticed, possibly because its members were divided over several sublineages (B.1.177 and sublineages B.1.177.XX). We collectively call this Janus, and it was eventually replaced by the Alpha (B.1.1.7) variant of concern (VoC), next replaced by Delta (B.1.617.2), which itself might soon be replaced by a fourth pandemic wave consisting of Omicron (B.1.1.529). We observe that splitting up and redefining variant lineages over time, as was the case with Janus and is now happening with Alpha, Delta and Omicron, is not helpful to describe the epidemic waves spreading globally. Only ∼5% of the 30 000 nucleotides of the SARS-CoV-2 genome are found to be variable. We conclude that a fourth wave of the pandemic with the Omicron variant might not be that different from other VoCs, and that we may already have the tools in hand to effectively deal with this new VoC.
Collapse
|
6
|
Decoding the epitranscriptional landscape from native RNA sequences. Nucleic Acids Res 2021; 49:e7. [PMID: 32710622 PMCID: PMC7826254 DOI: 10.1093/nar/gkaa620] [Citation(s) in RCA: 118] [Impact Index Per Article: 39.3] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/31/2020] [Revised: 06/13/2020] [Accepted: 07/13/2020] [Indexed: 11/14/2022] Open
Abstract
Traditional epitranscriptomics relies on capturing a single RNA modification by antibody or chemical treatment, combined with short-read sequencing to identify its transcriptomic location. This approach is labor-intensive and may introduce experimental artifacts. Direct sequencing of native RNA using Oxford Nanopore Technologies (ONT) can allow for directly detecting the RNA base modifications, although these modifications might appear as sequencing errors. The percent Error of Specific Bases (%ESB) was higher for native RNA than unmodified RNA, which enabled the detection of ribonucleotide modification sites. Based on the %ESB differences, we developed a bioinformatic tool, epitranscriptional landscape inferring from glitches of ONT signals (ELIGOS), that is based on various types of synthetic modified RNA and applied to rRNA and mRNA. ELIGOS is able to accurately predict known classes of RNA methylation sites (AUC > 0.93) in rRNAs from Escherichiacoli, yeast, and human cells, using either unmodified in vitro transcription RNA or a background error model, which mimics the systematic error of direct RNA sequencing as the reference. The well-known DRACH/RRACH motif was localized and identified, consistent with previous studies, using differential analysis of ELIGOS to study the impact of RNA m6A methyltransferase by comparing wild type and knockouts in yeast and mouse cells. Lastly, the DRACH motif could also be identified in the mRNA of three human cell lines. The mRNA modification identified by ELIGOS is at the level of individual base resolution. In summary, we have developed a bioinformatic software package to uncover native RNA modifications.
Collapse
|
7
|
Mash-based analyses of Escherichia coli genomes reveal 14 distinct phylogroups. Commun Biol 2021; 4:117. [PMID: 33500552 PMCID: PMC7838162 DOI: 10.1038/s42003-020-01626-5] [Citation(s) in RCA: 38] [Impact Index Per Article: 12.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/12/2020] [Accepted: 12/21/2020] [Indexed: 01/30/2023] Open
Abstract
In this study, more than one hundred thousand Escherichia coli and Shigella genomes were examined and classified. This is, to our knowledge, the largest E. coli genome dataset analyzed to date. A Mash-based analysis of a cleaned set of 10,667 E. coli genomes from GenBank revealed 14 distinct phylogroups. A representative genome or medoid identified for each phylogroup was used as a proxy to classify 95,525 unassembled genomes from the Sequence Read Archive (SRA). We find that most of the sequenced E. coli genomes belong to four phylogroups (A, C, B1 and E2(O157)). Authenticity of the 14 phylogroups is supported by several different lines of evidence: phylogroup-specific core genes, a phylogenetic tree constructed with 2613 single copy core genes, and differences in the rates of gene gain/loss/duplication. The methodology used in this work is able to reproduce known phylogroups, as well as to identify previously uncharacterized phylogroups in E. coli species.
Collapse
|
8
|
Decoding the epitranscriptional landscape from native RNA sequences. Nucleic Acids Res 2021; 49:e7. [PMID: 32710622 DOI: 10.1101/487819] [Citation(s) in RCA: 10] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/31/2020] [Revised: 06/13/2020] [Accepted: 07/13/2020] [Indexed: 05/25/2023] Open
Abstract
Traditional epitranscriptomics relies on capturing a single RNA modification by antibody or chemical treatment, combined with short-read sequencing to identify its transcriptomic location. This approach is labor-intensive and may introduce experimental artifacts. Direct sequencing of native RNA using Oxford Nanopore Technologies (ONT) can allow for directly detecting the RNA base modifications, although these modifications might appear as sequencing errors. The percent Error of Specific Bases (%ESB) was higher for native RNA than unmodified RNA, which enabled the detection of ribonucleotide modification sites. Based on the %ESB differences, we developed a bioinformatic tool, epitranscriptional landscape inferring from glitches of ONT signals (ELIGOS), that is based on various types of synthetic modified RNA and applied to rRNA and mRNA. ELIGOS is able to accurately predict known classes of RNA methylation sites (AUC > 0.93) in rRNAs from Escherichiacoli, yeast, and human cells, using either unmodified in vitro transcription RNA or a background error model, which mimics the systematic error of direct RNA sequencing as the reference. The well-known DRACH/RRACH motif was localized and identified, consistent with previous studies, using differential analysis of ELIGOS to study the impact of RNA m6A methyltransferase by comparing wild type and knockouts in yeast and mouse cells. Lastly, the DRACH motif could also be identified in the mRNA of three human cell lines. The mRNA modification identified by ELIGOS is at the level of individual base resolution. In summary, we have developed a bioinformatic software package to uncover native RNA modifications.
Collapse
|
9
|
ProdMX: Rapid query and analysis of protein functional domain based on compressed sparse matrices. Comput Struct Biotechnol J 2020; 18:3890-3896. [PMID: 33335686 PMCID: PMC7719867 DOI: 10.1016/j.csbj.2020.10.023] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/30/2020] [Revised: 10/20/2020] [Accepted: 10/23/2020] [Indexed: 11/26/2022] Open
Abstract
Large-scale protein analysis has been used to characterize large numbers of proteins across numerous species. One of the applications is to use as a high-throughput screening method for pathogenicity of genomes. Unlike sequence homology methods, protein comparison at a functional level provides us with a unique opportunity to classify proteins, based on their functional structures without dealing with sequence complexity of distantly related species. Protein functions can be abstractly described by a set of protein functional domains, such as PfamA domains; a set of genomes can then be mapped to a matrix, with each row representing a genome, and the columns representing the presence or absence of a given functional domain. However, a powerful tool is needed to analyze the large sparse matrices generated by millions of genomes that will become available in the near future. The ProdMX is a tool with user-friendly utilities developed to facilitate high-throughput analysis of proteins with an ability to be included as an effective module in the high-throughput pipeline. The ProdMX employs a compressed sparse matrix algorithm to reduce computational resources and time used to perform the matrix manipulation during functional domain analysis. The ProdMX is a free and publicly available Python package which can be installed with popular package mangers such as PyPI and Conda, or with a standard installer from source code available on the ProdMX GitHub repository at https://github.com/visanuwan/prodmx.
Collapse
|
10
|
Abstract
The advancements of information technology and related processing techniques have created a fertile base for progress in many scientific fields and industries. In the fields of drug discovery and development, machine learning techniques have been used for the development of novel drug candidates. The methods for designing drug targets and novel drug discovery now routinely combine machine learning and deep learning algorithms to enhance the efficiency, efficacy, and quality of developed outputs. The generation and incorporation of big data, through technologies such as high-throughput screening and high through-put computational analysis of databases used for both lead and target discovery, has increased the reliability of the machine learning and deep learning incorporated techniques. The use of these virtual screening and encompassing online information has also been highlighted in developing lead synthesis pathways. In this review, machine learning and deep learning algorithms utilized in drug discovery and associated techniques will be discussed. The applications that produce promising results and methods will be reviewed.
Collapse
|
11
|
A novel Cas9-targeted long-read assay for simultaneous detection of IDH1/2 mutations and clinically relevant MGMT methylation in fresh biopsies of diffuse glioma. Acta Neuropathol Commun 2020; 8:87. [PMID: 32563269 PMCID: PMC7305623 DOI: 10.1186/s40478-020-00963-0] [Citation(s) in RCA: 20] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/23/2020] [Accepted: 06/11/2020] [Indexed: 12/20/2022] Open
Abstract
Molecular biomarkers provide both diagnostic and prognostic results for patients with diffuse glioma, the most common primary brain tumor in adults. Here, we used a long-read nanopore-based sequencing technique to simultaneously assess IDH mutation status and MGMT methylation level in 4 human cell lines and 8 fresh human brain tumor biopsies. Currently, these biomarkers are assayed separately, and results can take days to weeks. We demonstrated the use of nanopore Cas9-targeted sequencing (nCATS) to identify IDH1 and IDH2 mutations within 36 h and compared this approach against currently used clinical methods. nCATS was also able to simultaneously provide high-resolution evaluation of MGMT methylation levels not only at the promoter region, as with currently used methods, but also at CpGs across the proximal promoter region, the entirety of exon 1, and a portion of intron 1. We compared the methylation levels of all CpGs to MGMT expression in all cell lines and tumors and observed a positive correlation between intron 1 methylation and MGMT expression. Finally, we identified single nucleotide variants in 3 target loci. This pilot study demonstrates the feasibility of using nCATS as a clinical tool for cancer precision medicine.
Collapse
|
12
|
Comparative genomics of hepatitis A virus, hepatitis C virus, and hepatitis E virus provides insights into the evolutionary history of Hepatovirus species. Microbiologyopen 2020; 9:e973. [PMID: 31742930 PMCID: PMC7002107 DOI: 10.1002/mbo3.973] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/01/2019] [Revised: 10/31/2019] [Accepted: 10/31/2019] [Indexed: 12/17/2022] Open
Abstract
The intraspecies genomic diversity of the single-strand RNA (+) virus species hepatitis A virus (Hepatovirus), hepatitis C virus (Hepacivirus), and hepatitis E virus (Orthohepevirus) was compared. These viral species all can cause liver inflammation (hepatitis), but share no gene similarity. The codon usage of human hepatitis A virus (HAV) is suboptimal for replication in its host, a characteristic it shares with taxonomically related rodent, simian, and bat hepatitis A virus species. We found this codon usage to be strikingly similar to that of Triatoma virus that infects blood-sucking kissing bugs. The codon usage of that virus is well adapted to its insect host. The codon usage of HAV is also similar to other invertebrate viruses of various taxonomic families. An evolutionary ancestor of HAV and related virus species is hypothesized to be an insect virus that underwent a host jump to infect mammals. The similarity between HAV and invertebrate viruses goes beyond codon usage, as they also share amino acid composition characteristics, while not sharing direct sequence homology. In contrast, hepatitis C virus and hepatitis E virus are highly similar in codon usage preference, nucleotide composition, and amino acid composition, and share these characteristics with Human pegivirus A, West Nile virus, and Zika virus. We present evidence that these observations are only partly explained by differences in nucleotide composition of the complete viral codon regions. We consider the combination of nucleotide composition, amino acid composition, and codon usage preference suitable to provide information on possible evolutionary similarities between distant virus species that cannot be investigated by phylogeny.
Collapse
|
13
|
An assessment of Oxford Nanopore sequencing for human gut metagenome profiling: A pilot study of head and neck cancer patients. J Microbiol Methods 2019; 166:105739. [PMID: 31626891 DOI: 10.1016/j.mimet.2019.105739] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/10/2019] [Revised: 10/03/2019] [Accepted: 10/08/2019] [Indexed: 12/21/2022]
Abstract
Gut metagenome profiling using the Oxford Nanopore Technologies (ONT) sequencer was assessed in a pilot-sized study of 10 subjects. The taxonomic abundance of gut microbiota derived from ONT was comparable with Illumina Technology (IT) for the high-abundance species. IT better detected low-abundance species through amplification, when material was limited.
Collapse
|
14
|
SMARC-B1 deficient sinonasal carcinoma metastasis to the brain with next generation sequencing data: a case report of perineural invasion progressing to leptomeningeal invasion. BMC Cancer 2019; 19:827. [PMID: 31438887 PMCID: PMC6704572 DOI: 10.1186/s12885-019-6043-0] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/18/2019] [Accepted: 08/15/2019] [Indexed: 12/15/2022] Open
Abstract
BACKGROUND SMARCB1-deficient sinonasal carcinoma (SDSC) is an aggressive subtype of head and neck cancers that has a poor prognosis despite multimodal therapy. We present a unique case with next generation sequencing data of a patient who had SDSC with perineural invasion to the trigeminal nerve that progressed to a brain metastasis and eventually leptomeningeal spread. CASE PRESENTATION A 42 year old female presented with facial pain and had resection of a tumor along the V2 division of the trigeminal nerve on the right. She underwent adjuvant stereotactic radiation. She developed further neurological symptoms and imaging demonstrated the tumor had infiltrated into the cavernous sinus as well as intradurally. She had surgical resection for removal of her brain metastasis and decompression of the cavernous sinus. Following her second surgery, she had adjuvant radiation and chemotherapy. Several months later she had quadriparesis and imaging was consistent with leptomeningeal spread. She underwent palliative radiation and ultimately transitioned quickly to comfort care and expired. Overall survival from time of diagnosis was 13 months. Next generation sequencing was carried out on her primary tumor and brain metastasis. The brain metastatic tissue had an increased tumor mutational burden in comparison to the primary. CONCLUSIONS This is the first report of SDSC with perineural invasion progressing to leptomeningeal carcinomatosis. Continued next generation sequencing of the primary and metastatic tissue by clinicians is encouraged toprovide further insights into metastatic progression of rare solid tumors.
Collapse
|
15
|
|
16
|
In silico Selection of Amplification Targets for Rapid Polymorphism Screening in Ebola Virus Outbreaks. Front Microbiol 2019; 10:857. [PMID: 31080442 PMCID: PMC6497787 DOI: 10.3389/fmicb.2019.00857] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/15/2019] [Accepted: 04/03/2019] [Indexed: 11/13/2022] Open
Abstract
To achieve maximum transmission chain tracking in the current Ebola outbreak, whole genome sequencing (WGS) has been proposed to provide optimal information. However, WGS remains a costly and time-intensive procedure that is poorly suited for the large numbers of samples being generated, especially under severe time and work-environment constraints as in the present DRC outbreak. To better prepare for future outbreaks, where an apparent single outbreak may actually represent overlapping outbreaks caused by independent variants, and where rapid identification of emerging new transmission chains will be essential, a more practical method would be to amplify and sequence genomic areas that reveal the highest information to differentiate EBOV variants. We have identified four highly informative polymorphism PCR sequencing targets, suitable for rapid tracing of transmission chains and identification of new sources of Ebola outbreaks, an approach which will be far more practical in the field than WGS.
Collapse
|
17
|
Mechanisms linking preterm birth to onset of cardiovascular disease later in adulthood. Eur Heart J 2019; 40:1107-1112. [PMID: 30753448 PMCID: PMC6451766 DOI: 10.1093/eurheartj/ehz025] [Citation(s) in RCA: 55] [Impact Index Per Article: 11.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 10/04/2018] [Revised: 12/03/2018] [Accepted: 01/18/2019] [Indexed: 12/23/2022] Open
Abstract
Cardiovascular disease (CVD) rates in adulthood are high in premature infants; unfortunately, the underlying mechanisms are not well defined. In this review, we discuss potential pathways that could lead to CVD in premature babies. Studies show intense oxidant stress and inflammation at tissue levels in these neonates. Alterations in lipid profile, foetal epigenomics, and gut microbiota in these infants may also underlie the development of CVD. Recently, probiotic bacteria, such as the mucin-degrading bacterium Akkermansia muciniphila have been shown to reduce inflammation and prevent heart disease in animal models. All this information might enable scientists and clinicians to target pathways to act early to curtail the adverse effects of prematurity on the cardiovascular system. This could lead to primary and secondary prevention of CVD and improve survival among preterm neonates later in adult life.
Collapse
|
18
|
Rapid Sequencing of Multiple RNA Viruses in Their Native Form. Front Microbiol 2019; 10:260. [PMID: 30858830 PMCID: PMC6398364 DOI: 10.3389/fmicb.2019.00260] [Citation(s) in RCA: 36] [Impact Index Per Article: 7.2] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/18/2018] [Accepted: 01/31/2019] [Indexed: 12/14/2022] Open
Abstract
Long-read nanopore sequencing by a MinION device offers the unique possibility to directly sequence native RNA. We combined an enzymatic poly-A tailing reaction with the native RNA sequencing to (i) sequence complex population of single-stranded (ss)RNA viruses in parallel, (ii) detect genome, subgenomic mRNA/mRNA simultaneously, (iii) detect a complex transcriptomic architecture without the need for assembly, (iv) enable real-time detection. Using this protocol, positive-ssRNA, negative-ssRNA, with/without a poly(A)-tail, segmented/non-segmented genomes were mixed and sequenced in parallel. Mapping of the generated sequences on the reference genomes showed 100% length recovery with up to 97% identity. This work provides a proof of principle and the validity of this strategy, opening up a wide range of applications to study RNA viruses.
Collapse
|
19
|
Abstract
Publication interests should not limit access to public data
Collapse
|
20
|
Genome-Based Comparison of Clostridioides difficile: Average Amino Acid Identity Analysis of Core Genomes. MICROBIAL ECOLOGY 2018; 76:801-813. [PMID: 29445826 PMCID: PMC6132499 DOI: 10.1007/s00248-018-1155-7] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 08/30/2017] [Accepted: 02/02/2018] [Indexed: 06/08/2023]
Abstract
Infections due to Clostridioides difficile (previously known as Clostridium difficile) are a major problem in hospitals, where cases can be caused by community-acquired strains as well as by nosocomial spread. Whole genome sequences from clinical samples contain a lot of information but that needs to be analyzed and compared in such a way that the outcome is useful for clinicians or epidemiologists. Here, we compare 663 public available complete genome sequences of C. difficile using average amino acid identity (AAI) scores. This analysis revealed that most of these genomes (640, 96.5%) clearly belong to the same species, while the remaining 23 genomes produce four distinct clusters within the Clostridioides genus. The main C. difficile cluster can be further divided into sub-clusters, depending on the chosen cutoff. We demonstrate that MLST, either based on partial or full gene-length, results in biased estimates of genetic differences and does not capture the true degree of similarity or differences of complete genomes. Presence of genes coding for C. difficile toxins A and B (ToxA/B), as well as the binary C. difficile toxin (CDT), was deduced from their unique PfamA domain architectures. Out of the 663 C. difficile genomes, 535 (80.7%) contained at least one copy of ToxA or ToxB, while these genes were missing from 128 genomes. Although some clusters were enriched for toxin presence, these genes are variably present in a given genetic background. The CDT genes were found in 191 genomes, which were restricted to a few clusters only, and only one cluster lacked the toxin A/B genes consistently. A total of 310 genomes contained ToxA/B without CDT (47%). Further, published metagenomic data from stools were used to assess the presence of C. difficile sequences in blinded cases of C. difficile infection (CDI) and controls, to test if metagenomic analysis is sensitive enough to detect the pathogen, and to establish strain relationships between cases from the same hospital. We conclude that metagenomics can contribute to the identification of CDI and can assist in characterization of the most probable causative strain in CDI patients.
Collapse
|
21
|
Abstract
We sequenced the virus genomes from 3 pregnant women in Thailand with Zika virus diagnoses. All had infections with the Asian lineage. The woman infected at gestational week 9, and not those infected at weeks 20 and 24, had a fetus with microcephaly. Asian lineage Zika viruses can cause microcephaly.
Collapse
|
22
|
Complete genomic and transcriptional landscape analysis using third-generation sequencing: a case study of Saccharomyces cerevisiae CEN.PK113-7D. Nucleic Acids Res 2018; 46:e38. [PMID: 29346625 PMCID: PMC5909453 DOI: 10.1093/nar/gky014] [Citation(s) in RCA: 95] [Impact Index Per Article: 15.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/11/2017] [Revised: 01/03/2018] [Accepted: 01/05/2018] [Indexed: 12/13/2022] Open
Abstract
Completion of eukaryal genomes can be difficult task with the highly repetitive sequences along the chromosomes and short read lengths of second-generation sequencing. Saccharomyces cerevisiae strain CEN.PK113-7D, widely used as a model organism and a cell factory, was selected for this study to demonstrate the superior capability of very long sequence reads for de novo genome assembly. We generated long reads using two common third-generation sequencing technologies (Oxford Nanopore Technology (ONT) and Pacific Biosciences (PacBio)) and used short reads obtained using Illumina sequencing for error correction. Assembly of the reads derived from all three technologies resulted in complete sequences for all 16 yeast chromosomes, as well as the mitochondrial chromosome, in one step. Further, we identified three types of DNA methylation (5mC, 4mC and 6mA). Comparison between the reference strain S288C and strain CEN.PK113-7D identified chromosomal rearrangements against a background of similar gene content between the two strains. We identified full-length transcripts through ONT direct RNA sequencing technology. This allows for the identification of transcriptional landscapes, including untranslated regions (UTRs) (5' UTR and 3' UTR) as well as differential gene expression quantification. About 91% of the predicted transcripts could be consistently detected across biological replicates grown either on glucose or ethanol. Direct RNA sequencing identified many polyadenylated non-coding RNAs, rRNAs, telomere-RNA, long non-coding RNA and antisense RNA. This work demonstrates a strategy to obtain complete genome sequences and transcriptional landscapes that can be applied to other eukaryal organisms.
Collapse
|
23
|
Abstract
Background Zika virus (ZIKV) is an emerging human pathogen. Since its arrival in the Western hemisphere, from Africa via Asia, it has become a serious threat to pregnant women, causing microcephaly and other neuropathies in developing fetuses. The mechanisms behind these teratogenic effects are unknown, although epidemiological evidence suggests that microcephaly is not associated with the original, African lineage of ZIKV. The sequences of 196 published ZIKV genomes were used to assess whether recently proposed mechanistic explanations for microcephaly are supported by molecular level changes that may have increased its virulence since the virus left Africa. For this we performed phylogenetic, recombination, adaptive evolution and tetramer frequency analyses, and compared protein sequences for the presence of protease cleavage sites, Pfam domains, glycosylation sites, signal peptides, trans-membrane protein domains, and phosphorylation sites. Results Recombination events within or between Asian and Brazilian lineages were not observed, and likewise there were no differences in protease cleavage, glycosylation sites, signal peptides or trans-membrane domains between African and Brazilian strains. The frequency of Retinoic Acid Response Element (RARE) sequences was increased in Brazilian strains. Genetic adaptation was also apparent by tetramer signatures that had undergone major changes in the past but has stabilized in the Brazilian lineage despite subsequent geographic spread, suggesting the viral population presently propagates in the same host species in various regions. Evidence for selection pressure was recognized for several amino acid sites in the Brazilian lineage compared to the African lineage, mainly in nonstructural proteins, especially protein NS4B. A number of these positively selected mutations resulted in an increased potential to be phosphorylated in the Brazilian lineage compared to the African linage, which may have increased their potential to interfere with neural fetal development. Conclusions ZIKV seems to have adapted to a limited number of hosts, including humans, during which its virulence increased. Its protein NS4B, together with NS4A, has recently been shown to inhibit Akt-mTOR signaling in human fetal neural stem cells, a key pathway for brain development. We hypothesize that positive selection of novel phosphorylation sites in the protein NS4B of the Brazilian lineage could interfere with phosphorylation of Akt and mTOR, impairing Akt-mTOR signaling and this may result in an increased risk for developmental neuropathies. Electronic supplementary material The online version of this article (10.1186/s12859-017-1894-3) contains supplementary material, which is available to authorized users.
Collapse
|
24
|
The qacC Gene Has Recently Spread between Rolling Circle Plasmids of Staphylococcus, Indicative of a Novel Gene Transfer Mechanism. Front Microbiol 2016; 7:1528. [PMID: 27729906 PMCID: PMC5037232 DOI: 10.3389/fmicb.2016.01528] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/21/2016] [Accepted: 09/12/2016] [Indexed: 11/13/2022] Open
Abstract
Resistance of Staphylococcus species to quaternary ammonium compounds, frequently used as disinfectants and biocides, can be attributed to qac genes. Most qac gene products belong to the Small Multidrug Resistant (SMR) protein family, and are often encoded by rolling-circle (RC) replicating plasmids. Four classes of SMR-type qac gene families have been described in Staphylococcus species: qacC, qacG, qacJ, and qacH. Within their class, these genes are highly conserved, but qacC genes are extremely conserved, although they are found in variable plasmid backgrounds. The lower degree of sequence identity of these plasmids compared to the strict nucleotide conservation of their qacC means that this gene has recently spread. In the absence of insertion sequences or other genetic elements explaining the mobility, we sought for an explanation of mobilization by sequence comparison. Publically available sequences of qac genes, their flanking genes and the replication gene that is invariably present in RC-plasmids were compared to reconstruct the evolutionary history of these plasmids and to explain the recent spread of qacC. Here we propose a new model that explains how qacC is mobilized and transferred to acceptor RC-plasmids without assistance of other genes, by means of its location in between the Double Strand replication Origin (DSO) and the Single-Strand replication Origin (SSO). The proposed mobilization model of this DSO-qacC-SSO element represents a novel mechanism of gene mobilization in RC-plasmids, which has also been employed by other genes, such as lnuA (conferring lincomycin resistance). The proposed gene mobility has aided to the wide spread of clinically relevant resistance genes in Staphylococcus populations.
Collapse
|
25
|
Metabolic functions of Pseudomonas fluorescens strains from Populus deltoides depend on rhizosphere or endosphere isolation compartment. Front Microbiol 2015; 6:1118. [PMID: 26528266 PMCID: PMC4604316 DOI: 10.3389/fmicb.2015.01118] [Citation(s) in RCA: 41] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/04/2015] [Accepted: 09/28/2015] [Indexed: 12/13/2022] Open
Abstract
The bacterial microbiota of plants is diverse, with 1000s of operational taxonomic units (OTUs) associated with any individual plant. In this work, we used phenotypic analysis, comparative genomics, and metabolic models to investigate the differences between 19 sequenced Pseudomonas fluorescens strains. These isolates represent a single OTU and were collected from the rhizosphere and endosphere of Populus deltoides. While no traits were exclusive to either endosphere or rhizosphere P. fluorescens isolates, multiple pathways relevant for plant-bacterial interactions are enriched in endosphere isolate genomes. Further, growth phenotypes such as phosphate solubilization, protease activity, denitrification and root growth promotion are biased toward endosphere isolates. Endosphere isolates have significantly more metabolic pathways for plant signaling compounds and an increased metabolic range that includes utilization of energy rich nucleotides and sugars, consistent with endosphere colonization. Rhizosphere P. fluorescens have fewer pathways representative of plant-bacterial interactions but show metabolic bias toward chemical substrates often found in root exudates. This work reveals the diverse functions that may contribute to colonization of the endosphere by bacteria and are enriched among closely related isolates.
Collapse
|
26
|
Abstract
The 2014 Ebola outbreak in West Africa is the largest documented for this virus. To examine the dynamics of this genome, we compare more than 100 currently available ebolavirus genomes to each other and to other viral genomes. Based on oligomer frequency analysis, the family Filoviridae forms a distinct group from all other sequenced viral genomes. All filovirus genomes sequenced to date encode proteins with similar functions and gene order, although there is considerable divergence in sequences between the three genera Ebolavirus, Cuevavirus and Marburgvirus within the family Filoviridae. Whereas all ebolavirus genomes are quite similar (multiple sequences of the same strain are often identical), variation is most common in the intergenic regions and within specific areas of the genes encoding the glycoprotein (GP), nucleoprotein (NP) and polymerase (L). We predict regions that could contain epitope-binding sites, which might be good vaccine targets. This information, combined with glycosylation sites and experimentally determined epitopes, can identify the most promising regions for the development of therapeutic strategies.This manuscript has been authored by UT-Battelle, LLC under Contract No. DE-AC05-00OR22725 with the U.S. Department of Energy. The United States Government retains and the publisher, by accepting the article for publication, acknowledges that the United States Government retains a non-exclusive, paid-up, irrevocable, world-wide license to publish or reproduce the published form of this manuscript, or allow others to do so, for United States Government purposes. The Department of Energy will provide public access to these results of federally sponsored research in accordance with the DOE Public Access Plan (http://energy.gov/downloads/doe-public-access-plan).
Collapse
|
27
|
Insights from 20 years of bacterial genome sequencing. Funct Integr Genomics 2015; 15:141-61. [PMID: 25722247 PMCID: PMC4361730 DOI: 10.1007/s10142-015-0433-4] [Citation(s) in RCA: 391] [Impact Index Per Article: 43.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/19/2015] [Revised: 02/11/2015] [Accepted: 02/12/2015] [Indexed: 12/18/2022]
Abstract
Since the first two complete bacterial genome sequences were published in 1995, the science of bacteria has dramatically changed. Using third-generation DNA sequencing, it is possible to completely sequence a bacterial genome in a few hours and identify some types of methylation sites along the genome as well. Sequencing of bacterial genome sequences is now a standard procedure, and the information from tens of thousands of bacterial genomes has had a major impact on our views of the bacterial world. In this review, we explore a series of questions to highlight some insights that comparative genomics has produced. To date, there are genome sequences available from 50 different bacterial phyla and 11 different archaeal phyla. However, the distribution is quite skewed towards a few phyla that contain model organisms. But the breadth is continuing to improve, with projects dedicated to filling in less characterized taxonomic groups. The clustered regularly interspaced short palindromic repeats (CRISPR)-Cas system provides bacteria with immunity against viruses, which outnumber bacteria by tenfold. How fast can we go? Second-generation sequencing has produced a large number of draft genomes (close to 90 % of bacterial genomes in GenBank are currently not complete); third-generation sequencing can potentially produce a finished genome in a few hours, and at the same time provide methlylation sites along the entire chromosome. The diversity of bacterial communities is extensive as is evident from the genome sequences available from 50 different bacterial phyla and 11 different archaeal phyla. Genome sequencing can help in classifying an organism, and in the case where multiple genomes of the same species are available, it is possible to calculate the pan- and core genomes; comparison of more than 2000 Escherichia coli genomes finds an E. coli core genome of about 3100 gene families and a total of about 89,000 different gene families. Why do we care about bacterial genome sequencing? There are many practical applications, such as genome-scale metabolic modeling, biosurveillance, bioforensics, and infectious disease epidemiology. In the near future, high-throughput sequencing of patient metagenomic samples could revolutionize medicine in terms of speed and accuracy of finding pathogens and knowing how to treat them.
Collapse
|
28
|
Abstract
BACKGROUND More than 80% of the microbial genomes in GenBank are of 'draft' quality (12,553 draft vs. 2,679 finished, as of October, 2013). We have examined all the microbial DNA sequences available for complete, draft, and Sequence Read Archive genomes in GenBank as well as three other major public databases, and assigned quality scores for more than 30,000 prokaryotic genome sequences. RESULTS Scores were assigned using four categories: the completeness of the assembly, the presence of full-length rRNA genes, tRNA composition and the presence of a set of 102 conserved genes in prokaryotes. Most (~88%) of the genomes had quality scores of 0.8 or better and can be safely used for standard comparative genomics analysis. We compared genomes across factors that may influence the score. We found that although sequencing depth coverage of over 100x did not ensure a better score, sequencing read length was a better indicator of sequencing quality. With few exceptions, most of the 30,000 genomes have nearly all the 102 essential genes. CONCLUSIONS The score can be used to set thresholds for screening data when analyzing "all published genomes" and reference data is either not available or not applicable. The scores highlighted organisms for which commonly used tools do not perform well. This information can be used to improve tools and to serve a broad group of users as more diverse organisms are sequenced. Unexpectedly, the comparison of predicted tRNAs across 15,000 high quality genomes showed that anticodons beginning with an 'A' (codons ending with a 'U') are almost non-existent, with the exception of one arginine codon (CGU); this has been noted previously in the literature for a few genomes, but not with the depth found here.
Collapse
|
29
|
Comparative genomics to delineate pathogenic potential in non-O157 Shiga toxin-producing Escherichia coli (STEC) from patients with and without haemolytic uremic syndrome (HUS) in Norway. PLoS One 2014; 9:e111788. [PMID: 25360710 PMCID: PMC4216125 DOI: 10.1371/journal.pone.0111788] [Citation(s) in RCA: 28] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/17/2014] [Accepted: 09/30/2014] [Indexed: 11/19/2022] Open
Abstract
Shiga toxin-producing Escherichia coli (STEC) cause infections in humans ranging from asymptomatic carriage to bloody diarrhoea and haemolytic uremic syndrome (HUS). Here we present whole genome comparison of Norwegian non-O157 STEC strains with the aim to distinguish between strains with the potential to cause HUS and less virulent strains. Whole genome sequencing and comparisons were performed across 95 non-O157 STEC strains. Twenty-three of these were classified as HUS-associated, including strains from patients with HUS (n = 19) and persons with an epidemiological link to a HUS-case (n = 4). Genomic comparison revealed considerable heterogeneity in gene content across the 95 STEC strains. A clear difference in gene profile was observed between strains with and without the Locus of Enterocyte Effacement (LEE) pathogenicity island. Phylogenetic analysis of the core genome showed high degree of diversity among the STEC strains, but all HUS-associated STEC strains were distributed in two distinct clusters within phylogroup B1. However, non-HUS strains were also found in these clusters. A number of accessory genes were found to be significantly overrepresented among HUS-associated STEC, but none of them were unique to this group of strains, suggesting that different sets of genes may contribute to the pathogenic potential in different phylogenetic STEC lineages. In this study we were not able to clearly distinguish between HUS-associated and non-HUS non-O157 STEC by extensive genome comparisons. Our results indicate that STECs from different phylogenetic backgrounds have independently acquired virulence genes that determine pathogenic potential, and that the content of such genes is overlapping between HUS-associated and non-HUS strains.
Collapse
|
30
|
Abstract
We have compared chromosome-specific genes in a set of 18 finished Vibrio genomes, and, in addition, also calculated the pan- and core-genomes from a data set of more than 250 draft Vibrio genome sequences. These genomes come from 9 known species and 2 unknown species. Within the finished chromosomes, we find a core set of 1269 encoded protein families for chromosome 1, and a core of 252 encoded protein families for chromosome 2. Many of these core proteins are also found in the draft genomes (although which chromosome they are located on is unknown.) Of the chromosome specific core protein families, 1169 and 153 are uniquely found in chromosomes 1 and 2, respectively. Gene ontology (GO) terms for each of the protein families were determined, and the different sets for each chromosome were compared. A total of 363 different "Molecular Function" GO categories were found for chromosome 1 specific protein families, and these include several broad activities: pyridoxine 5' phosphate synthetase, glucosylceramidase, heme transport, DNA ligase, amino acid binding, and ribosomal components; in contrast, chromosome 2 specific protein families have only 66 Molecular Function GO terms and include many membrane-associated activities, such as ion channels, transmembrane transporters, and electron transport chain proteins. Thus, it appears that whilst there are many "housekeeping systems" encoded in chromosome 1, there are far fewer core functions found in chromosome 2. However, the presence of many membrane-associated encoded proteins in chromosome 2 is surprising.
Collapse
|
31
|
Abstract
The Firmicutes represent a major component of the intestinal microflora. The intestinal Firmicutes are a large, diverse group of organisms, many of which are poorly characterized due to their anaerobic growth requirements. Although most Firmicutes are Gram positive, members of the class Negativicutes, including the genus Veillonella, stain Gram negative. Veillonella are among the most abundant organisms of the oral and intestinal microflora of animals and humans, in spite of being strict anaerobes. In this work, the genomes of 24 Negativicutes, including eight Veillonella spp., are compared to 20 other Firmicutes genomes; a further 101 prokaryotic genomes were included, covering 26 phyla. Thus a total of 145 prokaryotic genomes were analyzed by various methods to investigate the apparent conflict of the Veillonella Gram stain and their taxonomic position within the Firmicutes. Comparison of the genome sequences confirms that the Negativicutes are distantly related to Clostridium spp., based on 16S rRNA, complete genomic DNA sequences, and a consensus tree based on conserved proteins. The genus Veillonella is relatively homogeneous: inter-genus pair-wise comparison identifies at least 1,350 shared proteins, although less than half of these are found in any given Clostridium genome. Only 27 proteins are found conserved in all analyzed prokaryote genomes. Veillonella has distinct metabolic properties, and significant similarities to genomes of Proteobacteria are not detected, with the exception of a shared LPS biosynthesis pathway. The clade within the class Negativicutes to which the genus Veillonella belongs exhibits unique properties, most of which are in common with Gram-positives and some with Gram negatives. They are only distantly related to Clostridia, but are even less closely related to Gram-negative species. Though the Negativicutes stain Gram-negative and possess two membranes, the genome and proteome analysis presented here confirm their place within the (mainly) Gram positive phylum of the Firmicutes. Further studies are required to unveil the evolutionary history of the Veillonella and other Negativicutes.
Collapse
|
32
|
Abstract
Background: Prediction of the optimal habitat conditions for a given bacterium, based on genome sequence alone would be of value for scientific as well as industrial purposes. One example of such a habitat adaptation is the requirement for oxygen. In spite of good genome data availability, there have been only a few prediction attempts of bacterial oxygen requirements, using genome sequences. Here, we describe a method for distinguishing aerobic, anaerobic and facultative anaerobic bacteria, based on genome sequence-derived input, using naive Bayesian inference. In contrast, other studies found in literature only demonstrate the ability to distinguish two classes at a time. Results: The results shown in the present study are as good as or better than comparable methods previously described in the scientific literature, with an arguably simpler method, when results are directly compared. This method further compares the performance of a single-step naive Bayesian prediction of the three included classifications, compared to a simple Bayesian network with two steps. A two-step network, distinguishing first respiring from non-respiring organisms, followed by the distinction of aerobe and facultative anaerobe organisms within the respiring group, is found to perform best. Conclusions: A simple naive Bayesian network based on the presence or absence of specific protein domains within a genome is an effective and easy way to predict bacterial habitat preferences, such as oxygen requirement.
Collapse
|
33
|
|
34
|
Amino acid usage is asymmetrically biased in AT- and GC-rich microbial genomes. PLoS One 2013; 8:e69878. [PMID: 23922837 PMCID: PMC3724673 DOI: 10.1371/journal.pone.0069878] [Citation(s) in RCA: 22] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/19/2013] [Accepted: 06/14/2013] [Indexed: 11/18/2022] Open
Abstract
INTRODUCTION Genomic base composition ranges from less than 25% AT to more than 85% AT in prokaryotes. Since only a small fraction of prokaryotic genomes is not protein coding even a minor change in genomic base composition will induce profound protein changes. We examined how amino acid and codon frequencies were distributed in over 2000 microbial genomes and how these distributions were affected by base compositional changes. In addition, we wanted to know how genome-wide amino acid usage was biased in the different genomes and how changes to base composition and mutations affected this bias. To carry this out, we used a Generalized Additive Mixed-effects Model (GAMM) to explore non-linear associations and strong data dependences in closely related microbes; principal component analysis (PCA) was used to examine genomic amino acid- and codon frequencies, while the concept of relative entropy was used to analyze genomic mutation rates. RESULTS We found that genomic amino acid frequencies carried a stronger phylogenetic signal than codon frequencies, but that this signal was weak compared to that of genomic %AT. Further, in contrast to codon usage bias (CUB), amino acid usage bias (AAUB) was differently distributed in AT- and GC-rich genomes in the sense that AT-rich genomes did not prefer specific amino acids over others to the same extent as GC-rich genomes. AAUB was also associated with relative entropy; genomes with low AAUB contained more random mutations as a consequence of relaxed purifying selection than genomes with higher AAUB. CONCLUSION Genomic base composition has a substantial effect on both amino acid- and codon frequencies in bacterial genomes. While phylogeny influenced amino acid usage more in GC-rich genomes, AT-content was driving amino acid usage in AT-rich genomes. We found the GAMM model to be an excellent tool to analyze the genomic data used in this study.
Collapse
|
35
|
Abstract
Background The preferred habitat of a given bacterium can provide a hint of which types of enzymes of potential industrial interest it might produce. These might include enzymes that are stable and active at very high or very low temperatures. Being able to accurately predict this based on a genomic sequence, would thus allow for an efficient and targeted search for production organisms, reducing the need for culturing experiments. Results This study found a total of 40 protein families useful for distinction between three thermophilicity classes (thermophiles, mesophiles and psychrophiles). The predictive performance of these protein families were compared to those of 87 basic sequence features (relative use of amino acids and codons, genomic and 16S rDNA AT content and genome size). When using naïve Bayesian inference, it was possible to correctly predict the optimal temperature range with a Matthews correlation coefficient of up to 0.68. The best predictive performance was always achieved by including protein families as well as structural features, compared to either of these alone. A dedicated computer program was created to perform these predictions. Conclusions This study shows that protein families associated with specific thermophilicity classes can provide effective input data for thermophilicity prediction, and that the naïve Bayesian approach is effective for such a task. The program created for this study is able to efficiently distinguish between thermophilic, mesophilic and psychrophilic adapted bacterial genomes.
Collapse
|
36
|
Estimating variation within the genes and inferring the phylogeny of 186 sequenced diverse Escherichia coli genomes. BMC Genomics 2012; 13:577. [PMID: 23114024 PMCID: PMC3575317 DOI: 10.1186/1471-2164-13-577] [Citation(s) in RCA: 155] [Impact Index Per Article: 12.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/10/2012] [Accepted: 10/22/2012] [Indexed: 01/18/2023] Open
Abstract
Background Escherichia coli exists in commensal and pathogenic forms. By measuring the variation of individual genes across more than a hundred sequenced genomes, gene variation can be studied in detail, including the number of mutations found for any given gene. This knowledge will be useful for creating better phylogenies, for determination of molecular clocks and for improved typing techniques. Results We find 3,051 gene clusters/families present in at least 95% of the genomes and 1,702 gene clusters present in 100% of the genomes. The former 'soft core' of about 3,000 gene families is perhaps more biologically relevant, especially considering that many of these genome sequences are draft quality. The E. coli pan-genome for this set of isolates contains 16,373 gene clusters. A core-gene tree, based on alignment and a pan-genome tree based on gene presence/absence, maps the relatedness of the 186 sequenced E. coli genomes. The core-gene tree displays high confidence and divides the E. coli strains into the observed MLST type clades and also separates defined phylotypes. Conclusion The results of comparing a large and diverse E. coli dataset support the theory that reliable and good resolution phylogenies can be inferred from the core-genome. The results further suggest that the resolution at the isolate level may, subsequently be improved by targeting more variable genes. The use of whole genome sequencing will make it possible to eliminate, or at least reduce, the need for several typing steps used in traditional epidemiology.
Collapse
|
37
|
Abstract
The study of microbial pangenomes relies on the computation of gene families, i.e. the clustering of coding sequences into groups of essentially similar genes. There is no standard approach to obtain such gene families. Ideally, the gene family computations should be robust against errors in the annotation of genes in various genomes. In an attempt to achieve this robustness, we propose to cluster sequences by their domain sequence, i.e. the ordered sequence of domains in their protein sequence. In a study of 347 genomes from
Escherichia coli we find on average around 4500 proteins having hits in Pfam-A in every genome, clustering into around 2500 distinct domain sequence families in each genome. Across all genomes we find a total of 5724 such families. A binomial mixture model approach indicates this is around 95% of all domain sequences we would expect to see in
E. coli in the future. A Heaps law analysis indicates the population of domain sequences is larger, but this analysis is also very sensitive to smaller changes in the computation procedure. The resolution between strains is good despite the coarse grouping obtained by domain sequence families. Clustering sequences by their ordered domain content give us domain sequence families, who are robust to errors in the gene prediction step. The computational load of the procedure scales linearly with the number of genomes, which is needed for the future explosion in the number of re-sequenced strains. The use of domain sequence families for a functional classification of strains clearly has some potential to be explored.
Collapse
|
38
|
Abstract
The comparative genomics of prokaryotes has shown the presence of conserved regions containing highly similar genes (the 'core genome') and other regions that vary in gene content (the 'flexible' regions). A significant part of the latter is involved in surface structures that are phage recognition targets. Another sizeable part provides for differences in niche exploitation. Metagenomic data indicates that natural populations of prokaryotes are composed of assemblages of clonal lineages or "meta-clones" that share a core of genes but contain a high diversity by varying the flexible component. This meta-clonal diversity is maintained by a collection of phages that equalize the populations by preventing any individual clonal lineage from hoarding common resources. Thus, this polyclonal assemblage and the phages preying upon them constitute natural selection units.
Collapse
|
39
|
Analysis of evolutionary patterns of genes in Campylobacter jejuni and C. coli. MICROBIAL INFORMATICS AND EXPERIMENTATION 2012; 2:8. [PMID: 22929701 PMCID: PMC3502170 DOI: 10.1186/2042-5783-2-8] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 03/22/2012] [Accepted: 07/20/2012] [Indexed: 01/06/2023]
Abstract
Background The thermophilic Campylobacter jejuni and Campylobacter coli are considered weakly clonal populations where incongruences between genetic markers are assumed to be due to random horizontal transfer of genomic DNA. In order to investigate the population genetics structure we extracted a set of 1180 core gene families (CGF) from 27 sequenced genomes of C. jejuni and C. coli. We adopted a principal component analysis (PCA) on the normalized evolutionary distances in order to reveal any patterns in the evolutionary signals contained within the various CGFs. Results The analysis indicates that the conserved genes in Campylobacter show at least two, possibly five, distinct patterns of evolutionary signals, seen as clusters in the score-space of our PCA. The dominant underlying factor separating the core genes is the ability to distinguish C. jejuni from C. coli. The genes in the clusters outside the main gene group have a strong tendency of being chromosomal neighbors, which is natural if they share a common evolutionary history. Also, the most distinct cluster outside the main group is enriched with genes under positive selection and displays larger than average recombination rates. Conclusions The Campylobacter genomes investigated here show that subsets of conserved genes differ from each other in a more systematic way than expected by random horizontal transfer, and is consistent with differences in selection pressure acting on different genes. These findings are indications of a population of bacteria characterized by genomes with a mixture of evolutionary patterns.
Collapse
|
40
|
Campylobacter fetus subspecies: comparative genomics and prediction of potential virulence targets. Gene 2012; 508:145-56. [PMID: 22890137 DOI: 10.1016/j.gene.2012.07.070] [Citation(s) in RCA: 23] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/24/2012] [Accepted: 07/30/2012] [Indexed: 01/10/2023]
Abstract
The genus Campylobacter contains pathogens causing a wide range of diseases, targeting both humans and animals. Among them, the Campylobacter fetus subspecies fetus and venerealis deserve special attention, as they are the etiological agents of human bacterial gastroenteritis and bovine genital campylobacteriosis, respectively. We compare the whole genomes of both subspecies to get insights into genomic architecture, phylogenetic relationships, genome conservation and core virulence factors. Pan-genomic approach was applied to identify the core- and pan-genome for both C. fetus subspecies and members of the genus. The C. fetus subspecies conserved (76%) proteome were then analyzed for their subcellular localization and protein functions in biological processes. Furthermore, with pathogenomic strategies, unique candidate regions in the genomes and several potential core-virulence factors were identified. The potential candidate factors identified for attenuation and/or subunit vaccine development against C. fetus subspecies contain: nucleoside diphosphate kinase (Ndk), type IV secretion systems (T4SS), outer membrane proteins (OMP), substrate binding proteins CjaA and CjaC, surface array proteins, sap gene, and cytolethal distending toxin (CDT). Significantly, many of those genes were found in genomic regions with signals of horizontal gene transfer and, therefore, predicted as putative pathogenicity islands. We found CRISPR loci and dam genes in an island specific for C. fetus subsp. fetus, and T4SS and sap genes in an island specific for C. fetus subsp. venerealis. The genomic variations and potential core and unique virulence factors characterized in this study would lead to better insight into the species virulence and to more efficient use of the candidates for antibiotic, drug and vaccine development.
Collapse
|
41
|
LeuO is a global regulator of gene expression inSalmonella entericaserovar Typhimurium. Mol Microbiol 2012; 85:1072-89. [DOI: 10.1111/j.1365-2958.2012.08162.x] [Citation(s) in RCA: 60] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
|
42
|
|
43
|
Abstract
More than 50 y of research have provided great insight into the physiology, metabolism, and molecular biology of Salmonella enterica serovar Typhimurium (S. Typhimurium), but important gaps in our knowledge remain. It is clear that a precise choreography of gene expression is required for Salmonella infection, but basic genetic information such as the global locations of transcription start sites (TSSs) has been lacking. We combined three RNA-sequencing techniques and two sequencing platforms to generate a robust picture of transcription in S. Typhimurium. Differential RNA sequencing identified 1,873 TSSs on the chromosome of S. Typhimurium SL1344 and 13% of these TSSs initiated antisense transcripts. Unique findings include the TSSs of the virulence regulators phoP, slyA, and invF. Chromatin immunoprecipitation revealed that RNA polymerase was bound to 70% of the TSSs, and two-thirds of these TSSs were associated with σ(70) (including phoP, slyA, and invF) from which we identified the -10 and -35 motifs of σ(70)-dependent S. Typhimurium gene promoters. Overall, we corrected the location of important genes and discovered 18 times more promoters than identified previously. S. Typhimurium expresses 140 small regulatory RNAs (sRNAs) at early stationary phase, including 60 newly identified sRNAs. Almost half of the experimentally verified sRNAs were found to be unique to the Salmonella genus, and <20% were found throughout the Enterobacteriaceae. This description of the transcriptional map of SL1344 advances our understanding of S. Typhimurium, arguably the most important bacterial infection model.
Collapse
|
44
|
Comparative genomics of Bifidobacterium, Lactobacillus and related probiotic genera. MICROBIAL ECOLOGY 2012; 63:651-673. [PMID: 22031452 PMCID: PMC3324989 DOI: 10.1007/s00248-011-9948-y] [Citation(s) in RCA: 72] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 05/10/2011] [Accepted: 08/01/2011] [Indexed: 05/31/2023]
Abstract
Six bacterial genera containing species commonly used as probiotics for human consumption or starter cultures for food fermentation were compared and contrasted, based on publicly available complete genome sequences. The analysis included 19 Bifidobacterium genomes, 21 Lactobacillus genomes, 4 Lactococcus and 3 Leuconostoc genomes, as well as a selection of Enterococcus (11) and Streptococcus (23) genomes. The latter two genera included genomes from probiotic or commensal as well as pathogenic organisms to investigate if their non-pathogenic members shared more genes with the other probiotic genomes than their pathogenic members. The pan- and core genome of each genus was defined. Pairwise BLASTP genome comparison was performed within and between genera. It turned out that pathogenic Streptococcus and Enterococcus shared more gene families than did the non-pathogenic genomes. In silico multilocus sequence typing was carried out for all genomes per genus, and the variable gene content of genomes was compared within the genera. Informative BLAST Atlases were constructed to visualize genomic variation within genera. The clusters of orthologous groups (COG) classes of all genes in the pan- and core genome of each genus were compared. In addition, it was investigated whether pathogenic genomes contain different COG classes compared to the probiotic or fermentative organisms, again comparing their pan- and core genomes. The obtained results were compared with published data from the literature. This study illustrates how over 80 genomes can be broadly compared using simple bioinformatic tools, leading to both confirmation of known information as well as novel observations.
Collapse
|
45
|
Genomic variation in Salmonella enterica core genes for epidemiological typing. BMC Genomics 2012; 13:88. [PMID: 22409488 PMCID: PMC3359268 DOI: 10.1186/1471-2164-13-88] [Citation(s) in RCA: 65] [Impact Index Per Article: 5.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/30/2011] [Accepted: 03/12/2012] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Technological advances in high throughput genome sequencing are making whole genome sequencing (WGS) available as a routine tool for bacterial typing. Standardized procedures for identification of relevant genes and of variation are needed to enable comparison between studies and over time. The core genes--the genes that are conserved in all (or most) members of a genus or species--are potentially good candidates for investigating genomic variation in phylogeny and epidemiology. RESULTS We identify a set of 2,882 core genes clusters based on 73 publicly available Salmonella enterica genomes and evaluate their value as typing targets, comparing whole genome typing and traditional methods such as 16S and MLST. A consensus tree based on variation of core genes gives much better resolution than 16S and MLST; the pan-genome family tree is similar to the consensus tree, but with higher confidence. The core genes can be divided into two categories: a few highly variable genes and a larger set of conserved core genes, with low variance. For the most variable core genes, the variance in amino acid sequences is higher than for the corresponding nucleotide sequences, suggesting that there is a positive selection towards mutations leading to amino acid changes. CONCLUSIONS Genomic variation within the core genome is useful for investigating molecular evolution and providing candidate genes for bacterial genome typing. Identification of genes with different degrees of variation is important especially in trend analysis.
Collapse
|
46
|
Defining the Pseudomonas genus: where do we draw the line with Azotobacter? MICROBIAL ECOLOGY 2012; 63:239-48. [PMID: 21811795 PMCID: PMC3275731 DOI: 10.1007/s00248-011-9914-8] [Citation(s) in RCA: 75] [Impact Index Per Article: 6.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/07/2010] [Accepted: 07/13/2011] [Indexed: 05/07/2023]
Abstract
The genus Pseudomonas has gone through many taxonomic revisions over the past 100 years, going from a very large and diverse group of bacteria to a smaller, more refined and ordered list having specific properties. The relationship of the Pseudomonas genus to Azotobacter vinelandii is examined using three genomic sequence-based methods. First, using 16S rRNA trees, it is shown that A. vinelandii groups within the Pseudomonas close to Pseudomonas aeruginosa. Genomes from other related organisms (Acinetobacter, Psychrobacter, and Cellvibrio) are outside the Pseudomonas cluster. Second, pan genome family trees based on conserved gene families also show A. vinelandii to be more closely related to Pseudomonas than other related organisms. Third, exhaustive BLAST comparisons demonstrate that the fraction of shared genes between A. vinelandii and Pseudomonas genomes is similar to that of Pseudomonas species with each other. The results of these different methods point to a high similarity between A. vinelandii and the Pseudomonas genus, suggesting that Azotobacter might actually be a Pseudomonas.
Collapse
|
47
|
Natural genetic engineering: intelligence & design in evolution? MICROBIAL INFORMATICS AND EXPERIMENTATION 2011. [PMCID: PMC3372291 DOI: 10.1186/2042-5783-1-11] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]
Abstract
There are many things that I like about James Shapiro's new book "Evolution: A View from the 21st Century" (FT Press Science, 2011). He begins the book by saying that it is the creation of novelty, and not selection, that is important in the history of life. In the presence of heritable traits that vary, selection results in the evolution of a population towards an optimal composition of those traits. But selection can only act on changes - and where does this variation come from? Historically, the creation of novelty has been assumed to be the result of random chance or accident. And yet, organisms seem 'designed'. When one examines the data from sequenced genomes, the changes appear NOT to be random or accidental, but one observes that whole chunks of the genome come and go. These 'chunks' often contain functional units, encoding sets of genes that together can perform some specific function. Shapiro argues that what we see in genomes is 'Natural Genetic Engineering', or designed evolution: "Thinking about genomes from an informatics perspective, it is apparent that systems engineering is a better metaphor for the evolutionary process than the conventional view of evolution as a select-biased random walk through limitless space of possible DNA configurations" (page 6). In this review, I will have a look at four topics: 1.) why I think genomics is not the whole story; 2.) my own perspective of E. coli genomics, and how I think it relates to this book; 3.) a brief discussion on "Intelligence, Design, and Evolution"; and finally, 4.) a section "in defense of the central dogma".
Collapse
|
48
|
The Salmonella enterica pan-genome. MICROBIAL ECOLOGY 2011; 62:487-504. [PMID: 21643699 PMCID: PMC3175032 DOI: 10.1007/s00248-011-9880-1] [Citation(s) in RCA: 130] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/01/2011] [Accepted: 05/08/2011] [Indexed: 05/25/2023]
Abstract
Salmonella enterica is divided into four subspecies containing a large number of different serovars, several of which are important zoonotic pathogens and some show a high degree of host specificity or host preference. We compare 45 sequenced S. enterica genomes that are publicly available (22 complete and 23 draft genome sequences). Of these, 35 were found to be of sufficiently good quality to allow a detailed analysis, along with two Escherichia coli strains (K-12 substr. DH10B and the avian pathogenic E. coli (APEC O1) strain). All genomes were subjected to standardized gene finding, and the core and pan-genome of Salmonella were estimated to be around 2,800 and 10,000 gene families, respectively. The constructed pan-genomic dendrograms suggest that gene content is often, but not uniformly correlated to serotype. Any given Salmonella strain has a large stable core, whilst there is an abundance of accessory genes, including the Salmonella pathogenicity islands (SPIs), transposable elements, phages, and plasmid DNA. We visualize conservation in the genomes in relation to chromosomal location and DNA structural features and find that variation in gene content is localized in a selection of variable genomic regions or islands. These include the SPIs but also encompass phage insertion sites and transposable elements. The islands were typically well conserved in several, but not all, isolates--a difference which may have implications in, e.g., host specificity.
Collapse
|
49
|
Abstract
A vast and rich body of information has grown up as a result of the world's enthusiasm for 'omics technologies. Finding ways to describe and make available this information that maximise its usefulness has become a major effort across the 'omics world. At the heart of this effort is the Genomic Standards Consortium (GSC), an open-membership organization that drives community-based standardization activities, Here we provide a short history of the GSC, provide an overview of its range of current activities, and make a call for the scientific community to join forces to improve the quality and quantity of contextual information about our public collections of genomes, metagenomes, and marker gene sequences.
Collapse
|
50
|
A closer look at bacteroides: phylogenetic relationship and genomic implications of a life in the human gut. MICROBIAL ECOLOGY 2011; 61:473-85. [PMID: 21222211 DOI: 10.1007/s00248-010-9796-1] [Citation(s) in RCA: 67] [Impact Index Per Article: 5.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/01/2010] [Accepted: 12/14/2010] [Indexed: 05/20/2023]
Abstract
The human gut is extremely densely inhabited by bacteria mainly from two phyla, Bacteroidetes and Firmicutes, and there is a great interest in analyzing whole-genome sequences for these species because of their relation to human health and disease. Here, we do whole-genome comparison of 105 Bacteroidetes/Chlorobi genomes to elucidate their phylogenetic relationship and to gain insight into what is separating the gut living Bacteroides and Parabacteroides genera from other Bacteroidetes/Chlorobi species. A comprehensive analysis shows that Bacteroides species have a higher number of extracytoplasmic function σ factors (ECF σ factors) and two component systems for extracellular signal transduction compared to other Bacteroidetes/Chlorobi species. A whole-genome phylogenetic analysis shows a very little difference between the Parabacteroides and Bacteroides genera. Further analysis shows that Bacteroides and Parabacteroides species share a large common core of 1,085 protein families. Genome atlases illustrate that there are few and only small unique areas on the chromosomes of four Bacteroides/Parabacteroides genomes. Functional classification to clusters of othologus groups show that Bacteroides species are enriched in carbohydrate transport and metabolism proteins. Classification of proteins in KEGG metabolic pathways gives a detailed view of the genome's metabolic capabilities that can be linked to its habitat. Bacteroides pectinophilus and Bacteroides capillosus do not cluster together with other Bacteroides species, based on analysis of 16S rRNA sequence, whole-genome protein families and functional content, 16S rRNA sequences of the two species suggest that they belong to the Firmicutes phylum. We have presented a more detailed and precise description of the phylogenetic relationships of members of the Bacteroidetes/Chlorobi phylum by whole genome comparison. Gut living Bacteroides have an enriched set of glycan, vitamin, and cofactor enzymes important for diet digestion.
Collapse
|