1
|
Sarumi OA, Hahn M, Heider D. NeuralBeds: Neural embeddings for efficient DNA data compression and optimized similarity search. Comput Struct Biotechnol J 2024; 23:732-741. [PMID: 38298179 PMCID: PMC10828564 DOI: 10.1016/j.csbj.2023.12.046] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/27/2023] [Revised: 12/28/2023] [Accepted: 12/28/2023] [Indexed: 02/02/2024] Open
Abstract
The availability of high throughput sequencing tools coupled with the declining costs in the production of DNA sequences has led to the generation of enormous amounts of omics data curated in several databases such as NCBI and EMBL. Identification of similar DNA sequences from these databases is one of the fundamental tasks in bioinformatics. It is essential for discovering homologous sequences in organisms, phylogenetic studies of evolutionary relationships among several biological entities, or detection of pathogens. Improving DNA similarity search is of outmost importance because of the increased complexity of the evergrowing repositories of sequences. Therefore, instead of using the conventional approach of comparing raw sequences, e.g., in fasta format, a numerical representation of the sequences can be used to calculate their similarities and optimize the search process. In this study, we analyzed different approaches for numerical embeddings, including Chaos Game Representation, hashing, and neural networks, and compared them with classical approaches such as principal component analysis. It turned out that neural networks generate embeddings that are able to capture the similarity between DNA sequences as a distance measure and outperform the other approaches on DNA similarity search, significantly.
Collapse
Affiliation(s)
- Oluwafemi A. Sarumi
- Department of Mathematics and Computer Science, University of Marburg, Hans-Meerwein-Str. 6, Marburg, D-35043, Germany
- Institute of Computer Science, Heinrich-Heine-University Duesseldorf, Graf-Adolf-Str. 63, Duesseldorf, D-40215, Germany
| | - Maximilian Hahn
- Department of Mathematics and Computer Science, University of Marburg, Hans-Meerwein-Str. 6, Marburg, D-35043, Germany
| | - Dominik Heider
- Department of Mathematics and Computer Science, University of Marburg, Hans-Meerwein-Str. 6, Marburg, D-35043, Germany
- Institute of Computer Science, Heinrich-Heine-University Duesseldorf, Graf-Adolf-Str. 63, Duesseldorf, D-40215, Germany
| |
Collapse
|
2
|
Sweet T, Sindi S, Sistrom M. Going through phages: a computational approach to revealing the role of prophage in Staphylococcus aureus. Access Microbiol 2023; 5:acmi000424. [PMID: 37424556 PMCID: PMC10323782 DOI: 10.1099/acmi.0.000424] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/19/2022] [Accepted: 03/28/2023] [Indexed: 07/11/2023] Open
Abstract
Prophages have important roles in virulence, antibiotic resistance, and genome evolution in Staphylococcus aureus . Rapid growth in the number of sequenced S. aureus genomes allows for an investigation of prophage sequences at an unprecedented scale. We developed a novel computational pipeline for phage discovery and annotation. We combined PhiSpy, a phage discovery tool, with VGAS and PROKKA, genome annotation tools to detect and analyse prophage sequences in nearly 10 011 S . aureus genomes, discovering thousands of putative prophage sequences with genes encoding virulence factors and antibiotic resistance. To our knowledge, this is the first large-scale application of PhiSpy on a large-scale set of genomes (10 011 S . aureus ). Determining the presence of virulence and resistance encoding genes in prophage has implications for the potential transfer of these genes/functions to other bacteria via transduction and thus can provide insight into the evolution and spread of these genes/functions between bacterial strains. While the phage we have identified may be known, these phages were not necessarily known or characterized in S. aureus and the clustering and comparison we did for phage based on their gene content is novel. Moreover, the reporting of these genes with the S. aureus genomes is novel.
Collapse
Affiliation(s)
- Tyrome Sweet
- Department of Life and Environmental Sciences, University of California, Merced, California, USA
| | - Suzanne Sindi
- Department of Applied Mathematics, University of California, Merced, California, USA
| | - Mark Sistrom
- Department of Life and Environmental Sciences, University of California, Merced, California, USA
| |
Collapse
|
3
|
Brůna T, Li H, Guhlin J, Honsel D, Herbold S, Stanke M, Nenasheva N, Ebel M, Gabriel L, Hoff KJ. GALBA: Genome Annotation with Miniprot and AUGUSTUS. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.04.10.536199. [PMID: 37090650 PMCID: PMC10120627 DOI: 10.1101/2023.04.10.536199] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 04/25/2023]
Abstract
The Earth Biogenome Project has rapidly increased the number of available eukaryotic genomes, but most released genomes continue to lack annotation of protein-coding genes. In addition, no transcriptome data is available for some genomes. Various gene annotation tools have been developed but each has its limitations. Here, we introduce GALBA, a fully automated pipeline that utilizes miniprot, a rapid protein- to-genome aligner, in combination with AUGUSTUS to predict genes with high accuracy. Accuracy results indicate that GALBA is particularly strong in the annotation of large vertebrate genomes. We also present use cases in insects, vertebrates, and a previously unannotated land plant. GALBA is fully open source and available as a docker image for easy execution with Singularity in high-performance computing environments. Our pipeline addresses the critical need for accurate gene annotation in newly sequenced genomes, and we believe that GALBA will greatly facilitate genome annotation for diverse organisms.
Collapse
Affiliation(s)
- Tomáš Brůna
- US Department of Energy Joint Genome Institute, Berkeley, CA 94720, USA
| | - Heng Li
- Department of Data Sciences, Dana-Farber Cancer Institute, Boston, MA 02215, USA & Department of Biomedical Informatics, Harvard Medical School, Boston, MA 02115, USA
| | - Joseph Guhlin
- Genomics Aotearoa and Laboratory for Evolution and Development, Department of Biochemistry, University of Otago, PO Box 56, Dunedin 9016, New Zealand
| | - Daniel Honsel
- Institute of Computer Science, University of Göttingen, 37077 Göttingen, Germany
| | - Steffen Herbold
- Faculty for Computer Science and Mathematics, University of Passau, 94032 Passau, Germany
| | - Mario Stanke
- Institute of Mathematics and Computer Science & Center for Functional Genomics of Microbes, University of Greifswald, 17489 Greifswald, Germany
| | - Natalia Nenasheva
- Institute of Mathematics and Computer Science & Center for Functional Genomics of Microbes, University of Greifswald, 17489 Greifswald, Germany
| | - Matthis Ebel
- Institute of Mathematics and Computer Science & Center for Functional Genomics of Microbes, University of Greifswald, 17489 Greifswald, Germany
| | - Lars Gabriel
- Institute of Mathematics and Computer Science & Center for Functional Genomics of Microbes, University of Greifswald, 17489 Greifswald, Germany
| | - Katharina J. Hoff
- Institute of Mathematics and Computer Science & Center for Functional Genomics of Microbes, University of Greifswald, 17489 Greifswald, Germany
| |
Collapse
|
4
|
Usha T, Middha SK, Babu D, Goyal AK, Das AJ, Saini D, Sarangi A, Krishnamurthy V, Prasannakumar MK, Saini DK, Sidhalinghamurthy KR. Hybrid Assembly and Annotation of the Genome of the Indian Punica granatum, a Superfood. Front Genet 2022; 13:786825. [PMID: 35646087 PMCID: PMC9130716 DOI: 10.3389/fgene.2022.786825] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/05/2021] [Accepted: 03/15/2022] [Indexed: 12/13/2022] Open
Abstract
The wonder fruit pomegranate (Punica granatum, family Lythraceae) is one of India’s economically important fruit crops that can grow in different agro-climatic conditions ranging from tropical to temperate regions. This study reports high-quality de novo draft hybrid genome assembly of diploid Punica cultivar “Bhagwa” and identifies its genomic features. This cultivar is most common among the farmers due to its high sustainability, glossy red color, soft seed, and nutraceutical properties with high market value. The draft genome assembly is about 361.76 Mb (N50 = 40 Mb), ∼9.0 Mb more than the genome size estimated by flow cytometry. The genome is 90.9% complete, and only 26.68% of the genome is occupied by transposable elements and has a relative abundance of 369.93 SSRs/Mb of the genome. A total of 30,803 proteins and their putative functions were predicted. Comparative whole-genome analysis revealed Eucalyptus grandis as the nearest neighbor. KEGG-KASS annotations indicated an abundance of genes involved in the biosynthesis of flavonoids, phenylpropanoids, and secondary metabolites, which are responsible for various medicinal properties of pomegranate, including anticancer, antihyperglycemic, antioxidant, and anti-inflammatory activities. The genome and gene annotations provide new insights into the pharmacological properties of the secondary metabolites synthesized in pomegranate. They will also serve as a valuable resource in mining biosynthetic pathways for key metabolites, novel genes, and variations associated with disease resistance, which can facilitate the breeding of new varieties with high yield and superior quality.
Collapse
Affiliation(s)
- Talambedu Usha
- Department of Biochemistry, Bangalore University, Bengaluru, India
| | - Sushil Kumar Middha
- DBT-BIF Facility, Department of Biotechnology, Maharani Lakshmi Ammanni College for Women, Bengaluru, India
| | - Dinesh Babu
- Faculty of Pharmacy and Pharmaceutical Sciences, University of Alberta, Edmonton, AB, Canada
| | - Arvind Kumar Goyal
- Centre for Bamboo Studies, Department of Biotechnology, Bodoland University, Kokrajhar, India
| | | | - Deepti Saini
- Protein Design Private Limited, Bengaluru, India
| | | | | | | | - Deepak Kumar Saini
- Department of Molecular Reproduction Development and Genetics, Indian Institute of Science, Bengaluru, India
| | | |
Collapse
|
5
|
MOSGA 2: Comparative genomics and validation tools. Comput Struct Biotechnol J 2021; 19:5504-5509. [PMID: 34712396 PMCID: PMC8517542 DOI: 10.1016/j.csbj.2021.09.024] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/10/2021] [Revised: 09/23/2021] [Accepted: 09/24/2021] [Indexed: 01/06/2023] Open
Abstract
Due to the highly growing number of available genomic information, the need for accessible and easy-to-use analysis tools is increasing. To facilitate eukaryotic genome annotations, we created MOSGA. In this work, we show how MOSGA 2 is developed by including several advanced analyses for genomic data. Since the genomic data quality greatly impacts the annotation quality, we included multiple tools to validate and ensure high-quality user-submitted genome assemblies. Moreover, thanks to the integration of comparative genomics methods, users can benefit from a broader genomic view by analyzing multiple genomic data sets simultaneously. Further, we demonstrate the new functionalities of MOSGA 2 by different use-cases and practical examples. MOSGA 2 extends the already established application to the quality control of the genomic data and integrates and analyzes multiple genomes in a larger context, e.g., by phylogenetics.
Collapse
|