1
|
Krause GR, Shands W, Wheeler TJ. Sensitive and error-tolerant annotation of protein-coding DNA with BATH. bioRxiv 2024:2023.12.31.573773. [PMID: 38260252 PMCID: PMC10802276 DOI: 10.1101/2023.12.31.573773] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/24/2024]
Abstract
We present BATH, a tool for highly sensitive annotation of protein-coding DNA based on direct alignment of that DNA to a database of protein sequences or profile hidden Markov models (pHMMs). BATH is built on top of the HMMER3 code base, and simplifies the annotation workflow for pHMM-based annotation by providing a straightforward input interface and easy-to-interpret output. BATH also introduces novel frameshift-aware algorithms to detect frameshift-inducing nucleotide insertions and deletions (indels). BATH matches the accuracy of HMMER3 for annotation of sequences containing no errors, and produces superior accuracy to all tested tools for annotation of sequences containing nucleotide indels. These results suggest that BATH should be used when high annotation sensitivity is required, particularly when frameshift errors are expected to interrupt protein-coding regions, as is true with long read sequencing data and in the context of pseudogenes.
Collapse
Affiliation(s)
- Genevieve R Krause
- R. Ken Coit College of Pharmacy, University of Arizona, Tucson, Arizona, USA
- Department of Computer Science, University of Montana, Missoula, Montana, USA
| | - Walt Shands
- Department of Computer Science, University of Montana, Missoula, Montana, USA
- UC Santa Cruz Genomics Institute, Santa Cruz, California, USA
| | - Travis J Wheeler
- R. Ken Coit College of Pharmacy, University of Arizona, Tucson, Arizona, USA
- Department of Computer Science, University of Montana, Missoula, Montana, USA
| |
Collapse
|
2
|
Mat Razali N, Hisham SN, Kumar IS, Shukla RN, Lee M, Abu Bakar MF, Nadarajah K. Comparative Genomics: Insights on the Pathogenicity and Lifestyle of Rhizoctonia solani. Int J Mol Sci 2021; 22:ijms22042183. [PMID: 33671736 PMCID: PMC7926851 DOI: 10.3390/ijms22042183] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/24/2020] [Revised: 02/06/2021] [Accepted: 02/15/2021] [Indexed: 12/17/2022] Open
Abstract
Proper management of agricultural disease is important to ensure sustainable food security. Staple food crops like rice, wheat, cereals, and other cash crops hold great export value for countries. Ensuring proper supply is critical; hence any biotic or abiotic factors contributing to the shortfall in yield of these crops should be alleviated. Rhizoctonia solani is a major biotic factor that results in yield losses in many agriculturally important crops. This paper focuses on genome informatics of our Malaysian Draft R. solani AG1-IA, and the comparative genomics (inter- and intra- AG) with four AGs including China AG1-IA (AG1-IA_KB317705.1), AG1-IB, AG3, and AG8. The genomic content of repeat elements, transposable elements (TEs), syntenic genomic blocks, functions of protein-coding genes as well as core orthologous genic information that underlies R. solani’s pathogenicity strategy were investigated. Our analyses show that all studied AGs have low content and varying profiles of TEs. All AGs were dominant for Class I TE, much like other basidiomycete pathogens. All AGs demonstrate dominance in Glycoside Hydrolase protein-coding gene assignments suggesting its importance in infiltration and infection of host. Our profiling also provides a basis for further investigation on lack of correlation observed between number of pathogenicity and enzyme-related genes with host range. Despite being grouped within the same AG with China AG1-IA, our Draft AG1-IA exhibits differences in terms of protein-coding gene proportions and classifications. This implies that strains from similar AG do not necessarily have to retain similar proportions and classification of TE but must have the necessary arsenal to enable successful infiltration and colonization of host. In a larger perspective, all the studied AGs essentially share core genes that are generally involved in adhesion, penetration, and host colonization. However, the different infiltration strategies will depend on the level of host resilience where this is clearly exhibited by the gene sets encoded for the process of infiltration, infection, and protection from host.
Collapse
Affiliation(s)
- Nurhani Mat Razali
- Department of Biological Sciences and Biotechnology, Faculty of Science and Technology, Universiti Kebangsaan Malaysia, Bangi 43600, Selangor, Malaysia; (N.M.R.); (S.N.H.); (I.S.K.)
| | - Siti Norvahida Hisham
- Department of Biological Sciences and Biotechnology, Faculty of Science and Technology, Universiti Kebangsaan Malaysia, Bangi 43600, Selangor, Malaysia; (N.M.R.); (S.N.H.); (I.S.K.)
| | - Ilakiya Sharanee Kumar
- Department of Biological Sciences and Biotechnology, Faculty of Science and Technology, Universiti Kebangsaan Malaysia, Bangi 43600, Selangor, Malaysia; (N.M.R.); (S.N.H.); (I.S.K.)
| | - Rohit Nandan Shukla
- Bionivid Technology Pte Ltd., 209, 4th Cross Rd, B Channasandra, East of NGEF Layout, Kasturi Nagar, Bengaluru 560043, Karnataka, India;
| | - Melvin Lee
- Codon Genomics Sdn. Bhd., No 26, Jalan Dutamas 7 Taman Dutamas Balakong, Seri Kembangan 43200, Selangor, Malaysia;
| | | | - Kalaivani Nadarajah
- Department of Biological Sciences and Biotechnology, Faculty of Science and Technology, Universiti Kebangsaan Malaysia, Bangi 43600, Selangor, Malaysia; (N.M.R.); (S.N.H.); (I.S.K.)
- Correspondence:
| |
Collapse
|
3
|
Abstract
BACKGROUND Creating a scalable computational infrastructure to analyze the wealth of information contained in data repositories is difficult due to significant barriers in organizing, extracting and analyzing relevant data. Shared data science infrastructures like Boag is needed to efficiently process and parse data contained in large data repositories. The main features of Boag are inspired from existing languages for data intensive computing and can easily integrate data from biological data repositories. RESULTS As a proof of concept, Boa for genomics, Boag, has been implemented to analyze RefSeq's 153,848 annotation (GFF) and assembly (FASTA) file metadata. Boag provides a massive improvement from existing solutions like Python and MongoDB, by utilizing a domain-specific language that uses Hadoop infrastructure for a smaller storage footprint that scales well and requires fewer lines of code. We execute scripts through Boag to answer questions about the genomes in RefSeq. We identify the largest and smallest genomes deposited, explore exon frequencies for assemblies after 2016, identify the most commonly used bacterial genome assembly program, and address how animal genome assemblies have improved since 2016. Boag databases provide a significant reduction in required storage of the raw data and a significant speed up in its ability to query large datasets due to automated parallelization and distribution of Hadoop infrastructure during computations. CONCLUSIONS In order to keep pace with our ability to produce biological data, innovative methods are required. The Shared Data Science Infrastructure, Boag, provides researchers a greater access to researchers to efficiently explore data in new ways. We demonstrate the potential of a the domain specific language Boag using the RefSeq database to explore how deposited genome assemblies and annotations are changing over time. This is a small example of how Boag could be used with large biological datasets.
Collapse
Affiliation(s)
- Hamid Bagheri
- Department of Computer Science, Iowa State University, 226 Atanasoff Hall, Ames, 50011 USA
| | - Usha Muppirala
- Genome Informatics Facility, Iowa State University, 206 Science I, Ames, 50011 USA
| | - Rick E. Masonbrink
- Genome Informatics Facility, Iowa State University, 206 Science I, Ames, 50011 USA
| | - Andrew J. Severin
- Genome Informatics Facility, Iowa State University, 206 Science I, Ames, 50011 USA
| | - Hridesh Rajan
- Department of Computer Science, Iowa State University, 226 Atanasoff Hall, Ames, 50011 USA
| |
Collapse
|
4
|
Abstract
dbVar houses over 3 million submitted structural variants (SSV) from 120 human studies including copy number variations (CNV), insertions, deletions, inversions, translocations, and complex chromosomal rearrangements. Users can submit multiple SSVs to dbVAR that are presumably identical, but were ascertained by different platforms and samples, to calculate whether the variant is rare or common in the population and allow for cross validation. However, because SSV genomic location reporting can vary – including fuzzy locations where the start and/or end points are not precisely known – analysis, comparison, annotation, and reporting of SSVs across studies can be difficult. This project was initiated by the Structural Variant Comparison Group for the purpose of generating a non-redundant set of genomic regions defined by counts of concordance for all human SSVs placed on RefSeq assembly GRCh38 (RefSeq accession GCF_000001405.26). We intend that the availability of these regions, called structural variant clusters (SVCs), will facilitate the analysis, annotation, and exchange of SV data and allow for simplified display in genomic sequence viewers for improved variant interpretation. Sets of SVCs were generated by variant type for each of the 120 studies as well as for a combined set across all studies. Starting from 3.64 million SSVs, 2.5 million and 3.4 million non-redundant SVCs with count >=1 were generated by variant type for each study and across all studies, respectively. In addition, we have developed utilities for annotating, searching, and filtering SVC data in GVF format for computing summary statistics, exporting data for genomic viewers, and annotating the SVC using external data sources.
Collapse
Affiliation(s)
- Lon Phan
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA
| | - Jeffrey Hsu
- Cleveland Clinic Lerner Research Institute, Cleveland, OH, USA
| | - Le Quang Minh Tri
- Department of Biotechnology, Ho Chi Minh City International University, Ho Chi Minh, Vietnam
| | - Michaela Willi
- Laboratory of Genetics and Physiology, National Institute of Diabetes, Digestive and Kidney Diseases, National Institutes of Health, Bethesda, MA, USA; Division of Bioinformatics, Biocenter, Medical University Innsbruck, Innsbruck, Austria
| | - Tamer Mansour
- Lab for Data Intensive Biology, Department of Population Health and Reproduction, University of California, Davis, CA, USA; Department of Clinical Pathology, University of Mansoura, Mansoura, Egypt
| | - Yan Kai
- Cancer Epigenetics Laboratory, Department of Anatomy and Regenerative Biology, The George Washington University, Washington, DC, USA; Department of Physics, The George Washington University, Washington, DC, USA
| | - John Garner
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA
| | - John Lopez
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA
| | - Ben Busby
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA
| |
Collapse
|
5
|
Abstract
In genomics, bioinformatics and other areas of data science, gaps exist between extant public datasets and the open-source software tools built by the community to analyze similar data types. The purpose of biological data science hackathons is to assemble groups of genomics or bioinformatics professionals and software developers to rapidly prototype software to address these gaps. The only two rules for the NCBI-assisted hackathons run so far are that 1) data either must be housed in public data repositories or be deposited to such repositories shortly after the hackathon’s conclusion, and 2) all software comprising the final pipeline must be open-source or open-use. Proposed topics, as well as suggested tools and approaches, are distributed to participants at the beginning of each hackathon and refined during the event. Software, scripts, and pipelines are developed and published on GitHub, a web service providing publicly available, free-usage tiers for collaborative software development. The code resulting from each hackathon is published at
https://github.com/NCBI-Hackathons/ with separate directories or repositories for each team.
Collapse
Affiliation(s)
- Ben Busby
- National Center for Biotechnology Information (NCBI), National Library of Medicine, National Institutes of Health, Bethesda, MD, USA
| | - Matthew Lesko
- National Center for Biotechnology Information (NCBI), National Library of Medicine, National Institutes of Health, Bethesda, MD, USA
| | | | - Lisa Federer
- NIH Library, Division of Library Services, Office of Research Services, National Institutes of Health, Bethesda, MD, USA
| |
Collapse
|