1
|
Patsakis M, Provatas K, Baltoumas FA, Chantzi N, Mouratidis I, Pavlopoulos GA, Georgakopoulos-Soares I. MAFin: motif detection in multiple alignment files. Bioinformatics 2025; 41:btaf125. [PMID: 40106711 PMCID: PMC11978385 DOI: 10.1093/bioinformatics/btaf125] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/11/2024] [Revised: 03/05/2025] [Accepted: 03/17/2025] [Indexed: 03/22/2025] Open
Abstract
MOTIVATION Whole Genome and Proteome Alignments, represented by the multiple alignment file format, have become a standard approach in comparative genomics and proteomics. These often require identifying conserved motifs, which is crucial for understanding functional and evolutionary relationships. However, current approaches lack a direct method for motif detection within MAF files. We present MAFin, a novel tool that enables efficient motif detection and conservation analysis in MAF files to address this gap, streamlining genomic and proteomic research. RESULTS We developed MAFin, the first motif detection tool for Multiple Alignment Format files. MAFin enables the multithreaded search of conserved motifs using three approaches: (i) using user-specified k-mers to search the sequences. (ii) with regular expressions, in which case one or more patterns are searched, and (iii) with predefined Position Weight Matrices. Once the motif has been found, MAFin detects the motif instances and calculates the conservation across the aligned sequences. MAFin also calculates a conservation percentage, which provides information about the conservation levels of each motif across the aligned sequences, based on the number of matches relative to the length of the motif. A set of statistics enables the interpretation of each motif's conservation level, and the detected motifs are exported in JSON and CSV files for downstream analyses. AVAILABILITY AND IMPLEMENTATION MAFin is offered as a Python package under the GPL license as a multi-platform application and is available at: https://github.com/Georgakopoulos-Soares-lab/MAFin.
Collapse
Affiliation(s)
- Michail Patsakis
- Institute for Personalized Medicine, Department of Molecular and Precision Medicine, The Pennsylvania State University College of Medicine, Hershey, PA 17033, United States
- Huck Institutes of the Life Sciences, The Pennsylvania State University, University Park, PA 16802, United States
| | - Kimonas Provatas
- Institute for Personalized Medicine, Department of Molecular and Precision Medicine, The Pennsylvania State University College of Medicine, Hershey, PA 17033, United States
- Huck Institutes of the Life Sciences, The Pennsylvania State University, University Park, PA 16802, United States
- Division of Basic Sciences, University of Crete Medical School, Heraklion 71110, Greece
| | - Fotis A Baltoumas
- Institute for Fundamental Biomedical Research, BSRC “Alexander Fleming”, Vari 16672, Greece
| | - Nikol Chantzi
- Institute for Personalized Medicine, Department of Molecular and Precision Medicine, The Pennsylvania State University College of Medicine, Hershey, PA 17033, United States
- Huck Institutes of the Life Sciences, The Pennsylvania State University, University Park, PA 16802, United States
| | - Ioannis Mouratidis
- Institute for Personalized Medicine, Department of Molecular and Precision Medicine, The Pennsylvania State University College of Medicine, Hershey, PA 17033, United States
- Huck Institutes of the Life Sciences, The Pennsylvania State University, University Park, PA 16802, United States
| | - Georgios A Pavlopoulos
- Institute for Fundamental Biomedical Research, BSRC “Alexander Fleming”, Vari 16672, Greece
| | - Ilias Georgakopoulos-Soares
- Institute for Personalized Medicine, Department of Molecular and Precision Medicine, The Pennsylvania State University College of Medicine, Hershey, PA 17033, United States
- Huck Institutes of the Life Sciences, The Pennsylvania State University, University Park, PA 16802, United States
| |
Collapse
|
2
|
Wang J, Turney A, Murray L, Craven AM, Bragger-Wilkinson P, dos Santos B, Martasek J, Desaphy J. BioRels' data infrastructure: a scientific schema and exchange standard to transform and enhance biological data sciences. Nucleic Acids Res 2025; 53:gkaf254. [PMID: 40183635 PMCID: PMC11969666 DOI: 10.1093/nar/gkaf254] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/11/2024] [Revised: 01/31/2025] [Accepted: 03/20/2025] [Indexed: 04/05/2025] Open
Abstract
Our understanding of biology and medicinal sciences augmented by advances in data structures and algorithms has resulted in proliferation of thousands of open-sourced resources, tools, and websites that are made by the scientific community to access, process, store, and visualize biological data. However, such data have become increasingly complex and heterogeneous, leading to an entangled web of relationships and external identifiers. Despite emergence of infrastructure such as data lakes, the scientists are still responsible for the time consuming and costly exercise to find, extract, clean, prepare, and maintain such data sources while following the FAIR principles. To better understand the complexity, we lay down a representation of the mainstream data ecosystem, describing the natural relationships and concepts found in biology. Built upon it and the fundamental principles of data unicity and atomicity, we introduce BioRels, an automated and standardized data preparation workstream aiming at improving reproducibility and speed for all scientists and handling up to 145 billion data points. BioRels allows complex querying capabilities across several data sources seamlessly and provides an exchange format, BIORJ, to export and import data with all its dependency and metadata. At last, we describe the advantages, limitations, applications, and perspectives of a future approach BioRels-KB to expand future data preparation capabilities.
Collapse
Affiliation(s)
- Jibo Wang
- Lilly Genetic Medicines, Eli Lilly and Company, Indianapolis, IN 46285, United States
| | - Amanda Turney
- Research-IDS, Eli Lilly and Company, Indianapolis, IN 46285, United States
| | - Lauren Murray
- Research-IDS, Eli Lilly and Company, Indianapolis, IN 46285, United States
| | - Andrew M Craven
- Tech@Lilly, Eli Lilly and Company, Indianapolis, IN 46285, United States
| | | | - Bruno dos Santos
- Lilly Genetic Medicines, Eli Lilly and Company, Indianapolis, IN 46285, United States
| | - Jaroslav Martasek
- Tech@Lilly, Eli Lilly and Company, Indianapolis, IN 46285, United States
| | - Jeremy Desaphy
- Lilly Genetic Medicines, Eli Lilly and Company, Indianapolis, IN 46285, United States
| |
Collapse
|
3
|
Nawrocki EP, Petrov AI, Williams KP. Expansion of the tmRNA sequence database and new tools for search and visualization. NAR Genom Bioinform 2025; 7:lqaf019. [PMID: 40104674 PMCID: PMC11915505 DOI: 10.1093/nargab/lqaf019] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/21/2025] [Accepted: 02/19/2025] [Indexed: 03/20/2025] Open
Abstract
Transfer-messenger RNA (tmRNA) contributes essential tRNA-like and mRNA-like functions during the process of trans-translation, a mechanism of quality control for the translating bacterial ribosome. Proper tmRNA identification benefits the study of trans-translation and also the study of genomic islands, which frequently use the tmRNA gene as an integration site. Automated tmRNA gene identification tools are available, but manual inspection is still important for eliminating false positives. We have increased our database of precisely mapped tmRNA sequences over 50-fold to 97 179 unique sequences. Group I introns had previously been found integrated within a single subsite within the TψC-loop; they have now been identified at four distinct subsites, suggesting multiple founding events of invasion of tmRNA genes by group I introns, all in the same vicinity. tmRNA genes were found in metagenomic archaeal genomes, perhaps a result of misbinning of bacterial sequences during genome assembly. With the expanded database, we have produced new covariance models for improved tmRNA sequence search and new secondary structure visualization tools.
Collapse
Affiliation(s)
- Eric P Nawrocki
- Division of Intramural Research, U.S. National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, United States
| | - Anton I Petrov
- Riboscope Ltd, 23 King Street, Cambridge CB1 1AH, United Kingdom
| | - Kelly P Williams
- Sandia National Laboratories, Livermore, CA 94550, United States
| |
Collapse
|
4
|
Leitão MDC, Cabral LS, Piva LC, Queiroz PFDS, Gomes TG, de Andrade RV, Perez ALA, de Paiva KLR, Báo SN, Reis VCB, Moraes LMP, Togawa RC, Barros LMG, Torres FAG, Pappas Júnior GJ, Coelho CM. SHIP identifies genomic safe harbors in eukaryotic organisms using genomic general feature annotation. Sci Rep 2025; 15:7193. [PMID: 40021804 PMCID: PMC11871141 DOI: 10.1038/s41598-025-91249-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/06/2024] [Accepted: 02/19/2025] [Indexed: 03/03/2025] Open
Abstract
Integrating foreign genes into loci, allowing their transcription without affecting endogenous gene expression, is the desirable strategy in genomic engineering. However, these loci, known as genomic safe harbors (GSHs), have been mainly identified by empirical methods. Furthermore, the most prominent available GSHs are localized within regions of high gene density, raising concerns about unstable expression. As synthetic biology is moving towards investigating polygenic modules rather than single genes, there is an increasing demand for tools to identify GSHs systematically. To expand the GSH repertoire, we present SHIP, an algorithm designed to detect potential GSHs in eukaryotes. Using the chassis organism Saccharomyces cerevisiae, five GSHs were experimentally curated based on data from DNA sequencing, stability, flow cytometry, qPCR, electron microscopy, RT-qPCR, and RNA-Seq assays. Our study places SHIP as a valuable tool for providing a list of promising candidates to assist in the experimental assessment of GSHs in eukaryotic organisms with available annotated genomes.
Collapse
Affiliation(s)
- Matheus de Castro Leitão
- Department of Genetics and Morphology, University of Brasilia, Brasilia, Brazil
- Department of Cell Biology, University of Brasilia, Brasilia, Brazil
| | | | - Luiza Cesca Piva
- Department of Cell Biology, University of Brasilia, Brasilia, Brazil
| | | | - Taísa Godoy Gomes
- Department of Microbiology, University of Brasilia, Brasilia, Brazil
| | | | | | | | - Sônia Nair Báo
- Department of Cell Biology, University of Brasilia, Brasilia, Brazil
| | | | | | | | | | | | | | | |
Collapse
|
5
|
Sayers EW, Cavanaugh M, Frisse L, Pruitt KD, Schneider VA, Underwood B, Yankie L, Karsch-Mizrachi I. GenBank 2025 update. Nucleic Acids Res 2025; 53:D56-D61. [PMID: 39558184 PMCID: PMC11701615 DOI: 10.1093/nar/gkae1114] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2024] [Revised: 10/23/2024] [Accepted: 10/28/2024] [Indexed: 11/20/2024] Open
Abstract
GenBank® (https://www.ncbi.nlm.nih.gov/genbank/) is a comprehensive, public data repository that contains 34 trillion base pairs from over 4.7 billion nucleotide sequences for 581 000 formally described species. Daily data exchange with the European Nucleotide Archive and the DNA Data Bank of Japan ensures worldwide coverage. We summarize the content of the database in 2025 and recent updates such as accelerated processing of influenza sequences and the ability to upload feature tables to Submission Portal for messenger RNA sequences. We provide an overview of the web, application programming and command-line interfaces that allow users to access GenBank data. We also discuss the importance of creating BioProject and BioSample records during submissions, particularly for viruses and metagenomes. Finally, we summarize educational materials and recent community outreach efforts.
Collapse
Affiliation(s)
- Eric W Sayers
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Building 38A, 8600 Rockville Pike, Bethesda, MD 20894, USA
| | - Mark Cavanaugh
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Building 38A, 8600 Rockville Pike, Bethesda, MD 20894, USA
| | - Linda Frisse
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Building 38A, 8600 Rockville Pike, Bethesda, MD 20894, USA
| | - Kim D Pruitt
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Building 38A, 8600 Rockville Pike, Bethesda, MD 20894, USA
| | - Valerie A Schneider
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Building 38A, 8600 Rockville Pike, Bethesda, MD 20894, USA
| | - Beverly A Underwood
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Building 38A, 8600 Rockville Pike, Bethesda, MD 20894, USA
| | - Linda Yankie
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Building 38A, 8600 Rockville Pike, Bethesda, MD 20894, USA
| | - Ilene Karsch-Mizrachi
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Building 38A, 8600 Rockville Pike, Bethesda, MD 20894, USA
| |
Collapse
|
6
|
Rigden DJ, Fernández XM. The 2025 Nucleic Acids Research database issue and the online molecular biology database collection. Nucleic Acids Res 2025; 53:D1-D9. [PMID: 39658041 PMCID: PMC11701706 DOI: 10.1093/nar/gkae1220] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/26/2024] [Accepted: 11/26/2024] [Indexed: 12/12/2024] Open
Abstract
The 2025 Nucleic Acids Research database issue contains 185 papers spanning biology and related areas. Seventy three new databases are covered, while resources previously described in the issue account for 101 update articles. Databases most recently published elsewhere account for a further 11 papers. Nucleic acid databases include EXPRESSO for multi-omics of 3D genome structure (this issue's chosen Breakthrough Resource and Article) and NAIRDB for Fourier transform infrared data. New protein databases include structure predictions for human isoforms at ASpdb and for viral proteins at BFVD. UniProt, Pfam and InterPro have all provided updates: metabolism and signalling are covered by new descriptions of STRING, KEGG and CAZy, while updated microbe-oriented databases include Enterobase, VFDB and PHI-base. Biomedical research is supported, among others, by ClinVar, PubChem and DrugMAP. Genomics-related resources include Ensembl, UCSC Genome Browser and dbSNP. New plant databases cover the Solanaceae (SolR) and Asteraceae (AMIR) families while an update from NCBI Taxonomy also features. The Database Issue is freely available on the Nucleic Acids Research website (https://academic.oup.com/nar). At the NAR online Molecular Biology Database Collection (http://www.oxfordjournals.org/nar/database/c/), 932 entries have been reviewed in the last year, 74 new resources added and 226 discontinued URLs eliminated bringing the current total to 2236 databases.
Collapse
Affiliation(s)
- Daniel J Rigden
- Department of Biochemistry, Cell and Systems Biology, Institute of Systems, Molecular and Integrative Biology, University of Liverpool, Crown Street, Liverpool L69 7ZB, UK
| | | |
Collapse
|
7
|
Sayers E, Beck J, Bolton E, Brister J, Chan J, Connor R, Feldgarden M, Fine A, Funk K, Hoffman J, Kannan S, Kelly C, Klimke W, Kim S, Lathrop S, Marchler-Bauer A, Murphy T, O’Sullivan C, Schmieder E, Skripchenko Y, Stine A, Thibaud-Nissen F, Wang J, Ye J, Zellers E, Schneider V, Pruitt K. Database resources of the National Center for Biotechnology Information in 2025. Nucleic Acids Res 2025; 53:D20-D29. [PMID: 39526373 PMCID: PMC11701734 DOI: 10.1093/nar/gkae979] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/16/2024] [Revised: 10/09/2024] [Accepted: 10/17/2024] [Indexed: 11/16/2024] Open
Abstract
The National Center for Biotechnology Information (NCBI) provides online information resources for biology, including the GenBank® nucleic acid sequence repository and the PubMed® repository of citations and abstracts published in life science journals. NCBI provides search and retrieval operations for most of these data from 31 distinct repositories and knowledgebases. The E-utilities serve as the programming interface for most of these. Resources receiving significant updates in the past year include PubMed, PubMed Central, Bookshelf, the NIH Comparative Genomics Resource, BLAST, Sequence Read Archive, Taxonomy, iCn3D, Conserved Domain Database, Pathogen Detection, antimicrobial resistance resources and PubChem. These resources can be accessed through the NCBI home page at https://www.ncbi.nlm.nih.gov.
Collapse
Affiliation(s)
- Eric W Sayers
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Building 38A, 8600 Rockville Pike, Bethesda, MD 20894, USA
| | - Jeffrey Beck
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Building 38A, 8600 Rockville Pike, Bethesda, MD 20894, USA
| | - Evan E Bolton
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Building 38A, 8600 Rockville Pike, Bethesda, MD 20894, USA
| | - J Rodney Brister
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Building 38A, 8600 Rockville Pike, Bethesda, MD 20894, USA
| | - Jessica Chan
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Building 38A, 8600 Rockville Pike, Bethesda, MD 20894, USA
| | - Ryan Connor
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Building 38A, 8600 Rockville Pike, Bethesda, MD 20894, USA
| | - Michael Feldgarden
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Building 38A, 8600 Rockville Pike, Bethesda, MD 20894, USA
| | - Anna M Fine
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Building 38A, 8600 Rockville Pike, Bethesda, MD 20894, USA
| | - Kathryn Funk
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Building 38A, 8600 Rockville Pike, Bethesda, MD 20894, USA
| | - Jinna Hoffman
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Building 38A, 8600 Rockville Pike, Bethesda, MD 20894, USA
| | - Sivakumar Kannan
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Building 38A, 8600 Rockville Pike, Bethesda, MD 20894, USA
| | - Christopher Kelly
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Building 38A, 8600 Rockville Pike, Bethesda, MD 20894, USA
| | - William Klimke
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Building 38A, 8600 Rockville Pike, Bethesda, MD 20894, USA
| | - Sunghwan Kim
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Building 38A, 8600 Rockville Pike, Bethesda, MD 20894, USA
| | - Stacy Lathrop
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Building 38A, 8600 Rockville Pike, Bethesda, MD 20894, USA
| | - Aron Marchler-Bauer
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Building 38A, 8600 Rockville Pike, Bethesda, MD 20894, USA
| | - Terence D Murphy
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Building 38A, 8600 Rockville Pike, Bethesda, MD 20894, USA
| | - Chris O’Sullivan
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Building 38A, 8600 Rockville Pike, Bethesda, MD 20894, USA
| | - Erin Schmieder
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Building 38A, 8600 Rockville Pike, Bethesda, MD 20894, USA
| | - Yuriy Skripchenko
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Building 38A, 8600 Rockville Pike, Bethesda, MD 20894, USA
| | - Adam Stine
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Building 38A, 8600 Rockville Pike, Bethesda, MD 20894, USA
| | - Francoise Thibaud-Nissen
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Building 38A, 8600 Rockville Pike, Bethesda, MD 20894, USA
| | - Jiyao Wang
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Building 38A, 8600 Rockville Pike, Bethesda, MD 20894, USA
| | - Jian Ye
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Building 38A, 8600 Rockville Pike, Bethesda, MD 20894, USA
| | - Erin Zellers
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Building 38A, 8600 Rockville Pike, Bethesda, MD 20894, USA
| | - Valerie A Schneider
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Building 38A, 8600 Rockville Pike, Bethesda, MD 20894, USA
| | - Kim D Pruitt
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Building 38A, 8600 Rockville Pike, Bethesda, MD 20894, USA
| |
Collapse
|