1
|
Mouratidis I, Konnaris MA, Chantzi N, Chan CSY, Patsakis M, Provatas K, Montgomery A, Baltoumas FA, Sha CM, Mareboina M, Pavlopoulos GA, Chartoumpekis DV, Georgakopoulos-Soares I. Identification of the shortest species-specific oligonucleotide sequences. Genome Res 2025; 35:279-295. [PMID: 39746719 PMCID: PMC11874967 DOI: 10.1101/gr.280070.124] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/07/2024] [Accepted: 11/27/2024] [Indexed: 01/04/2025]
Abstract
Despite the exponential increase in sequencing information driven by massively parallel DNA sequencing technologies, universal and succinct genomic fingerprints for each organism are still missing. Identifying the shortest species-specific nucleotide sequences offers insights into species evolution and holds potential practical applications in agriculture, wildlife conservation, and healthcare. We propose a new method for sequence analysis termed nucleic "quasi-primes," the shortest occurring sequences in each of 45,076 organismal reference genomes, present in one genome and absent from every other examined genome. In the human genome, we find that the genomic loci of nucleic quasi-primes are most enriched for genes associated with brain development and cognitive function. In a single-cell case study focusing on the human primary motor cortex, nucleic quasi-prime genes account for a significantly larger proportion of the variation based on average gene expression. Nonneuronal cell types, including astrocytes, endothelial cells, microglia perivascular-macrophages, oligodendrocytes, and vascular and leptomeningeal cells, exhibit significant activation of quasi-prime-containing gene associations related to cancer, whereas simultaneously suppressing quasi-prime-containing genes are associated with cognitive, mental, and developmental disorders. We also show that human disease-causing variants, eQTLs, mQTLs, and sQTLs are 4.43-fold, 4.34-fold, 4.29-fold, and 4.21-fold enriched at human quasi-prime loci, respectively. These findings indicate that nucleic quasi-primes are genomic loci linked to the evolution of species-specific traits, and in humans, they provide insights in the development of cognitive traits and human diseases, including neurodevelopmental disorders.
Collapse
Affiliation(s)
- Ioannis Mouratidis
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, Pennsylvania 17033, USA
- Huck Institutes of the Life Sciences, The Pennsylvania State University, University Park, Pennsylvania 16802, USA
| | - Maxwell A Konnaris
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, Pennsylvania 17033, USA
- Huck Institutes of the Life Sciences, The Pennsylvania State University, University Park, Pennsylvania 16802, USA
| | - Nikol Chantzi
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, Pennsylvania 17033, USA
- Huck Institutes of the Life Sciences, The Pennsylvania State University, University Park, Pennsylvania 16802, USA
| | - Candace S Y Chan
- Department of Bioengineering and Therapeutic Sciences, University of California San Francisco, San Francisco, California 94143, USA
| | - Michail Patsakis
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, Pennsylvania 17033, USA
- National Technical University of Athens, School of Electrical and Computer Engineering, Athens 15772, Greece
| | - Kimonas Provatas
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, Pennsylvania 17033, USA
- National Technical University of Athens, School of Electrical and Computer Engineering, Athens 15772, Greece
| | - Austin Montgomery
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, Pennsylvania 17033, USA
| | - Fotis A Baltoumas
- Institute for Fundamental Biomedical Research, BSRC "Alexander Fleming," Vari 16672, Greece
| | - Congzhou M Sha
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, Pennsylvania 17033, USA
| | - Manvita Mareboina
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, Pennsylvania 17033, USA
| | - Georgios A Pavlopoulos
- Institute for Fundamental Biomedical Research, BSRC "Alexander Fleming," Vari 16672, Greece
- Center for New Biotechnologies and Precision Medicine, School of Medicine, National and Kapodistrian University of Athens, Athens 11527, Greece
| | - Dionysios V Chartoumpekis
- Service of Endocrinology, Diabetology and Metabolism, Lausanne University Hospital, 1005 Lausanne, Switzerland
| | - Ilias Georgakopoulos-Soares
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, Pennsylvania 17033, USA;
| |
Collapse
|
2
|
Biane C, Hampikian G, Kirgizov S, Nurligareev K. Endhered Patterns in Matchings and RNA. J Comput Biol 2025; 32:28-46. [PMID: 39714916 DOI: 10.1089/cmb.2024.0658] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/24/2024] Open
Abstract
An endhered (end-adhered) pattern is a subset of arcs in matchings, such that the corresponding starting points are consecutive, and the same holds for the ending points. Such patterns are in one-to-one correspondence with the permutations. We focus on the occurrence frequency of such patterns in matchings and native (real-world) RNA structures with pseudoknots. We present combinatorial results related to the distribution and asymptotic behavior of the pattern 21, which corresponds to two consecutive base pairs frequently encountered in RNA, and the pattern 12, representing the archetypal minimal pseudoknot. We show that in matchings these two patterns are equidistributed, which is quite different from what we can find in native RNAs. We also examine the distribution of endhered patterns of size 3, showing how the patterns change under the transformation called endhered twist. Finally, we compute the distributions of endhered patterns of size 2 and 3 in native secondary RNA structures with pseudoknots and discuss possible outcomes of our study.
Collapse
Affiliation(s)
- Célia Biane
- Laboratoire d'Informatique de Bourgogne, Université de Bourgogne, Dijon Cedex, France
| | | | - Sergey Kirgizov
- Laboratoire d'Informatique de Bourgogne, Université de Bourgogne, Dijon Cedex, France
| | - Khaydar Nurligareev
- Laboratoire d'Informatique de Bourgogne, Université de Bourgogne, Dijon Cedex, France
| |
Collapse
|
3
|
Chantzi N, Mareboina M, Konnaris MA, Montgomery A, Patsakis M, Mouratidis I, Georgakopoulos-Soares I. The determinants of the rarity of nucleic and peptide short sequences in nature. NAR Genom Bioinform 2024; 6:lqae029. [PMID: 38584871 PMCID: PMC10993293 DOI: 10.1093/nargab/lqae029] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/12/2023] [Revised: 02/21/2024] [Accepted: 03/18/2024] [Indexed: 04/09/2024] Open
Abstract
The prevalence of nucleic and peptide short sequences across organismal genomes and proteomes has not been thoroughly investigated. We examined 45 785 reference genomes and 21 871 reference proteomes, spanning archaea, bacteria, eukaryotes and viruses to calculate the rarity of short sequences in them. To capture this, we developed a metric of the rarity of each sequence in nature, the rarity index. We find that the frequency of certain dipeptides in rare oligopeptide sequences is hundreds of times lower than expected, which is not the case for any dinucleotides. We also generate predictive regression models that infer the rarity of nucleic and proteomic sequences across nature or within each domain of life and viruses separately. When examining each of the three domains of life and viruses separately, the R² performance of the model predicting rarity for 5-mer peptides from mono- and dipeptides ranged between 0.814 and 0.932. A separate model predicting rarity for 10-mer oligonucleotides from mono- and dinucleotides achieved R² performance between 0.408 and 0.606. Our results indicate that the mono- and dinucleotide composition of nucleic sequences and the mono- and dipeptide composition of peptide sequences can explain a significant proportion of the variance in their frequencies in nature.
Collapse
Affiliation(s)
- Nikol Chantzi
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, 17033, USA
| | - Manvita Mareboina
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, 17033, USA
| | - Maxwell A Konnaris
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, 17033, USA
- Department of Statistics, Penn State University, University Park, PA, 16802, USA
- Huck Institutes of the Life Sciences, Penn State University, University Park, PA, 16802, USA
| | - Austin Montgomery
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, 17033, USA
| | - Michail Patsakis
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, 17033, USA
| | - Ioannis Mouratidis
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, 17033, USA
- Huck Institutes of the Life Sciences, Penn State University, University Park, PA, 16802, USA
| | - Ilias Georgakopoulos-Soares
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, 17033, USA
| |
Collapse
|