1
|
Bukhnikashvili L. Overlaps Between CDS Regions of Protein-Coding Genes in the Human Genome: A Case Study on the NR1D1-THRA Gene Pair. J Mol Evol 2023; 91:963-975. [PMID: 38006429 DOI: 10.1007/s00239-023-10147-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/28/2023] [Accepted: 11/12/2023] [Indexed: 11/27/2023]
Abstract
For several decades, it has been known that a substantial number of genes within human DNA exhibit overlap; however, the biological and evolutionary significance of these overlaps remain poorly understood. This study focused on investigating specific instances of overlap where the overlapping DNA region encompasses the coding DNA sequences (CDSs) of protein-coding genes. The results revealed that proteins encoded by overlapping CDSs exhibit greater disorder than those from nonoverlapping CDSs. Additionally, these DNA regions were identified as GC-rich. This could be partially attributed to the absence of stop codons from two distinct reading frames rather than one. Furthermore, these regions were found to harbour fewer single-nucleotide polymorphism (SNP) sites, possibly due to constraints arising from the overlapping state where mutations could affect two genes simultaneously.While elucidating these properties, the NR1D1-THRA gene pair emerged as an exceptional case with highly structured proteins and a distinctly conserved sequence across eutherian mammals. Both NR1D1 and THRA are nuclear receptors lacking a ligand-binding domain at their C-terminus, which is the region where these gene pairs overlap. The NR1D1 gene is involved in the regulation of circadian rhythm, while the THRA gene encodes a thyroid hormone receptor, and both play crucial roles in various physiological processes. This study suggests that, in addition to their well-established functions, the specifically overlapping CDS regions of these genes may encode protein segments with additional, yet undiscovered, biological roles.
Collapse
|
2
|
New Genomic Signals Underlying the Emergence of Human Proto-Genes. Genes (Basel) 2022; 13:genes13020284. [PMID: 35205330 PMCID: PMC8871994 DOI: 10.3390/genes13020284] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/17/2021] [Revised: 01/20/2022] [Accepted: 01/24/2022] [Indexed: 12/04/2022] Open
Abstract
De novo genes are novel genes which emerge from non-coding DNA. Until now, little is known about de novo genes’ properties, correlated to their age and mechanisms of emergence. In this study, we investigate four related properties: introns, upstream regulatory motifs, 5′ Untranslated regions (UTRs) and protein domains, in 23,135 human proto-genes. We found that proto-genes contain introns, whose number and position correlates with the genomic position of proto-gene emergence. The origin of these introns is debated, as our results suggest that 41% of proto-genes might have captured existing introns, and 13.7% of them do not splice the ORF. We show that proto-genes which emerged via overprinting tend to be more enriched in core promotor motifs, while intergenic and intronic genes are more enriched in enhancers, even if the TATA motif is most commonly found upstream in these genes. Intergenic and intronic 5′ UTRs of proto-genes have a lower potential to stabilise mRNA structures than exonic proto-genes and established human genes. Finally, we confirm that proteins expressed by proto-genes gain new putative domains with age. Overall, we find that regulatory motifs inducing transcription and translation of previously non-coding sequences may facilitate proto-gene emergence. Our study demonstrates that introns, 5′ UTRs, and domains have specific properties in proto-genes. We also emphasize that the genomic positions of de novo genes strongly impacts these properties.
Collapse
|
3
|
Abstract
Modern genome-scale methods that identify new genes, such as proteogenomics and ribosome profiling, have revealed, to the surprise of many, that overlap in genes, open reading frames and even coding sequences is widespread and functionally integrated into prokaryotic, eukaryotic and viral genomes. In parallel, the constraints that overlapping regions place on genome sequences and their evolution can be harnessed in bioengineering to build more robust synthetic strains and constructs. With a focus on overlapping protein-coding and RNA-coding genes, this Review examines their discovery, topology and biogenesis in the context of their genome biology. We highlight exciting new uses for sequence overlap to control translation, compress synthetic genetic constructs, and protect against mutation.
Collapse
|
4
|
Rosikiewicz W, Sikora J, Skrzypczak T, Kubiak MR, Makałowska I. Promoter switching in response to changing environment and elevated expression of protein-coding genes overlapping at their 5' ends. Sci Rep 2021; 11:8984. [PMID: 33903630 PMCID: PMC8076222 DOI: 10.1038/s41598-021-87970-w] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/11/2020] [Accepted: 04/07/2021] [Indexed: 11/09/2022] Open
Abstract
Despite the number of studies focused on sense-antisense transcription, the key question of whether such organization evolved as a regulator of gene expression or if this is only a byproduct of other regulatory processes has not been elucidated to date. In this study, protein-coding sense-antisense gene pairs were analyzed with a particular focus on pairs overlapping at their 5' ends. Analyses were performed in 73 human transcription start site libraries. The results of our studies showed that the overlap between genes is not a stable feature and depends on which TSSs are utilized in a given cell type. An analysis of gene expression did not confirm that overlap between genes causes downregulation of their expression. This observation contradicts earlier findings. In addition, we showed that the switch from one promoter to another, leading to genes overlap, may occur in response to changing environment of a cell or tissue. We also demonstrated that in transfected and cancerous cells genes overlap is observed more often in comparison with normal tissues. Moreover, utilization of overlapping promoters depends on particular state of a cell and, at least in some groups of genes, is not merely coincidental.
Collapse
Affiliation(s)
- Wojciech Rosikiewicz
- Center for Applied Bioinformatics, St. Jude Children's Research Hospital, Memphis, TN, USA
| | - Jarosław Sikora
- Institute of Human Biology and Evolution, Faculty of Biology, Adam Mickiewicz University, Poznań, Poland
| | - Tomasz Skrzypczak
- Institute of Human Biology and Evolution, Faculty of Biology, Adam Mickiewicz University, Poznań, Poland
- Center for Advanced Technology, Adam Mickiewicz University, Poznań, Poland
| | - Magdalena R Kubiak
- Institute of Human Biology and Evolution, Faculty of Biology, Adam Mickiewicz University, Poznań, Poland
| | - Izabela Makałowska
- Institute of Human Biology and Evolution, Faculty of Biology, Adam Mickiewicz University, Poznań, Poland.
| |
Collapse
|
5
|
Nelson CW, Ardern Z, Wei X. OLGenie: Estimating Natural Selection to Predict Functional Overlapping Genes. Mol Biol Evol 2021; 37:2440-2449. [PMID: 32243542 PMCID: PMC7531306 DOI: 10.1093/molbev/msaa087] [Citation(s) in RCA: 12] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/20/2022] Open
Abstract
Purifying (negative) natural selection is a hallmark of functional biological sequences, and can be detected in protein-coding genes using the ratio of nonsynonymous to synonymous substitutions per site (dN/dS). However, when two genes overlap the same nucleotide sites in different frames, synonymous changes in one gene may be nonsynonymous in the other, perturbing dN/dS. Thus, scalable methods are needed to estimate functional constraint specifically for overlapping genes (OLGs). We propose OLGenie, which implements a modification of the Wei–Zhang method. Assessment with simulations and controls from viral genomes (58 OLGs and 176 non-OLGs) demonstrates low false-positive rates and good discriminatory ability in differentiating true OLGs from non-OLGs. We also apply OLGenie to the unresolved case of HIV-1’s putative antisense protein gene, showing significant purifying selection. OLGenie can be used to study known OLGs and to predict new OLGs in genome annotation. Software and example data are freely available at https://github.com/chasewnelson/OLGenie (last accessed April 10, 2020).
Collapse
Affiliation(s)
- Chase W Nelson
- Sackler Institute for Comparative Genomics, American Museum of Natural History, New York, NY.,Biodiversity Research Center, Academia Sinica, Taipei, Taiwan
| | - Zachary Ardern
- Microbial Ecology, ZIEL-Institute for Food & Health, Technische Universität München, Freising, Germany
| | - Xinzhu Wei
- Department of Ecology and Evolutionary Biology, University of Michigan, Ann Arbor, MI.,Department of Integrative Biology and Statistics, University of California, Berkeley, CA
| |
Collapse
|
6
|
Yin C, Zhu B, Li X, Goding CR, Cui R. A Reply to ''Evidence that STK19 Is Not an NRAS-Dependent Melanoma Driver". Cell 2020; 181:1406-1409.e2. [PMID: 32531246 DOI: 10.1016/j.cell.2020.04.029] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/12/2020] [Revised: 04/02/2020] [Accepted: 04/16/2020] [Indexed: 01/01/2023]
Affiliation(s)
- Chengqian Yin
- Department of Pharmacology and Experimental Therapeutics, Boston University School of Medicine, Boston, MA 02118, USA.
| | - Bo Zhu
- Department of Pharmacology and Experimental Therapeutics, Boston University School of Medicine, Boston, MA 02118, USA
| | - Xin Li
- Department of Pharmacology and Experimental Therapeutics, Boston University School of Medicine, Boston, MA 02118, USA
| | - Colin R Goding
- Ludwig Institute for Cancer Research, University of Oxford, Headington, Oxford OX3 7DQ, UK
| | - Rutao Cui
- Department of Pharmacology and Experimental Therapeutics, Boston University School of Medicine, Boston, MA 02118, USA.
| |
Collapse
|
7
|
Li Y, Chen X, Wu K, Pan J, Long H, Yan Y. Characterization of Simple Sequence Repeats (SSRs) in Ciliated Protists Inferred by Comparative Genomics. Microorganisms 2020; 8:microorganisms8050662. [PMID: 32370063 PMCID: PMC7285179 DOI: 10.3390/microorganisms8050662] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/22/2020] [Revised: 04/24/2020] [Accepted: 04/26/2020] [Indexed: 01/02/2023] Open
Abstract
Simple sequence repeats (SSRs) are prevalent in the genomes of all organisms. They are widely used as genetic markers, and are insertion/deletion mutation hotspots, which directly influence genome evolution. However, little is known about such important genomic components in ciliated protists, a large group of unicellular eukaryotes with extremely long evolutionary history and genome diversity. With recent publications of multiple ciliate genomes, we start to get a chance to explore perfect SSRs with motif size 1-100 bp and at least three motif repeats in nine species of two ciliate classes, Oligohymenophorea and Spirotrichea. We found that homopolymers are the most prevalent SSRs in these A/T-rich species, with AAA (lysine, charged amino acid; also seen as an SSR with one-adenine motif repeated three times) being the codons repeated at the highest frequencies in coding SSR regions, consistent with the widespread alveolin proteins rich in lysine repeats as found in Tetrahymena. Micronuclear SSRs are universally more abundant than the macronuclear ones of the same motif-size, except for the 8-bp-motif SSRs in extensively fragmented chromosomes. Both the abundance and A/T content of SSRs decrease as motif-size increases, while the abundance is positively correlated with the A/T content of the genome. Also, smaller genomes have lower proportions of coding SSRs out of all SSRs in Paramecium species. This genome-wide and cross-species analysis reveals the high diversity of SSRs and reflects the rapid evolution of these simple repetitive elements in ciliate genomes.
Collapse
|
8
|
Affiliation(s)
- Stephen Branden Van Oss
- Department of Computational and Systems Biology, Pittsburgh Center for Evolutionary Biology and Medicine, School of Medicine, University of Pittsburgh, Pittsburgh, PA, United States of America
| | - Anne-Ruxandra Carvunis
- Department of Computational and Systems Biology, Pittsburgh Center for Evolutionary Biology and Medicine, School of Medicine, University of Pittsburgh, Pittsburgh, PA, United States of America
| |
Collapse
|
9
|
The Evolution and Expression Pattern of Human Overlapping lncRNA and Protein-coding Gene Pairs. Sci Rep 2017; 7:42775. [PMID: 28344339 PMCID: PMC5366806 DOI: 10.1038/srep42775] [Citation(s) in RCA: 24] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/16/2016] [Accepted: 01/13/2017] [Indexed: 12/27/2022] Open
Abstract
Long non-coding RNA overlapping with protein-coding gene (lncRNA-coding pair) is a special type of overlapping genes. Protein-coding overlapping genes have been well studied and increasing attention has been paid to lncRNAs. By studying lncRNA-coding pairs in human genome, we showed that lncRNA-coding pairs were more likely to be generated by overprinting and retaining genes in lncRNA-coding pairs were given higher priority than non-overlapping genes. Besides, the preference of overlapping configurations preserved during evolution was based on the origin of lncRNA-coding pairs. Further investigations showed that lncRNAs promoting the splicing of their embedded protein-coding partners was a unilateral interaction, but the existence of overlapping partners improving the gene expression was bidirectional and the effect was decreased with the increased evolutionary age of genes. Additionally, the expression of lncRNA-coding pairs showed an overall positive correlation and the expression correlation was associated with their overlapping configurations, local genomic environment and evolutionary age of genes. Comparison of the expression correlation of lncRNA-coding pairs between normal and cancer samples found that the lineage-specific pairs including old protein-coding genes may play an important role in tumorigenesis. This work presents a systematically comprehensive understanding of the evolution and the expression pattern of human lncRNA-coding pairs.
Collapse
|
10
|
Saha D, Podder S, Ghosh TC. Overlapping Regions in HIV-1 Genome Act as Potential Sites for Host-Virus Interaction. Front Microbiol 2016; 7:1735. [PMID: 27867372 PMCID: PMC5095123 DOI: 10.3389/fmicb.2016.01735] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/05/2016] [Accepted: 10/17/2016] [Indexed: 01/05/2023] Open
Abstract
More than a decade, overlapping genes in RNA viruses became a subject of research which has explored various effect of gene overlapping on the evolution and function of viral genomes like genome size compaction. Additionally, overlapping regions (OVRs) are also reported to encode elevated degree of protein intrinsic disorder (PID) in unspliced RNA viruses. With the aim to explore the roles of OVRs in HIV-1 pathogenesis, we have carried out an in-depth analysis on the association of gene overlapping with PID in 35 HIV1- M subtypes. Our study reveals an over representation of PID in OVR of HIV-1 genomes. These disordered residues endure several vital, structural features like short linear motifs (SLiMs) and protein phosphorylation (PP) sites which are previously shown to be involved in massive host–virus interaction. Moreover, SLiMs in OVRs are noticed to be more functionally potential as compared to that of non-overlapping region. Although, density of experimentally verified SLiMs, resided in 9 HIV-1 genes, involved in host–virus interaction do not show any bias toward clustering into OVR, tat and rev two important proteins mediates host–pathogen interaction by their experimentally verified SLiMs, which are mostly localized in OVR. Finally, our analysis suggests that the acquisition of SLiMs in OVR is mutually exclusive of the occurrence of disordered residues, while the enrichment of PPs in OVR is solely dependent on PID and not on overlapping coding frames. Thus, OVRs of HIV-1 genomes could be demarcated as potential molecular recognition sites during host–virus interaction.
Collapse
Affiliation(s)
- Deeya Saha
- Bioinformatics Centre, Bose Institute Kolkata, India
| | - Soumita Podder
- Department of Microbiology, Raiganj University Raiganj, India
| | | |
Collapse
|
11
|
The combinatorics of overlapping genes. J Theor Biol 2016; 415:90-101. [PMID: 27737786 DOI: 10.1016/j.jtbi.2016.09.018] [Citation(s) in RCA: 22] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/15/2016] [Revised: 08/31/2016] [Accepted: 09/22/2016] [Indexed: 11/23/2022]
Abstract
Overlapping genes exist in all domains of life and are much more abundant than expected upon their first discovery in the late 1970s. Assuming that the reference gene is read in frame +0, an overlapping gene can be encoded in two reading frames in the sense strand, denoted by +1 and +2, and in three reading frames in the opposite strand, denoted by -0, -1, and -2. This motivated numerous researchers to study the constraints induced by the genetic code on the various overlapping frames, mostly based on information theory. Our focus in this paper is on the constraints induced on two overlapping genes in terms of amino acids, as well as polypeptides. We show that simple linear constraints bind the amino-acid composition of two proteins encoded by overlapping genes. Novel constraints are revealed when polypeptides are considered, and not just single amino acids. For example, in double-coding sequences with an overlapping reading frame -2, each Tyrosine (denoted as Tyr or Y) in the overlapping frame overlaps a Tyrosine in the reference frame +0 (and reciprocally), whereas specific words (e.g. YY) never occur. We thus distinguish between null constraints (YY = 0 in frame -2) and non-null constraints (Y in frame +0 ⇔ Y in frame -2). Our equivalence-based constraints are symmetrical and thus enable the characterization of the joint composition of overlapping proteins. We describe several formal frameworks and a graph algorithm to characterize and compute these constraints. As expected, the degrees of freedom left by these constraints vary drastically among the different overlapping frames. Interestingly, the biological meaning of constraints induced on two overlapping proteins (hydropathy, forbidden di-peptides, expected overlap length …) is also specific to the reading frame. We study the combinatorics of these constraints for overlapping polypeptides of length n, pointing out that, (i) except for frame -2, non-null constraints are deduced from the amino-acid (length = 1) constraints and (ii) null constraints are deduced from the di-peptide (length = 2) constraints. These results yield support for understanding the mechanisms and evolution of overlapping genes, and for developing novel overlapping gene detection methods.
Collapse
|
12
|
Guerzoni D, McLysaght A. De Novo Genes Arise at a Slow but Steady Rate along the Primate Lineage and Have Been Subject to Incomplete Lineage Sorting. Genome Biol Evol 2016; 8:1222-32. [PMID: 27056411 PMCID: PMC4860702 DOI: 10.1093/gbe/evw074] [Citation(s) in RCA: 35] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/21/2022] Open
Abstract
De novo protein-coding gene origination is increasingly recognized as an important evolutionary mechanism. However, there remains a large amount of uncertainty regarding the frequency of these events and the mechanisms and speed of gene establishment. Here, we describe a rigorous search for cases of de novo gene origination in the great apes. We analyzed annotated proteomes as well as full genomic DNA and transcriptional and translational evidence. It is notable that results vary between database updates due to the fluctuating annotation of these genes. Nonetheless we identified 35 de novo genes: 16 human-specific; 5 human and chimpanzee specific; and 14 that originated prior to the divergence of human, chimpanzee, and gorilla and are found in all three genomes. The taxonomically restricted distribution of these genes cannot be explained by loss in other lineages. Each gene is supported by an open reading frame-creating mutation that occurred within the primate lineage, and which is not polymorphic in any species. Similarly to previous studies we find that the de novo genes identified are short and frequently located near pre-existing genes. Also, they may be associated with Alu elements and prior transcription and RNA-splicing at the locus. Additionally, we report the first case of apparent independent lineage sorting of a de novo gene. The gene is present in human and gorilla, whereas chimpanzee has the ancestral noncoding sequence. This indicates a long period of polymorphism prior to fixation and thus supports a model where de novo genes may, at least initially, have a neutral effect on fitness.
Collapse
Affiliation(s)
- Daniele Guerzoni
- Smurfit Institute of Genetics, Department of Genetics, Trinity College Dublin, University of Dublin, Ireland
| | - Aoife McLysaght
- Smurfit Institute of Genetics, Department of Genetics, Trinity College Dublin, University of Dublin, Ireland
| |
Collapse
|
13
|
Overlapping genes: a new strategy of thermophilic stress tolerance in prokaryotes. Extremophiles 2014; 19:345-53. [PMID: 25503326 DOI: 10.1007/s00792-014-0720-3] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/09/2014] [Accepted: 12/01/2014] [Indexed: 12/29/2022]
Abstract
Overlapping genes (OGs) draw the focus of recent day's research. However, the significance of OGs in prokaryotic genomes remained unexplored. As an adaptation to high temperature, thermophiles were shown to eliminate their intergenic regions. Therefore, it could be possible that prokaryotes would increase their OG content to adapt to high temperature. To test this hypothesis, we carried out a comparative study on OG frequency of 256 prokaryotic genomes comprising both thermophiles and non-thermophiles. It was found that thermophiles exhibit higher frequency of overlapping genes than non-thermophiles. Moreover, overlap frequency was found to correlate with optimal growth temperature (OGT) in prokaryotes. Long overlap frequency was found to hold a positive correlation with OGT resulting in an abundance of long overlaps in thermophiles compared to non-thermophiles. On the other hand, short overlap (1-4 nucleotides) frequency (SOF) did not yield any direct correlation with OGT. However, the correlation of SOF with CAIavg (extent of variation of codon usage bias measured as the mean of codon adaptation index of all genes in a given genome) and IG% (proportion of intergenic regions) indicate that they might upregulate the aforementioned factors (CAIavg and IG%) which are already known to be vital forces for thermophilic adaptation. From these evidences, we propose that the OG content bears a strong link to thermophily. Long overlaps are important for their genome compaction and short overlaps are important to uphold high CAIavg. Our findings will surely help in better understanding of the significance of overlapping gene content in prokaryotic genomes.
Collapse
|
14
|
|
15
|
Origin and length distribution of unidirectional prokaryotic overlapping genes. G3-GENES GENOMES GENETICS 2014; 4:19-27. [PMID: 24192837 PMCID: PMC3887535 DOI: 10.1534/g3.113.005652] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
Prokaryotic unidirectional overlapping genes can be originated by disrupting and replacing of the start or stop codon of one protein-coding gene with another start or stop codon within the adjacent gene. However, the probability of disruption and replacement of a start or stop codon may differ significantly depending on the number and redundancy of the start and stop codons sets. Here, we performed a simulation study of the formation of unidirectional overlapping genes using a simple model of nucleotide change and contrasted it with empirical data. Our results suggest that overlaps originated by an elongation of the 3′-end of the upstream gene are significantly more frequent than those originated by an elongation of the 5′-end of the downstream gene. According to this, we propose a model for the creation of unidirectional overlaps that is based on the disruption probabilities of start codon and stop codon sets and on the different probabilities of phase 1 and phase 2 overlaps. Additionally, our results suggest that phase 2 overlaps are formed at higher rates than phase 1 overlaps, given the same evolutionary time. Finally, we propose that there is no need to invoke selection to explain the prevalence of long phase 1 unidirectional overlaps. Rather, the overrepresentation of long phase 1 relative to long phase 2 overlaps might occur because it is highly probable that phase 2 overlaps are retained as short overlaps by chance. Such a pattern is stronger if selection against very long overlaps is included in the model. Our model as a whole is able to explain to a large extent the empirical length distribution of unidirectional overlaps in prokaryotic genomes.
Collapse
|
16
|
Abstract
Background New genes in eukaryotes are created through a variety of different mechanisms. De novo origin from non-coding DNA is a mechanism that has recently gained attention. So far, de novo genes have been described in a handful of organisms, with Drosophila being the most extensively studied. We searched for genes that have appeared de novo in the mouse and rat lineages. Methodology Using a rigorous and conservative approach we identify 75 murine genes (69 mouse genes and 6 rat genes) for which there is good evidence of de novo origin since the divergence of mouse and rat. Each of these genes is only found in either the mouse or rat lineages, with no candidate orthologs nor evidence for potentially-unannotated orthologs in the other lineage. The veracity of each of these genes is supported by expression evidence. Additionally, their presence in one lineage and absence in the other cannot be explained by sequencing gaps. For 11 of the 75 candidate novel genes we could identify a mouse-specific mutation that led to the creation of the open reading frame (ORF) specifically in mouse. None of the six rat-specific genes had an unequivocal rat-specific mutation creating the ORF, which may at least be partly due to lower data quality for that genome. Conclusions All 75 candidate genes presented in this study are relatively small and encode short peptides. A large number of them (51 out of 69 mouse genes and 3 out of 6 rat genes) also overlap with other genes, either within introns, or on the opposite strand. These characteristics have previously been documented for de novo genes. The description of these genes opens up the opportunity to integrate this evolutionary analysis with the rich experimental data available for these two model organisms.
Collapse
Affiliation(s)
- Daniel N. Murphy
- Smurfit Institute of Genetics, University of Dublin, Trinity College, Dublin, Ireland
| | - Aoife McLysaght
- Smurfit Institute of Genetics, University of Dublin, Trinity College, Dublin, Ireland
- * E-mail:
| |
Collapse
|
17
|
Ho MR, Tsai KW, Lin WC. A unified framework of overlapping genes: towards the origination and endogenic regulation. Genomics 2012; 100:231-9. [PMID: 22766524 DOI: 10.1016/j.ygeno.2012.06.011] [Citation(s) in RCA: 22] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/27/2012] [Revised: 06/21/2012] [Accepted: 06/25/2012] [Indexed: 11/27/2022]
Abstract
Overlapping genes are pairs of adjacent genes whose genomic regions partially overlap. They are notable by their potential intricate regulation, such as cis-regulation of nested gene-promoter configurations, and post-transcriptional regulation of natural antisense transcripts. The originations and consequent detailed regulation remain obscure. Herein, we propose a unified framework comprising biological classification rules followed by extensive analyses, namely, exon-sharing analysis, a human-mouse conservation study, and transcriptome analysis of hundreds of microarrays and transcriptome sequencing data (mRNA-Seq). We demonstrate that the tail-to-tail architecture would result from sharing functional elements in 3'-untranslated regions (3'-UTRs) of pre-existing genes. Dissimilarly, we illustrate that the other gene overlaps would originate from a new gene arising in a pre-existing gene locus. Interestingly, these types of coupled overlapping genes may influence each other synergistically or competitively during transcription, depending on the promoter configurations. This framework discloses distinctive characteristics of overlapping genes to be a foundation for a further comprehensive understanding of them.
Collapse
Affiliation(s)
- Meng-Ru Ho
- Biodiversity Research Center, Academia Sinica, Taipei 115, Taiwan
| | | | | |
Collapse
|
18
|
Successful COG8 and PDF overlap is mediated by alterations in splicing and polyadenylation signals. Hum Genet 2011; 131:265-74. [PMID: 21805148 DOI: 10.1007/s00439-011-1075-9] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/23/2011] [Accepted: 07/19/2011] [Indexed: 01/21/2023]
Abstract
Although gene-free areas compose the great majority of eukaryotic genomes, a significant fraction of genes overlaps, i.e., unique nucleotide sequences are part of more than one transcription unit. In this work, the evolutionary history and origin of a same-strand gene overlap is dissected through the analysis of COG8 (component of oligomeric Golgi complex 8) and PDF (peptide deformylase). Comparative genomic surveys reveal that the relative locations of these two genes have been changing over the last 445 million years from distinct chromosomal locations in fish to overlapping in rodents and primates, indicating that the overlap between these genes precedes their divergence. The overlap between the two genes was initiated by the gain of a novel splice donor site between the COG8 stop codon and PDF initiation codon. Splicing is accomplished by the use of the PDF acceptor, leading COG8 to share the 3'end with PDF. In primates, loss of the ancestral polyadenylation signal for COG8 makes the overlap between COG8 and PDF mandatory, while in mouse and rat concurrent overlapping and non-overlapping Cog8 transcripts exist. Altogether, we demonstrate that the origin, evolution and preservation of the COG8/PDF same-strand overlap follow similar mechanistic steps as those documented for antisense overlaps where gain and/or loss of splice sites and polyadenylation signals seems to drive the process.
Collapse
|
19
|
Donoghue MT, Keshavaiah C, Swamidatta SH, Spillane C. Evolutionary origins of Brassicaceae specific genes in Arabidopsis thaliana. BMC Evol Biol 2011; 11:47. [PMID: 21332978 PMCID: PMC3049755 DOI: 10.1186/1471-2148-11-47] [Citation(s) in RCA: 126] [Impact Index Per Article: 9.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/10/2010] [Accepted: 02/18/2011] [Indexed: 11/21/2022] Open
Abstract
Background All sequenced genomes contain a proportion of lineage-specific genes, which exhibit no sequence similarity to any genes outside the lineage. Despite their prevalence, the origins and functions of most lineage-specific genes remain largely unknown. As more genomes are sequenced opportunities for understanding evolutionary origins and functions of lineage-specific genes are increasing. Results This study provides a comprehensive analysis of the origins of lineage-specific genes (LSGs) in Arabidopsis thaliana that are restricted to the Brassicaceae family. In this study, lineage-specific genes within the nuclear (1761 genes) and mitochondrial (28 genes) genomes are identified. The evolutionary origins of two thirds of the lineage-specific genes within the Arabidopsis thaliana genome are also identified. Almost a quarter of lineage-specific genes originate from non-lineage-specific paralogs, while the origins of ~10% of lineage-specific genes are partly derived from DNA exapted from transposable elements (twice the proportion observed for non-lineage-specific genes). Lineage-specific genes are also enriched in genes that have overlapping CDS, which is consistent with such novel genes arising from overprinting. Over half of the subset of the 958 lineage-specific genes found only in Arabidopsis thaliana have alignments to intergenic regions in Arabidopsis lyrata, consistent with either de novo origination or differential gene loss and retention, with both evolutionary scenarios explaining the lineage-specific status of these genes. A smaller number of lineage-specific genes with an incomplete open reading frame across different Arabidopsis thaliana accessions are further identified as accession-specific genes, most likely of recent origin in Arabidopsis thaliana. Putative de novo origination for two of the Arabidopsis thaliana-only genes is identified via additional sequencing across accessions of Arabidopsis thaliana and closely related sister species lineages. We demonstrate that lineage-specific genes have high tissue specificity and low expression levels across multiple tissues and developmental stages. Finally, stress responsiveness is identified as a distinct feature of Brassicaceae-specific genes; where these LSGs are enriched for genes responsive to a wide range of abiotic stresses. Conclusion Improving our understanding of the origins of lineage-specific genes is key to gaining insights regarding how novel genes can arise and acquire functionality in different lineages. This study comprehensively identifies all of the Brassicaceae-specific genes in Arabidopsis thaliana and identifies how the majority of such lineage-specific genes have arisen. The analysis allows the relative importance (and prevalence) of different evolutionary routes to the genesis of novel ORFs within lineages to be assessed. Insights regarding the functional roles of lineage-specific genes are further advanced through identification of enrichment for stress responsiveness in lineage-specific genes, highlighting their likely importance for environmental adaptation strategies.
Collapse
Affiliation(s)
- Mark Ta Donoghue
- Department of Biochemistry, University College Cork, Cork, Ireland
| | | | | | | |
Collapse
|
20
|
Rindfleisch BC, Brown MS, VandeBerg JL, Munroe SH. Structure and expression of two nuclear receptor genes in marsupials: insights into the evolution of the antisense overlap between the α-thyroid hormone receptor and Rev-erbα. BMC Mol Biol 2010; 11:97. [PMID: 21143985 PMCID: PMC3047299 DOI: 10.1186/1471-2199-11-97] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/29/2010] [Accepted: 12/10/2010] [Indexed: 11/30/2022] Open
Abstract
Background Alternative processing of α-thyroid hormone receptor (TRα, NR1A1) mRNAs gives rise to two functionally antagonistic nuclear receptors: TRα1, the α-type receptor, and TRα2, a non-hormone binding variant that is found only in mammals. TRα2 shares an unusual antisense coding overlap with mRNA for Rev-erbα (NR1D1), another nuclear receptor protein. In this study we examine the structure and expression of these genes in the gray short-tailed opossum, Monodelphis domestica, in comparison with that of eutherian mammals and three other marsupial species, Didelphis virginiana, Potorous tridactylus and Macropus eugenii, in order to understand the evolution and regulatory role of this antisense overlap. Results The sequence, expression and genomic organization of mRNAs encoding TRα1 and Rev-erbα are very similar in the opossum and eutherian mammals. However, the sequence corresponding to the TRα2 coding region appears truncated by almost 100 amino acids. While expression of TRα1 and Rev-erbα was readily detected in all tissues of M. domestica ages 0 days to 18 weeks, TRα2 mRNA was not detected in any tissue or stage examined. These results contrast with the widespread and abundant expression of TRα2 in rodents and other eutherian mammals. To examine requirements for alternative splicing of TRα mRNAs, a series of chimeric minigenes was constructed. Results show that the opossum TRα2-specific 5' splice site sequence is fully competent for splicing but the sequence homologous to the TRα2 3' splice site is not, even though the marsupial sequences are remarkably similar to core splice site elements in rat. Conclusions Our results strongly suggest that the variant nuclear receptor isoform, TRα2, is not expressed in marsupials and that the antisense overlap between TRα and Rev-erbα thus is unique to eutherian mammals. Further investigation of the TRα and Rev-erbα genes in marsupial and eutherian species promises to yield additional insight into the physiological function of TRα2 and the role of the associated antisense overlap with Rev-erbα in regulating expression of these genes.
Collapse
|
21
|
Abstract
The origin of new genes is extremely important to evolutionary innovation. Most new genes arise from existing genes through duplication or recombination. The origin of new genes from noncoding DNA is extremely rare, and very few eukaryotic examples are known. We present evidence for the de novo origin of at least three human protein-coding genes since the divergence with chimp. Each of these genes has no protein-coding homologs in any other genome, but is supported by evidence from expression and, importantly, proteomics data. The absence of these genes in chimp and macaque cannot be explained by sequencing gaps or annotation error. High-quality sequence data indicate that these loci are noncoding DNA in other primates. Furthermore, chimp, gorilla, gibbon, and macaque share the same disabling sequence difference, supporting the inference that the ancestral sequence was noncoding over the alternative possibility of parallel gene inactivation in multiple primate lineages. The genes are not well characterized, but interestingly, one of them was first identified as an up-regulated gene in chronic lymphocytic leukemia. This is the first evidence for entirely novel human-specific protein-coding genes originating from ancestrally noncoding sequences. We estimate that 0.075% of human genes may have originated through this mechanism leading to a total expectation of 18 such cases in a genome of 24,000 protein-coding genes.
Collapse
|
22
|
Sabath N, Landan G, Graur D. A method for the simultaneous estimation of selection intensities in overlapping genes. PLoS One 2008; 3:e3996. [PMID: 19098983 PMCID: PMC2601044 DOI: 10.1371/journal.pone.0003996] [Citation(s) in RCA: 31] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/22/2008] [Accepted: 11/21/2008] [Indexed: 11/18/2022] Open
Abstract
Inferring the intensity of positive selection in protein-coding genes is important since it is used to shed light on the process of adaptation. Recently, it has been reported that overlapping genes, which are ubiquitous in all domains of life, seem to exhibit inordinate degrees of positive selection. Here, we present a new method for the simultaneous estimation of selection intensities in overlapping genes. We show that the appearance of positive selection is caused by assuming that selection operates independently on each gene in an overlapping pair, thereby ignoring the unique evolutionary constraints on overlapping coding regions. Our method uses an exact evolutionary model, thereby voiding the need for approximation or intensive computation. We test the method by simulating the evolution of overlapping genes of different types as well as under diverse evolutionary scenarios. Our results indicate that the independent estimation approach leads to the false appearance of positive selection even though the gene is in reality subject to negative selection. Finally, we use our method to estimate selection in two influenza A genes for which positive selection was previously inferred. We find no evidence for positive selection in both cases.
Collapse
Affiliation(s)
- Niv Sabath
- Department of Biology and Biochemistry, University of Houston, Houston, Texas, United States of America.
| | | | | |
Collapse
|
23
|
Makałowska I. Comparative analysis of an unusual gene arrangement in the human chromosome 1. Gene 2008; 423:172-9. [DOI: 10.1016/j.gene.2008.06.030] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/16/2008] [Revised: 06/14/2008] [Accepted: 06/18/2008] [Indexed: 11/15/2022]
|
24
|
Kavanaugh SI, Nozaki M, Sower SA. Origins of gonadotropin-releasing hormone (GnRH) in vertebrates: identification of a novel GnRH in a basal vertebrate, the sea lamprey. Endocrinology 2008; 149:3860-9. [PMID: 18436713 PMCID: PMC2488216 DOI: 10.1210/en.2008-0184] [Citation(s) in RCA: 88] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 02/07/2023]
Abstract
We cloned a cDNA encoding a novel (GnRH), named lamprey GnRH-II, from the sea lamprey, a basal vertebrate. The deduced amino acid sequence of the newly identified lamprey GnRH-II is QHWSHGWFPG. The architecture of the precursor is similar to that reported for other GnRH precursors consisting of a signal peptide, decapeptide, a downstream processing site, and a GnRH-associated peptide; however, the gene for lamprey GnRH-II does not have introns in comparison with the gene organization for all other vertebrate GnRHs. Lamprey GnRH-II precursor transcript was widely expressed in a variety of tissues. In situ hybridization of the brain showed expression and localization of the transcript in the hypothalamus, medulla, and olfactory regions, whereas immunohistochemistry using a specific antiserum showed only GnRH-II cell bodies and processes in the preoptic nucleus/hypothalamus areas. Lamprey GnRH-II was shown to stimulate the hypothalamic-pituitary axis using in vivo and in vitro studies. Lamprey GnRH-II was also shown to activate the inositol phosphate signaling system in COS-7 cells transiently transfected with the lamprey GnRH receptor. These studies provide evidence for a novel lamprey GnRH that has a role as a third hypothalamic GnRH. In summary, the newly discovered lamprey GnRH-II offers a new paradigm of the origin of the vertebrate GnRH family. We hypothesize that due to a genome/gene duplication event, an ancestral gene gave rise to two lineages of GnRHs: the gnathostome GnRH and lamprey GnRH-II.
Collapse
Affiliation(s)
- Scott I Kavanaugh
- Department of Biochemistry and Molecular Biology, University of New Hampshire, Durham, New Hampshire 03824, USA
| | | | | |
Collapse
|