1
|
Savinkova LK, Sharypova EB, Kolchanov NA. On the Role of TATA Boxes and TATA-Binding Protein in Arabidopsis thaliana. PLANTS (BASEL, SWITZERLAND) 2023; 12:1000. [PMID: 36903861 PMCID: PMC10005294 DOI: 10.3390/plants12051000] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 11/10/2022] [Revised: 01/13/2023] [Accepted: 02/20/2023] [Indexed: 06/18/2023]
Abstract
For transcription initiation by RNA polymerase II (Pol II), all eukaryotes require assembly of basal transcription machinery on the core promoter, a region located approximately in the locus spanning a transcription start site (-50; +50 bp). Although Pol II is a complex multi-subunit enzyme conserved among all eukaryotes, it cannot initiate transcription without the participation of many other proteins. Transcription initiation on TATA-containing promoters requires the assembly of the preinitiation complex; this process is triggered by an interaction of TATA-binding protein (TBP, a component of the general transcription factor TFIID (transcription factor II D)) with a TATA box. The interaction of TBP with various TATA boxes in plants, in particular Arabidopsis thaliana, has hardly been investigated, except for a few early studies that addressed the role of a TATA box and substitutions in it in plant transcription systems. This is despite the fact that the interaction of TBP with TATA boxes and their variants can be used to regulate transcription. In this review, we examine the roles of some general transcription factors in the assembly of the basal transcription complex, as well as functions of TATA boxes of the model plant A. thaliana. We review examples showing not only the involvement of TATA boxes in the initiation of transcription machinery assembly but also their indirect participation in plant adaptation to environmental conditions in responses to light and other phenomena. Examples of an influence of the expression levels of A. thaliana TBP1 and TBP2 on morphological traits of the plants are also examined. We summarize available functional data on these two early players that trigger the assembly of transcription machinery. This information will deepen the understanding of the mechanisms underlying transcription by Pol II in plants and will help to utilize the functions of the interaction of TBP with TATA boxes in practice.
Collapse
|
2
|
Deviatiiarov RM, Gams A, Kulakovskiy IV, Buyan A, Meshcheryakov G, Syunyaev R, Singh R, Shah P, Tatarinova TV, Gusev O, Efimov IR. An atlas of transcribed human cardiac promoters and enhancers reveals an important role of regulatory elements in heart failure. NATURE CARDIOVASCULAR RESEARCH 2023; 2:58-75. [PMID: 39196209 DOI: 10.1038/s44161-022-00182-x] [Citation(s) in RCA: 4] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/23/2021] [Accepted: 11/02/2022] [Indexed: 08/29/2024]
Abstract
A deeper knowledge of the dynamic transcriptional activity of promoters and enhancers is needed to improve mechanistic understanding of the pathogenesis of heart failure and heart diseases. In this study, we used cap analysis of gene expression (CAGE) to identify and quantify the activity of transcribed regulatory elements (TREs) in the four cardiac chambers of 21 healthy and ten failing adult human hearts. We identified 17,668 promoters and 14,920 enhancers associated with the expression of 14,519 genes. We showed how these regulatory elements are alternatively transcribed in different heart regions, in healthy versus failing hearts and in ischemic versus non-ischemic heart failure samples. Cardiac-disease-related single-nucleotide polymorphisms (SNPs) appeared to be enriched in TREs, potentially affecting the allele-specific transcription factor binding. To conclude, our open-source heart CAGE atlas will serve the cardiovascular community in improving the understanding of the role of the cardiac gene regulatory networks in cardiovascular disease and therapy.
Collapse
Affiliation(s)
- Ruslan M Deviatiiarov
- Laboratory of Regulatory Genomics, Institute of Fundamental Medicine and Biology, Kazan Federal University, Kazan, Russia
| | - Anna Gams
- Department of Biomedical Engineering, The George Washington University, Washington, DC, USA
| | - Ivan V Kulakovskiy
- Laboratory of Regulatory Genomics, Institute of Fundamental Medicine and Biology, Kazan Federal University, Kazan, Russia
- Institute of Protein Research, Russian Academy of Sciences, Pushchino, Russia
- Vavilov Institute of General Genetics, Russian Academy of Sciences, Moscow, Russia
| | - Andrey Buyan
- Laboratory of Regulatory Genomics, Institute of Fundamental Medicine and Biology, Kazan Federal University, Kazan, Russia
- Institute of Protein Research, Russian Academy of Sciences, Pushchino, Russia
| | | | - Roman Syunyaev
- Department of Biomedical Engineering, The George Washington University, Washington, DC, USA
- I.M. Sechenov First Moscow State Medical University, Moscow, Russia
| | - Ramesh Singh
- Inova Heart and Vascular Institute, Falls Church, VA, USA
| | - Palak Shah
- Department of Biomedical Engineering, The George Washington University, Washington, DC, USA
- Inova Heart and Vascular Institute, Falls Church, VA, USA
| | - Tatiana V Tatarinova
- Vavilov Institute of General Genetics, Russian Academy of Sciences, Moscow, Russia.
- Department of Biology, University of La Verne, La Verne, CA, USA.
| | - Oleg Gusev
- Laboratory of Regulatory Genomics, Institute of Fundamental Medicine and Biology, Kazan Federal University, Kazan, Russia.
- Graduate School of Medicine, Juntendo University, Tokyo, Japan.
- RIKEN Center for Integrative Medical Sciences, RIKEN, Yokohama, Japan.
- Endocrinology Research Center, Moscow, Russia.
| | - Igor R Efimov
- Department of Biomedical Engineering, The George Washington University, Washington, DC, USA.
- Department of Biomedical Engineering, Northwestern University, Chicago, IL, USA.
- Department of Medicine, Northwestern University, Chicago, IL, USA.
| |
Collapse
|
3
|
Genome-Wide Prediction of Transcription Start Sites in Conifers. Int J Mol Sci 2022; 23:ijms23031735. [PMID: 35163661 PMCID: PMC8836283 DOI: 10.3390/ijms23031735] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/30/2021] [Revised: 01/30/2022] [Accepted: 02/01/2022] [Indexed: 02/04/2023] Open
Abstract
The identification of promoters is an essential step in the genome annotation process, providing a framework for gene regulatory networks and their role in transcription regulation. Despite considerable advances in the high-throughput determination of transcription start sites (TSSs) and transcription factor binding sites (TFBSs), experimental methods are still time-consuming and expensive. Instead, several computational approaches have been developed to provide fast and reliable means for predicting the location of TSSs and regulatory motifs on a genome-wide scale. Numerous studies have been carried out on the regulatory elements of mammalian genomes, but plant promoters, especially in gymnosperms, have been left out of the limelight and, therefore, have been poorly investigated. The aim of this study was to enhance and expand the existing genome annotations using computational approaches for genome-wide prediction of TSSs in the four conifer species: loblolly pine, white spruce, Norway spruce, and Siberian larch. Our pipeline will be useful for TSS predictions in other genomes, especially for draft assemblies, where reliable TSS predictions are not usually available. We also explored some of the features of the nucleotide composition of the predicted promoters and compared the GC properties of conifer genes with model monocot and dicot plants. Here, we demonstrate that even incomplete genome assemblies and partial annotations can be a reliable starting point for TSS annotation. The results of the TSS prediction in four conifer species have been deposited in the Persephone genome browser, which allows smooth visualization and is optimized for large data sets. This work provides the initial basis for future experimental validation and the study of the regulatory regions to understand gene regulation in gymnosperms.
Collapse
|
4
|
Zhang M, Jia C, Li F, Li C, Zhu Y, Akutsu T, Webb GI, Zou Q, Coin LJM, Song J. Critical assessment of computational tools for prokaryotic and eukaryotic promoter prediction. Brief Bioinform 2022; 23:6502561. [PMID: 35021193 PMCID: PMC8921625 DOI: 10.1093/bib/bbab551] [Citation(s) in RCA: 10] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/09/2021] [Revised: 11/12/2021] [Accepted: 11/30/2021] [Indexed: 01/13/2023] Open
Abstract
Promoters are crucial regulatory DNA regions for gene transcriptional activation. Rapid advances in next-generation sequencing technologies have accelerated the accumulation of genome sequences, providing increased training data to inform computational approaches for both prokaryotic and eukaryotic promoter prediction. However, it remains a significant challenge to accurately identify species-specific promoter sequences using computational approaches. To advance computational support for promoter prediction, in this study, we curated 58 comprehensive, up-to-date, benchmark datasets for 7 different species (i.e. Escherichia coli, Bacillus subtilis, Homo sapiens, Mus musculus, Arabidopsis thaliana, Zea mays and Drosophila melanogaster) to assist the research community to assess the relative functionality of alternative approaches and support future research on both prokaryotic and eukaryotic promoters. We revisited 106 predictors published since 2000 for promoter identification (40 for prokaryotic promoter, 61 for eukaryotic promoter, and 5 for both). We systematically evaluated their training datasets, computational methodologies, calculated features, performance and software usability. On the basis of these benchmark datasets, we benchmarked 19 predictors with functioning webservers/local tools and assessed their prediction performance. We found that deep learning and traditional machine learning-based approaches generally outperformed scoring function-based approaches. Taken together, the curated benchmark dataset repository and the benchmarking analysis in this study serve to inform the design and implementation of computational approaches for promoter prediction and facilitate more rigorous comparison of new techniques in the future.
Collapse
Affiliation(s)
| | - Cangzhi Jia
- Corresponding authors: Jiangning Song, Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, VIC 3800, Australia. E-mail: ; Lachlan J.M. Coin, Department of Microbiology and Immunology, The Peter Doherty Institute for Infection and Immunity, The University of Melbourne, 792 Elizabeth Street, Melbourne, Victoria 3000, Australia. E-mail: ; Quan Zou, Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China. E-mail: ; Cangzhi Jia, School of Science, Dalian Maritime University, Dalian 116026, China. E-mail:
| | | | | | | | | | - Geoffrey I Webb
- Department of Data Science and Artificial Intelligence, Monash University, Melbourne, VIC 3800, Australia,Monash Data Futures Institute, Monash University, Melbourne, VIC 3800, Australia
| | - Quan Zou
- Corresponding authors: Jiangning Song, Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, VIC 3800, Australia. E-mail: ; Lachlan J.M. Coin, Department of Microbiology and Immunology, The Peter Doherty Institute for Infection and Immunity, The University of Melbourne, 792 Elizabeth Street, Melbourne, Victoria 3000, Australia. E-mail: ; Quan Zou, Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China. E-mail: ; Cangzhi Jia, School of Science, Dalian Maritime University, Dalian 116026, China. E-mail:
| | - Lachlan J M Coin
- Corresponding authors: Jiangning Song, Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, VIC 3800, Australia. E-mail: ; Lachlan J.M. Coin, Department of Microbiology and Immunology, The Peter Doherty Institute for Infection and Immunity, The University of Melbourne, 792 Elizabeth Street, Melbourne, Victoria 3000, Australia. E-mail: ; Quan Zou, Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China. E-mail: ; Cangzhi Jia, School of Science, Dalian Maritime University, Dalian 116026, China. E-mail:
| | - Jiangning Song
- Corresponding authors: Jiangning Song, Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, VIC 3800, Australia. E-mail: ; Lachlan J.M. Coin, Department of Microbiology and Immunology, The Peter Doherty Institute for Infection and Immunity, The University of Melbourne, 792 Elizabeth Street, Melbourne, Victoria 3000, Australia. E-mail: ; Quan Zou, Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China. E-mail: ; Cangzhi Jia, School of Science, Dalian Maritime University, Dalian 116026, China. E-mail:
| |
Collapse
|
5
|
To JPC, Davis IW, Marengo MS, Shariff A, Baublite C, Decker K, Galvão RM, Gao Z, Haragutchi O, Jung JW, Li H, O'Brien B, Sant A, Elich TD. Expression Elements Derived From Plant Sequences Provide Effective Gene Expression Regulation and New Opportunities for Plant Biotechnology Traits. FRONTIERS IN PLANT SCIENCE 2021; 12:712179. [PMID: 34745155 PMCID: PMC8569612 DOI: 10.3389/fpls.2021.712179] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 05/20/2021] [Accepted: 09/15/2021] [Indexed: 06/13/2023]
Abstract
Plant biotechnology traits provide a means to increase crop yields, manage weeds and pests, and sustainably contribute to addressing the needs of a growing population. One of the key challenges in developing new traits for plant biotechnology is the availability of expression elements for efficacious and predictable transgene regulation. Recent advances in genomics, transcriptomics, and computational tools have enabled the generation of new expression elements in a variety of model organisms. In this study, new expression element sequences were computationally generated for use in crops, starting from native Arabidopsis and maize sequences. These elements include promoters, 5' untranslated regions (5' UTRs), introns, and 3' UTRs. The expression elements were demonstrated to drive effective transgene expression in stably transformed soybean plants across multiple tissues types and developmental stages. The expressed transcripts were characterized to demonstrate the molecular function of these expression elements. The data show that the promoters precisely initiate transcripts, the introns are effectively spliced, and the 3' UTRs enable predictable processing of transcript 3' ends. Overall, our results indicate that these new expression elements can recapitulate key functional properties of natural sequences and provide opportunities for optimizing the expression of genes in future plant biotechnology traits.
Collapse
Affiliation(s)
- Jennifer P. C. To
- Bayer Crop Science, Chesterfield, MO, United States
- GrassRoots Biotechnology, Durham, NC, United States
- Monsanto Company, Research Triangle Park, Durham, NC, United States
| | - Ian W. Davis
- Bayer Crop Science, Chesterfield, MO, United States
- GrassRoots Biotechnology, Durham, NC, United States
- Monsanto Company, Research Triangle Park, Durham, NC, United States
| | - Matthew S. Marengo
- Bayer Crop Science, Chesterfield, MO, United States
- GrassRoots Biotechnology, Durham, NC, United States
- Monsanto Company, Research Triangle Park, Durham, NC, United States
| | - Aabid Shariff
- GrassRoots Biotechnology, Durham, NC, United States
- Monsanto Company, Research Triangle Park, Durham, NC, United States
- Pairwise Plants, Durham, NC, United States
| | | | - Keith Decker
- Bayer Crop Science, Chesterfield, MO, United States
| | - Rafaelo M. Galvão
- Bayer Crop Science, Chesterfield, MO, United States
- GrassRoots Biotechnology, Durham, NC, United States
- Monsanto Company, Research Triangle Park, Durham, NC, United States
| | - Zhihuan Gao
- Bayer Crop Science, Chesterfield, MO, United States
- GrassRoots Biotechnology, Durham, NC, United States
- Monsanto Company, Research Triangle Park, Durham, NC, United States
| | - Olivia Haragutchi
- Bayer Crop Science, Chesterfield, MO, United States
- GrassRoots Biotechnology, Durham, NC, United States
- Monsanto Company, Research Triangle Park, Durham, NC, United States
| | - Jee W. Jung
- Bayer Crop Science, Chesterfield, MO, United States
- GrassRoots Biotechnology, Durham, NC, United States
- Monsanto Company, Research Triangle Park, Durham, NC, United States
- Duke University, Office for Translation and Commercialization, Durham, NC, United States
| | - Hong Li
- Bayer Crop Science, Chesterfield, MO, United States
| | - Brent O'Brien
- Bayer Crop Science, Chesterfield, MO, United States
- GrassRoots Biotechnology, Durham, NC, United States
- Monsanto Company, Research Triangle Park, Durham, NC, United States
| | - Anagha Sant
- Bayer Crop Science, Chesterfield, MO, United States
| | - Tedd D. Elich
- GrassRoots Biotechnology, Durham, NC, United States
- Monsanto Company, Research Triangle Park, Durham, NC, United States
- LifeEDIT Therapeutics, Durham, NC, United States
| |
Collapse
|
6
|
Flavell RB. Perspective: 50 years of plant chromosome biology. PLANT PHYSIOLOGY 2021; 185:731-753. [PMID: 33604616 PMCID: PMC8133586 DOI: 10.1093/plphys/kiaa108] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 09/15/2020] [Accepted: 12/04/2020] [Indexed: 06/12/2023]
Abstract
The past 50 years has been the greatest era of plant science discovery, and most of the discoveries have emerged from or been facilitated by our knowledge of plant chromosomes. At last we have descriptive and mechanistic outlines of the information in chromosomes that programs plant life. We had almost no such information 50 years ago when few had isolated DNA from any plant species. The important features of genes have been revealed through whole genome comparative genomics and testing of variants using transgenesis. Progress has been enabled by the development of technologies that had to be invented and then become widely available. Arabidopsis (Arabidopsis thaliana) and rice (Oryza sativa) have played extraordinary roles as model species. Unexpected evolutionary dramas were uncovered when learning that chromosomes have to manage constantly the vast numbers of potentially mutagenic families of transposons and other repeated sequences. The chromatin-based transcriptional and epigenetic mechanisms that co-evolved to manage the evolutionary drama as well as gene expression and 3-D nuclear architecture have been elucidated these past 20 years. This perspective traces some of the major developments with which I have become particularly familiar while seeking ways to improve crop plants. I draw some conclusions from this look-back over 50 years during which the scientific community has (i) exposed how chromosomes guard, readout, control, recombine, and transmit information that programs plant species, large and small, weed and crop, and (ii) modified the information in chromosomes for the purposes of genetic, physiological, and developmental analyses and plant improvement.
Collapse
Affiliation(s)
- Richard B Flavell
- International Wheat Yield Partnership, 1500 Research Parkway, College Station, TX 77843, USA
| |
Collapse
|
7
|
Pachganov S, Murtazalieva K, Zarubin A, Taran T, Chartier D, Tatarinova TV. Prediction of Rice Transcription Start Sites Using TransPrise: A Novel Machine Learning Approach. Methods Mol Biol 2021; 2238:261-274. [PMID: 33471337 DOI: 10.1007/978-1-0716-1068-8_17] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
As the interest in genetic resequencing increases, so does the need for effective mathematical, computational, and statistical approaches. One of the difficult problems in genome annotation is determination of precise positions of transcription start sites. In this paper, we present TransPrise-an efficient deep learning tool for predicting positions of eukaryotic transcription start sites. TransPrise offers significant improvement over existing promoter-prediction methods. To illustrate this, we compared predictions of TransPrise with the TSSPlant approach for well-annotated genome of Oryza sativa. Using a computer with a graphics processing unit, the run time of TransPrise is 250 min on a genome of 374 Mb long.We provide the full basis for the comparison and encourage users to freely access a set of our computational tools to facilitate and streamline their own analyses. The ready-to-use Docker image with all the necessary packages, models, and code as well as the source code of the TransPrise algorithm are available at http://compubioverne.group/ . The source code is ready to use and to be customized to predict TSS in any eukaryotic organism.
Collapse
Affiliation(s)
- Stepan Pachganov
- Ugra Research Institute of Information Technologies, Khanty-Mansiysk, Russia
| | | | - Alexei Zarubin
- Tomsk National Research Medical Center of the Russian Academy of Sciences, Research Institute of Medical Genetics, Tomsk, Russia
| | | | - Duane Chartier
- International Center for Art Intelligence, Inc, Los Angeles, CA, USA
| | - Tatiana V Tatarinova
- Vavilov Institute of General Genetics, Moscow, Russia.
- Department of Biology, University of La Verne, La Verne, CA, USA.
- A.A. Kharkevich Institute for Information Transmission Problems, Russian Academy of Sciences, Moscow, Russia.
- Siberian Federal University, Krasnoyarsk, Russia.
| |
Collapse
|
8
|
Sarpan N, Taranenko E, Ooi SE, Low ETL, Espinoza A, Tatarinova TV, Ong-Abdullah M. DNA methylation changes in clonally propagated oil palm. PLANT CELL REPORTS 2020; 39:1219-1233. [PMID: 32591850 DOI: 10.1007/s00299-020-02561-9] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/27/2020] [Accepted: 06/17/2020] [Indexed: 06/11/2023]
Abstract
Several hypomethylated sites within the Karma region of EgDEF1 and hotspot regions in chromosomes 1, 2, 3, and 5 may be associated with mantling. One of the main challenges faced by the oil palm industry is fruit abnormalities, such as the "mantled" phenotype that can lead to reduced yields. This clonal abnormality is an epigenetic phenomenon and has been linked to the hypomethylation of a transposable element within the EgDEF1 gene. To understand the epigenome changes in clones, methylomes of clonal oil palms were compared to methylomes of seedling-derived oil palms. Whole-genome bisulfite sequencing data from seedlings, normal, and mantled clones were analyzed to determine and compare the context-specific DNA methylomes. In seedlings, coding and regulatory regions are generally hypomethylated while introns and repeats are extensively methylated. Genes with a low number of guanines and cytosines in the third position of codons (GC3-poor genes) were increasingly methylated towards their 3' region, while GC3-rich genes remain demethylated, similar to patterns in other eukaryotic species. Predicted promoter regions were generally hypomethylated in seedlings. In clones, CG, CHG, and CHH methylation levels generally decreased in functionally important regions, such as promoters, 5' UTRs, and coding regions. Although random regions were found to be hypomethylated in clonal genomes, hypomethylation of certain hotspot regions may be associated with the clonal mantling phenotype. Our findings, therefore, suggest other hypomethylated CHG sites within the Karma of EgDEF1 and hypomethylated hotspot regions in chromosomes 1, 2, 3 and 5, are associated with mantling.
Collapse
Affiliation(s)
- Norashikin Sarpan
- Advanced Biotechnology and Breeding Centre, Malaysian Palm Oil Board, 6 Persiaran Institusi, Bandar Baru Bangi, 43000, Kajang, Selangor, Malaysia
| | - Elizaveta Taranenko
- Department of Biology, University of La Verne, La Verne, CA, USA
- Department of Fundamental Biology and Biotechnology, Siberian Federal University, 660074, Krasnoyarsk, Russia
| | - Siew-Eng Ooi
- Advanced Biotechnology and Breeding Centre, Malaysian Palm Oil Board, 6 Persiaran Institusi, Bandar Baru Bangi, 43000, Kajang, Selangor, Malaysia
| | - Eng-Ti Leslie Low
- Advanced Biotechnology and Breeding Centre, Malaysian Palm Oil Board, 6 Persiaran Institusi, Bandar Baru Bangi, 43000, Kajang, Selangor, Malaysia
| | | | - Tatiana V Tatarinova
- Department of Biology, University of La Verne, La Verne, CA, USA.
- Department of Fundamental Biology and Biotechnology, Siberian Federal University, 660074, Krasnoyarsk, Russia.
- Vavilov Institute for General Genetics, Moscow, Russia.
- A.A. Kharkevich Institute for Information Transmission Problems, Russian Academy of Sciences, Moscow, Russia.
| | - Meilina Ong-Abdullah
- Advanced Biotechnology and Breeding Centre, Malaysian Palm Oil Board, 6 Persiaran Institusi, Bandar Baru Bangi, 43000, Kajang, Selangor, Malaysia.
| |
Collapse
|
9
|
Pachganov S, Murtazalieva K, Zarubin A, Sokolov D, Chartier DR, Tatarinova TV. TransPrise: a novel machine learning approach for eukaryotic promoter prediction. PeerJ 2019; 7:e7990. [PMID: 31695967 PMCID: PMC6827441 DOI: 10.7717/peerj.7990] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/09/2019] [Accepted: 10/04/2019] [Indexed: 02/01/2023] Open
Abstract
As interest in genetic resequencing increases, so does the need for effective mathematical, computational, and statistical approaches. One of the difficult problems in genome annotation is determination of precise positions of transcription start sites. In this paper we present TransPrise-an efficient deep learning tool for prediction of positions of eukaryotic transcription start sites. Our pipeline consists of two parts: the binary classifier operates the first, and if a sequence is classified as TSS-containing the regression step follows, where the precise location of TSS is being identified. TransPrise offers significant improvement over existing promoter-prediction methods. To illustrate this, we compared predictions of TransPrise classification and regression models with the TSSPlant approach for the well annotated genome of Oryza sativa. Using a computer equipped with a graphics processing unit, the run time of TransPrise is 250 minutes on a genome of 374 Mb long. The Matthews correlation coefficient value for TransPrise is 0.79, more than two times larger than the 0.31 for TSSPlant classification models. This represents a high level of prediction accuracy. Additionally, the mean absolute error for the regression model is 29.19 nt, allowing for accurate prediction of TSS location. TransPrise was also tested in Homo sapiens, where mean absolute error of the regression model was 47.986 nt. We provide the full basis for the comparison and encourage users to freely access a set of our computational tools to facilitate and streamline their own analyses. The ready-to-use Docker image with all necessary packages, models, code as well as the source code of the TransPrise algorithm are available at (http://compubioverne.group/). The source code is ready to use and customizable to predict TSS in any eukaryotic organism.
Collapse
Affiliation(s)
- Stepan Pachganov
- Ugra Research Institute of Information Technologies, Khanty-Mansiysk, Russia
| | - Khalimat Murtazalieva
- Vavilov Institute for General Genetics, Moscow, Russia.,Institute of Bioinformatics, Moscow, Russia
| | - Aleksei Zarubin
- Tomsk National Research Medical Center of the Russian Academy of Sciences, Research Institute of Medical Genetics, Tomsk, Russia
| | | | - Duane R Chartier
- International Center for Art Intelligence, Inc., Los Angeles, CA, United States of America
| | - Tatiana V Tatarinova
- Vavilov Institute for General Genetics, Moscow, Russia.,Department of Biology, University of La Verne, La Verne, CA, United States of America.,A.A. Kharkevich Institute for Information Transmission Problems, Russian Academy of Sciences, Moscow, Russia.,Siberian Federal University, Krasnoyarsk, Russia
| |
Collapse
|
10
|
Tonnessen BW, Bossa-Castro AM, Mauleon R, Alexandrov N, Leach JE. Shared cis-regulatory architecture identified across defense response genes is associated with broad-spectrum quantitative resistance in rice. Sci Rep 2019; 9:1536. [PMID: 30733489 PMCID: PMC6367480 DOI: 10.1038/s41598-018-38195-x] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/05/2018] [Accepted: 12/18/2018] [Indexed: 12/30/2022] Open
Abstract
Plant disease resistance that is durable and effective against diverse pathogens (broad-spectrum) is essential to stabilize crop production. Such resistance is frequently controlled by Quantitative Trait Loci (QTL), and often involves differential regulation of Defense Response (DR) genes. In this study, we sought to understand how expression of DR genes is orchestrated, with the long-term goal of enabling genome-wide breeding for more effective and durable resistance. We identified short sequence motifs in rice promoters that are shared across Broad-Spectrum DR (BS-DR) genes co-expressed after challenge with three major rice pathogens (Magnaporthe oryzae, Rhizoctonia solani, and Xanthomonas oryzae pv. oryzae) and several chemical elicitors. Specific groupings of these BS-DR-associated motifs, called cis-Regulatory Modules (CRMs), are enriched in DR gene promoters, and the CRMs include cis-elements known to be involved in disease resistance. Polymorphisms in CRMs occur in promoters of genes in resistant relative to susceptible BS-DR haplotypes providing evidence that these CRMs have a predictive role in the contribution of other BS-DR genes to resistance. Therefore, we predict that a CRM signature within BS-DR gene promoters can be used as a marker for future breeding practices to enrich for the most responsive and effective BS-DR genes across the genome.
Collapse
Affiliation(s)
| | | | - Ramil Mauleon
- International Rice Research Institute, Manila, Philippines
| | | | - Jan E Leach
- Colorado State University, Fort Collins, CO, USA.
| |
Collapse
|
11
|
Vishnevsky OV, Bocharnikov AV, Kolchanov NA. Argo_CUDA: Exhaustive GPU based approach for motif discovery in large DNA datasets. J Bioinform Comput Biol 2017; 16:1740012. [PMID: 29281953 DOI: 10.1142/s0219720017400121] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
The development of chromatin immunoprecipitation sequencing (ChIP-seq) technology has revolutionized the genetic analysis of the basic mechanisms underlying transcription regulation and led to accumulation of information about a huge amount of DNA sequences. There are a lot of web services which are currently available for de novo motif discovery in datasets containing information about DNA/protein binding. An enormous motif diversity makes their finding challenging. In order to avoid the difficulties, researchers use different stochastic approaches. Unfortunately, the efficiency of the motif discovery programs dramatically declines with the query set size increase. This leads to the fact that only a fraction of top "peak" ChIP-Seq segments can be analyzed or the area of analysis should be narrowed. Thus, the motif discovery in massive datasets remains a challenging issue. Argo_Compute Unified Device Architecture (CUDA) web service is designed to process the massive DNA data. It is a program for the detection of degenerate oligonucleotide motifs of fixed length written in 15-letter IUPAC code. Argo_CUDA is a full-exhaustive approach based on the high-performance GPU technologies. Compared with the existing motif discovery web services, Argo_CUDA shows good prediction quality on simulated sets. The analysis of ChIP-Seq sequences revealed the motifs which correspond to known transcription factor binding sites.
Collapse
Affiliation(s)
- Oleg V Vishnevsky
- * Institute of Cytology and Genetics SB RAS, Lavrentieva Ave., 10, Novosibirsk 630090, Russia.,† Novosibirsk State University, Pirogova, 10, Novosibirsk 630090, Russia
| | | | - Nikolay A Kolchanov
- * Institute of Cytology and Genetics SB RAS, Lavrentieva Ave., 10, Novosibirsk 630090, Russia.,† Novosibirsk State University, Pirogova, 10, Novosibirsk 630090, Russia
| |
Collapse
|
12
|
Triska M, Solovyev V, Baranova A, Kel A, Tatarinova TV. Nucleotide patterns aiding in prediction of eukaryotic promoters. PLoS One 2017; 12:e0187243. [PMID: 29141011 PMCID: PMC5687710 DOI: 10.1371/journal.pone.0187243] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/02/2017] [Accepted: 09/05/2017] [Indexed: 01/09/2023] Open
Abstract
Computational analysis of promoters is hindered by the complexity of their architecture. In less studied genomes with complex organization, false positive promoter predictions are common. Accurate identification of transcription start sites and core promoter regions remains an unsolved problem. In this paper, we present a comprehensive analysis of genomic features associated with promoters and show that probabilistic integrative algorithms-driven models allow accurate classification of DNA sequence into “promoters” and “non-promoters” even in absence of the full-length cDNA sequences. These models may be built upon the maps of the distributions of sequence polymorphisms, RNA sequencing reads on genomic DNA, methylated nucleotides, transcription factor binding sites, as well as relative frequencies of nucleotides and their combinations. Positional clustering of binding sites shows that the cells of Oryza sativa utilize three distinct classes of transcription factors: those that bind preferentially to the [-500,0] region (188 “promoter-specific” transcription factors), those that bind preferentially to the [0,500] region (282 “5′ UTR-specific” TFs), and 207 of the “promiscuous” transcription factors with little or no location preference with respect to TSS. For the most informative motifs, their positional preferences are conserved between dicots and monocots.
Collapse
Affiliation(s)
- Martin Triska
- Children’s Hospital Los Angeles, University of Southern California, Los Angeles, CA, United States of America
- Faculty of Advanced Technology, University of South Wales, Pontypridd, Wales, United Kingdom
| | | | - Ancha Baranova
- School of Systems Biology, George Mason University, Fairfax, VA, United States of America
- Research Centre for Medical Genetics, Moscow, Russia
| | - Alexander Kel
- geneXplain GmbH, Wolfenbuettel, Germany
- Institute of Chemical Biology and Fundamental Medicine, Novosibirsk, Russia
| | - Tatiana V. Tatarinova
- School of Systems Biology, George Mason University, Fairfax, VA, United States of America
- Department of Biology, Division of Natural Sciences, University of La Verne, La Verne, CA, United States of America
- Bioinformatics Center, AA Kharkevich Institute for Information Transmission Problems RAS, Moscow, Russia
- Vavilov’s Institute for General Genetics, Moscow, Russia, Moscow, Russia
- * E-mail:
| |
Collapse
|
13
|
Chan KL, Tatarinova TV, Rosli R, Amiruddin N, Azizi N, Halim MAA, Sanusi NSNM, Jayanthi N, Ponomarenko P, Triska M, Solovyev V, Firdaus-Raih M, Sambanthamurthi R, Murphy D, Low ETL. Evidence-based gene models for structural and functional annotations of the oil palm genome. Biol Direct 2017; 12:21. [PMID: 28886750 PMCID: PMC5591544 DOI: 10.1186/s13062-017-0191-4] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/27/2017] [Accepted: 08/07/2017] [Indexed: 11/13/2022] Open
Abstract
Background Oil palm is an important source of edible oil. The importance of the crop, as well as its long breeding cycle (10-12 years) has led to the sequencing of its genome in 2013 to pave the way for genomics-guided breeding. Nevertheless, the first set of gene predictions, although useful, had many fragmented genes. Classification and characterization of genes associated with traits of interest, such as those for fatty acid biosynthesis and disease resistance, were also limited. Lipid-, especially fatty acid (FA)-related genes are of particular interest for the oil palm as they specify oil yields and quality. This paper presents the characterization of the oil palm genome using different gene prediction methods and comparative genomics analysis, identification of FA biosynthesis and disease resistance genes, and the development of an annotation database and bioinformatics tools. Results Using two independent gene-prediction pipelines, Fgenesh++ and Seqping, 26,059 oil palm genes with transcriptome and RefSeq support were identified from the oil palm genome. These coding regions of the genome have a characteristic broad distribution of GC3 (fraction of cytosine and guanine in the third position of a codon) with over half the GC3-rich genes (GC3 ≥ 0.75286) being intronless. In comparison, only one-seventh of the oil palm genes identified are intronless. Using comparative genomics analysis, characterization of conserved domains and active sites, and expression analysis, 42 key genes involved in FA biosynthesis in oil palm were identified. For three of them, namely EgFABF, EgFABH and EgFAD3, segmental duplication events were detected. Our analysis also identified 210 candidate resistance genes in six classes, grouped by their protein domain structures. Conclusions We present an accurate and comprehensive annotation of the oil palm genome, focusing on analysis of important categories of genes (GC3-rich and intronless), as well as those associated with important functions, such as FA biosynthesis and disease resistance. The study demonstrated the advantages of having an integrated approach to gene prediction and developed a computational framework for combining multiple genome annotations. These results, available in the oil palm annotation database (http://palmxplore.mpob.gov.my), will provide important resources for studies on the genomes of oil palm and related crops. Reviewers This article was reviewed by Alexander Kel, Igor Rogozin, and Vladimir A. Kuznetsov. Electronic supplementary material The online version of this article (doi:10.1186/s13062-017-0191-4) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Kuang-Lim Chan
- Advanced Biotechnology and Breeding Centre, Malaysian Palm Oil Board, No. 6, Persiaran Institusi, Bandar Baru Bangi, 43000 Kajang, Selangor, Malaysia.,Faculty of Science and Technology, Universiti Kebangsaan Malaysia, 43600, Bangi, Selangor, Malaysia
| | - Tatiana V Tatarinova
- Department of Biology, University of La Verne, La Verne, California, 91750, USA.,Spatial Sciences Institute, University of Southern California, Los Angeles, CA, 90089, USA
| | - Rozana Rosli
- Advanced Biotechnology and Breeding Centre, Malaysian Palm Oil Board, No. 6, Persiaran Institusi, Bandar Baru Bangi, 43000 Kajang, Selangor, Malaysia.,Genomics and Computational Biology Research Group, University of South Wales, Pontypridd, CF371DL, UK
| | - Nadzirah Amiruddin
- Advanced Biotechnology and Breeding Centre, Malaysian Palm Oil Board, No. 6, Persiaran Institusi, Bandar Baru Bangi, 43000 Kajang, Selangor, Malaysia
| | - Norazah Azizi
- Advanced Biotechnology and Breeding Centre, Malaysian Palm Oil Board, No. 6, Persiaran Institusi, Bandar Baru Bangi, 43000 Kajang, Selangor, Malaysia
| | - Mohd Amin Ab Halim
- Advanced Biotechnology and Breeding Centre, Malaysian Palm Oil Board, No. 6, Persiaran Institusi, Bandar Baru Bangi, 43000 Kajang, Selangor, Malaysia
| | - Nik Shazana Nik Mohd Sanusi
- Advanced Biotechnology and Breeding Centre, Malaysian Palm Oil Board, No. 6, Persiaran Institusi, Bandar Baru Bangi, 43000 Kajang, Selangor, Malaysia
| | - Nagappan Jayanthi
- Advanced Biotechnology and Breeding Centre, Malaysian Palm Oil Board, No. 6, Persiaran Institusi, Bandar Baru Bangi, 43000 Kajang, Selangor, Malaysia
| | - Petr Ponomarenko
- Spatial Sciences Institute, University of Southern California, Los Angeles, CA, 90089, USA
| | - Martin Triska
- Children's Hospital Los Angeles, University of Southern California, Los Angeles, CA, 90089, USA
| | - Victor Solovyev
- Softberry Inc., 116 Radio Circle, Suite 400, Mount Kisco, NY, 10549, USA
| | - Mohd Firdaus-Raih
- Faculty of Science and Technology, Universiti Kebangsaan Malaysia, 43600, Bangi, Selangor, Malaysia
| | - Ravigadevi Sambanthamurthi
- Advanced Biotechnology and Breeding Centre, Malaysian Palm Oil Board, No. 6, Persiaran Institusi, Bandar Baru Bangi, 43000 Kajang, Selangor, Malaysia
| | - Denis Murphy
- Genomics and Computational Biology Research Group, University of South Wales, Pontypridd, CF371DL, UK
| | - Eng-Ti Leslie Low
- Advanced Biotechnology and Breeding Centre, Malaysian Palm Oil Board, No. 6, Persiaran Institusi, Bandar Baru Bangi, 43000 Kajang, Selangor, Malaysia.
| |
Collapse
|
14
|
Evolution of Brain Active Gene Promoters in Human Lineage Towards the Increased Plasticity of Gene Regulation. Mol Neurobiol 2017; 55:1871-1904. [PMID: 28233272 DOI: 10.1007/s12035-017-0427-4] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/10/2016] [Accepted: 01/26/2017] [Indexed: 01/31/2023]
Abstract
Adaptability to a variety of environmental conditions is a prominent feature of Homo sapiens. We hypothesize that this feature can be explained by evolutionary changes in gene promoters active in the brain prefrontal cortex leading to a more flexible gene regulation network. The genotype-dependent range of gene expression can be broader in humans than in other higher primates. Thus, we searched for specific signatures of evolutionary changes in promoter architectures of multiple hominid genes, including the genes active in human cortical neurons that may indicate an increase of variability of gene expression rather than just changes in the level of expression, such as downregulation or upregulation of the genes. We performed a whole-genome search for genetic-based alterations that may impact gene regulation "flexibility" in a process of hominids evolution, such as (i) CpG dinucleotide content, (ii) predicted nucleosome-DNA dissociation constant, and (iii) predicted affinities for TATA-binding protein (TBP) in gene promoters. We tested all putative promoter regions across the human genome and especially gene promoters in active chromatin state in neurons of prefrontal cortex, the brain region critical for abstract thinking and social and behavioral adaptation. Our data imply that the origin of modern man has been associated with an increase of flexibility of promoter-driven gene regulation in brain. In contrast, after splitting from the ancestral lineages of H. sapiens, the evolution of ape species is characterized by reduced flexibility of gene promoter functioning, underlying reduced variability of the gene expression.
Collapse
|
15
|
Chan KL, Rosli R, Tatarinova TV, Hogan M, Firdaus-Raih M, Low ETL. Seqping: gene prediction pipeline for plant genomes using self-training gene models and transcriptomic data. BMC Bioinformatics 2017; 18:1426. [PMID: 28466793 PMCID: PMC5333190 DOI: 10.1186/s12859-016-1426-6] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Gene prediction is one of the most important steps in the genome annotation process. A large number of software tools and pipelines developed by various computing techniques are available for gene prediction. However, these systems have yet to accurately predict all or even most of the protein-coding regions. Furthermore, none of the currently available gene-finders has a universal Hidden Markov Model (HMM) that can perform gene prediction for all organisms equally well in an automatic fashion. RESULTS We present an automated gene prediction pipeline, Seqping that uses self-training HMM models and transcriptomic data. The pipeline processes the genome and transcriptome sequences of the target species using GlimmerHMM, SNAP, and AUGUSTUS pipelines, followed by MAKER2 program to combine predictions from the three tools in association with the transcriptomic evidence. Seqping generates species-specific HMMs that are able to offer unbiased gene predictions. The pipeline was evaluated using the Oryza sativa and Arabidopsis thaliana genomes. Benchmarking Universal Single-Copy Orthologs (BUSCO) analysis showed that the pipeline was able to identify at least 95% of BUSCO's plantae dataset. Our evaluation shows that Seqping was able to generate better gene predictions compared to three HMM-based programs (MAKER2, GlimmerHMM and AUGUSTUS) using their respective available HMMs. Seqping had the highest accuracy in rice (0.5648 for CDS, 0.4468 for exon, and 0.6695 nucleotide structure) and A. thaliana (0.5808 for CDS, 0.5955 for exon, and 0.8839 nucleotide structure). CONCLUSIONS Seqping provides researchers a seamless pipeline to train species-specific HMMs and predict genes in newly sequenced or less-studied genomes. We conclude that the Seqping pipeline predictions are more accurate than gene predictions using the other three approaches with the default or available HMMs.
Collapse
Affiliation(s)
- Kuang-Lim Chan
- Advanced Biotechnology and Breeding Center, Malaysian Palm Oil Board, 6 Persiaran Institusi, Bandar Baru Bangi, 43000 Kajang, Selangor Malaysia
- Faculty of Science and Technology, Universiti Kebangsaan Malaysia, 43600 Bangi, Selangor Malaysia
| | - Rozana Rosli
- Advanced Biotechnology and Breeding Center, Malaysian Palm Oil Board, 6 Persiaran Institusi, Bandar Baru Bangi, 43000 Kajang, Selangor Malaysia
| | - Tatiana V. Tatarinova
- Center for Personalized Medicine and Spatial Sciences Institute, University of Southern California, Los Angeles, CA USA
| | - Michael Hogan
- Orion Genomics, 4041 Forest Park Avenue, St. Louis, MO 63108 USA
| | - Mohd Firdaus-Raih
- Faculty of Science and Technology, Universiti Kebangsaan Malaysia, 43600 Bangi, Selangor Malaysia
| | - Eng-Ti Leslie Low
- Advanced Biotechnology and Breeding Center, Malaysian Palm Oil Board, 6 Persiaran Institusi, Bandar Baru Bangi, 43000 Kajang, Selangor Malaysia
| |
Collapse
|
16
|
Zolotarenko A, Chekalin E, Mehta R, Baranova A, Tatarinova TV, Bruskin S. Identification of Transcriptional Regulators of Psoriasis from RNA-Seq Experiments. Methods Mol Biol 2017; 1613:355-370. [PMID: 28849568 DOI: 10.1007/978-1-4939-7027-8_14] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/07/2023]
Abstract
Psoriasis is a common inflammatory skin disease with complex etiology and chronic progression. To provide novel insights into the molecular mechanisms of regulation of the disease we performed RNA sequencing (RNA-Seq) analysis of 14 pairs of skin samples collected from psoriatic patients. Subsequent pathway analysis and an extraction of transcriptional regulators governing psoriasis-associated pathways was executed using a combination of MetaCore Interactome enrichment tool and cisExpress algorithm, and followed by comparison to a set of previously described psoriasis response elements. A comparative approach has allowed us to identify 42 core transcriptional regulators of the disease associated with inflammation (NFkB, IRF9, JUN, FOS, SRF), activity of T-cells in the psoriatic lesions (STAT6, FOXP3, NFATC2, GATA3, TCF7, RUNX1, etc.), hyperproliferation and migration of keratinocytes (JUN, FOS, NFIB, TFAP2A, TFAP2C), and lipid metabolism (TFAP2, RARA, VDR). After merging the ChIP-seq and RNA-seq data, we conclude that the atypical expression of FOXA1 transcriptional factor is an important player in psoriasis, as it inhibits maturation of naive T cells into this Treg subpopulation (CD4+FOXA1+CD47+CD69+PD-L1(hi)FOXP3-), therefore contributing to the development of psoriatic skin lesions.
Collapse
Affiliation(s)
- Alena Zolotarenko
- Laboratory of Functional Genomics, Vavilov Institute of General Genetics RAS, Gubkina Street, 3119991, Moscow, Russia
| | - Evgeny Chekalin
- Laboratory of Functional Genomics, Vavilov Institute of General Genetics RAS, Gubkina Street, 3119991, Moscow, Russia
| | - Rohini Mehta
- The Center of the Study of Chronic Metabolic and Rare Diseases, School of Systems Biology, George Mason University, Fairfax, VA, USA
| | - Ancha Baranova
- The Center of the Study of Chronic Metabolic and Rare Diseases, School of Systems Biology, George Mason University, Fairfax, VA, USA
- Research Centre for Medical Genetics RAMS, Moscow, Russia
- Moscow Institute of Physics and Technology, Dolgoprudny, Moscow, Russia
- Atlas Biomed Group, Moscow, Russia
| | - Tatiana V Tatarinova
- Atlas Biomed Group, Moscow, Russia
- Center for Personalized Medicine, Children's Hospital Los Angeles and Spatial Sciences Institute, University of Southern California, Los Angeles, CA, USA
- A.A. Kharkevich Institute for Information Transmission Problems RAS, Moscow, Russia
| | - Sergey Bruskin
- Laboratory of Functional Genomics, Vavilov Institute of General Genetics RAS, Gubkina Street, 3119991, Moscow, Russia.
- Moscow Institute of Physics and Technology, Dolgoprudny, Moscow, Russia.
| |
Collapse
|
17
|
Triska M, Ivliev A, Nikolsky Y, Tatarinova TV. Analysis of cis-Regulatory Elements in Gene Co-expression Networks in Cancer. Methods Mol Biol 2017; 1613:291-310. [PMID: 28849565 DOI: 10.1007/978-1-4939-7027-8_11] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/22/2023]
Abstract
Analysis of gene co-expression networks is a powerful "data-driven" tool, invaluable for understanding cancer biology and mechanisms of tumor development. Yet, despite of completion of thousands of studies on cancer gene expression, there were few attempts to normalize and integrate co-expression data from scattered sources in a concise "meta-analysis" framework. Here we describe an integrated approach to cancer expression meta-analysis, which combines generation of "data-driven" co-expression networks with detailed statistical detection of promoter sequence motifs within the co-expression clusters. First, we applied Weighted Gene Co-Expression Network Analysis (WGCNA) workflow and Pearson's correlation to generate a comprehensive set of over 3000 co-expression clusters in 82 normalized microarray datasets from nine cancers of different origin. Next, we designed a genome-wide statistical approach to the detection of specific DNA sequence motifs based on similarities between the promoters of similarly expressed genes. The approach, realized as cisExpress software module, was specifically designed for analysis of very large data sets such as those generated by publicly accessible whole genome and transcriptome projects. cisExpress uses a task farming algorithm to exploit all available computational cores within a shared memory node.We discovered that although co-expression modules are populated with different sets of genes, they share distinct stable patterns of co-regulation based on promoter sequence analysis. The number of motifs per co-expression cluster varies widely in accordance with cancer tissue of origin, with the largest number in colon (68 motifs) and the lowest in ovary (18 motifs). The top scored motifs are typically shared between several tissues; they define sets of target genes responsible for certain functionality of cancerogenesis. Both the co-expression modules and a database of precalculated motifs are publically available and accessible for further studies.
Collapse
Affiliation(s)
- Martin Triska
- Spatial Sciences Institute, University of Southern California, Los Angeles, CA, USA
| | | | - Yuri Nikolsky
- Prosapia Genetics, Solana Beach, CA, USA.,School of Systems Biology, George Mason University, Fairfax, VA, USA
| | - Tatiana V Tatarinova
- Spatial Sciences Institute, University of Southern California, Los Angeles, CA, USA. .,Center for Personalized Medicine, Children's Hospital Los Angeles, 4640 Hollywood Blvd, Los Angeles, CA, 90027, USA. .,A.A. Kharkevich Institute for Information Transmission Problems RAS, Moscow, Russia.
| |
Collapse
|
18
|
Integrated computational approach to the analysis of RNA-seq data reveals new transcriptional regulators of psoriasis. Exp Mol Med 2016; 48:e268. [PMID: 27811935 PMCID: PMC5133374 DOI: 10.1038/emm.2016.97] [Citation(s) in RCA: 20] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/10/2016] [Revised: 05/06/2016] [Accepted: 05/24/2016] [Indexed: 02/07/2023] Open
Abstract
Psoriasis is a common inflammatory skin disease with complex etiology and chronic progression. To provide novel insights into the regulatory molecular mechanisms of the disease, we performed RNA sequencing analysis of 14 pairs of skin samples collected from patients with psoriasis. Subsequent pathway analysis and extraction of the transcriptional regulators governing psoriasis-associated pathways was executed using a combination of the MetaCore Interactome enrichment tool and the cisExpress algorithm, followed by comparison to a set of previously described psoriasis response elements. A comparative approach allowed us to identify 42 core transcriptional regulators of the disease associated with inflammation (NFκB, IRF9, JUN, FOS, SRF), the activity of T cells in psoriatic lesions (STAT6, FOXP3, NFATC2, GATA3, TCF7, RUNX1), the hyperproliferation and migration of keratinocytes (JUN, FOS, NFIB, TFAP2A, TFAP2C) and lipid metabolism (TFAP2, RARA, VDR). In addition to the core regulators, we identified 38 transcription factors previously not associated with the disease that can clarify the pathogenesis of psoriasis. To illustrate these findings, we analyzed the regulatory role of one of the identified transcription factors (TFs), FOXA1. Using ChIP-seq and RNA-seq data, we concluded that the atypical expression of the FOXA1 TF is an important player in the disease as it inhibits the maturation of naive T cells into the (CD4+FOXA1+CD47+CD69+PD-L1(hi)FOXP3-) regulatory T cell subpopulation, therefore contributing to the development of psoriatic skin lesions.
Collapse
|
19
|
Tatarinova TV, Chekalin E, Nikolsky Y, Bruskin S, Chebotarov D, McNally KL, Alexandrov N. Nucleotide diversity analysis highlights functionally important genomic regions. Sci Rep 2016; 6:35730. [PMID: 27774999 PMCID: PMC5075931 DOI: 10.1038/srep35730] [Citation(s) in RCA: 34] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/12/2016] [Accepted: 09/30/2016] [Indexed: 12/15/2022] Open
Abstract
We analyzed functionality and relative distribution of genetic variants across the complete Oryza sativa genome, using the 40 million single nucleotide polymorphisms (SNPs) dataset from the 3,000 Rice Genomes Project (http://snp-seek.irri.org), the largest and highest density SNP collection for any higher plant. We have shown that the DNA-binding transcription factors (TFs) are the most conserved group of genes, whereas kinases and membrane-localized transporters are the most variable ones. TFs may be conserved because they belong to some of the most connected regulatory hubs that modulate transcription of vast downstream gene networks, whereas signaling kinases and transporters need to adapt rapidly to changing environmental conditions. In general, the observed profound patterns of nucleotide variability reveal functionally important genomic regions. As expected, nucleotide diversity is much higher in intergenic regions than within gene bodies (regions spanning gene models), and protein-coding sequences are more conserved than untranslated gene regions. We have observed a sharp decline in nucleotide diversity that begins at about 250 nucleotides upstream of the transcription start and reaches minimal diversity exactly at the transcription start. We found the transcription termination sites to have remarkably symmetrical patterns of SNP density, implying presence of functional sites near transcription termination. Also, nucleotide diversity was significantly lower near 3′ UTRs, the area rich with regulatory regions.
Collapse
Affiliation(s)
- Tatiana V Tatarinova
- Center for Personalized Medicine and Spatial Sciences Institute, University of Southern California, Los Angeles, CA, USA.,Kharkevich Institute for Information Transmission Problems, Russian Academy of Sciences, Moscow, Russian Federation
| | | | - Yuri Nikolsky
- Vavilov Institute of General Genetics, Moscow, Russia.,F1 Genomics, San Diego, CA, USA.,School of Systems Biology, George Mason University, VA, USA
| | | | - Dmitry Chebotarov
- International Rice Research Institute, Los Baños, Laguna 4031, Philippines
| | - Kenneth L McNally
- International Rice Research Institute, Los Baños, Laguna 4031, Philippines
| | | |
Collapse
|
20
|
Morozova I, Flegontov P, Mikheyev AS, Bruskin S, Asgharian H, Ponomarenko P, Klyuchnikov V, ArunKumar G, Prokhortchouk E, Gankin Y, Rogaev E, Nikolsky Y, Baranova A, Elhaik E, Tatarinova TV. Toward high-resolution population genomics using archaeological samples. DNA Res 2016; 23:295-310. [PMID: 27436340 PMCID: PMC4991838 DOI: 10.1093/dnares/dsw029] [Citation(s) in RCA: 18] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/22/2015] [Accepted: 05/22/2016] [Indexed: 12/30/2022] Open
Abstract
The term ‘ancient DNA’ (aDNA) is coming of age, with over 1,200 hits in the PubMed database, beginning in the early 1980s with the studies of ‘molecular paleontology’. Rooted in cloning and limited sequencing of DNA from ancient remains during the pre-PCR era, the field has made incredible progress since the introduction of PCR and next-generation sequencing. Over the last decade, aDNA analysis ushered in a new era in genomics and became the method of choice for reconstructing the history of organisms, their biogeography, and migration routes, with applications in evolutionary biology, population genetics, archaeogenetics, paleo-epidemiology, and many other areas. This change was brought by development of new strategies for coping with the challenges in studying aDNA due to damage and fragmentation, scarce samples, significant historical gaps, and limited applicability of population genetics methods. In this review, we describe the state-of-the-art achievements in aDNA studies, with particular focus on human evolution and demographic history. We present the current experimental and theoretical procedures for handling and analysing highly degraded aDNA. We also review the challenges in the rapidly growing field of ancient epigenomics. Advancement of aDNA tools and methods signifies a new era in population genetics and evolutionary medicine research.
Collapse
Affiliation(s)
- Irina Morozova
- Institute of Evolutionary Medicine, University of Zurich, Zurich, Switzerland
| | - Pavel Flegontov
- Department of Biology and Ecology, Faculty of Science, University of Ostrava, Ostrava, Czech Republic Bioinformatics Center, A.A. Kharkevich Institute for Information Transmission Problems, Russian Academy of Sciences, Moscow, Russian Federation
| | - Alexander S Mikheyev
- Ecology and Evolution Unit, Okinawa Institute of Science and Technology Graduate University, Okinawa, Japan
| | - Sergey Bruskin
- Vavilov Institute of General Genetics RAS, Moscow, Russia
| | - Hosseinali Asgharian
- Department of Computational and Molecular Biology, University of Southern California, Los Angeles, CA, USA
| | - Petr Ponomarenko
- Center for Personalized Medicine, Children's Hospital Los Angeles, Los Angeles, CA, USA Spatial Sciences Institute, University of Southern California, Los Angeles, CA, USA
| | | | | | - Egor Prokhortchouk
- Research Center of Biotechnology RAS, Moscow, Russia Department of Biology, Lomonosov Moscow State University, Russia
| | | | - Evgeny Rogaev
- Vavilov Institute of General Genetics RAS, Moscow, Russia University of Massachusetts Medical School, Worcester, MA, USA
| | - Yuri Nikolsky
- Vavilov Institute of General Genetics RAS, Moscow, Russia F1 Genomics, San Diego, CA, USA School of Systems Biology, George Mason University, VA, USA
| | - Ancha Baranova
- School of Systems Biology, George Mason University, VA, USA Research Centre for Medical Genetics, Moscow, Russia Atlas Biomed Group, Moscow, Russia
| | - Eran Elhaik
- Department of Animal & Plant Sciences, University of Sheffield, Sheffield, South Yorkshire, UK
| | - Tatiana V Tatarinova
- Bioinformatics Center, A.A. Kharkevich Institute for Information Transmission Problems, Russian Academy of Sciences, Moscow, Russian Federation Center for Personalized Medicine, Children's Hospital Los Angeles, Los Angeles, CA, USA Spatial Sciences Institute, University of Southern California, Los Angeles, CA, USA
| |
Collapse
|
21
|
Li WL, Buckley J, Sanchez-Lara PA, Maglinte DT, Viduetsky L, Tatarinova TV, Aparicio JG, Kim JW, Au M, Ostrow D, Lee TC, O'Gorman M, Judkins A, Cobrinik D, Triche TJ. A Rapid and Sensitive Next-Generation Sequencing Method to Detect RB1 Mutations Improves Care for Retinoblastoma Patients and Their Families. J Mol Diagn 2016; 18:480-93. [PMID: 27155049 DOI: 10.1016/j.jmoldx.2016.02.006] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/17/2015] [Revised: 01/14/2016] [Accepted: 02/01/2016] [Indexed: 01/26/2023] Open
Abstract
Retinoblastoma is a childhood eye malignancy that can lead to the loss of vision, eye(s), and sometimes life. The tumors are initiated by inactivating mutations in both alleles of the tumor-suppressor gene, RB1, or, rarely, by MYCN amplification. Timely identification of a germline RB1 mutation in blood samples or either somatic RB1 mutation or MYCN amplification in tumors is important for effective care and management of retinoblastoma patients and their families. However, current procedures to thoroughly test RB1 mutations are complicated and lengthy. Herein, we report a next-generation sequencing-based method capable of detecting point mutations, small indels, and large deletions or duplications across the entire RB1 gene and amplification of MYCN gene on a single platform. From DNA extraction to clinical interpretation requires only 3 days, enabling early molecular diagnosis of retinoblastoma and optimal treatment outcomes. This method can also detect low-level mosaic mutations in blood samples that can be missed by routine Sanger sequencing. In addition, it can differentiate between RB1 mutation- and MYCN amplification-driven retinoblastomas. This rapid, comprehensive, and sensitive method for detecting RB1 mutations and MYCN amplification can readily identify RB1 mutation carriers and thus improve the management and genetic counseling for retinoblastoma patients and their families.
Collapse
Affiliation(s)
- Wenhui L Li
- Department of Pathology and Laboratory Medicine, Children's Hospital Los Angeles, Los Angeles, California; Department of Pathology, USC Roski Eye Institute, University of Southern California, Los Angeles, California.
| | - Jonathan Buckley
- Department of Pathology and Laboratory Medicine, Children's Hospital Los Angeles, Los Angeles, California; Department of Pathology, USC Roski Eye Institute, University of Southern California, Los Angeles, California
| | - Pedro A Sanchez-Lara
- Department of Pathology and Laboratory Medicine, Children's Hospital Los Angeles, Los Angeles, California; Department of Pathology, USC Roski Eye Institute, University of Southern California, Los Angeles, California; Department of Pediatrics, USC Roski Eye Institute, University of Southern California, Los Angeles, California
| | - Dennis T Maglinte
- Department of Pathology and Laboratory Medicine, Children's Hospital Los Angeles, Los Angeles, California
| | - Lucy Viduetsky
- Department of Pathology and Laboratory Medicine, Children's Hospital Los Angeles, Los Angeles, California
| | - Tatiana V Tatarinova
- Department of Pediatrics, USC Roski Eye Institute, University of Southern California, Los Angeles, California; Spatial Sciences Institute, Dornsife College of Letters, Arts and Sciences, University of Southern California, Los Angeles, California
| | | | - Jonathan W Kim
- Vision Center, Children's Hospital Los Angeles, Los Angeles, California; Department of Opthalmology, USC Roski Eye Institute, University of Southern California, Los Angeles, California
| | - Margaret Au
- Department of Pathology and Laboratory Medicine, Children's Hospital Los Angeles, Los Angeles, California
| | - Dejerianne Ostrow
- Department of Pathology and Laboratory Medicine, Children's Hospital Los Angeles, Los Angeles, California
| | - Thomas C Lee
- Vision Center, Children's Hospital Los Angeles, Los Angeles, California; Department of Opthalmology, USC Roski Eye Institute, University of Southern California, Los Angeles, California
| | - Maurice O'Gorman
- Department of Pathology and Laboratory Medicine, Children's Hospital Los Angeles, Los Angeles, California; Department of Pathology, USC Roski Eye Institute, University of Southern California, Los Angeles, California
| | - Alexander Judkins
- Department of Pathology and Laboratory Medicine, Children's Hospital Los Angeles, Los Angeles, California; Department of Pathology, USC Roski Eye Institute, University of Southern California, Los Angeles, California
| | - David Cobrinik
- Vision Center, Children's Hospital Los Angeles, Los Angeles, California; Department of Opthalmology, USC Roski Eye Institute, University of Southern California, Los Angeles, California; Division of Ophthalmology and Department of Surgery, and Saban Research Institute, Children's Hospital Los Angeles, Los Angeles, California; Department of Biochemistry & Molecular Biology, USC Roski Eye Institute, University of Southern California, Los Angeles, California; Norris Comprehensive Cancer Center, USC Keck School of Medicine, University of Southern California, Los Angeles, California
| | - Timothy J Triche
- Department of Pathology and Laboratory Medicine, Children's Hospital Los Angeles, Los Angeles, California; Department of Pathology, USC Roski Eye Institute, University of Southern California, Los Angeles, California.
| |
Collapse
|
22
|
Jiang N, Wang L, Chen J, Wang L, Leach L, Luo Z. Conserved and divergent patterns of DNA methylation in higher vertebrates. Genome Biol Evol 2014; 6:2998-3014. [PMID: 25355807 PMCID: PMC4255770 DOI: 10.1093/gbe/evu238] [Citation(s) in RCA: 25] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 10/20/2014] [Indexed: 02/07/2023] Open
Abstract
DNA methylation in the genome plays a fundamental role in the regulation of gene expression and is widespread in the genome of eukaryotic species. For example, in higher vertebrates, there is a "global" methylation pattern involving complete methylation of CpG sites genome-wide, except in promoter regions that are typically enriched for CpG dinucleotides, or so called "CpG islands." Here, we comprehensively examined and compared the distribution of CpG sites within ten model eukaryotic species and linked the observed patterns to the role of DNA methylation in controlling gene transcription. The analysis revealed two distinct but conserved methylation patterns for gene promoters in human and mouse genomes, involving genes with distinct distributions of promoter CpGs and gene expression patterns. Comparative analysis with four other higher vertebrates revealed that the primary regulatory role of the DNA methylation system is highly conserved in higher vertebrates.
Collapse
Affiliation(s)
- Ning Jiang
- Department of Biostatistics & Computational Biology, SKLG, School of Life Sciences, Fudan University, Shanghai, China School of Biosciences, The University of Birmingham, Birmingham B15 2TT United Kingdom
| | - Lin Wang
- Department of Biostatistics & Computational Biology, SKLG, School of Life Sciences, Fudan University, Shanghai, China
| | - Jing Chen
- School of Biosciences, The University of Birmingham, Birmingham B15 2TT United Kingdom
| | - Luwen Wang
- Department of Biostatistics & Computational Biology, SKLG, School of Life Sciences, Fudan University, Shanghai, China
| | - Lindsey Leach
- School of Biosciences, The University of Birmingham, Birmingham B15 2TT United Kingdom
| | - Zewei Luo
- Department of Biostatistics & Computational Biology, SKLG, School of Life Sciences, Fudan University, Shanghai, China School of Biosciences, The University of Birmingham, Birmingham B15 2TT United Kingdom
| |
Collapse
|
23
|
iRegulon: from a gene list to a gene regulatory network using large motif and track collections. PLoS Comput Biol 2014; 10:e1003731. [PMID: 25058159 PMCID: PMC4109854 DOI: 10.1371/journal.pcbi.1003731] [Citation(s) in RCA: 613] [Impact Index Per Article: 61.3] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/13/2014] [Accepted: 05/27/2014] [Indexed: 01/17/2023] Open
Abstract
Identifying master regulators of biological processes and mapping their downstream gene networks are key challenges in systems biology. We developed a computational method, called iRegulon, to reverse-engineer the transcriptional regulatory network underlying a co-expressed gene set using cis-regulatory sequence analysis. iRegulon implements a genome-wide ranking-and-recovery approach to detect enriched transcription factor motifs and their optimal sets of direct targets. We increase the accuracy of network inference by using very large motif collections of up to ten thousand position weight matrices collected from various species, and linking these to candidate human TFs via a motif2TF procedure. We validate iRegulon on gene sets derived from ENCODE ChIP-seq data with increasing levels of noise, and we compare iRegulon with existing motif discovery methods. Next, we use iRegulon on more challenging types of gene lists, including microRNA target sets, protein-protein interaction networks, and genetic perturbation data. In particular, we over-activate p53 in breast cancer cells, followed by RNA-seq and ChIP-seq, and could identify an extensive up-regulated network controlled directly by p53. Similarly we map a repressive network with no indication of direct p53 regulation but rather an indirect effect via E2F and NFY. Finally, we generalize our computational framework to include regulatory tracks such as ChIP-seq data and show how motif and track discovery can be combined to map functional regulatory interactions among co-expressed genes. iRegulon is available as a Cytoscape plugin from http://iregulon.aertslab.org. Gene regulatory networks control developmental, homeostatic, and disease processes by governing precise levels and spatio-temporal patterns of gene expression. Determining their topology can provide mechanistic insight into these processes. Gene regulatory networks consist of interactions between transcription factors and their direct target genes. Each regulatory interaction represents the binding of the transcription factor to a specific DNA binding site near its target gene. Here we present a computational method, called iRegulon, to identify master regulators and direct target genes in a human gene signature, i.e. a set of co-expressed genes. iRegulon relies on the analysis of the regulatory sequences around each gene in the gene set to detect enriched TF motifs or ChIP-seq peaks, using databases of nearly 10.000 TF motifs and 1000 ChIP-seq data sets or “tracks”. Next, it associates enriched motifs and tracks with candidate transcription factors and determines the optimal subset of direct target genes. We validate iRegulon on ENCODE data, and use it in combination with RNA-seq and ChIP-seq data to map a p53 downstream network with new predicted co-factors and targets. iRegulon is available as a Cytoscape plugin, supporting human, mouse, and Drosophila genes, and provides access to hundreds of cancer-related TF-target subnetworks or “regulons”.
Collapse
|
24
|
Abstract
In this paper we present NPEST, a novel tool for the analysis of expressed sequence tags (EST) distributions and transcription start site (TSS) prediction. This method estimates an unknown probability distribution of ESTs using a maximum likelihood (ML) approach, which is then used to predict positions of TSS. Accurate identification of TSS is an important genomics task, since the position of regulatory elements with respect to the TSS can have large effects on gene regulation, and performance of promoter motif-finding methods depends on correct identification of TSSs. Our probabilistic approach expands recognition capabilities to multiple TSS per locus that may be a useful tool to enhance the understanding of alternative splicing mechanisms. This paper presents analysis of simulated data as well as statistical analysis of promoter regions of a model dicot plant Arabidopsis thaliana. Using our statistical tool we analyzed 16520 loci and developed a database of TSS, which is now publicly available at www.glacombio.net/NPEST.
Collapse
|
25
|
Triska M, Grocutt D, Southern J, Murphy DJ, Tatarinova T. cisExpress: motif detection in DNA sequences. ACTA ACUST UNITED AC 2013; 29:2203-5. [PMID: 23793750 DOI: 10.1093/bioinformatics/btt366] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022]
Abstract
MOTIVATION One of the major challenges for contemporary bioinformatics is the analysis and accurate annotation of genomic datasets to enable extraction of useful information about the functional role of DNA sequences. This article describes a novel genome-wide statistical approach to the detection of specific DNA sequence motifs based on similarities between the promoters of similarly expressed genes. This new tool, cisExpress, is especially designed for use with large datasets, such as those generated by publicly accessible whole genome and transcriptome projects. cisExpress uses a task farming algorithm to exploit all available computational cores within a shared memory node. We demonstrate the robust nature and validity of the proposed method. It is applicable for use with a wide range of genomic databases for any species of interest. AVAILABILITY cisExpress is available at www.cisexpress.org.
Collapse
Affiliation(s)
- Martin Triska
- Genomics and Computational Biology Research Group, Faculty of Computing, Engineering and Science, University of South Wales, Pontypridd, UK
| | | | | | | | | |
Collapse
|
26
|
Abstract
Transcription factors and the short, often degenerate DNA sequences they recognize are central regulators of gene expression, but their regulatory code is challenging to dissect experimentally. Thus, computational approaches have long been used to identify putative regulatory elements from the patterns in promoter sequences. Here we present a new algorithm “POWRS” (POsition-sensitive WoRd Set) for identifying regulatory sequence motifs, specifically developed to address two common shortcomings of existing algorithms. First, POWRS uses the position-specific enrichment of regulatory elements near transcription start sites to significantly increase sensitivity, while providing new information about the preferred localization of those elements. Second, POWRS forgoes position weight matrices for a discrete motif representation that appears more resistant to over-generalization. We apply this algorithm to discover sequences related to constitutive, high-level gene expression in the model plant Arabidopsis thaliana, and then experimentally validate the importance of those elements by systematically mutating two endogenous promoters and measuring the effect on gene expression levels. This provides a foundation for future efforts to rationally engineer gene expression in plants, a problem of great importance in developing biotech crop varieties. Availability: BSD-licensed Python code at http://grassrootsbio.com/papers/powrs/.
Collapse
|
27
|
Xie T, Zhang C, Zhang B, Molony C, Oudes A, Roberts C, Dai H, Schadt E, Lamb J. A survey of cancer cell lines reveals highly structured and hierarchical relationships within and between DNA and mRNA that may be the result of selection. OMICS-A JOURNAL OF INTEGRATIVE BIOLOGY 2010; 14:91-7. [PMID: 20141331 DOI: 10.1089/omi.2009.0114] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/23/2023]
Abstract
Copy number variation (CNV) is one of the most profound forms of somatic DNA changes that underlie most human cancers. However, the degree of complexity within and between DNA and mRNA variations in cancer cohorts has yet to be fully characterized. Here we characterized the connectivity of CNV/CNV and its contribution to transcriptome in human cancer cell lines. Strikingly, we found there is a significant nonrandom correlation of many unlinked DNA loci and also a significant association between CNV and mRNA expression in cis and in trans (called eCNV). Both distributions of DNA/DNA and DNA/mRNA associations exhibit a scale-free structure showing that, for DNA/DNA, a few loci correlate to many other loci, whereas most loci correlate to only a few loci; and for DNA/mRNA, certain chromosomal loci associate with many mRNAs and that many mRNAs are controlled by more than one locus. This suggests that a small number of DNA loci act as hubs in a hierarchical structure that is highly nonrandom in nature, and genes linking to these hot spots tend to be involved in similar biological functions. Derivation of highly connected structures suggests a process of undirected copy number changes followed by selection of those advantageous to tumor cells during tumorigenesis. Given that the cohort includes many tissue types, our observations may identify a common and important underlying structure present in human tumors.
Collapse
Affiliation(s)
- Tao Xie
- Rosetta Inpharmatics LLC, Seattle Washington, USA.
| | | | | | | | | | | | | | | | | |
Collapse
|
28
|
Tatarinova TV, Alexandrov NN, Bouck JB, Feldmann KA. GC3 biology in corn, rice, sorghum and other grasses. BMC Genomics 2010; 11:308. [PMID: 20470436 PMCID: PMC2895627 DOI: 10.1186/1471-2164-11-308] [Citation(s) in RCA: 105] [Impact Index Per Article: 7.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/16/2009] [Accepted: 05/16/2010] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND The third, or wobble, position in a codon provides a high degree of possible degeneracy and is an elegant fault-tolerance mechanism. Nucleotide biases between organisms at the wobble position have been documented and correlated with the abundances of the complementary tRNAs. We and others have noticed a bias for cytosine and guanine at the third position in a subset of transcripts within a single organism. The bias is present in some plant species and warm-blooded vertebrates but not in all plants, or in invertebrates or cold-blooded vertebrates. RESULTS Here we demonstrate that in certain organisms the amount of GC at the wobble position (GC3) can be used to distinguish two classes of genes. We highlight the following features of genes with high GC3 content: they (1) provide more targets for methylation, (2) exhibit more variable expression, (3) more frequently possess upstream TATA boxes, (4) are predominant in certain classes of genes (e.g., stress responsive genes) and (5) have a GC3 content that increases from 5'to 3'. These observations led us to formulate a hypothesis to explain GC3 bimodality in grasses. CONCLUSIONS Our findings suggest that high levels of GC3 typify a class of genes whose expression is regulated through DNA methylation or are a legacy of accelerated evolution through gene conversion. We discuss the three most probable explanations for GC3 bimodality: biased gene conversion, transcriptional and translational advantage and gene methylation.
Collapse
Affiliation(s)
- Tatiana V Tatarinova
- Department of Biomedical Engineering, Georgia Institute of Technology, Atlanta, Georgia 30332, USA.
| | | | | | | |
Collapse
|