1
|
Barbero-Aparicio JA, Olivares-Gil A, Díez-Pastor JF, García-Osorio C. Deep learning and support vector machines for transcription start site identification. PeerJ Comput Sci 2023; 9:e1340. [PMID: 37346545 PMCID: PMC10280436 DOI: 10.7717/peerj-cs.1340] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/26/2022] [Accepted: 03/21/2023] [Indexed: 06/23/2023]
Abstract
Recognizing transcription start sites is key to gene identification. Several approaches have been employed in related problems such as detecting translation initiation sites or promoters, many of the most recent ones based on machine learning. Deep learning methods have been proven to be exceptionally effective for this task, but their use in transcription start site identification has not yet been explored in depth. Also, the very few existing works do not compare their methods to support vector machines (SVMs), the most established technique in this area of study, nor provide the curated dataset used in the study. The reduced amount of published papers in this specific problem could be explained by this lack of datasets. Given that both support vector machines and deep neural networks have been applied in related problems with remarkable results, we compared their performance in transcription start site predictions, concluding that SVMs are computationally much slower, and deep learning methods, specially long short-term memory neural networks (LSTMs), are best suited to work with sequences than SVMs. For such a purpose, we used the reference human genome GRCh38. Additionally, we studied two different aspects related to data processing: the proper way to generate training examples and the imbalanced nature of the data. Furthermore, the generalization performance of the models studied was also tested using the mouse genome, where the LSTM neural network stood out from the rest of the algorithms. To sum up, this article provides an analysis of the best architecture choices in transcription start site identification, as well as a method to generate transcription start site datasets including negative instances on any species available in Ensembl. We found that deep learning methods are better suited than SVMs to solve this problem, being more efficient and better adapted to long sequences and large amounts of data. We also create a transcription start site (TSS) dataset large enough to be used in deep learning experiments.
Collapse
Affiliation(s)
| | - Alicia Olivares-Gil
- Departamento de Ingeniería Informática, Universidad de Burgos, Burgos, Spain
| | - José F. Díez-Pastor
- Departamento de Ingeniería Informática, Universidad de Burgos, Burgos, Spain
| | - César García-Osorio
- Departamento de Ingeniería Informática, Universidad de Burgos, Burgos, Spain
| |
Collapse
|
2
|
Genome-Wide Prediction of Transcription Start Sites in Conifers. Int J Mol Sci 2022; 23:ijms23031735. [PMID: 35163661 PMCID: PMC8836283 DOI: 10.3390/ijms23031735] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/30/2021] [Revised: 01/30/2022] [Accepted: 02/01/2022] [Indexed: 02/04/2023] Open
Abstract
The identification of promoters is an essential step in the genome annotation process, providing a framework for gene regulatory networks and their role in transcription regulation. Despite considerable advances in the high-throughput determination of transcription start sites (TSSs) and transcription factor binding sites (TFBSs), experimental methods are still time-consuming and expensive. Instead, several computational approaches have been developed to provide fast and reliable means for predicting the location of TSSs and regulatory motifs on a genome-wide scale. Numerous studies have been carried out on the regulatory elements of mammalian genomes, but plant promoters, especially in gymnosperms, have been left out of the limelight and, therefore, have been poorly investigated. The aim of this study was to enhance and expand the existing genome annotations using computational approaches for genome-wide prediction of TSSs in the four conifer species: loblolly pine, white spruce, Norway spruce, and Siberian larch. Our pipeline will be useful for TSS predictions in other genomes, especially for draft assemblies, where reliable TSS predictions are not usually available. We also explored some of the features of the nucleotide composition of the predicted promoters and compared the GC properties of conifer genes with model monocot and dicot plants. Here, we demonstrate that even incomplete genome assemblies and partial annotations can be a reliable starting point for TSS annotation. The results of the TSS prediction in four conifer species have been deposited in the Persephone genome browser, which allows smooth visualization and is optimized for large data sets. This work provides the initial basis for future experimental validation and the study of the regulatory regions to understand gene regulation in gymnosperms.
Collapse
|
3
|
Umarov R, Li Y, Arakawa T, Takizawa S, Gao X, Arner E. ReFeaFi: Genome-wide prediction of regulatory elements driving transcription initiation. PLoS Comput Biol 2021; 17:e1009376. [PMID: 34491989 PMCID: PMC8448322 DOI: 10.1371/journal.pcbi.1009376] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/22/2021] [Revised: 09/17/2021] [Accepted: 08/23/2021] [Indexed: 11/19/2022] Open
Abstract
Regulatory elements control gene expression through transcription initiation (promoters) and by enhancing transcription at distant regions (enhancers). Accurate identification of regulatory elements is fundamental for annotating genomes and understanding gene expression patterns. While there are many attempts to develop computational promoter and enhancer identification methods, reliable tools to analyze long genomic sequences are still lacking. Prediction methods often perform poorly on the genome-wide scale because the number of negatives is much higher than that in the training sets. To address this issue, we propose a dynamic negative set updating scheme with a two-model approach, using one model for scanning the genome and the other one for testing candidate positions. The developed method achieves good genome-level performance and maintains robust performance when applied to other vertebrate species, without re-training. Moreover, the unannotated predicted regulatory regions made on the human genome are enriched for disease-associated variants, suggesting them to be potentially true regulatory elements rather than false positives. We validated high scoring "false positive" predictions using reporter assay and all tested candidates were successfully validated, demonstrating the ability of our method to discover novel human regulatory regions.
Collapse
Affiliation(s)
- Ramzan Umarov
- Graduate School of Integrated Sciences for Life, Hiroshima University, Higashi-Hiroshima, Japan
- * E-mail: (RU); (XG); (EA)
| | - Yu Li
- Department of Computer Science and Engineering (CSE), The Chinese University of Hong Kong (CUHK), Hong Kong, People’s Republic of China
| | - Takahiro Arakawa
- Laboratory for Applied Regulatory Genomics Network Analysis, RIKEN Center for Integrative Medical Sciences, Yokohama, Kanagawa, Japan
| | - Satoshi Takizawa
- Laboratory for Applied Regulatory Genomics Network Analysis, RIKEN Center for Integrative Medical Sciences, Yokohama, Kanagawa, Japan
| | - Xin Gao
- King Abdullah University of Science and Technology, Computational Bioscience Research Center, Computer, Electrical and Mathematical Sciences and Engineering Division, Thuwal, Saudi Arabia
- * E-mail: (RU); (XG); (EA)
| | - Erik Arner
- Graduate School of Integrated Sciences for Life, Hiroshima University, Higashi-Hiroshima, Japan
- Laboratory for Applied Regulatory Genomics Network Analysis, RIKEN Center for Integrative Medical Sciences, Yokohama, Kanagawa, Japan
- * E-mail: (RU); (XG); (EA)
| |
Collapse
|
4
|
Pachganov S, Murtazalieva K, Zarubin A, Taran T, Chartier D, Tatarinova TV. Prediction of Rice Transcription Start Sites Using TransPrise: A Novel Machine Learning Approach. Methods Mol Biol 2021; 2238:261-274. [PMID: 33471337 DOI: 10.1007/978-1-0716-1068-8_17] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
As the interest in genetic resequencing increases, so does the need for effective mathematical, computational, and statistical approaches. One of the difficult problems in genome annotation is determination of precise positions of transcription start sites. In this paper, we present TransPrise-an efficient deep learning tool for predicting positions of eukaryotic transcription start sites. TransPrise offers significant improvement over existing promoter-prediction methods. To illustrate this, we compared predictions of TransPrise with the TSSPlant approach for well-annotated genome of Oryza sativa. Using a computer with a graphics processing unit, the run time of TransPrise is 250 min on a genome of 374 Mb long.We provide the full basis for the comparison and encourage users to freely access a set of our computational tools to facilitate and streamline their own analyses. The ready-to-use Docker image with all the necessary packages, models, and code as well as the source code of the TransPrise algorithm are available at http://compubioverne.group/ . The source code is ready to use and to be customized to predict TSS in any eukaryotic organism.
Collapse
Affiliation(s)
- Stepan Pachganov
- Ugra Research Institute of Information Technologies, Khanty-Mansiysk, Russia
| | | | - Alexei Zarubin
- Tomsk National Research Medical Center of the Russian Academy of Sciences, Research Institute of Medical Genetics, Tomsk, Russia
| | | | - Duane Chartier
- International Center for Art Intelligence, Inc, Los Angeles, CA, USA
| | - Tatiana V Tatarinova
- Vavilov Institute of General Genetics, Moscow, Russia.
- Department of Biology, University of La Verne, La Verne, CA, USA.
- A.A. Kharkevich Institute for Information Transmission Problems, Russian Academy of Sciences, Moscow, Russia.
- Siberian Federal University, Krasnoyarsk, Russia.
| |
Collapse
|
5
|
Pachganov S, Murtazalieva K, Zarubin A, Sokolov D, Chartier DR, Tatarinova TV. TransPrise: a novel machine learning approach for eukaryotic promoter prediction. PeerJ 2019; 7:e7990. [PMID: 31695967 PMCID: PMC6827441 DOI: 10.7717/peerj.7990] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/09/2019] [Accepted: 10/04/2019] [Indexed: 02/01/2023] Open
Abstract
As interest in genetic resequencing increases, so does the need for effective mathematical, computational, and statistical approaches. One of the difficult problems in genome annotation is determination of precise positions of transcription start sites. In this paper we present TransPrise-an efficient deep learning tool for prediction of positions of eukaryotic transcription start sites. Our pipeline consists of two parts: the binary classifier operates the first, and if a sequence is classified as TSS-containing the regression step follows, where the precise location of TSS is being identified. TransPrise offers significant improvement over existing promoter-prediction methods. To illustrate this, we compared predictions of TransPrise classification and regression models with the TSSPlant approach for the well annotated genome of Oryza sativa. Using a computer equipped with a graphics processing unit, the run time of TransPrise is 250 minutes on a genome of 374 Mb long. The Matthews correlation coefficient value for TransPrise is 0.79, more than two times larger than the 0.31 for TSSPlant classification models. This represents a high level of prediction accuracy. Additionally, the mean absolute error for the regression model is 29.19 nt, allowing for accurate prediction of TSS location. TransPrise was also tested in Homo sapiens, where mean absolute error of the regression model was 47.986 nt. We provide the full basis for the comparison and encourage users to freely access a set of our computational tools to facilitate and streamline their own analyses. The ready-to-use Docker image with all necessary packages, models, code as well as the source code of the TransPrise algorithm are available at (http://compubioverne.group/). The source code is ready to use and customizable to predict TSS in any eukaryotic organism.
Collapse
Affiliation(s)
- Stepan Pachganov
- Ugra Research Institute of Information Technologies, Khanty-Mansiysk, Russia
| | - Khalimat Murtazalieva
- Vavilov Institute for General Genetics, Moscow, Russia.,Institute of Bioinformatics, Moscow, Russia
| | - Aleksei Zarubin
- Tomsk National Research Medical Center of the Russian Academy of Sciences, Research Institute of Medical Genetics, Tomsk, Russia
| | | | - Duane R Chartier
- International Center for Art Intelligence, Inc., Los Angeles, CA, United States of America
| | - Tatiana V Tatarinova
- Vavilov Institute for General Genetics, Moscow, Russia.,Department of Biology, University of La Verne, La Verne, CA, United States of America.,A.A. Kharkevich Institute for Information Transmission Problems, Russian Academy of Sciences, Moscow, Russia.,Siberian Federal University, Krasnoyarsk, Russia
| |
Collapse
|
6
|
Triska M, Solovyev V, Baranova A, Kel A, Tatarinova TV. Nucleotide patterns aiding in prediction of eukaryotic promoters. PLoS One 2017; 12:e0187243. [PMID: 29141011 PMCID: PMC5687710 DOI: 10.1371/journal.pone.0187243] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/02/2017] [Accepted: 09/05/2017] [Indexed: 01/09/2023] Open
Abstract
Computational analysis of promoters is hindered by the complexity of their architecture. In less studied genomes with complex organization, false positive promoter predictions are common. Accurate identification of transcription start sites and core promoter regions remains an unsolved problem. In this paper, we present a comprehensive analysis of genomic features associated with promoters and show that probabilistic integrative algorithms-driven models allow accurate classification of DNA sequence into “promoters” and “non-promoters” even in absence of the full-length cDNA sequences. These models may be built upon the maps of the distributions of sequence polymorphisms, RNA sequencing reads on genomic DNA, methylated nucleotides, transcription factor binding sites, as well as relative frequencies of nucleotides and their combinations. Positional clustering of binding sites shows that the cells of Oryza sativa utilize three distinct classes of transcription factors: those that bind preferentially to the [-500,0] region (188 “promoter-specific” transcription factors), those that bind preferentially to the [0,500] region (282 “5′ UTR-specific” TFs), and 207 of the “promiscuous” transcription factors with little or no location preference with respect to TSS. For the most informative motifs, their positional preferences are conserved between dicots and monocots.
Collapse
Affiliation(s)
- Martin Triska
- Children’s Hospital Los Angeles, University of Southern California, Los Angeles, CA, United States of America
- Faculty of Advanced Technology, University of South Wales, Pontypridd, Wales, United Kingdom
| | | | - Ancha Baranova
- School of Systems Biology, George Mason University, Fairfax, VA, United States of America
- Research Centre for Medical Genetics, Moscow, Russia
| | - Alexander Kel
- geneXplain GmbH, Wolfenbuettel, Germany
- Institute of Chemical Biology and Fundamental Medicine, Novosibirsk, Russia
| | - Tatiana V. Tatarinova
- School of Systems Biology, George Mason University, Fairfax, VA, United States of America
- Department of Biology, Division of Natural Sciences, University of La Verne, La Verne, CA, United States of America
- Bioinformatics Center, AA Kharkevich Institute for Information Transmission Problems RAS, Moscow, Russia
- Vavilov’s Institute for General Genetics, Moscow, Russia, Moscow, Russia
- * E-mail:
| |
Collapse
|
7
|
Umarov RK, Solovyev VV. Recognition of prokaryotic and eukaryotic promoters using convolutional deep learning neural networks. PLoS One 2017; 12:e0171410. [PMID: 28158264 PMCID: PMC5291440 DOI: 10.1371/journal.pone.0171410] [Citation(s) in RCA: 130] [Impact Index Per Article: 18.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/16/2016] [Accepted: 01/20/2017] [Indexed: 11/18/2022] Open
Abstract
Accurate computational identification of promoters remains a challenge as these key DNA regulatory regions have variable structures composed of functional motifs that provide gene-specific initiation of transcription. In this paper we utilize Convolutional Neural Networks (CNN) to analyze sequence characteristics of prokaryotic and eukaryotic promoters and build their predictive models. We trained a similar CNN architecture on promoters of five distant organisms: human, mouse, plant (Arabidopsis), and two bacteria (Escherichia coli and Bacillus subtilis). We found that CNN trained on sigma70 subclass of Escherichia coli promoter gives an excellent classification of promoters and non-promoter sequences (Sn = 0.90, Sp = 0.96, CC = 0.84). The Bacillus subtilis promoters identification CNN model achieves Sn = 0.91, Sp = 0.95, and CC = 0.86. For human, mouse and Arabidopsis promoters we employed CNNs for identification of two well-known promoter classes (TATA and non-TATA promoters). CNN models nicely recognize these complex functional regions. For human promoters Sn/Sp/CC accuracy of prediction reached 0.95/0.98/0,90 on TATA and 0.90/0.98/0.89 for non-TATA promoter sequences, respectively. For Arabidopsis we observed Sn/Sp/CC 0.95/0.97/0.91 (TATA) and 0.94/0.94/0.86 (non-TATA) promoters. Thus, the developed CNN models, implemented in CNNProm program, demonstrated the ability of deep learning approach to grasp complex promoter sequence characteristics and achieve significantly higher accuracy compared to the previously developed promoter prediction programs. We also propose random substitution procedure to discover positionally conserved promoter functional elements. As the suggested approach does not require knowledge of any specific promoter features, it can be easily extended to identify promoters and other complex functional regions in sequences of many other and especially newly sequenced genomes. The CNNProm program is available to run at web server http://www.softberry.com.
Collapse
Affiliation(s)
- Ramzan Kh. Umarov
- King Abdullah University of Science and Technology, Thuwal, Saudi Arabia
| | | |
Collapse
|
8
|
Zolotarenko A, Chekalin E, Mehta R, Baranova A, Tatarinova TV, Bruskin S. Identification of Transcriptional Regulators of Psoriasis from RNA-Seq Experiments. Methods Mol Biol 2017; 1613:355-370. [PMID: 28849568 DOI: 10.1007/978-1-4939-7027-8_14] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/07/2023]
Abstract
Psoriasis is a common inflammatory skin disease with complex etiology and chronic progression. To provide novel insights into the molecular mechanisms of regulation of the disease we performed RNA sequencing (RNA-Seq) analysis of 14 pairs of skin samples collected from psoriatic patients. Subsequent pathway analysis and an extraction of transcriptional regulators governing psoriasis-associated pathways was executed using a combination of MetaCore Interactome enrichment tool and cisExpress algorithm, and followed by comparison to a set of previously described psoriasis response elements. A comparative approach has allowed us to identify 42 core transcriptional regulators of the disease associated with inflammation (NFkB, IRF9, JUN, FOS, SRF), activity of T-cells in the psoriatic lesions (STAT6, FOXP3, NFATC2, GATA3, TCF7, RUNX1, etc.), hyperproliferation and migration of keratinocytes (JUN, FOS, NFIB, TFAP2A, TFAP2C), and lipid metabolism (TFAP2, RARA, VDR). After merging the ChIP-seq and RNA-seq data, we conclude that the atypical expression of FOXA1 transcriptional factor is an important player in psoriasis, as it inhibits maturation of naive T cells into this Treg subpopulation (CD4+FOXA1+CD47+CD69+PD-L1(hi)FOXP3-), therefore contributing to the development of psoriatic skin lesions.
Collapse
Affiliation(s)
- Alena Zolotarenko
- Laboratory of Functional Genomics, Vavilov Institute of General Genetics RAS, Gubkina Street, 3119991, Moscow, Russia
| | - Evgeny Chekalin
- Laboratory of Functional Genomics, Vavilov Institute of General Genetics RAS, Gubkina Street, 3119991, Moscow, Russia
| | - Rohini Mehta
- The Center of the Study of Chronic Metabolic and Rare Diseases, School of Systems Biology, George Mason University, Fairfax, VA, USA
| | - Ancha Baranova
- The Center of the Study of Chronic Metabolic and Rare Diseases, School of Systems Biology, George Mason University, Fairfax, VA, USA
- Research Centre for Medical Genetics RAMS, Moscow, Russia
- Moscow Institute of Physics and Technology, Dolgoprudny, Moscow, Russia
- Atlas Biomed Group, Moscow, Russia
| | - Tatiana V Tatarinova
- Atlas Biomed Group, Moscow, Russia
- Center for Personalized Medicine, Children's Hospital Los Angeles and Spatial Sciences Institute, University of Southern California, Los Angeles, CA, USA
- A.A. Kharkevich Institute for Information Transmission Problems RAS, Moscow, Russia
| | - Sergey Bruskin
- Laboratory of Functional Genomics, Vavilov Institute of General Genetics RAS, Gubkina Street, 3119991, Moscow, Russia.
- Moscow Institute of Physics and Technology, Dolgoprudny, Moscow, Russia.
| |
Collapse
|
9
|
Triska M, Ivliev A, Nikolsky Y, Tatarinova TV. Analysis of cis-Regulatory Elements in Gene Co-expression Networks in Cancer. Methods Mol Biol 2017; 1613:291-310. [PMID: 28849565 DOI: 10.1007/978-1-4939-7027-8_11] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/22/2023]
Abstract
Analysis of gene co-expression networks is a powerful "data-driven" tool, invaluable for understanding cancer biology and mechanisms of tumor development. Yet, despite of completion of thousands of studies on cancer gene expression, there were few attempts to normalize and integrate co-expression data from scattered sources in a concise "meta-analysis" framework. Here we describe an integrated approach to cancer expression meta-analysis, which combines generation of "data-driven" co-expression networks with detailed statistical detection of promoter sequence motifs within the co-expression clusters. First, we applied Weighted Gene Co-Expression Network Analysis (WGCNA) workflow and Pearson's correlation to generate a comprehensive set of over 3000 co-expression clusters in 82 normalized microarray datasets from nine cancers of different origin. Next, we designed a genome-wide statistical approach to the detection of specific DNA sequence motifs based on similarities between the promoters of similarly expressed genes. The approach, realized as cisExpress software module, was specifically designed for analysis of very large data sets such as those generated by publicly accessible whole genome and transcriptome projects. cisExpress uses a task farming algorithm to exploit all available computational cores within a shared memory node.We discovered that although co-expression modules are populated with different sets of genes, they share distinct stable patterns of co-regulation based on promoter sequence analysis. The number of motifs per co-expression cluster varies widely in accordance with cancer tissue of origin, with the largest number in colon (68 motifs) and the lowest in ovary (18 motifs). The top scored motifs are typically shared between several tissues; they define sets of target genes responsible for certain functionality of cancerogenesis. Both the co-expression modules and a database of precalculated motifs are publically available and accessible for further studies.
Collapse
Affiliation(s)
- Martin Triska
- Spatial Sciences Institute, University of Southern California, Los Angeles, CA, USA
| | | | - Yuri Nikolsky
- Prosapia Genetics, Solana Beach, CA, USA.,School of Systems Biology, George Mason University, Fairfax, VA, USA
| | - Tatiana V Tatarinova
- Spatial Sciences Institute, University of Southern California, Los Angeles, CA, USA. .,Center for Personalized Medicine, Children's Hospital Los Angeles, 4640 Hollywood Blvd, Los Angeles, CA, 90027, USA. .,A.A. Kharkevich Institute for Information Transmission Problems RAS, Moscow, Russia.
| |
Collapse
|
10
|
Integrated computational approach to the analysis of RNA-seq data reveals new transcriptional regulators of psoriasis. Exp Mol Med 2016; 48:e268. [PMID: 27811935 PMCID: PMC5133374 DOI: 10.1038/emm.2016.97] [Citation(s) in RCA: 16] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/10/2016] [Revised: 05/06/2016] [Accepted: 05/24/2016] [Indexed: 02/07/2023] Open
Abstract
Psoriasis is a common inflammatory skin disease with complex etiology and chronic progression. To provide novel insights into the regulatory molecular mechanisms of the disease, we performed RNA sequencing analysis of 14 pairs of skin samples collected from patients with psoriasis. Subsequent pathway analysis and extraction of the transcriptional regulators governing psoriasis-associated pathways was executed using a combination of the MetaCore Interactome enrichment tool and the cisExpress algorithm, followed by comparison to a set of previously described psoriasis response elements. A comparative approach allowed us to identify 42 core transcriptional regulators of the disease associated with inflammation (NFκB, IRF9, JUN, FOS, SRF), the activity of T cells in psoriatic lesions (STAT6, FOXP3, NFATC2, GATA3, TCF7, RUNX1), the hyperproliferation and migration of keratinocytes (JUN, FOS, NFIB, TFAP2A, TFAP2C) and lipid metabolism (TFAP2, RARA, VDR). In addition to the core regulators, we identified 38 transcription factors previously not associated with the disease that can clarify the pathogenesis of psoriasis. To illustrate these findings, we analyzed the regulatory role of one of the identified transcription factors (TFs), FOXA1. Using ChIP-seq and RNA-seq data, we concluded that the atypical expression of the FOXA1 TF is an important player in the disease as it inhibits the maturation of naive T cells into the (CD4+FOXA1+CD47+CD69+PD-L1(hi)FOXP3-) regulatory T cell subpopulation, therefore contributing to the development of psoriatic skin lesions.
Collapse
|
11
|
Tatarinova TV, Chekalin E, Nikolsky Y, Bruskin S, Chebotarov D, McNally KL, Alexandrov N. Nucleotide diversity analysis highlights functionally important genomic regions. Sci Rep 2016; 6:35730. [PMID: 27774999 PMCID: PMC5075931 DOI: 10.1038/srep35730] [Citation(s) in RCA: 34] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/12/2016] [Accepted: 09/30/2016] [Indexed: 12/15/2022] Open
Abstract
We analyzed functionality and relative distribution of genetic variants across the complete Oryza sativa genome, using the 40 million single nucleotide polymorphisms (SNPs) dataset from the 3,000 Rice Genomes Project (http://snp-seek.irri.org), the largest and highest density SNP collection for any higher plant. We have shown that the DNA-binding transcription factors (TFs) are the most conserved group of genes, whereas kinases and membrane-localized transporters are the most variable ones. TFs may be conserved because they belong to some of the most connected regulatory hubs that modulate transcription of vast downstream gene networks, whereas signaling kinases and transporters need to adapt rapidly to changing environmental conditions. In general, the observed profound patterns of nucleotide variability reveal functionally important genomic regions. As expected, nucleotide diversity is much higher in intergenic regions than within gene bodies (regions spanning gene models), and protein-coding sequences are more conserved than untranslated gene regions. We have observed a sharp decline in nucleotide diversity that begins at about 250 nucleotides upstream of the transcription start and reaches minimal diversity exactly at the transcription start. We found the transcription termination sites to have remarkably symmetrical patterns of SNP density, implying presence of functional sites near transcription termination. Also, nucleotide diversity was significantly lower near 3′ UTRs, the area rich with regulatory regions.
Collapse
Affiliation(s)
- Tatiana V Tatarinova
- Center for Personalized Medicine and Spatial Sciences Institute, University of Southern California, Los Angeles, CA, USA.,Kharkevich Institute for Information Transmission Problems, Russian Academy of Sciences, Moscow, Russian Federation
| | | | - Yuri Nikolsky
- Vavilov Institute of General Genetics, Moscow, Russia.,F1 Genomics, San Diego, CA, USA.,School of Systems Biology, George Mason University, VA, USA
| | | | - Dmitry Chebotarov
- International Rice Research Institute, Los Baños, Laguna 4031, Philippines
| | - Kenneth L McNally
- International Rice Research Institute, Los Baños, Laguna 4031, Philippines
| | | |
Collapse
|
12
|
Morozova I, Flegontov P, Mikheyev AS, Bruskin S, Asgharian H, Ponomarenko P, Klyuchnikov V, ArunKumar G, Prokhortchouk E, Gankin Y, Rogaev E, Nikolsky Y, Baranova A, Elhaik E, Tatarinova TV. Toward high-resolution population genomics using archaeological samples. DNA Res 2016; 23:295-310. [PMID: 27436340 PMCID: PMC4991838 DOI: 10.1093/dnares/dsw029] [Citation(s) in RCA: 18] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/22/2015] [Accepted: 05/22/2016] [Indexed: 12/30/2022] Open
Abstract
The term ‘ancient DNA’ (aDNA) is coming of age, with over 1,200 hits in the PubMed database, beginning in the early 1980s with the studies of ‘molecular paleontology’. Rooted in cloning and limited sequencing of DNA from ancient remains during the pre-PCR era, the field has made incredible progress since the introduction of PCR and next-generation sequencing. Over the last decade, aDNA analysis ushered in a new era in genomics and became the method of choice for reconstructing the history of organisms, their biogeography, and migration routes, with applications in evolutionary biology, population genetics, archaeogenetics, paleo-epidemiology, and many other areas. This change was brought by development of new strategies for coping with the challenges in studying aDNA due to damage and fragmentation, scarce samples, significant historical gaps, and limited applicability of population genetics methods. In this review, we describe the state-of-the-art achievements in aDNA studies, with particular focus on human evolution and demographic history. We present the current experimental and theoretical procedures for handling and analysing highly degraded aDNA. We also review the challenges in the rapidly growing field of ancient epigenomics. Advancement of aDNA tools and methods signifies a new era in population genetics and evolutionary medicine research.
Collapse
Affiliation(s)
- Irina Morozova
- Institute of Evolutionary Medicine, University of Zurich, Zurich, Switzerland
| | - Pavel Flegontov
- Department of Biology and Ecology, Faculty of Science, University of Ostrava, Ostrava, Czech Republic Bioinformatics Center, A.A. Kharkevich Institute for Information Transmission Problems, Russian Academy of Sciences, Moscow, Russian Federation
| | - Alexander S Mikheyev
- Ecology and Evolution Unit, Okinawa Institute of Science and Technology Graduate University, Okinawa, Japan
| | - Sergey Bruskin
- Vavilov Institute of General Genetics RAS, Moscow, Russia
| | - Hosseinali Asgharian
- Department of Computational and Molecular Biology, University of Southern California, Los Angeles, CA, USA
| | - Petr Ponomarenko
- Center for Personalized Medicine, Children's Hospital Los Angeles, Los Angeles, CA, USA Spatial Sciences Institute, University of Southern California, Los Angeles, CA, USA
| | | | | | - Egor Prokhortchouk
- Research Center of Biotechnology RAS, Moscow, Russia Department of Biology, Lomonosov Moscow State University, Russia
| | | | - Evgeny Rogaev
- Vavilov Institute of General Genetics RAS, Moscow, Russia University of Massachusetts Medical School, Worcester, MA, USA
| | - Yuri Nikolsky
- Vavilov Institute of General Genetics RAS, Moscow, Russia F1 Genomics, San Diego, CA, USA School of Systems Biology, George Mason University, VA, USA
| | - Ancha Baranova
- School of Systems Biology, George Mason University, VA, USA Research Centre for Medical Genetics, Moscow, Russia Atlas Biomed Group, Moscow, Russia
| | - Eran Elhaik
- Department of Animal & Plant Sciences, University of Sheffield, Sheffield, South Yorkshire, UK
| | - Tatiana V Tatarinova
- Bioinformatics Center, A.A. Kharkevich Institute for Information Transmission Problems, Russian Academy of Sciences, Moscow, Russian Federation Center for Personalized Medicine, Children's Hospital Los Angeles, Los Angeles, CA, USA Spatial Sciences Institute, University of Southern California, Los Angeles, CA, USA
| |
Collapse
|
13
|
Li WL, Buckley J, Sanchez-Lara PA, Maglinte DT, Viduetsky L, Tatarinova TV, Aparicio JG, Kim JW, Au M, Ostrow D, Lee TC, O'Gorman M, Judkins A, Cobrinik D, Triche TJ. A Rapid and Sensitive Next-Generation Sequencing Method to Detect RB1 Mutations Improves Care for Retinoblastoma Patients and Their Families. J Mol Diagn 2016; 18:480-93. [PMID: 27155049 DOI: 10.1016/j.jmoldx.2016.02.006] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/17/2015] [Revised: 01/14/2016] [Accepted: 02/01/2016] [Indexed: 01/26/2023] Open
Abstract
Retinoblastoma is a childhood eye malignancy that can lead to the loss of vision, eye(s), and sometimes life. The tumors are initiated by inactivating mutations in both alleles of the tumor-suppressor gene, RB1, or, rarely, by MYCN amplification. Timely identification of a germline RB1 mutation in blood samples or either somatic RB1 mutation or MYCN amplification in tumors is important for effective care and management of retinoblastoma patients and their families. However, current procedures to thoroughly test RB1 mutations are complicated and lengthy. Herein, we report a next-generation sequencing-based method capable of detecting point mutations, small indels, and large deletions or duplications across the entire RB1 gene and amplification of MYCN gene on a single platform. From DNA extraction to clinical interpretation requires only 3 days, enabling early molecular diagnosis of retinoblastoma and optimal treatment outcomes. This method can also detect low-level mosaic mutations in blood samples that can be missed by routine Sanger sequencing. In addition, it can differentiate between RB1 mutation- and MYCN amplification-driven retinoblastomas. This rapid, comprehensive, and sensitive method for detecting RB1 mutations and MYCN amplification can readily identify RB1 mutation carriers and thus improve the management and genetic counseling for retinoblastoma patients and their families.
Collapse
Affiliation(s)
- Wenhui L Li
- Department of Pathology and Laboratory Medicine, Children's Hospital Los Angeles, Los Angeles, California; Department of Pathology, USC Roski Eye Institute, University of Southern California, Los Angeles, California.
| | - Jonathan Buckley
- Department of Pathology and Laboratory Medicine, Children's Hospital Los Angeles, Los Angeles, California; Department of Pathology, USC Roski Eye Institute, University of Southern California, Los Angeles, California
| | - Pedro A Sanchez-Lara
- Department of Pathology and Laboratory Medicine, Children's Hospital Los Angeles, Los Angeles, California; Department of Pathology, USC Roski Eye Institute, University of Southern California, Los Angeles, California; Department of Pediatrics, USC Roski Eye Institute, University of Southern California, Los Angeles, California
| | - Dennis T Maglinte
- Department of Pathology and Laboratory Medicine, Children's Hospital Los Angeles, Los Angeles, California
| | - Lucy Viduetsky
- Department of Pathology and Laboratory Medicine, Children's Hospital Los Angeles, Los Angeles, California
| | - Tatiana V Tatarinova
- Department of Pediatrics, USC Roski Eye Institute, University of Southern California, Los Angeles, California; Spatial Sciences Institute, Dornsife College of Letters, Arts and Sciences, University of Southern California, Los Angeles, California
| | | | - Jonathan W Kim
- Vision Center, Children's Hospital Los Angeles, Los Angeles, California; Department of Opthalmology, USC Roski Eye Institute, University of Southern California, Los Angeles, California
| | - Margaret Au
- Department of Pathology and Laboratory Medicine, Children's Hospital Los Angeles, Los Angeles, California
| | - Dejerianne Ostrow
- Department of Pathology and Laboratory Medicine, Children's Hospital Los Angeles, Los Angeles, California
| | - Thomas C Lee
- Vision Center, Children's Hospital Los Angeles, Los Angeles, California; Department of Opthalmology, USC Roski Eye Institute, University of Southern California, Los Angeles, California
| | - Maurice O'Gorman
- Department of Pathology and Laboratory Medicine, Children's Hospital Los Angeles, Los Angeles, California; Department of Pathology, USC Roski Eye Institute, University of Southern California, Los Angeles, California
| | - Alexander Judkins
- Department of Pathology and Laboratory Medicine, Children's Hospital Los Angeles, Los Angeles, California; Department of Pathology, USC Roski Eye Institute, University of Southern California, Los Angeles, California
| | - David Cobrinik
- Vision Center, Children's Hospital Los Angeles, Los Angeles, California; Department of Opthalmology, USC Roski Eye Institute, University of Southern California, Los Angeles, California; Division of Ophthalmology and Department of Surgery, and Saban Research Institute, Children's Hospital Los Angeles, Los Angeles, California; Department of Biochemistry & Molecular Biology, USC Roski Eye Institute, University of Southern California, Los Angeles, California; Norris Comprehensive Cancer Center, USC Keck School of Medicine, University of Southern California, Los Angeles, California
| | - Timothy J Triche
- Department of Pathology and Laboratory Medicine, Children's Hospital Los Angeles, Los Angeles, California; Department of Pathology, USC Roski Eye Institute, University of Southern California, Los Angeles, California.
| |
Collapse
|
14
|
Kozlov K, Chebotarev D, Hassan M, Triska M, Triska P, Flegontov P, Tatarinova TV. Differential Evolution approach to detect recent admixture. BMC Genomics 2015; 16 Suppl 8:S9. [PMID: 26111206 PMCID: PMC4480842 DOI: 10.1186/1471-2164-16-s8-s9] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/17/2022] Open
Abstract
The genetic structure of human populations is extraordinarily complex and of fundamental importance to studies of anthropology, evolution, and medicine. As increasingly many individuals are of mixed origin, there is an unmet need for tools that can infer multiple origins. Misclassification of such individuals can lead to incorrect and costly misinterpretations of genomic data, primarily in disease studies and drug trials. We present an advanced tool to infer ancestry that can identify the biogeographic origins of highly mixed individuals. reAdmix can incorporate individual's knowledge of ancestors (e.g. having some ancestors from Turkey or a Scottish grandmother). reAdmix is an online tool available at http://chcb.saban-chla.usc.edu/reAdmix/.
Collapse
|