Reference Citation Analysis: Find an Article, Find a Category, Find a Journal, Find a Scholar

For: Tatarinova T, Kryshchenko A, Triska M, Hassan M, Murphy D, Neely M, Schumitzky A. NPEST: a nonparametric method and a database for transcription start site prediction. Quant Biol 2013;1:261-71. [PMID: 25197613 DOI: 10.1007/s40484-013-0022-2] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/25/2022]

For:	Tatarinova T, Kryshchenko A, Triska M, Hassan M, Murphy D, Neely M, Schumitzky A. NPEST: a nonparametric method and a database for transcription start site prediction. Quant Biol 2013;1:261-71. [PMID: 25197613 DOI: 10.1007/s40484-013-0022-2] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/25/2022]

Number

Cited by Other Article(s)

Barbero-Aparicio JA, Olivares-Gil A, Díez-Pastor JF, García-Osorio C. Deep learning and support vector machines for transcription start site identification. PeerJ Comput Sci 2023;9:e1340. [PMID: 37346545 PMCID: PMC10280436 DOI: 10.7717/peerj-cs.1340] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/26/2022] [Accepted: 03/21/2023] [Indexed: 06/23/2023]

Abstract

Recognizing transcription start sites is key to gene identification. Several approaches have been employed in related problems such as detecting translation initiation sites or promoters, many of the most recent ones based on machine learning. Deep learning methods have been proven to be exceptionally effective for this task, but their use in transcription start site identification has not yet been explored in depth. Also, the very few existing works do not compare their methods to support vector machines (SVMs), the most established technique in this area of study, nor provide the curated dataset used in the study. The reduced amount of published papers in this specific problem could be explained by this lack of datasets. Given that both support vector machines and deep neural networks have been applied in related problems with remarkable results, we compared their performance in transcription start site predictions, concluding that SVMs are computationally much slower, and deep learning methods, specially long short-term memory neural networks (LSTMs), are best suited to work with sequences than SVMs. For such a purpose, we used the reference human genome GRCh38. Additionally, we studied two different aspects related to data processing: the proper way to generate training examples and the imbalanced nature of the data. Furthermore, the generalization performance of the models studied was also tested using the mouse genome, where the LSTM neural network stood out from the rest of the algorithms. To sum up, this article provides an analysis of the best architecture choices in transcription start site identification, as well as a method to generate transcription start site datasets including negative instances on any species available in Ensembl. We found that deep learning methods are better suited than SVMs to solve this problem, being more efficient and better adapted to long sequences and large amounts of data. We also create a transcription start site (TSS) dataset large enough to be used in deep learning experiments.

Collapse

Genome-Wide Prediction of Transcription Start Sites in Conifers. Int J Mol Sci 2022;23:ijms23031735. [PMID: 35163661 PMCID: PMC8836283 DOI: 10.3390/ijms23031735] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/30/2021] [Revised: 01/30/2022] [Accepted: 02/01/2022] [Indexed: 02/04/2023] Open

Abstract

The identification of promoters is an essential step in the genome annotation process, providing a framework for gene regulatory networks and their role in transcription regulation. Despite considerable advances in the high-throughput determination of transcription start sites (TSSs) and transcription factor binding sites (TFBSs), experimental methods are still time-consuming and expensive. Instead, several computational approaches have been developed to provide fast and reliable means for predicting the location of TSSs and regulatory motifs on a genome-wide scale. Numerous studies have been carried out on the regulatory elements of mammalian genomes, but plant promoters, especially in gymnosperms, have been left out of the limelight and, therefore, have been poorly investigated. The aim of this study was to enhance and expand the existing genome annotations using computational approaches for genome-wide prediction of TSSs in the four conifer species: loblolly pine, white spruce, Norway spruce, and Siberian larch. Our pipeline will be useful for TSS predictions in other genomes, especially for draft assemblies, where reliable TSS predictions are not usually available. We also explored some of the features of the nucleotide composition of the predicted promoters and compared the GC properties of conifer genes with model monocot and dicot plants. Here, we demonstrate that even incomplete genome assemblies and partial annotations can be a reliable starting point for TSS annotation. The results of the TSS prediction in four conifer species have been deposited in the Persephone genome browser, which allows smooth visualization and is optimized for large data sets. This work provides the initial basis for future experimental validation and the study of the regulatory regions to understand gene regulation in gymnosperms.

Collapse

Umarov R, Li Y, Arakawa T, Takizawa S, Gao X, Arner E. ReFeaFi: Genome-wide prediction of regulatory elements driving transcription initiation. PLoS Comput Biol 2021;17:e1009376. [PMID: 34491989 PMCID: PMC8448322 DOI: 10.1371/journal.pcbi.1009376] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/22/2021] [Revised: 09/17/2021] [Accepted: 08/23/2021] [Indexed: 11/19/2022] Open

Pachganov S, Murtazalieva K, Zarubin A, Taran T, Chartier D, Tatarinova TV. Prediction of Rice Transcription Start Sites Using TransPrise: A Novel Machine Learning Approach. Methods Mol Biol 2021;2238:261-274. [PMID: 33471337 DOI: 10.1007/978-1-0716-1068-8_17] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]

Pachganov S, Murtazalieva K, Zarubin A, Sokolov D, Chartier DR, Tatarinova TV. TransPrise: a novel machine learning approach for eukaryotic promoter prediction. PeerJ 2019;7:e7990. [PMID: 31695967 PMCID: PMC6827441 DOI: 10.7717/peerj.7990] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/09/2019] [Accepted: 10/04/2019] [Indexed: 02/01/2023] Open

Abstract

As interest in genetic resequencing increases, so does the need for effective mathematical, computational, and statistical approaches. One of the difficult problems in genome annotation is determination of precise positions of transcription start sites. In this paper we present TransPrise-an efficient deep learning tool for prediction of positions of eukaryotic transcription start sites. Our pipeline consists of two parts: the binary classifier operates the first, and if a sequence is classified as TSS-containing the regression step follows, where the precise location of TSS is being identified. TransPrise offers significant improvement over existing promoter-prediction methods. To illustrate this, we compared predictions of TransPrise classification and regression models with the TSSPlant approach for the well annotated genome of Oryza sativa. Using a computer equipped with a graphics processing unit, the run time of TransPrise is 250 minutes on a genome of 374 Mb long. The Matthews correlation coefficient value for TransPrise is 0.79, more than two times larger than the 0.31 for TSSPlant classification models. This represents a high level of prediction accuracy. Additionally, the mean absolute error for the regression model is 29.19 nt, allowing for accurate prediction of TSS location. TransPrise was also tested in Homo sapiens, where mean absolute error of the regression model was 47.986 nt. We provide the full basis for the comparison and encourage users to freely access a set of our computational tools to facilitate and streamline their own analyses. The ready-to-use Docker image with all necessary packages, models, code as well as the source code of the TransPrise algorithm are available at (http://compubioverne.group/). The source code is ready to use and customizable to predict TSS in any eukaryotic organism.

Collapse

Triska M, Solovyev V, Baranova A, Kel A, Tatarinova TV. Nucleotide patterns aiding in prediction of eukaryotic promoters. PLoS One 2017;12:e0187243. [PMID: 29141011 PMCID: PMC5687710 DOI: 10.1371/journal.pone.0187243] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/02/2017] [Accepted: 09/05/2017] [Indexed: 01/09/2023] Open

Umarov RK, Solovyev VV. Recognition of prokaryotic and eukaryotic promoters using convolutional deep learning neural networks. PLoS One 2017;12:e0171410. [PMID: 28158264 PMCID: PMC5291440 DOI: 10.1371/journal.pone.0171410] [Citation(s) in RCA: 130] [Impact Index Per Article: 18.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/16/2016] [Accepted: 01/20/2017] [Indexed: 11/18/2022] Open

Abstract

Accurate computational identification of promoters remains a challenge as these key DNA regulatory regions have variable structures composed of functional motifs that provide gene-specific initiation of transcription. In this paper we utilize Convolutional Neural Networks (CNN) to analyze sequence characteristics of prokaryotic and eukaryotic promoters and build their predictive models. We trained a similar CNN architecture on promoters of five distant organisms: human, mouse, plant (Arabidopsis), and two bacteria (Escherichia coli and Bacillus subtilis). We found that CNN trained on sigma70 subclass of Escherichia coli promoter gives an excellent classification of promoters and non-promoter sequences (Sn = 0.90, Sp = 0.96, CC = 0.84). The Bacillus subtilis promoters identification CNN model achieves Sn = 0.91, Sp = 0.95, and CC = 0.86. For human, mouse and Arabidopsis promoters we employed CNNs for identification of two well-known promoter classes (TATA and non-TATA promoters). CNN models nicely recognize these complex functional regions. For human promoters Sn/Sp/CC accuracy of prediction reached 0.95/0.98/0,90 on TATA and 0.90/0.98/0.89 for non-TATA promoter sequences, respectively. For Arabidopsis we observed Sn/Sp/CC 0.95/0.97/0.91 (TATA) and 0.94/0.94/0.86 (non-TATA) promoters. Thus, the developed CNN models, implemented in CNNProm program, demonstrated the ability of deep learning approach to grasp complex promoter sequence characteristics and achieve significantly higher accuracy compared to the previously developed promoter prediction programs. We also propose random substitution procedure to discover positionally conserved promoter functional elements. As the suggested approach does not require knowledge of any specific promoter features, it can be easily extended to identify promoters and other complex functional regions in sequences of many other and especially newly sequenced genomes. The CNNProm program is available to run at web server http://www.softberry.com.

Collapse

Zolotarenko A, Chekalin E, Mehta R, Baranova A, Tatarinova TV, Bruskin S. Identification of Transcriptional Regulators of Psoriasis from RNA-Seq Experiments. Methods Mol Biol 2017;1613:355-370. [PMID: 28849568 DOI: 10.1007/978-1-4939-7027-8_14] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/07/2023]

Triska M, Ivliev A, Nikolsky Y, Tatarinova TV. Analysis of cis-Regulatory Elements in Gene Co-expression Networks in Cancer. Methods Mol Biol 2017;1613:291-310. [PMID: 28849565 DOI: 10.1007/978-1-4939-7027-8_11] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/22/2023]

Abstract

Analysis of gene co-expression networks is a powerful "data-driven" tool, invaluable for understanding cancer biology and mechanisms of tumor development. Yet, despite of completion of thousands of studies on cancer gene expression, there were few attempts to normalize and integrate co-expression data from scattered sources in a concise "meta-analysis" framework. Here we describe an integrated approach to cancer expression meta-analysis, which combines generation of "data-driven" co-expression networks with detailed statistical detection of promoter sequence motifs within the co-expression clusters. First, we applied Weighted Gene Co-Expression Network Analysis (WGCNA) workflow and Pearson's correlation to generate a comprehensive set of over 3000 co-expression clusters in 82 normalized microarray datasets from nine cancers of different origin. Next, we designed a genome-wide statistical approach to the detection of specific DNA sequence motifs based on similarities between the promoters of similarly expressed genes. The approach, realized as cisExpress software module, was specifically designed for analysis of very large data sets such as those generated by publicly accessible whole genome and transcriptome projects. cisExpress uses a task farming algorithm to exploit all available computational cores within a shared memory node.We discovered that although co-expression modules are populated with different sets of genes, they share distinct stable patterns of co-regulation based on promoter sequence analysis. The number of motifs per co-expression cluster varies widely in accordance with cancer tissue of origin, with the largest number in colon (68 motifs) and the lowest in ovary (18 motifs). The top scored motifs are typically shared between several tissues; they define sets of target genes responsible for certain functionality of cancerogenesis. Both the co-expression modules and a database of precalculated motifs are publically available and accessible for further studies.

Collapse

Integrated computational approach to the analysis of RNA-seq data reveals new transcriptional regulators of psoriasis. Exp Mol Med 2016;48:e268. [PMID: 27811935 PMCID: PMC5133374 DOI: 10.1038/emm.2016.97] [Citation(s) in RCA: 16] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/10/2016] [Revised: 05/06/2016] [Accepted: 05/24/2016] [Indexed: 02/07/2023] Open

Tatarinova TV, Chekalin E, Nikolsky Y, Bruskin S, Chebotarov D, McNally KL, Alexandrov N. Nucleotide diversity analysis highlights functionally important genomic regions. Sci Rep 2016;6:35730. [PMID: 27774999 PMCID: PMC5075931 DOI: 10.1038/srep35730] [Citation(s) in RCA: 34] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/12/2016] [Accepted: 09/30/2016] [Indexed: 12/15/2022] Open

Morozova I, Flegontov P, Mikheyev AS, Bruskin S, Asgharian H, Ponomarenko P, Klyuchnikov V, ArunKumar G, Prokhortchouk E, Gankin Y, Rogaev E, Nikolsky Y, Baranova A, Elhaik E, Tatarinova TV. Toward high-resolution population genomics using archaeological samples. DNA Res 2016;23:295-310. [PMID: 27436340 PMCID: PMC4991838 DOI: 10.1093/dnares/dsw029] [Citation(s) in RCA: 18] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/22/2015] [Accepted: 05/22/2016] [Indexed: 12/30/2022] Open

Affiliation(s)

Irina Morozova Institute of Evolutionary Medicine, University of Zurich, Zurich, Switzerland
Pavel Flegontov Department of Biology and Ecology, Faculty of Science, University of Ostrava, Ostrava, Czech Republic Bioinformatics Center, A.A. Kharkevich Institute for Information Transmission Problems, Russian Academy of Sciences, Moscow, Russian Federation
Alexander S Mikheyev Ecology and Evolution Unit, Okinawa Institute of Science and Technology Graduate University, Okinawa, Japan
Sergey Bruskin Vavilov Institute of General Genetics RAS, Moscow, Russia
Hosseinali Asgharian Department of Computational and Molecular Biology, University of Southern California, Los Angeles, CA, USA
Petr Ponomarenko Center for Personalized Medicine, Children's Hospital Los Angeles, Los Angeles, CA, USA Spatial Sciences Institute, University of Southern California, Los Angeles, CA, USA
Vladimir Klyuchnikov Donskaya Archeologia, Rostov, Russia
GaneshPrasad ArunKumar School of Chemical and Biotechnology, SASTRA University, Tanjore, India
Egor Prokhortchouk Research Center of Biotechnology RAS, Moscow, Russia Department of Biology, Lomonosov Moscow State University, Russia
Yuriy Gankin EPAM Systems, Newtown, PA, USA
Evgeny Rogaev Vavilov Institute of General Genetics RAS, Moscow, Russia University of Massachusetts Medical School, Worcester, MA, USA
Yuri Nikolsky Vavilov Institute of General Genetics RAS, Moscow, Russia F1 Genomics, San Diego, CA, USA School of Systems Biology, George Mason University, VA, USA
Ancha Baranova School of Systems Biology, George Mason University, VA, USA Research Centre for Medical Genetics, Moscow, Russia Atlas Biomed Group, Moscow, Russia
Eran Elhaik Department of Animal & Plant Sciences, University of Sheffield, Sheffield, South Yorkshire, UK
Tatiana V Tatarinova Bioinformatics Center, A.A. Kharkevich Institute for Information Transmission Problems, Russian Academy of Sciences, Moscow, Russian Federation Center for Personalized Medicine, Children's Hospital Los Angeles, Los Angeles, CA, USA Spatial Sciences Institute, University of Southern California, Los Angeles, CA, USA

Collapse

Li WL, Buckley J, Sanchez-Lara PA, Maglinte DT, Viduetsky L, Tatarinova TV, Aparicio JG, Kim JW, Au M, Ostrow D, Lee TC, O'Gorman M, Judkins A, Cobrinik D, Triche TJ. A Rapid and Sensitive Next-Generation Sequencing Method to Detect RB1 Mutations Improves Care for Retinoblastoma Patients and Their Families. J Mol Diagn 2016;18:480-93. [PMID: 27155049 DOI: 10.1016/j.jmoldx.2016.02.006] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/17/2015] [Revised: 01/14/2016] [Accepted: 02/01/2016] [Indexed: 01/26/2023] Open

Affiliation(s)

Wenhui L Li Department of Pathology and Laboratory Medicine, Children's Hospital Los Angeles, Los Angeles, California; Department of Pathology, USC Roski Eye Institute, University of Southern California, Los Angeles, California.
Jonathan Buckley Department of Pathology and Laboratory Medicine, Children's Hospital Los Angeles, Los Angeles, California; Department of Pathology, USC Roski Eye Institute, University of Southern California, Los Angeles, California
Pedro A Sanchez-Lara Department of Pathology and Laboratory Medicine, Children's Hospital Los Angeles, Los Angeles, California; Department of Pathology, USC Roski Eye Institute, University of Southern California, Los Angeles, California; Department of Pediatrics, USC Roski Eye Institute, University of Southern California, Los Angeles, California
Dennis T Maglinte Department of Pathology and Laboratory Medicine, Children's Hospital Los Angeles, Los Angeles, California
Lucy Viduetsky Department of Pathology and Laboratory Medicine, Children's Hospital Los Angeles, Los Angeles, California
Tatiana V Tatarinova Department of Pediatrics, USC Roski Eye Institute, University of Southern California, Los Angeles, California; Spatial Sciences Institute, Dornsife College of Letters, Arts and Sciences, University of Southern California, Los Angeles, California
Jennifer G Aparicio Vision Center, Children's Hospital Los Angeles, Los Angeles, California
Jonathan W Kim Vision Center, Children's Hospital Los Angeles, Los Angeles, California; Department of Opthalmology, USC Roski Eye Institute, University of Southern California, Los Angeles, California
Margaret Au Department of Pathology and Laboratory Medicine, Children's Hospital Los Angeles, Los Angeles, California
Dejerianne Ostrow Department of Pathology and Laboratory Medicine, Children's Hospital Los Angeles, Los Angeles, California
Thomas C Lee Vision Center, Children's Hospital Los Angeles, Los Angeles, California; Department of Opthalmology, USC Roski Eye Institute, University of Southern California, Los Angeles, California
Maurice O'Gorman Department of Pathology and Laboratory Medicine, Children's Hospital Los Angeles, Los Angeles, California; Department of Pathology, USC Roski Eye Institute, University of Southern California, Los Angeles, California
Alexander Judkins Department of Pathology and Laboratory Medicine, Children's Hospital Los Angeles, Los Angeles, California; Department of Pathology, USC Roski Eye Institute, University of Southern California, Los Angeles, California
David Cobrinik Vision Center, Children's Hospital Los Angeles, Los Angeles, California; Department of Opthalmology, USC Roski Eye Institute, University of Southern California, Los Angeles, California; Division of Ophthalmology and Department of Surgery, and Saban Research Institute, Children's Hospital Los Angeles, Los Angeles, California; Department of Biochemistry & Molecular Biology, USC Roski Eye Institute, University of Southern California, Los Angeles, California; Norris Comprehensive Cancer Center, USC Keck School of Medicine, University of Southern California, Los Angeles, California
Timothy J Triche Department of Pathology and Laboratory Medicine, Children's Hospital Los Angeles, Los Angeles, California; Department of Pathology, USC Roski Eye Institute, University of Southern California, Los Angeles, California.

Collapse

Kozlov K, Chebotarev D, Hassan M, Triska M, Triska P, Flegontov P, Tatarinova TV. Differential Evolution approach to detect recent admixture. BMC Genomics 2015;16 Suppl 8:S9. [PMID: 26111206 PMCID: PMC4480842 DOI: 10.1186/1471-2164-16-s8-s9] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/17/2022] Open