501
|
Li Y, Shi W, Wasserman WW. Genome-wide prediction of cis-regulatory regions using supervised deep learning methods. BMC Bioinformatics 2018; 19:202. [PMID: 29855387 PMCID: PMC5984344 DOI: 10.1186/s12859-018-2187-1] [Citation(s) in RCA: 57] [Impact Index Per Article: 8.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/13/2017] [Accepted: 05/04/2018] [Indexed: 01/07/2023] Open
Abstract
Background In the human genome, 98% of DNA sequences are non-protein-coding regions that were previously disregarded as junk DNA. In fact, non-coding regions host a variety of cis-regulatory regions which precisely control the expression of genes. Thus, Identifying active cis-regulatory regions in the human genome is critical for understanding gene regulation and assessing the impact of genetic variation on phenotype. The developments of high-throughput sequencing and machine learning technologies make it possible to predict cis-regulatory regions genome wide. Results Based on rich data resources such as the Encyclopedia of DNA Elements (ENCODE) and the Functional Annotation of the Mammalian Genome (FANTOM) projects, we introduce DECRES based on supervised deep learning approaches for the identification of enhancer and promoter regions in the human genome. Due to their ability to discover patterns in large and complex data, the introduction of deep learning methods enables a significant advance in our knowledge of the genomic locations of cis-regulatory regions. Using models for well-characterized cell lines, we identify key experimental features that contribute to the predictive performance. Applying DECRES, we delineate locations of 300,000 candidate enhancers genome wide (6.8% of the genome, of which 40,000 are supported by bidirectional transcription data), and 26,000 candidate promoters (0.6% of the genome). Conclusion The predicted annotations of cis-regulatory regions will provide broad utility for genome interpretation from functional genomics to clinical applications. The DECRES model demonstrates potentials of deep learning technologies when combined with high-throughput sequencing data, and inspires the development of other advanced neural network models for further improvement of genome annotations. Electronic supplementary material The online version of this article (10.1186/s12859-018-2187-1) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Yifeng Li
- Centre for Molecular Medicine and Therapeutics, BC Children's Hospital Research Institute, Department of Medical Genetics, University of British Columbia, Rm 3109, 950 West 28th Avenue, Vancouver, V5Z 4H4, Canada.,Digital Technologies Research Centre, National Research Council Canada, Building M-50, 1200 Montreal Road, Ottawa, K1A 0R6, Canada
| | - Wenqiang Shi
- Centre for Molecular Medicine and Therapeutics, BC Children's Hospital Research Institute, Department of Medical Genetics, University of British Columbia, Rm 3109, 950 West 28th Avenue, Vancouver, V5Z 4H4, Canada
| | - Wyeth W Wasserman
- Centre for Molecular Medicine and Therapeutics, BC Children's Hospital Research Institute, Department of Medical Genetics, University of British Columbia, Rm 3109, 950 West 28th Avenue, Vancouver, V5Z 4H4, Canada.
| |
Collapse
|
502
|
Zou LS, Erdos MR, Taylor DL, Chines PS, Varshney A, Parker SCJ, Collins FS, Didion JP. BoostMe accurately predicts DNA methylation values in whole-genome bisulfite sequencing of multiple human tissues. BMC Genomics 2018; 19:390. [PMID: 29792182 PMCID: PMC5966887 DOI: 10.1186/s12864-018-4766-y] [Citation(s) in RCA: 26] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/06/2018] [Accepted: 05/08/2018] [Indexed: 01/14/2023] Open
Abstract
Background Bisulfite sequencing is widely employed to study the role of DNA methylation in disease; however, the data suffer from biases due to coverage depth variability. Imputation of methylation values at low-coverage sites may mitigate these biases while also identifying important genomic features associated with predictive power. Results Here we describe BoostMe, a method for imputing low-quality DNA methylation estimates within whole-genome bisulfite sequencing (WGBS) data. BoostMe uses a gradient boosting algorithm, XGBoost, and leverages information from multiple samples for prediction. We find that BoostMe outperforms existing algorithms in speed and accuracy when applied to WGBS of human tissues. Furthermore, we show that imputation improves concordance between WGBS and the MethylationEPIC array at low WGBS depth, suggesting improved WGBS accuracy after imputation. Conclusions Our findings support the use of BoostMe as a preprocessing step for WGBS analysis. Electronic supplementary material The online version of this article (10.1186/s12864-018-4766-y) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Luli S Zou
- National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, 20892, USA
| | - Michael R Erdos
- National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, 20892, USA
| | - D Leland Taylor
- National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, 20892, USA.,European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridgeshire, UK
| | - Peter S Chines
- National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, 20892, USA
| | - Arushi Varshney
- Department of Human Genetics, University of Michigan, Ann Arbor, MI, 48109, USA
| | | | - Stephen C J Parker
- Department of Human Genetics, University of Michigan, Ann Arbor, MI, 48109, USA.,Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI, 48109, USA
| | - Francis S Collins
- National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, 20892, USA.
| | - John P Didion
- National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, 20892, USA
| |
Collapse
|
503
|
Min X, Zeng W, Chen N, Chen T, Jiang R. Chromatin accessibility prediction via convolutional long short-term memory networks with k-mer embedding. Bioinformatics 2018; 33:i92-i101. [PMID: 28881969 PMCID: PMC5870572 DOI: 10.1093/bioinformatics/btx234] [Citation(s) in RCA: 80] [Impact Index Per Article: 11.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/30/2023] Open
Abstract
Motivation Experimental techniques for measuring chromatin accessibility are expensive and time consuming, appealing for the development of computational approaches to predict open chromatin regions from DNA sequences. Along this direction, existing methods fall into two classes: one based on handcrafted k-mer features and the other based on convolutional neural networks. Although both categories have shown good performance in specific applications thus far, there still lacks a comprehensive framework to integrate useful k-mer co-occurrence information with recent advances in deep learning. Results We fill this gap by addressing the problem of chromatin accessibility prediction with a convolutional Long Short-Term Memory (LSTM) network with k-mer embedding. We first split DNA sequences into k-mers and pre-train k-mer embedding vectors based on the co-occurrence matrix of k-mers by using an unsupervised representation learning approach. We then construct a supervised deep learning architecture comprised of an embedding layer, three convolutional layers and a Bidirectional LSTM (BLSTM) layer for feature learning and classification. We demonstrate that our method gains high-quality fixed-length features from variable-length sequences and consistently outperforms baseline methods. We show that k-mer embedding can effectively enhance model performance by exploring different embedding strategies. We also prove the efficacy of both the convolution and the BLSTM layers by comparing two variations of the network architecture. We confirm the robustness of our model to hyper-parameters by performing sensitivity analysis. We hope our method can eventually reinforce our understanding of employing deep learning in genomic studies and shed light on research regarding mechanisms of chromatin accessibility. Availability and implementation The source code can be downloaded from https://github.com/minxueric/ismb2017_lstm. Supplementary information Supplementary materials are available at Bioinformatics online.
Collapse
Affiliation(s)
- Xu Min
- MOE Key Laboratory of Bioinformatics and Bioinformatics Division, TNLIST, Tsinghua University, Beijing, China.,Department of Computer Science and Technology, State Key Lab of Intelligent Technology and Systems, Tsinghua University, Beijing, China
| | - Wanwen Zeng
- MOE Key Laboratory of Bioinformatics and Bioinformatics Division, TNLIST, Tsinghua University, Beijing, China.,Department of Automation, Tsinghua University, Beijing, China
| | - Ning Chen
- MOE Key Laboratory of Bioinformatics and Bioinformatics Division, TNLIST, Tsinghua University, Beijing, China.,Department of Computer Science and Technology, State Key Lab of Intelligent Technology and Systems, Tsinghua University, Beijing, China
| | - Ting Chen
- MOE Key Laboratory of Bioinformatics and Bioinformatics Division, TNLIST, Tsinghua University, Beijing, China.,Department of Computer Science and Technology, State Key Lab of Intelligent Technology and Systems, Tsinghua University, Beijing, China.,Program in Computational Biology and Bioinformatics, University of Southern California, CA, USA
| | - Rui Jiang
- MOE Key Laboratory of Bioinformatics and Bioinformatics Division, TNLIST, Tsinghua University, Beijing, China.,Department of Automation, Tsinghua University, Beijing, China
| |
Collapse
|
504
|
Fraser K, Bruckner DM, Dordick JS. Advancing Predictive Hepatotoxicity at the Intersection of Experimental, in Silico, and Artificial Intelligence Technologies. Chem Res Toxicol 2018; 31:412-430. [PMID: 29722533 DOI: 10.1021/acs.chemrestox.8b00054] [Citation(s) in RCA: 26] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/30/2022]
Abstract
Adverse drug reactions, particularly those that result in drug-induced liver injury (DILI), are a major cause of drug failure in clinical trials and drug withdrawals. Hepatotoxicity-mediated drug attrition occurs despite substantial investments of time and money in developing cellular assays, animal models, and computational models to predict its occurrence in humans. Underperformance in predicting hepatotoxicity associated with drugs and drug candidates has been attributed to existing gaps in our understanding of the mechanisms involved in driving hepatic injury after these compounds perfuse and are metabolized by the liver. Herein we assess in vitro, in vivo (animal), and in silico strategies used to develop predictive DILI models. We address the effectiveness of several two- and three-dimensional in vitro cellular methods that are frequently employed in hepatotoxicity screens and how they can be used to predict DILI in humans. We also explore how humanized animal models can recapitulate human drug metabolic profiles and associated liver injury. Finally, we highlight the maturation of computational methods for predicting hepatotoxicity, the untapped potential of artificial intelligence for improving in silico DILI screens, and how knowledge acquired from these predictions can shape the refinement of experimental methods.
Collapse
Affiliation(s)
- Keith Fraser
- Department of Chemical and Biological Engineering and Department of Biological Sciences Center for Biotechnology and Interdisciplinary Studies , Rensselaer Polytechnic Institute , Troy , New York 12180 , United States
| | - Dylan M Bruckner
- Department of Chemical and Biological Engineering and Department of Biological Sciences Center for Biotechnology and Interdisciplinary Studies , Rensselaer Polytechnic Institute , Troy , New York 12180 , United States
| | - Jonathan S Dordick
- Department of Chemical and Biological Engineering and Department of Biological Sciences Center for Biotechnology and Interdisciplinary Studies , Rensselaer Polytechnic Institute , Troy , New York 12180 , United States
| |
Collapse
|
505
|
Amidi A, Amidi S, Vlachakis D, Megalooikonomou V, Paragios N, Zacharaki EI. EnzyNet: enzyme classification using 3D convolutional neural networks on spatial representation. PeerJ 2018; 6:e4750. [PMID: 29740518 PMCID: PMC5937476 DOI: 10.7717/peerj.4750] [Citation(s) in RCA: 36] [Impact Index Per Article: 5.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/25/2017] [Accepted: 04/21/2018] [Indexed: 11/20/2022] Open
Abstract
During the past decade, with the significant progress of computational power as well as ever-rising data availability, deep learning techniques became increasingly popular due to their excellent performance on computer vision problems. The size of the Protein Data Bank (PDB) has increased more than 15-fold since 1999, which enabled the expansion of models that aim at predicting enzymatic function via their amino acid composition. Amino acid sequence, however, is less conserved in nature than protein structure and therefore considered a less reliable predictor of protein function. This paper presents EnzyNet, a novel 3D convolutional neural networks classifier that predicts the Enzyme Commission number of enzymes based only on their voxel-based spatial structure. The spatial distribution of biochemical properties was also examined as complementary information. The two-layer architecture was investigated on a large dataset of 63,558 enzymes from the PDB and achieved an accuracy of 78.4% by exploiting only the binary representation of the protein shape. Code and datasets are available at https://github.com/shervinea/enzynet.
Collapse
Affiliation(s)
- Afshine Amidi
- Massachusetts Institute of Technology, Cambridge, MA, USA.,Center for Visual Computing, Department of Applied Mathematics, Ecole Centrale de Paris (CentraleSupélec), Châtenay-Malabry, France
| | - Shervine Amidi
- Center for Visual Computing, Department of Applied Mathematics, Ecole Centrale de Paris (CentraleSupélec), Châtenay-Malabry, France
| | - Dimitrios Vlachakis
- MDAKM Group, Department of Computer Engineering and Informatics, University of Patras, Patras, Greece
| | - Vasileios Megalooikonomou
- MDAKM Group, Department of Computer Engineering and Informatics, University of Patras, Patras, Greece
| | - Nikos Paragios
- Center for Visual Computing, Department of Applied Mathematics, Ecole Centrale de Paris (CentraleSupélec), Châtenay-Malabry, France
| | - Evangelia I Zacharaki
- Center for Visual Computing, Department of Applied Mathematics, Ecole Centrale de Paris (CentraleSupélec), Châtenay-Malabry, France.,MDAKM Group, Department of Computer Engineering and Informatics, University of Patras, Patras, Greece
| |
Collapse
|
506
|
Abstract
Motivation The discovery of transcription factor binding site (TFBS) motifs is essential for untangling the complex mechanism of genetic variation under different developmental and environmental conditions. Among the huge amount of computational approaches for de novo identification of TFBS motifs, discriminative motif learning (DML) methods have been proven to be promising for harnessing the discovery power of accumulated huge amount of high-throughput binding data. However, they have to sacrifice accuracy for speed and could fail to fully utilize the information of the input sequences. Results We propose a novel algorithm called CDAUC for optimizing DML-learned motifs based on the area under the receiver-operating characteristic curve (AUC) criterion, which has been widely used in the literature to evaluate the significance of extracted motifs. We show that when the considered AUC loss function is optimized in a coordinate-wise manner, the cost function of each resultant sub-problem is a piece-wise constant function, whose optimal value can be found exactly and efficiently. Further, a key step of each iteration of CDAUC can be efficiently solved as a computational geometry problem. Experimental results on real world high-throughput datasets illustrate that CDAUC outperforms competing methods for refining DML motifs, while being one order of magnitude faster. Meanwhile, preliminary results also show that CDAUC may also be useful for improving the interpretability of convolutional kernels generated by the emerging deep learning approaches for predicting TF sequences specificities. Availability and Implementation CDAUC is available at: https://drive.google.com/drive/folders/0BxOW5MtIZbJjNFpCeHlBVWJHeW8. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Lin Zhu
- Institute of Machine Learning and Systems Biology, Department of College of Electronics and Information Engineering, Tongji University, Shanghai, China
| | - Hong-Bo Zhang
- Institute of Machine Learning and Systems Biology, Department of College of Electronics and Information Engineering, Tongji University, Shanghai, China
| | - De-Shuang Huang
- Institute of Machine Learning and Systems Biology, Department of College of Electronics and Information Engineering, Tongji University, Shanghai, China
| |
Collapse
|
507
|
Koh PW, Pierson E, Kundaje A. Denoising genome-wide histone ChIP-seq with convolutional neural networks. Bioinformatics 2018; 33:i225-i233. [PMID: 28881977 PMCID: PMC5870713 DOI: 10.1093/bioinformatics/btx243] [Citation(s) in RCA: 41] [Impact Index Per Article: 5.9] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/09/2023] Open
Abstract
Motivation Chromatin immune-precipitation sequencing (ChIP-seq) experiments are commonly used to obtain genome-wide profiles of histone modifications associated with different types of functional genomic elements. However, the quality of histone ChIP-seq data is affected by many experimental parameters such as the amount of input DNA, antibody specificity, ChIP enrichment and sequencing depth. Making accurate inferences from chromatin profiling experiments that involve diverse experimental parameters is challenging. Results We introduce a convolutional denoising algorithm, Coda, that uses convolutional neural networks to learn a mapping from suboptimal to high-quality histone ChIP-seq data. This overcomes various sources of noise and variability, substantially enhancing and recovering signal when applied to low-quality chromatin profiling datasets across individuals, cell types and species. Our method has the potential to improve data quality at reduced costs. More broadly, this approach-using a high-dimensional discriminative model to encode a generative noise process-is generally applicable to other biological domains where it is easy to generate noisy data but difficult to analytically characterize the noise or underlying data distribution. Availability and implementation https://github.com/kundajelab/coda . Contact akundaje@stanford.edu.
Collapse
Affiliation(s)
- Pang Wei Koh
- Department of Computer Science, Stanford University, Stanford, CA, USA.,Department of Genetics, Stanford University, Stanford, CA, USA
| | - Emma Pierson
- Department of Computer Science, Stanford University, Stanford, CA, USA
| | - Anshul Kundaje
- Department of Computer Science, Stanford University, Stanford, CA, USA.,Department of Genetics, Stanford University, Stanford, CA, USA
| |
Collapse
|
508
|
Kelley DR, Reshef YA, Bileschi M, Belanger D, McLean CY, Snoek J. Sequential regulatory activity prediction across chromosomes with convolutional neural networks. Genome Res 2018; 28:739-750. [PMID: 29588361 PMCID: PMC5932613 DOI: 10.1101/gr.227819.117] [Citation(s) in RCA: 280] [Impact Index Per Article: 40.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/17/2017] [Accepted: 03/23/2018] [Indexed: 01/10/2023]
Abstract
Models for predicting phenotypic outcomes from genotypes have important applications to understanding genomic function and improving human health. Here, we develop a machine-learning system to predict cell-type-specific epigenetic and transcriptional profiles in large mammalian genomes from DNA sequence alone. By use of convolutional neural networks, this system identifies promoters and distal regulatory elements and synthesizes their content to make effective gene expression predictions. We show that model predictions for the influence of genomic variants on gene expression align well to causal variants underlying eQTLs in human populations and can be useful for generating mechanistic hypotheses to enable fine mapping of disease loci.
Collapse
Affiliation(s)
| | - Yakir A Reshef
- Department of Computer Science, Harvard University, Cambridge, Massachusetts 02138, USA
| | | | | | | | - Jasper Snoek
- Google Brain, Cambridge, Massachusetts 02142, USA
| |
Collapse
|
509
|
Diao JA, Kohane IS, Manrai AK. Biomedical informatics and machine learning for clinical genomics. Hum Mol Genet 2018; 27:R29-R34. [PMID: 29566172 PMCID: PMC5946905 DOI: 10.1093/hmg/ddy088] [Citation(s) in RCA: 18] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/21/2018] [Revised: 03/08/2018] [Accepted: 03/08/2018] [Indexed: 12/22/2022] Open
Abstract
While tens of thousands of pathogenic variants are used to inform the many clinical applications of genomics, there remains limited information on quantitative disease risk for the majority of variants used in clinical practice. At the same time, rising demand for genetic counselling has prompted a growing need for computational approaches that can help interpret genetic variation. Such tasks include predicting variant pathogenicity and identifying variants that are too common to be penetrant. To address these challenges, researchers are increasingly turning to integrative informatics approaches. These approaches often leverage vast sources of data, including electronic health records and population-level allele frequency databases (e.g. gnomAD), as well as machine learning techniques such as support vector machines and deep learning. In this review, we highlight recent informatics and machine learning approaches that are improving our understanding of pathogenic variation and discuss obstacles that may limit their emerging role in clinical genomics.
Collapse
Affiliation(s)
- James A Diao
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA 02115, USA
- Department of Statistics and Data Science, Yale University, New Haven, CT 06520, USA
| | - Isaac S Kohane
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA 02115, USA
| | - Arjun K Manrai
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA 02115, USA
| |
Collapse
|
510
|
Pei L, Zheng Y, Zou S, Li Z. Dynamics of four-neuron recurrent inhibitory loop with state-dependent time delays. Neurocomputing 2018. [DOI: 10.1016/j.neucom.2018.02.062] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
|
511
|
Kalinin AA, Higgins GA, Reamaroon N, Soroushmehr S, Allyn-Feuer A, Dinov ID, Najarian K, Athey BD. Deep learning in pharmacogenomics: from gene regulation to patient stratification. Pharmacogenomics 2018; 19:629-650. [PMID: 29697304 PMCID: PMC6022084 DOI: 10.2217/pgs-2018-0008] [Citation(s) in RCA: 74] [Impact Index Per Article: 10.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/24/2018] [Accepted: 03/09/2018] [Indexed: 01/02/2023] Open
Abstract
This Perspective provides examples of current and future applications of deep learning in pharmacogenomics, including: identification of novel regulatory variants located in noncoding domains of the genome and their function as applied to pharmacoepigenomics; patient stratification from medical records; and the mechanistic prediction of drug response, targets and their interactions. Deep learning encapsulates a family of machine learning algorithms that has transformed many important subfields of artificial intelligence over the last decade, and has demonstrated breakthrough performance improvements on a wide range of tasks in biomedicine. We anticipate that in the future, deep learning will be widely used to predict personalized drug response and optimize medication selection and dosing, using knowledge extracted from large and complex molecular, epidemiological, clinical and demographic datasets.
Collapse
Affiliation(s)
- Alexandr A Kalinin
- Department of Computational Medicine & Bioinformatics, University of Michigan Medical School, Ann Arbor, MI 48109, USA
- Statistics Online Computational Resource (SOCR), University of Michigan School of Nursing, Ann Arbor, MI 48109, USA
| | - Gerald A Higgins
- Department of Computational Medicine & Bioinformatics, University of Michigan Medical School, Ann Arbor, MI 48109, USA
| | - Narathip Reamaroon
- Department of Computational Medicine & Bioinformatics, University of Michigan Medical School, Ann Arbor, MI 48109, USA
| | - Sayedmohammadreza Soroushmehr
- Department of Computational Medicine & Bioinformatics, University of Michigan Medical School, Ann Arbor, MI 48109, USA
| | - Ari Allyn-Feuer
- Department of Computational Medicine & Bioinformatics, University of Michigan Medical School, Ann Arbor, MI 48109, USA
| | - Ivo D Dinov
- Department of Computational Medicine & Bioinformatics, University of Michigan Medical School, Ann Arbor, MI 48109, USA
- Statistics Online Computational Resource (SOCR), University of Michigan School of Nursing, Ann Arbor, MI 48109, USA
- Michigan Institute for Data Science (MIDAS), University of Michigan, Ann Arbor, MI 48109, USA
| | - Kayvan Najarian
- Department of Computational Medicine & Bioinformatics, University of Michigan Medical School, Ann Arbor, MI 48109, USA
- Department of Emergency Medicine, University of Michigan Medical School, Ann Arbor, MI 48109, USA
| | - Brian D Athey
- Department of Computational Medicine & Bioinformatics, University of Michigan Medical School, Ann Arbor, MI 48109, USA
- Michigan Institute for Data Science (MIDAS), University of Michigan, Ann Arbor, MI 48109, USA
- Department of Internal Medicine, University of Michigan Health System, Ann Arbor, MI 48109, USA
- Department of Psychiatry, University of Michigan Medical School, Ann Arbor, MI 48109, USA
| |
Collapse
|
512
|
Abstract
Abstract
Next Generation Sequencing (NGS) or deep sequencing technology enables parallel reading of multiple individual DNA fragments, thereby enabling the identification of millions of base pairs in several hours. Recent research has clearly shown that machine learning technologies can efficiently analyse large sets of genomic data and help to identify novel gene functions and regulation regions. A deep artificial neural network consists of a group of artificial neurons that mimic the properties of living neurons. These mathematical models, termed Artificial Neural Networks (ANN), can be used to solve artificial intelligence engineering problems in several different technological fields (e.g., biology, genomics, proteomics, and metabolomics). In practical terms, neural networks are non-linear statistical structures that are organized as modelling tools and are used to simulate complex genomic relationships between inputs and outputs. To date, Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNN) have been demonstrated to be the best tools for improving performance in problem solving tasks within the genomic field.
Collapse
|
513
|
Avsec Ž, Barekatain M, Cheng J, Gagneur J. Modeling positional effects of regulatory sequences with spline transformations increases prediction accuracy of deep neural networks. Bioinformatics 2018; 34:1261-1269. [PMID: 29155928 PMCID: PMC5905632 DOI: 10.1093/bioinformatics/btx727] [Citation(s) in RCA: 18] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/18/2017] [Revised: 10/16/2017] [Accepted: 11/15/2017] [Indexed: 12/01/2022] Open
Abstract
Motivation Regulatory sequences are not solely defined by their nucleic acid sequence but also by their relative distances to genomic landmarks such as transcription start site, exon boundaries or polyadenylation site. Deep learning has become the approach of choice for modeling regulatory sequences because of its strength to learn complex sequence features. However, modeling relative distances to genomic landmarks in deep neural networks has not been addressed. Results Here we developed spline transformation, a neural network module based on splines to flexibly and robustly model distances. Modeling distances to various genomic landmarks with spline transformations significantly increased state-of-the-art prediction accuracy of in vivo RNA-binding protein binding sites for 120 out of 123 proteins. We also developed a deep neural network for human splice branchpoint based on spline transformations that outperformed the current best, already distance-based, machine learning model. Compared to piecewise linear transformation, as obtained by composition of rectified linear units, spline transformation yields higher prediction accuracy as well as faster and more robust training. As spline transformation can be applied to further quantities beyond distances, such as methylation or conservation, we foresee it as a versatile component in the genomics deep learning toolbox. Availability and implementation Spline transformation is implemented as a Keras layer in the CONCISE python package: https://github.com/gagneurlab/concise. Analysis code is available at https://github.com/gagneurlab/Manuscript_Avsec_Bioinformatics_2017. Contact avsec@in.tum.de or gagneur@in.tum.de. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Žiga Avsec
- Department of Informatics, Technical University of Munich, Garching, Germany
- Graduate School of Quantitative Biosciences (QBM), Gene Center, Ludwig-Maximilians-Universität München, Munich, Germany
| | | | - Jun Cheng
- Department of Informatics, Technical University of Munich, Garching, Germany
- Graduate School of Quantitative Biosciences (QBM), Gene Center, Ludwig-Maximilians-Universität München, Munich, Germany
| | - Julien Gagneur
- Department of Informatics, Technical University of Munich, Garching, Germany
| |
Collapse
|
514
|
Ching T, Himmelstein DS, Beaulieu-Jones BK, Kalinin AA, Do BT, Way GP, Ferrero E, Agapow PM, Zietz M, Hoffman MM, Xie W, Rosen GL, Lengerich BJ, Israeli J, Lanchantin J, Woloszynek S, Carpenter AE, Shrikumar A, Xu J, Cofer EM, Lavender CA, Turaga SC, Alexandari AM, Lu Z, Harris DJ, DeCaprio D, Qi Y, Kundaje A, Peng Y, Wiley LK, Segler MHS, Boca SM, Swamidass SJ, Huang A, Gitter A, Greene CS. Opportunities and obstacles for deep learning in biology and medicine. J R Soc Interface 2018; 15:20170387. [PMID: 29618526 PMCID: PMC5938574 DOI: 10.1098/rsif.2017.0387] [Citation(s) in RCA: 877] [Impact Index Per Article: 125.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/26/2017] [Accepted: 03/07/2018] [Indexed: 11/12/2022] Open
Abstract
Deep learning describes a class of machine learning algorithms that are capable of combining raw inputs into layers of intermediate features. These algorithms have recently shown impressive results across a variety of domains. Biology and medicine are data-rich disciplines, but the data are complex and often ill-understood. Hence, deep learning techniques may be particularly well suited to solve problems of these fields. We examine applications of deep learning to a variety of biomedical problems-patient classification, fundamental biological processes and treatment of patients-and discuss whether deep learning will be able to transform these tasks or if the biomedical sphere poses unique challenges. Following from an extensive literature review, we find that deep learning has yet to revolutionize biomedicine or definitively resolve any of the most pressing challenges in the field, but promising advances have been made on the prior state of the art. Even though improvements over previous baselines have been modest in general, the recent progress indicates that deep learning methods will provide valuable means for speeding up or aiding human investigation. Though progress has been made linking a specific neural network's prediction to input features, understanding how users should interpret these models to make testable hypotheses about the system under study remains an open challenge. Furthermore, the limited amount of labelled data for training presents problems in some domains, as do legal and privacy constraints on work with sensitive health records. Nonetheless, we foresee deep learning enabling changes at both bench and bedside with the potential to transform several areas of biology and medicine.
Collapse
Affiliation(s)
- Travers Ching
- Molecular Biosciences and Bioengineering Graduate Program, University of Hawaii at Manoa, Honolulu, HI, USA
| | - Daniel S Himmelstein
- Department of Systems Pharmacology and Translational Therapeutics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA
| | - Brett K Beaulieu-Jones
- Genomics and Computational Biology Graduate Group, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA
| | - Alexandr A Kalinin
- Department of Computational Medicine and Bioinformatics, University of Michigan Medical School, Ann Arbor, MI, USA
| | | | - Gregory P Way
- Department of Systems Pharmacology and Translational Therapeutics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA
| | - Enrico Ferrero
- Computational Biology and Stats, Target Sciences, GlaxoSmithKline, Stevenage, UK
| | | | - Michael Zietz
- Department of Systems Pharmacology and Translational Therapeutics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA
| | - Michael M Hoffman
- Princess Margaret Cancer Centre, Toronto, Ontario, Canada
- Department of Medical Biophysics, University of Toronto, Toronto, Ontario, Canada
- Department of Computer Science, University of Toronto, Toronto, Ontario, Canada
| | - Wei Xie
- Electrical Engineering and Computer Science, Vanderbilt University, Nashville, TN, USA
| | - Gail L Rosen
- Ecological and Evolutionary Signal-processing and Informatics Laboratory, Department of Electrical and Computer Engineering, Drexel University, Philadelphia, PA, USA
| | - Benjamin J Lengerich
- Computational Biology Department, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA
| | - Johnny Israeli
- Biophysics Program, Stanford University, Stanford, CA, USA
| | - Jack Lanchantin
- Department of Computer Science, University of Virginia, Charlottesville, VA, USA
| | - Stephen Woloszynek
- Ecological and Evolutionary Signal-processing and Informatics Laboratory, Department of Electrical and Computer Engineering, Drexel University, Philadelphia, PA, USA
| | - Anne E Carpenter
- Imaging Platform, Broad Institute of Harvard and MIT, Cambridge, MA, USA
| | - Avanti Shrikumar
- Department of Computer Science, Stanford University, Stanford, CA, USA
| | - Jinbo Xu
- Toyota Technological Institute at Chicago, Chicago, IL, USA
| | - Evan M Cofer
- Department of Computer Science, Trinity University, San Antonio, TX, USA
- Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, NJ, USA
| | - Christopher A Lavender
- Integrative Bioinformatics, National Institute of Environmental Health Sciences, National Institutes of Health, Research Triangle Park, NC, USA
| | - Srinivas C Turaga
- Howard Hughes Medical Institute, Janelia Research Campus, Ashburn, VA, USA
| | - Amr M Alexandari
- Department of Computer Science, Stanford University, Stanford, CA, USA
| | - Zhiyong Lu
- National Center for Biotechnology Information and National Library of Medicine, National Institutes of Health, Bethesda, MD, USA
| | - David J Harris
- Department of Wildlife Ecology and Conservation, University of Florida, Gainesville, FL, USA
| | | | - Yanjun Qi
- Department of Computer Science, University of Virginia, Charlottesville, VA, USA
| | - Anshul Kundaje
- Department of Computer Science, Stanford University, Stanford, CA, USA
- Department of Genetics, Stanford University, Stanford, CA, USA
| | - Yifan Peng
- National Center for Biotechnology Information and National Library of Medicine, National Institutes of Health, Bethesda, MD, USA
| | - Laura K Wiley
- Division of Biomedical Informatics and Personalized Medicine, University of Colorado School of Medicine, Aurora, CO, USA
| | - Marwin H S Segler
- Institute of Organic Chemistry, Westfälische Wilhelms-Universität Münster, Münster, Germany
| | - Simina M Boca
- Innovation Center for Biomedical Informatics, Georgetown University Medical Center, Washington, DC, USA
| | - S Joshua Swamidass
- Department of Pathology and Immunology, Washington University in Saint Louis, St Louis, MO, USA
| | - Austin Huang
- Department of Medicine, Brown University, Providence, RI, USA
| | - Anthony Gitter
- Department of Biostatistics and Medical Informatics, University of Wisconsin-Madison, Madison, WI, USA
- Morgridge Institute for Research, Madison, WI, USA
| | - Casey S Greene
- Department of Systems Pharmacology and Translational Therapeutics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA
| |
Collapse
|
515
|
Kim HK, Min S, Song M, Jung S, Choi JW, Kim Y, Lee S, Yoon S, Kim HH. Deep learning improves prediction of CRISPR-Cpf1 guide RNA activity. Nat Biotechnol 2018; 36:239-241. [PMID: 29431740 DOI: 10.1038/nbt.4061] [Citation(s) in RCA: 210] [Impact Index Per Article: 30.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/24/2017] [Accepted: 12/08/2017] [Indexed: 12/26/2022]
Abstract
We present two algorithms to predict the activity of AsCpf1 guide RNAs. Indel frequencies for 15,000 target sequences were used in a deep-learning framework based on a convolutional neural network to train Seq-deepCpf1. We then incorporated chromatin accessibility information to create the better-performing DeepCpf1 algorithm for cell lines for which such information is available and show that both algorithms outperform previous machine learning algorithms on our own and published data sets.
Collapse
Affiliation(s)
- Hui Kwon Kim
- Department of Pharmacology, Yonsei University College of Medicine, Seoul, Republic of Korea
- Brain Korea 21 Plus Project for Medical Sciences, Yonsei University College of Medicine, Seoul, Republic of Korea
| | - Seonwoo Min
- Electrical and Computer Engineering, Seoul National University, Seoul, Republic of Korea
| | - Myungjae Song
- Department of Pharmacology, Yonsei University College of Medicine, Seoul, Republic of Korea
- Graduate School of Biomedical Science and Engineering, Hanyang University, Seoul, Republic of Korea
| | - Soobin Jung
- Department of Pharmacology, Yonsei University College of Medicine, Seoul, Republic of Korea
- Brain Korea 21 Plus Project for Medical Sciences, Yonsei University College of Medicine, Seoul, Republic of Korea
| | - Jae Woo Choi
- Department of Pharmacology, Yonsei University College of Medicine, Seoul, Republic of Korea
- Severance Biomedical Science Institute, Yonsei University College of Medicine, Seoul, Republic of Korea
| | - Younggwang Kim
- Department of Pharmacology, Yonsei University College of Medicine, Seoul, Republic of Korea
- Brain Korea 21 Plus Project for Medical Sciences, Yonsei University College of Medicine, Seoul, Republic of Korea
| | - Sangeun Lee
- Department of Pharmacology, Yonsei University College of Medicine, Seoul, Republic of Korea
- Brain Korea 21 Plus Project for Medical Sciences, Yonsei University College of Medicine, Seoul, Republic of Korea
| | - Sungroh Yoon
- Electrical and Computer Engineering, Seoul National University, Seoul, Republic of Korea
- Interdisciplinary Program in Bioinformatics, Seoul National University, Seoul, Republic of Korea
| | - Hyongbum Henry Kim
- Department of Pharmacology, Yonsei University College of Medicine, Seoul, Republic of Korea
- Brain Korea 21 Plus Project for Medical Sciences, Yonsei University College of Medicine, Seoul, Republic of Korea
- Severance Biomedical Science Institute, Yonsei University College of Medicine, Seoul, Republic of Korea
- Center for Nanomedicine, Institute for Basic Science (IBS), Seoul, Republic of Korea
- Yonsei-IBS Institute, Yonsei University, Seoul, Republic of Korea
| |
Collapse
|
516
|
Zhang Y, An L, Xu J, Zhang B, Zheng WJ, Hu M, Tang J, Yue F. Enhancing Hi-C data resolution with deep convolutional neural network HiCPlus. Nat Commun 2018; 9:750. [PMID: 29467363 PMCID: PMC5821732 DOI: 10.1038/s41467-018-03113-2] [Citation(s) in RCA: 96] [Impact Index Per Article: 13.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/07/2017] [Accepted: 01/19/2018] [Indexed: 12/31/2022] Open
Abstract
Although Hi-C technology is one of the most popular tools for studying 3D genome organization, due to sequencing cost, the resolution of most Hi-C datasets are coarse and cannot be used to link distal regulatory elements to their target genes. Here we develop HiCPlus, a computational approach based on deep convolutional neural network, to infer high-resolution Hi-C interaction matrices from low-resolution Hi-C data. We demonstrate that HiCPlus can impute interaction matrices highly similar to the original ones, while only using 1/16 of the original sequencing reads. We show that the models learned from one cell type can be applied to make predictions in other cell or tissue types. Our work not only provides a computational framework to enhance Hi-C data resolution but also reveals features underlying the formation of 3D chromatin interactions. Despite its popularity for measuring the spatial organization of mammalian genomes, the resolution of most Hi-C datasets is coarse due to sequencing cost. Here, Zhang et al. develop HiCPlus, a computational approach based on deep convolutional neural network, to infer high-resolution Hi-C interaction matrices from low-resolution Hi-C data.
Collapse
Affiliation(s)
- Yan Zhang
- Department of Computer Science and Engineering, University of South Carolina, Columbia, SC, 29208, USA
| | - Lin An
- Bioinformatics and Genomics Program, Huck Institutes of the Life Sciences, The Pennsylvania State University, University Park, PA, 16802, USA
| | - Jie Xu
- Department of Biochemistry and Molecular Biology, College of Medicine, The Pennsylvania State University, Hershey, PA, 17033, USA
| | - Bo Zhang
- Bioinformatics and Genomics Program, Huck Institutes of the Life Sciences, The Pennsylvania State University, University Park, PA, 16802, USA
| | - W Jim Zheng
- School of Biomedical Informatics, University of Texas Health Science Center at Houston, Houston, TX, 77030, USA
| | - Ming Hu
- Department of Quantitative Health Sciences, Lerner Research Institute, Cleveland Clinic Foundation, Cleveland, OH, 44195, USA
| | - Jijun Tang
- Department of Computer Science and Engineering, University of South Carolina, Columbia, SC, 29208, USA. .,School of Computer Science and Technology, Tianjin University, 300072, Tianjin, China. .,Tianjin University Institute of Computational Biology, Tianjin University, 300072, Tianjin, China.
| | - Feng Yue
- Bioinformatics and Genomics Program, Huck Institutes of the Life Sciences, The Pennsylvania State University, University Park, PA, 16802, USA. .,Department of Biochemistry and Molecular Biology, College of Medicine, The Pennsylvania State University, Hershey, PA, 17033, USA.
| |
Collapse
|
517
|
Cao C, Liu F, Tan H, Song D, Shu W, Li W, Zhou Y, Bo X, Xie Z. Deep Learning and Its Applications in Biomedicine. GENOMICS, PROTEOMICS & BIOINFORMATICS 2018; 16:17-32. [PMID: 29522900 PMCID: PMC6000200 DOI: 10.1016/j.gpb.2017.07.003] [Citation(s) in RCA: 253] [Impact Index Per Article: 36.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 06/18/2017] [Revised: 06/18/2017] [Accepted: 07/05/2017] [Indexed: 12/19/2022]
Abstract
Advances in biological and medical technologies have been providing us explosive volumes of biological and physiological data, such as medical images, electroencephalography, genomic and protein sequences. Learning from these data facilitates the understanding of human health and disease. Developed from artificial neural networks, deep learning-based algorithms show great promise in extracting features and learning patterns from complex data. The aim of this paper is to provide an overview of deep learning techniques and some of the state-of-the-art applications in the biomedical field. We first introduce the development of artificial neural network and deep learning. We then describe two main components of deep learning, i.e., deep learning architectures and model optimization. Subsequently, some examples are demonstrated for deep learning applications, including medical image classification, genomic sequence analysis, as well as protein structure classification and prediction. Finally, we offer our perspectives for the future directions in the field of deep learning.
Collapse
Affiliation(s)
- Chensi Cao
- CapitalBio Corporation, Beijing 102206, China
| | - Feng Liu
- Department of Biotechnology, Beijing Institute of Radiation Medicine, Beijing 100850, China
| | - Hai Tan
- State Key Lab of Ophthalmology, Zhongshan Ophthalmic Center, Sun Yat-sen University, Guangzhou 500040, China
| | - Deshou Song
- State Key Lab of Ophthalmology, Zhongshan Ophthalmic Center, Sun Yat-sen University, Guangzhou 500040, China
| | - Wenjie Shu
- Department of Biotechnology, Beijing Institute of Radiation Medicine, Beijing 100850, China
| | - Weizhong Li
- Zhongshan School of Medicine, Sun Yat-sen University, Guangzhou 500040, China
| | - Yiming Zhou
- CapitalBio Corporation, Beijing 102206, China; Department of Biomedical Engineering, Medical Systems Biology Research Center, Tsinghua University School of Medicine, Beijing 100084, China.
| | - Xiaochen Bo
- Department of Biotechnology, Beijing Institute of Radiation Medicine, Beijing 100850, China.
| | - Zhi Xie
- State Key Lab of Ophthalmology, Zhongshan Ophthalmic Center, Sun Yat-sen University, Guangzhou 500040, China.
| |
Collapse
|
518
|
Alakwaa F, Chaudhary K, Garmire LX. Deep Learning Accurately Predicts Estrogen Receptor Status in Breast Cancer Metabolomics Data. J Proteome Res 2018; 17:337-347. [PMID: 29110491 PMCID: PMC5759031 DOI: 10.1021/acs.jproteome.7b00595] [Citation(s) in RCA: 134] [Impact Index Per Article: 19.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/22/2017] [Indexed: 12/17/2022]
Abstract
Metabolomics holds the promise as a new technology to diagnose highly heterogeneous diseases. Conventionally, metabolomics data analysis for diagnosis is done using various statistical and machine learning based classification methods. However, it remains unknown if deep neural network, a class of increasingly popular machine learning methods, is suitable to classify metabolomics data. Here we use a cohort of 271 breast cancer tissues, 204 positive estrogen receptor (ER+), and 67 negative estrogen receptor (ER-) to test the accuracies of feed-forward networks, a deep learning (DL) framework, as well as six widely used machine learning models, namely random forest (RF), support vector machines (SVM), recursive partitioning and regression trees (RPART), linear discriminant analysis (LDA), prediction analysis for microarrays (PAM), and generalized boosted models (GBM). DL framework has the highest area under the curve (AUC) of 0.93 in classifying ER+/ER- patients, compared to the other six machine learning algorithms. Furthermore, the biological interpretation of the first hidden layer reveals eight commonly enriched significant metabolomics pathways (adjusted P-value <0.05) that cannot be discovered by other machine learning methods. Among them, protein digestion and absorption and ATP-binding cassette (ABC) transporters pathways are also confirmed in integrated analysis between metabolomics and gene expression data in these samples. In summary, deep learning method shows advantages for metabolomics based breast cancer ER status classification, with both the highest prediction accuracy (AUC = 0.93) and better revelation of disease biology. We encourage the adoption of feed-forward networks based deep learning method in the metabolomics research community for classification.
Collapse
Affiliation(s)
- Fadhl
M. Alakwaa
- Epidemiology
Program, University of Hawaii Cancer Center, Honolulu, Hawaii 96813, United States
| | - Kumardeep Chaudhary
- Epidemiology
Program, University of Hawaii Cancer Center, Honolulu, Hawaii 96813, United States
| | - Lana X. Garmire
- Epidemiology
Program, University of Hawaii Cancer Center, Honolulu, Hawaii 96813, United States
- Molecular
Biosciences and Bioengineering Graduate Program, University of Hawaii at Manoa, Honolulu, Hawaii 96822, United States
| |
Collapse
|
519
|
Celesti F, Celesti A, Wan J, Villari M. Why Deep Learning Is Changing the Way to Approach NGS Data Processing: A Review. IEEE Rev Biomed Eng 2018; 11:68-76. [DOI: 10.1109/rbme.2018.2825987] [Citation(s) in RCA: 18] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/05/2022]
|
520
|
Kim M, Tagkopoulos I. Data integration and predictive modeling methods for multi-omics datasets. Mol Omics 2018; 14:8-25. [DOI: 10.1039/c7mo00051k] [Citation(s) in RCA: 56] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/24/2022]
Abstract
We provide an overview of opportunities and challenges in multi-omics predictive analytics with particular emphasis on data integration and machine learning methods.
Collapse
Affiliation(s)
- Minseung Kim
- Department of Computer Science
- University of California
- Davis
- USA
- Genome Center
| | - Ilias Tagkopoulos
- Department of Computer Science
- University of California
- Davis
- USA
- Genome Center
| |
Collapse
|
521
|
Lee PH, Lee C, Li X, Wee B, Dwivedi T, Daly M. Principles and methods of in-silico prioritization of non-coding regulatory variants. Hum Genet 2018; 137:15-30. [PMID: 29288389 PMCID: PMC5892192 DOI: 10.1007/s00439-017-1861-0] [Citation(s) in RCA: 29] [Impact Index Per Article: 4.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/24/2017] [Accepted: 12/14/2017] [Indexed: 12/13/2022]
Abstract
Over a decade of genome-wide association, studies have made great strides toward the detection of genes and genetic mechanisms underlying complex traits. However, the majority of associated loci reside in non-coding regions that are functionally uncharacterized in general. Now, the availability of large-scale tissue and cell type-specific transcriptome and epigenome data enables us to elucidate how non-coding genetic variants can affect gene expressions and are associated with phenotypic changes. Here, we provide an overview of this emerging field in human genomics, summarizing available data resources and state-of-the-art analytic methods to facilitate in-silico prioritization of non-coding regulatory mutations. We also highlight the limitations of current approaches and discuss the direction of much-needed future research.
Collapse
Affiliation(s)
- Phil H Lee
- Center for Genomic Medicine, Massachusetts General Hospital and Harvard Medical School, Simches Research Building, 185 Cambridge St, Boston, MA, 02114, USA.
- Quantitative Genomics Program, Harvard T.H. Chan School of Public Health, Boston, MA, USA.
| | - Christian Lee
- Center for Genomic Medicine, Massachusetts General Hospital and Harvard Medical School, Simches Research Building, 185 Cambridge St, Boston, MA, 02114, USA
- Department of Life Sciences, Harvard University, Cambridge, MA, USA
| | - Xihao Li
- Quantitative Genomics Program, Harvard T.H. Chan School of Public Health, Boston, MA, USA
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA, USA
| | - Brian Wee
- Center for Genomic Medicine, Massachusetts General Hospital and Harvard Medical School, Simches Research Building, 185 Cambridge St, Boston, MA, 02114, USA
| | - Tushar Dwivedi
- Center for Genomic Medicine, Massachusetts General Hospital and Harvard Medical School, Simches Research Building, 185 Cambridge St, Boston, MA, 02114, USA
- John A. Paulson School of Engineering and Applied Sciences, Harvard University, Cambridge, MA, USA
| | - Mark Daly
- Center for Genomic Medicine, Massachusetts General Hospital and Harvard Medical School, Simches Research Building, 185 Cambridge St, Boston, MA, 02114, USA
- Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA, 02142, USA
| |
Collapse
|
522
|
Ye F. Particle swarm optimization-based automatic parameter selection for deep neural networks and its applications in large-scale and high-dimensional data. PLoS One 2017; 12:e0188746. [PMID: 29236718 PMCID: PMC5728507 DOI: 10.1371/journal.pone.0188746] [Citation(s) in RCA: 19] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/02/2017] [Accepted: 10/02/2017] [Indexed: 01/02/2023] Open
Abstract
In this paper, we propose a new automatic hyperparameter selection approach for determining the optimal network configuration (network structure and hyperparameters) for deep neural networks using particle swarm optimization (PSO) in combination with a steepest gradient descent algorithm. In the proposed approach, network configurations were coded as a set of real-number m-dimensional vectors as the individuals of the PSO algorithm in the search procedure. During the search procedure, the PSO algorithm is employed to search for optimal network configurations via the particles moving in a finite search space, and the steepest gradient descent algorithm is used to train the DNN classifier with a few training epochs (to find a local optimal solution) during the population evaluation of PSO. After the optimization scheme, the steepest gradient descent algorithm is performed with more epochs and the final solutions (pbest and gbest) of the PSO algorithm to train a final ensemble model and individual DNN classifiers, respectively. The local search ability of the steepest gradient descent algorithm and the global search capabilities of the PSO algorithm are exploited to determine an optimal solution that is close to the global optimum. We constructed several experiments on hand-written characters and biological activity prediction datasets to show that the DNN classifiers trained by the network configurations expressed by the final solutions of the PSO algorithm, employed to construct an ensemble model and individual classifier, outperform the random approach in terms of the generalization performance. Therefore, the proposed approach can be regarded an alternative tool for automatic network structure and parameter selection for deep neural networks.
Collapse
Affiliation(s)
- Fei Ye
- School of information science and technology, Southwest Jiaotong University, ChengDu, China
| |
Collapse
|
523
|
Banovich NE, Li YI, Raj A, Ward MC, Greenside P, Calderon D, Tung PY, Burnett JE, Myrthil M, Thomas SM, Burrows CK, Romero IG, Pavlovic BJ, Kundaje A, Pritchard JK, Gilad Y. Impact of regulatory variation across human iPSCs and differentiated cells. Genome Res 2017; 28:122-131. [PMID: 29208628 PMCID: PMC5749177 DOI: 10.1101/gr.224436.117] [Citation(s) in RCA: 78] [Impact Index Per Article: 9.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/28/2017] [Accepted: 11/20/2017] [Indexed: 12/17/2022]
Abstract
Induced pluripotent stem cells (iPSCs) are an essential tool for studying cellular differentiation and cell types that are otherwise difficult to access. We investigated the use of iPSCs and iPSC-derived cells to study the impact of genetic variation on gene regulation across different cell types and as models for studies of complex disease. To do so, we established a panel of iPSCs from 58 well-studied Yoruba lymphoblastoid cell lines (LCLs); 14 of these lines were further differentiated into cardiomyocytes. We characterized regulatory variation across individuals and cell types by measuring gene expression levels, chromatin accessibility, and DNA methylation. Our analysis focused on a comparison of inter-individual regulatory variation across cell types. While most cell-type-specific regulatory quantitative trait loci (QTLs) lie in chromatin that is open only in the affected cell types, we found that 20% of cell-type-specific regulatory QTLs are in shared open chromatin. This observation motivated us to develop a deep neural network to predict open chromatin regions from DNA sequence alone. Using this approach, we were able to use the sequences of segregating haplotypes to predict the effects of common SNPs on cell-type-specific chromatin accessibility.
Collapse
Affiliation(s)
- Nicholas E Banovich
- Department of Human Genetics, University of Chicago, Chicago, Illinois 60637, USA
| | - Yang I Li
- Department of Genetics, Stanford University, Stanford, California 94305, USA
| | - Anil Raj
- Department of Genetics, Stanford University, Stanford, California 94305, USA
| | - Michelle C Ward
- Department of Human Genetics, University of Chicago, Chicago, Illinois 60637, USA.,Department of Medicine, University of Chicago, Chicago, Illinois 60637, USA
| | - Peyton Greenside
- Department of Biomedical Informatics, Stanford University, Stanford, California 94305, USA
| | - Diego Calderon
- Department of Biomedical Informatics, Stanford University, Stanford, California 94305, USA
| | - Po Yuan Tung
- Department of Human Genetics, University of Chicago, Chicago, Illinois 60637, USA.,Department of Medicine, University of Chicago, Chicago, Illinois 60637, USA
| | - Jonathan E Burnett
- Department of Human Genetics, University of Chicago, Chicago, Illinois 60637, USA
| | - Marsha Myrthil
- Department of Human Genetics, University of Chicago, Chicago, Illinois 60637, USA
| | - Samantha M Thomas
- Department of Human Genetics, University of Chicago, Chicago, Illinois 60637, USA
| | - Courtney K Burrows
- Department of Human Genetics, University of Chicago, Chicago, Illinois 60637, USA
| | - Irene Gallego Romero
- Department of Human Genetics, University of Chicago, Chicago, Illinois 60637, USA
| | - Bryan J Pavlovic
- Department of Human Genetics, University of Chicago, Chicago, Illinois 60637, USA
| | - Anshul Kundaje
- Department of Genetics, Stanford University, Stanford, California 94305, USA
| | - Jonathan K Pritchard
- Department of Genetics, Stanford University, Stanford, California 94305, USA.,Department of Biology, Stanford University, Stanford, California 94305, USA.,Howard Hughes Medical Institute, Stanford University, Stanford, California 94305, USA
| | - Yoav Gilad
- Department of Human Genetics, University of Chicago, Chicago, Illinois 60637, USA.,Department of Medicine, University of Chicago, Chicago, Illinois 60637, USA
| |
Collapse
|
524
|
Singh R, Lanchantin J, Sekhon A, Qi Y. Attend and Predict: Understanding Gene Regulation by Selective Attention on Chromatin. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 2017; 30:6785-6795. [PMID: 30147283 PMCID: PMC6105294] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
The past decade has seen a revolution in genomic technologies that enabled a flood of genome-wide profiling of chromatin marks. Recent literature tried to understand gene regulation by predicting gene expression from large-scale chromatin measurements. Two fundamental challenges exist for such learning tasks: (1) genome-wide chromatin signals are spatially structured, high-dimensional and highly modular; and (2) the core aim is to understand what the relevant factors are and how they work together. Previous studies either failed to model complex dependencies among input signals or relied on separate feature analysis to explain the decisions. This paper presents an attention-based deep learning approach, AttentiveChrome, that uses a unified architecture to model and to interpret dependencies among chromatin factors for controlling gene regulation. AttentiveChrome uses a hierarchy of multiple Long Short-Term Memory (LSTM) modules to encode the input signals and to model how various chromatin marks cooperate automatically. AttentiveChrome trains two levels of attention jointly with the target prediction, enabling it to attend differentially to relevant marks and to locate important positions per mark. We evaluate the model across 56 different cell types (tasks) in humans. Not only is the proposed architecture more accurate, but its attention scores provide a better interpretation than state-of-the-art feature visualization methods such as saliency maps.
Collapse
Affiliation(s)
| | | | | | - Yanjun Qi
- Department of Computer Science, University of Virginia
| |
Collapse
|
525
|
Xu Y, Wang Y, Luo J, Zhao W, Zhou X. Deep learning of the splicing (epi)genetic code reveals a novel candidate mechanism linking histone modifications to ESC fate decision. Nucleic Acids Res 2017; 45:12100-12112. [PMID: 29036709 PMCID: PMC5716079 DOI: 10.1093/nar/gkx870] [Citation(s) in RCA: 59] [Impact Index Per Article: 7.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/29/2017] [Revised: 09/08/2017] [Accepted: 09/15/2017] [Indexed: 01/31/2023] Open
Abstract
Alternative splicing (AS) is a genetically and epigenetically regulated pre-mRNA processing to increase transcriptome and proteome diversity. Comprehensively decoding these regulatory mechanisms holds promise in getting deeper insights into a variety of biological contexts involving in AS, such as development and diseases. We assembled splicing (epi)genetic code, DeepCode, for human embryonic stem cell (hESC) differentiation by integrating heterogeneous features of genomic sequences, 16 histone modifications with a multi-label deep neural network. With the advantages of epigenetic features, DeepCode significantly improves the performance in predicting the splicing patterns and their changes during hESC differentiation. Meanwhile, DeepCode reveals the superiority of epigenomic features and their dominant roles in decoding AS patterns, highlighting the necessity of including the epigenetic properties when assembling a more comprehensive splicing code. Moreover, DeepCode allows the robust predictions across cell lineages and datasets. Especially, we identified a putative H3K36me3-regulated AS event leading to a nonsense-mediated mRNA decay of BARD1. Reduced BARD1 expression results in the attenuation of ATM/ATR signalling activities and further the hESC differentiation. These results suggest a novel candidate mechanism linking histone modifications to hESC fate decision. In addition, when trained in different contexts, DeepCode can be expanded to a variety of biological and biomedical fields.
Collapse
Affiliation(s)
- Yungang Xu
- Center for Systems Medicine, School of Biomedical Bioinformatics, University of Texas Health Science Center at Houston, TX 77030, USA
- Center for Bioinformatics and Systems Biology, Wake Forest School of Medicine, Winston-Salem, NC 27157, USA
| | - Yongcui Wang
- Key Laboratory of Adaptation and Evolution of Plateau Biota, Northwest Institute of Plateau Biology, Chinese Academy of Sciences, Xining, Qinghai 810008, China
| | - Jiesi Luo
- Center for Systems Medicine, School of Biomedical Bioinformatics, University of Texas Health Science Center at Houston, TX 77030, USA
| | - Weiling Zhao
- Center for Systems Medicine, School of Biomedical Bioinformatics, University of Texas Health Science Center at Houston, TX 77030, USA
| | - Xiaobo Zhou
- Center for Systems Medicine, School of Biomedical Bioinformatics, University of Texas Health Science Center at Houston, TX 77030, USA
- Center for Bioinformatics and Systems Biology, Wake Forest School of Medicine, Winston-Salem, NC 27157, USA
| |
Collapse
|
526
|
Min X, Zeng W, Chen S, Chen N, Chen T, Jiang R. Predicting enhancers with deep convolutional neural networks. BMC Bioinformatics 2017; 18:478. [PMID: 29219068 PMCID: PMC5773911 DOI: 10.1186/s12859-017-1878-3] [Citation(s) in RCA: 54] [Impact Index Per Article: 6.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/22/2023] Open
Abstract
Background With the rapid development of deep sequencing techniques in the recent years, enhancers have been systematically identified in such projects as FANTOM and ENCODE, forming genome-wide landscapes in a series of human cell lines. Nevertheless, experimental approaches are still costly and time consuming for large scale identification of enhancers across a variety of tissues under different disease status, making computational identification of enhancers indispensable. Results To facilitate the identification of enhancers, we propose a computational framework, named DeepEnhancer, to distinguish enhancers from background genomic sequences. Our method purely relies on DNA sequences to predict enhancers in an end-to-end manner by using a deep convolutional neural network (CNN). We train our deep learning model on permissive enhancers and then adopt a transfer learning strategy to fine-tune the model on enhancers specific to a cell line. Results demonstrate the effectiveness and efficiency of our method in the classification of enhancers against random sequences, exhibiting advantages of deep learning over traditional sequence-based classifiers. We then construct a variety of neural networks with different architectures and show the usefulness of such techniques as max-pooling and batch normalization in our method. To gain the interpretability of our approach, we further visualize convolutional kernels as sequence logos and successfully identify similar motifs in the JASPAR database. Conclusions DeepEnhancer enables the identification of novel enhancers using only DNA sequences via a highly accurate deep learning model. The proposed computational framework can also be applied to similar problems, thereby prompting the use of machine learning methods in life sciences.
Collapse
Affiliation(s)
- Xu Min
- MOE Key Laboratory of Bioinformatics and Bioinformatics Division, TNLIST, Beijing, 100084, China.,Department of Computer Science and Technology, State Key Lab of Intelligent Technology and Systems, Tsinghua University, Beijing, 100084, China
| | - Wanwen Zeng
- MOE Key Laboratory of Bioinformatics and Bioinformatics Division, TNLIST, Beijing, 100084, China.,Department of Automation, Tsinghua University, Beijing, 100084, China
| | - Shengquan Chen
- MOE Key Laboratory of Bioinformatics and Bioinformatics Division, TNLIST, Beijing, 100084, China.,Department of Automation, Tsinghua University, Beijing, 100084, China
| | - Ning Chen
- MOE Key Laboratory of Bioinformatics and Bioinformatics Division, TNLIST, Beijing, 100084, China.,Department of Computer Science and Technology, State Key Lab of Intelligent Technology and Systems, Tsinghua University, Beijing, 100084, China
| | - Ting Chen
- MOE Key Laboratory of Bioinformatics and Bioinformatics Division, TNLIST, Beijing, 100084, China.,Department of Computer Science and Technology, State Key Lab of Intelligent Technology and Systems, Tsinghua University, Beijing, 100084, China.,Program in Computational Biology and Bioinformatics, University of Southern California, Los Angeles, CA, 90089, USA
| | - Rui Jiang
- MOE Key Laboratory of Bioinformatics and Bioinformatics Division, TNLIST, Beijing, 100084, China. .,Department of Automation, Tsinghua University, Beijing, 100084, China.
| |
Collapse
|
527
|
Abstract
BACKGROUND Gene expression is a key intermediate level that genotypes lead to a particular trait. Gene expression is affected by various factors including genotypes of genetic variants. With an aim of delineating the genetic impact on gene expression, we build a deep auto-encoder model to assess how good genetic variants will contribute to gene expression changes. This new deep learning model is a regression-based predictive model based on the MultiLayer Perceptron and Stacked Denoising Auto-encoder (MLP-SAE). The model is trained using a stacked denoising auto-encoder for feature selection and a multilayer perceptron framework for backpropagation. We further improve the model by introducing dropout to prevent overfitting and improve performance. RESULTS To demonstrate the usage of this model, we apply MLP-SAE to a real genomic datasets with genotypes and gene expression profiles measured in yeast. Our results show that the MLP-SAE model with dropout outperforms other models including Lasso, Random Forests and the MLP-SAE model without dropout. Using the MLP-SAE model with dropout, we show that gene expression quantifications predicted by the model solely based on genotypes, align well with true gene expression patterns. CONCLUSION We provide a deep auto-encoder model for predicting gene expression from SNP genotypes. This study demonstrates that deep learning is appropriate for tackling another genomic problem, i.e., building predictive models to understand genotypes' contribution to gene expression. With the emerging availability of richer genomic data, we anticipate that deep learning models play a bigger role in modeling and interpreting genomics.
Collapse
Affiliation(s)
- Rui Xie
- Department of Computer Science, University of Missouri at Columbia, Columbia, MO USA
| | - Jia Wen
- Department of Bioinformatics and Genomics, College of Computing and Informatics, University of North Carolina at Charlotte, University City Blvd, Charlotte, NC USA
| | - Andrew Quitadamo
- Department of Bioinformatics and Genomics, College of Computing and Informatics, University of North Carolina at Charlotte, University City Blvd, Charlotte, NC USA
| | - Jianlin Cheng
- Department of Computer Science, University of Missouri at Columbia, Columbia, MO USA
| | - Xinghua Shi
- Department of Bioinformatics and Genomics, College of Computing and Informatics, University of North Carolina at Charlotte, University City Blvd, Charlotte, NC USA
| |
Collapse
|
528
|
Ransohoff JD, Wei Y, Khavari PA. The functions and unique features of long intergenic non-coding RNA. Nat Rev Mol Cell Biol 2017; 19:143-157. [PMID: 29138516 DOI: 10.1038/nrm.2017.104] [Citation(s) in RCA: 923] [Impact Index Per Article: 115.4] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/21/2022]
Abstract
Long intergenic non-coding RNA (lincRNA) genes have diverse features that distinguish them from mRNA-encoding genes and exercise functions such as remodelling chromatin and genome architecture, RNA stabilization and transcription regulation, including enhancer-associated activity. Some genes currently annotated as encoding lincRNAs include small open reading frames (smORFs) and encode functional peptides and thus may be more properly classified as coding RNAs. lincRNAs may broadly serve to fine-tune the expression of neighbouring genes with remarkable tissue specificity through a diversity of mechanisms, highlighting our rapidly evolving understanding of the non-coding genome.
Collapse
Affiliation(s)
- Julia D Ransohoff
- Program in Epithelial Biology, Stanford University School of Medicine, California 94305, USA
| | - Yuning Wei
- Program in Epithelial Biology, Stanford University School of Medicine, California 94305, USA
| | - Paul A Khavari
- Program in Epithelial Biology, Stanford University School of Medicine, California 94305, USA.,Veterans Affairs Palo Alto Healthcare System, Palo Alto, California 94304, USA
| |
Collapse
|
529
|
Computational biology: deep learning. Emerg Top Life Sci 2017; 1:257-274. [PMID: 33525807 PMCID: PMC7289034 DOI: 10.1042/etls20160025] [Citation(s) in RCA: 40] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/21/2017] [Revised: 09/13/2017] [Accepted: 09/18/2017] [Indexed: 02/06/2023]
Abstract
Deep learning is the trendiest tool in a computational biologist's toolbox. This exciting class of methods, based on artificial neural networks, quickly became popular due to its competitive performance in prediction problems. In pioneering early work, applying simple network architectures to abundant data already provided gains over traditional counterparts in functional genomics, image analysis, and medical diagnostics. Now, ideas for constructing and training networks and even off-the-shelf models have been adapted from the rapidly developing machine learning subfield to improve performance in a range of computational biology tasks. Here, we review some of these advances in the last 2 years.
Collapse
|
530
|
Gene Prediction in Metagenomic Fragments with Deep Learning. BIOMED RESEARCH INTERNATIONAL 2017; 2017:4740354. [PMID: 29250541 PMCID: PMC5698827 DOI: 10.1155/2017/4740354] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Received: 06/30/2017] [Accepted: 10/08/2017] [Indexed: 01/14/2023]
Abstract
Next generation sequencing technologies used in metagenomics yield numerous sequencing fragments which come from thousands of different species. Accurately identifying genes from metagenomics fragments is one of the most fundamental issues in metagenomics. In this article, by fusing multifeatures (i.e., monocodon usage, monoamino acid usage, ORF length coverage, and Z-curve features) and using deep stacking networks learning model, we present a novel method (called Meta-MFDL) to predict the metagenomic genes. The results with 10 CV and independent tests show that Meta-MFDL is a powerful tool for identifying genes from metagenomic fragments.
Collapse
|
531
|
Cuperus JT, Groves B, Kuchina A, Rosenberg AB, Jojic N, Fields S, Seelig G. Deep learning of the regulatory grammar of yeast 5' untranslated regions from 500,000 random sequences. Genome Res 2017; 27:2015-2024. [PMID: 29097404 PMCID: PMC5741052 DOI: 10.1101/gr.224964.117] [Citation(s) in RCA: 119] [Impact Index Per Article: 14.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/12/2017] [Accepted: 10/18/2017] [Indexed: 11/25/2022]
Abstract
Our ability to predict protein expression from DNA sequence alone remains poor, reflecting our limited understanding of cis-regulatory grammar and hampering the design of engineered genes for synthetic biology applications. Here, we generate a model that predicts the protein expression of the 5′ untranslated region (UTR) of mRNAs in the yeast Saccharomyces cerevisiae. We constructed a library of half a million 50-nucleotide-long random 5′ UTRs and assayed their activity in a massively parallel growth selection experiment. The resulting data allow us to quantify the impact on protein expression of Kozak sequence composition, upstream open reading frames (uORFs), and secondary structure. We trained a convolutional neural network (CNN) on the random library and showed that it performs well at predicting the protein expression of both a held-out set of the random 5′ UTRs as well as native S. cerevisiae 5′ UTRs. The model additionally was used to computationally evolve highly active 5′ UTRs. We confirmed experimentally that the great majority of the evolved sequences led to higher protein expression rates than the starting sequences, demonstrating the predictive power of this model.
Collapse
Affiliation(s)
- Josh T Cuperus
- Department of Genome Sciences, University of Washington, Seattle, Washington 98195, USA.,Howard Hughes Medical Institute, University of Washington, Seattle, Washington 98195, USA
| | - Benjamin Groves
- Department of Electrical Engineering, University of Washington, Seattle, Washington 98195, USA
| | - Anna Kuchina
- Department of Electrical Engineering, University of Washington, Seattle, Washington 98195, USA
| | - Alexander B Rosenberg
- Department of Electrical Engineering, University of Washington, Seattle, Washington 98195, USA
| | | | - Stanley Fields
- Department of Genome Sciences, University of Washington, Seattle, Washington 98195, USA.,Howard Hughes Medical Institute, University of Washington, Seattle, Washington 98195, USA.,Department of Medicine, University of Washington, Seattle, Washington 98195, USA
| | - Georg Seelig
- Department of Electrical Engineering, University of Washington, Seattle, Washington 98195, USA.,Department of Computer Science & Engineering, University of Washington, Seattle, Washington 98195, USA
| |
Collapse
|
532
|
Zeng H, Gifford DK. Predicting the impact of non-coding variants on DNA methylation. Nucleic Acids Res 2017; 45:e99. [PMID: 28334830 PMCID: PMC5499808 DOI: 10.1093/nar/gkx177] [Citation(s) in RCA: 55] [Impact Index Per Article: 6.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/15/2016] [Accepted: 03/13/2017] [Indexed: 12/22/2022] Open
Abstract
DNA methylation plays a crucial role in the establishment of tissue-specific gene expression and the regulation of key biological processes. However, our present inability to predict the effect of genome sequence variation on DNA methylation precludes a comprehensive assessment of the consequences of non-coding variation. We introduce CpGenie, a sequence-based framework that learns a regulatory code of DNA methylation using a deep convolutional neural network and uses this network to predict the impact of sequence variation on proximal CpG site DNA methylation. CpGenie produces allele-specific DNA methylation prediction with single-nucleotide sensitivity that enables accurate prediction of methylation quantitative trait loci (meQTL). We demonstrate that CpGenie prioritizes validated GWAS SNPs, and contributes to the prediction of functional non-coding variants, including expression quantitative trait loci (eQTL) and disease-associated mutations. CpGenie is publicly available to assist in identifying and interpreting regulatory non-coding variants.
Collapse
Affiliation(s)
- Haoyang Zeng
- Computer Science and Artificial Intelligence Lab, Massachusetts Institute of Technology Cambridge, MA 02142, USA
| | - David K Gifford
- Computer Science and Artificial Intelligence Lab, Massachusetts Institute of Technology Cambridge, MA 02142, USA
| |
Collapse
|
533
|
Finnegan A, Song JS. Maximum entropy methods for extracting the learned features of deep neural networks. PLoS Comput Biol 2017; 13:e1005836. [PMID: 29084280 PMCID: PMC5679649 DOI: 10.1371/journal.pcbi.1005836] [Citation(s) in RCA: 31] [Impact Index Per Article: 3.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/18/2017] [Revised: 11/09/2017] [Accepted: 10/23/2017] [Indexed: 11/19/2022] Open
Abstract
New architectures of multilayer artificial neural networks and new methods for training them are rapidly revolutionizing the application of machine learning in diverse fields, including business, social science, physical sciences, and biology. Interpreting deep neural networks, however, currently remains elusive, and a critical challenge lies in understanding which meaningful features a network is actually learning. We present a general method for interpreting deep neural networks and extracting network-learned features from input data. We describe our algorithm in the context of biological sequence analysis. Our approach, based on ideas from statistical physics, samples from the maximum entropy distribution over possible sequences, anchored at an input sequence and subject to constraints implied by the empirical function learned by a network. Using our framework, we demonstrate that local transcription factor binding motifs can be identified from a network trained on ChIP-seq data and that nucleosome positioning signals are indeed learned by a network trained on chemical cleavage nucleosome maps. Imposing a further constraint on the maximum entropy distribution also allows us to probe whether a network is learning global sequence features, such as the high GC content in nucleosome-rich regions. This work thus provides valuable mathematical tools for interpreting and extracting learned features from feed-forward neural networks. Deep learning is a state-of-the-art reformulation of artificial neural networks that have a long history of development. It can perform superbly well in diverse automated classification and prediction problems, including handwriting recognition, image identification, and biological pattern recognition. Its modern success can be attributed to improved training algorithms, clever network architecture, rapid explosion of available data, and advanced computing power–all of which have allowed the great expansion in the number of unknown parameters to be estimated by the model. These parameters, however, are so intricately connected through highly nonlinear functions that interpreting which essential features of given data are actually used by a deep neural network for its excellent performance has been difficult. We address this problem by using ideas from statistical physics to sample new unseen data that are likely to behave similarly to original data points when passed through the trained network. This synthetic data cloud around each original data point retains informative features while averaging out nonessential ones, ultimately allowing us to extract important network-learned features from the original data set and thus improving the human interpretability of deep learning methods. We demonstrate how our method can be applied to biological sequence analysis.
Collapse
Affiliation(s)
- Alex Finnegan
- Department of Physics, University of Illinois, Urbana-Champaign, Urbana, Illinois, United States of America
- Carl R. Woese Institute for Genomic Biology, University of Illinois, Urbana-Champaign, Urbana, Illinois, United States of America
| | - Jun S. Song
- Department of Physics, University of Illinois, Urbana-Champaign, Urbana, Illinois, United States of America
- Carl R. Woese Institute for Genomic Biology, University of Illinois, Urbana-Champaign, Urbana, Illinois, United States of America
- * E-mail:
| |
Collapse
|
534
|
Alvarez RV, Li S, Landsman D, Ovcharenko I. SNPDelScore: combining multiple methods to score deleterious effects of noncoding mutations in the human genome. Bioinformatics 2017; 34:289-291. [PMID: 28968739 DOI: 10.1093/bioinformatics/btx583] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/29/2017] [Revised: 09/11/2017] [Accepted: 09/13/2017] [Indexed: 11/12/2022] Open
Abstract
SUMMARY Addressing deleterious effects of noncoding mutations is an essential step towards the identification of disease-causal mutations of gene regulatory elements. Several methods for quantifying the deleteriousness of noncoding mutations using artificial intelligence, deep learning and other approaches have been recently proposed. Although the majority of the proposed methods have demonstrated excellent accuracy on different test sets, there is rarely a consensus. In addition, advanced statistical and artificial learning approaches used by these methods make it difficult porting these methods outside of the labs that have developed them. To address these challenges and to transform the methodological advances in predicting deleterious noncoding mutations into a practical resource available for the broader functional genomics and population genetics communities, we developed SNPDelScore, which uses a panel of proposed methods for quantifying deleterious effects of noncoding mutations to precompute and compare the deleteriousness scores of all common SNPs in the human genome in 44 cell lines. The panel of deleteriousness scores of a SNP computed using different methods is supplemented by functional information from the GWAS Catalog, libraries of transcription factor-binding sites, and genic characteristics of mutations. SNPDelScore comes with a genome browser capable of displaying and comparing large sets of SNPs in a genomic locus and rapidly identifying consensus SNPs with the highest deleteriousness scores making those prime candidates for phenotype-causal polymorphisms. AVAILABILITY AND IMPLEMENTATION https://www.ncbi.nlm.nih.gov/research/snpdelscore/. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Roberto Vera Alvarez
- Computational Biology Branch, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA
| | - Shan Li
- Computational Biology Branch, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA
| | - David Landsman
- Computational Biology Branch, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA
| | - Ivan Ovcharenko
- Computational Biology Branch, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA
| |
Collapse
|
535
|
Schwessinger R, Suciu MC, McGowan SJ, Telenius J, Taylor S, Higgs DR, Hughes JR. Sasquatch: predicting the impact of regulatory SNPs on transcription factor binding from cell- and tissue-specific DNase footprints. Genome Res 2017; 27:1730-1742. [PMID: 28904015 PMCID: PMC5630036 DOI: 10.1101/gr.220202.117] [Citation(s) in RCA: 23] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/04/2017] [Accepted: 08/07/2017] [Indexed: 12/22/2022]
Abstract
In the era of genome-wide association studies (GWAS) and personalized medicine, predicting the impact of single nucleotide polymorphisms (SNPs) in regulatory elements is an important goal. Current approaches to determine the potential of regulatory SNPs depend on inadequate knowledge of cell-specific DNA binding motifs. Here, we present Sasquatch, a new computational approach that uses DNase footprint data to estimate and visualize the effects of noncoding variants on transcription factor binding. Sasquatch performs a comprehensive k-mer-based analysis of DNase footprints to determine any k-mer's potential for protein binding in a specific cell type and how this may be changed by sequence variants. Therefore, Sasquatch uses an unbiased approach, independent of known transcription factor binding sites and motifs. Sasquatch only requires a single DNase-seq data set per cell type, from any genotype, and produces consistent predictions from data generated by different experimental procedures and at different sequence depths. Here we demonstrate the effectiveness of Sasquatch using previously validated functional SNPs and benchmark its performance against existing approaches. Sasquatch is available as a versatile webtool incorporating publicly available data, including the human ENCODE collection. Thus, Sasquatch provides a powerful tool and repository for prioritizing likely regulatory SNPs in the noncoding genome.
Collapse
Affiliation(s)
- Ron Schwessinger
- MRC Molecular Haematology Unit, MRC Weatherall Institute of Molecular Medicine, Oxford OX3 9DS, United Kingdom
| | - Maria C Suciu
- MRC Molecular Haematology Unit, MRC Weatherall Institute of Molecular Medicine, Oxford OX3 9DS, United Kingdom
| | - Simon J McGowan
- Computational Biology Research Group, MRC Weatherall Institute of Molecular Medicine, Oxford OX3 9DS, United Kingdom
| | - Jelena Telenius
- MRC Molecular Haematology Unit, MRC Weatherall Institute of Molecular Medicine, Oxford OX3 9DS, United Kingdom
| | - Stephen Taylor
- Computational Biology Research Group, MRC Weatherall Institute of Molecular Medicine, Oxford OX3 9DS, United Kingdom
| | - Doug R Higgs
- MRC Molecular Haematology Unit, MRC Weatherall Institute of Molecular Medicine, Oxford OX3 9DS, United Kingdom
| | - Jim R Hughes
- MRC Molecular Haematology Unit, MRC Weatherall Institute of Molecular Medicine, Oxford OX3 9DS, United Kingdom
| |
Collapse
|
536
|
Kreimer A, Zeng H, Edwards MD, Guo Y, Tian K, Shin S, Welch R, Wainberg M, Mohan R, Sinnott-Armstrong NA, Li Y, Eraslan G, AMIN TB, Goke J, Mueller NS, Kellis M, Kundaje A, Beer MA, Keles S, Gifford DK, Yosef N. Predicting gene expression in massively parallel reporter assays: A comparative study. Hum Mutat 2017; 38:1240-1250. [PMID: 28220625 PMCID: PMC5560998 DOI: 10.1002/humu.23197] [Citation(s) in RCA: 31] [Impact Index Per Article: 3.9] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/26/2016] [Revised: 01/19/2017] [Accepted: 02/12/2017] [Indexed: 02/03/2023]
Abstract
In many human diseases, associated genetic changes tend to occur within noncoding regions, whose effect might be related to transcriptional control. A central goal in human genetics is to understand the function of such noncoding regions: given a region that is statistically associated with changes in gene expression (expression quantitative trait locus [eQTL]), does it in fact play a regulatory role? And if so, how is this role "coded" in its sequence? These questions were the subject of the Critical Assessment of Genome Interpretation eQTL challenge. Participants were given a set of sequences that flank eQTLs in humans and were asked to predict whether these are capable of regulating transcription (as evaluated by massively parallel reporter assays), and whether this capability changes between alternative alleles. Here, we report lessons learned from this community effort. By inspecting predictive properties in isolation, and conducting meta-analysis over the competing methods, we find that using chromatin accessibility and transcription factor binding as features in an ensemble of classifiers or regression models leads to the most accurate results. We then characterize the loci that are harder to predict, putting the spotlight on areas of weakness, which we expect to be the subject of future studies.
Collapse
Affiliation(s)
- Anat Kreimer
- Department of Electrical Engineering and Computer Science and Center for Computational Biology, University of California, Berkeley, Berkeley, CA 94720, USA
- Department of Bioengineering and Therapeutic Sciences, Institute for Human Genetics, University of California, San Francisco, San Francisco, California, USA
| | - Haoyang Zeng
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA 02142, USA
| | - Matthew D. Edwards
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA 02142, USA
| | - Yuchun Guo
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA 02142, USA
| | - Kevin Tian
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA 02142, USA
| | - Sunyoung Shin
- Department of Statistics, Department of Biostatistics and Medical Informatics University of Wisconsin-Madison, Madison, Wisconsin, USA
| | - Rene Welch
- Department of Statistics, Department of Biostatistics and Medical Informatics University of Wisconsin-Madison, Madison, Wisconsin, USA
| | - Michael Wainberg
- Department of Genetics, Stanford University School of Medicine, Department of Computer Science, Stanford, California 94305, USA
| | - Rahul Mohan
- Department of Genetics, Stanford University School of Medicine, Department of Computer Science, Stanford, California 94305, USA
| | - Nicholas A. Sinnott-Armstrong
- Department of Genetics, Stanford University School of Medicine, Department of Computer Science, Stanford, California 94305, USA
| | - Yue Li
- Computer Science and Artificial Intelligence Lab, Massachusetts Institute of Technology, 32 Vassar St, Cambridge, Massachusetts 02139, USA
- Computer Science and Artificial Intelligence Lab, Massachusetts Institute of Technology, 32 Vassar St, Cambridge, Massachusetts 02139, USA
| | - Gökcen Eraslan
- Computational Cell Maps, Institute of Computational Biology, Helmholtz Zentrum München, Ingolstädter Landstr. 1 85764 Neuherberg, Germany
| | - Talal Bin AMIN
- Computational and Systems Biology, Genome Institute of Singapore, Singapore 138672, Singapore
| | - Jonathan Goke
- Computational and Systems Biology, Genome Institute of Singapore, Singapore 138672, Singapore
| | - Nikola S. Mueller
- Computational Cell Maps, Institute of Computational Biology, Helmholtz Zentrum München, Ingolstädter Landstr. 1 85764 Neuherberg, Germany
| | - Manolis Kellis
- Computer Science and Artificial Intelligence Lab, Massachusetts Institute of Technology, 32 Vassar St, Cambridge, Massachusetts 02139, USA
- Computer Science and Artificial Intelligence Lab, Massachusetts Institute of Technology, 32 Vassar St, Cambridge, Massachusetts 02139, USA
| | - Anshul Kundaje
- Department of Genetics, Stanford University School of Medicine, Department of Computer Science, Stanford, California 94305, USA
| | - Michael A Beer
- McKusick-Nathans Institute of Genetic Medicine, Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, USA
| | - Sunduz Keles
- Department of Statistics, Department of Biostatistics and Medical Informatics University of Wisconsin-Madison, Madison, Wisconsin, USA
| | - David K. Gifford
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA 02142, USA
| | - Nir Yosef
- Department of Electrical Engineering and Computer Science and Center for Computational Biology, University of California, Berkeley, Berkeley, CA 94720, USA
- Ragon Institute of Massachusetts General Hospital, MIT and Harvard, Cambridge, MA, 02139
| |
Collapse
|
537
|
Fu H, Zhang X. Noncoding Variants Functional Prioritization Methods Based on Predicted Regulatory Factor Binding Sites. Curr Genomics 2017; 18:322-331. [PMID: 29081688 PMCID: PMC5635616 DOI: 10.2174/1389202918666170228143619] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/01/2016] [Revised: 10/16/2016] [Accepted: 11/02/2016] [Indexed: 12/31/2022] Open
Abstract
BACKGROUNDS With the advent of the post genomic era, the research for the genetic mechanism of the diseases has found to be increasingly depended on the studies of the genes, the gene-networks and gene-protein interaction networks. To explore gene expression and regulation, the researchers have carried out many studies on transcription factors and their binding sites (TFBSs). Based on the large amount of transcription factor binding sites predicting values in the deep learning models, further computation and analysis have been done to reveal the relationship between the gene mutation and the occurrence of the disease. It has been demonstrated that based on the deep learning methods, the performances of the prediction for the functions of the noncoding variants are outperforming than those of the conventional methods. The research on the prediction for functions of Single Nucleotide Polymorphisms (SNPs) is expected to uncover the mechanism of the gene mutation affection on traits and diseases of human beings. RESULTS We reviewed the conventional TFBSs identification methods from different perspectives. As for the deep learning methods to predict the TFBSs, we discussed the related problems, such as the raw data preprocessing, the structure design of the deep convolution neural network (CNN) and the model performance measure et al. And then we summarized the techniques that usually used in finding out the functional noncoding variants from de novo sequence. CONCLUSION Along with the rapid development of the high-throughout assays, more and more sample data and chromatin features would be conducive to improve the prediction accuracy of the deep convolution neural network for TFBSs identification. Meanwhile, getting more insights into the deep CNN framework itself has been proved useful for both the promotion on model performance and the development for more suitable design to sample data. Based on the feature values predicted by the deep CNN model, the prioritization model for functional noncoding variants would contribute to reveal the affection of gene mutation on the diseases.
Collapse
Affiliation(s)
- Haoyue Fu
- College of Sciences, Northeastern University, Shenyang, China
| | - LianpingYang
- College of Sciences, Northeastern University, Shenyang, China
- University of Southern California, Dept. Biol. Sci., Program Mol & Computat Biol, USA
| | - Xiangde Zhang
- College of Sciences, Northeastern University, Shenyang, China
| |
Collapse
|
538
|
Reiman D, Metwally A. Using convolutional neural networks to explore the microbiome. ANNUAL INTERNATIONAL CONFERENCE OF THE IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. ANNUAL INTERNATIONAL CONFERENCE 2017; 2017:4269-4272. [PMID: 29060840 DOI: 10.1109/embc.2017.8037799] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/07/2023]
Abstract
The microbiome has been shown to have an impact on the development of various diseases in the host. Being able to make an accurate prediction of the phenotype of a genomic sample based on its microbial taxonomic abundance profile is an important problem for personalized medicine. In this paper, we examine the potential of using a deep learning framework, a convolutional neural network (CNN), for such a prediction. To facilitate the CNN learning, we explore the structure of abundance profiles by creating the phylogenetic tree and by designing a scheme to embed the tree to a matrix that retains the spatial relationship of nodes in the tree and their quantitative characteristics. The proposed CNN framework is highly accurate, achieving a 99.47% of accuracy based on the evaluation on a dataset 1967 samples of three phenotypes. Our result demonstrated the feasibility and promising aspect of CNN in the classification of sample phenotype.
Collapse
|
539
|
Zhang H, Zhu L, Huang DS. WSMD: weakly-supervised motif discovery in transcription factor ChIP-seq data. Sci Rep 2017; 7:3217. [PMID: 28607381 PMCID: PMC5468353 DOI: 10.1038/s41598-017-03554-7] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/15/2016] [Accepted: 05/02/2017] [Indexed: 01/24/2023] Open
Abstract
Although discriminative motif discovery (DMD) methods are promising for eliciting motifs from high-throughput experimental data, due to consideration of computational expense, most of existing DMD methods have to choose approximate schemes that greatly restrict the search space, leading to significant loss of predictive accuracy. In this paper, we propose Weakly-Supervised Motif Discovery (WSMD) to discover motifs from ChIP-seq datasets. In contrast to the learning strategies adopted by previous DMD methods, WSMD allows a "global" optimization scheme of the motif parameters in continuous space, thereby reducing the information loss of model representation and improving the quality of resultant motifs. Meanwhile, by exploiting the connection between DMD framework and existing weakly supervised learning (WSL) technologies, we also present highly scalable learning strategies for the proposed method. The experimental results on both real ChIP-seq datasets and synthetic datasets show that WSMD substantially outperforms former DMD methods (including DREME, HOMER, XXmotif, motifRG and DECOD) in terms of predictive accuracy, while also achieving a competitive computational speed.
Collapse
Affiliation(s)
- Hongbo Zhang
- Institute of Machine Learning and Systems Biology, College of Electronics and Information Engineering, Tongji University, Shanghai, 201804, P.R. China
| | - Lin Zhu
- Institute of Machine Learning and Systems Biology, College of Electronics and Information Engineering, Tongji University, Shanghai, 201804, P.R. China
| | - De-Shuang Huang
- Institute of Machine Learning and Systems Biology, College of Electronics and Information Engineering, Tongji University, Shanghai, 201804, P.R. China.
| |
Collapse
|
540
|
Gomez-Cabrero D, Tegnér J. Iterative Systems Biology for Medicine – Time for advancing from network signatures to mechanistic equations. ACTA ACUST UNITED AC 2017. [DOI: 10.1016/j.coisb.2017.05.001] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/23/2022]
|
541
|
Feigin ME, Garvin T, Bailey P, Waddell N, Chang DK, Kelley DR, Shuai S, Gallinger S, McPherson JD, Grimmond SM, Khurana E, Stein LD, Biankin AV, Schatz MC, Tuveson DA. Recurrent noncoding regulatory mutations in pancreatic ductal adenocarcinoma. Nat Genet 2017; 49:825-833. [PMID: 28481342 PMCID: PMC5659388 DOI: 10.1038/ng.3861] [Citation(s) in RCA: 44] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/28/2016] [Accepted: 04/10/2017] [Indexed: 12/15/2022]
Abstract
The contributions of coding mutations to tumorigenesis are relatively well known; however, little is known about somatic alterations in noncoding DNA. Here we describe GECCO (Genomic Enrichment Computational Clustering Operation) to analyze somatic noncoding alterations in 308 pancreatic ductal adenocarcinomas (PDAs) and identify commonly mutated regulatory regions. We find recurrent noncoding mutations to be enriched in PDA pathways, including axon guidance and cell adhesion, and newly identified processes, including transcription and homeobox genes. We identified mutations in protein binding sites correlating with differential expression of proximal genes and experimentally validated effects of mutations on expression. We developed an expression modulation score that quantifies the strength of gene regulation imposed by each class of regulatory elements, and found the strongest elements were most frequently mutated, suggesting a selective advantage. Our detailed single-cancer analysis of noncoding alterations identifies regulatory mutations as candidates for diagnostic and prognostic markers, and suggests new mechanisms for tumor evolution.
Collapse
Affiliation(s)
- Michael E Feigin
- Cold Spring Harbor Laboratory, Cold Spring Harbor, New York, USA
- Lustgarten Foundation Pancreatic Cancer Research Laboratory, Cold Spring Harbor, New York, USA
| | - Tyler Garvin
- Watson School of Biological Sciences, Cold Spring Harbor Laboratory, Cold Spring Harbor, New York, USA
| | - Peter Bailey
- Wolfson Wohl Cancer Research Centre, Institute of Cancer Sciences, University of Glasgow, Glasgow, Scotland, UK
| | - Nicola Waddell
- QIMR Berghofer Medical Research Institute, Brisbane, Queensland, Australia
- Queensland Centre for Medical Genomics, Institute for Molecular Bioscience, The University of Queensland, Brisbane, Queensland, Australia
| | - David K Chang
- Wolfson Wohl Cancer Research Centre, Institute of Cancer Sciences, University of Glasgow, Glasgow, Scotland, UK
- The Kinghorn Cancer Centre, Cancer Research Program, Garvan Institute of Medical Research, Darlinghurst, Sydney, New South Wales, Australia
- Department of Surgery, Bankstown Hospital, Bankstown, Sydney, New South Wales, Australia
- South Western Sydney Clinical School, Faculty of Medicine, University of New South Wales, Liverpool, New South Wales, Australia
| | - David R Kelley
- Department of Stem Cell and Regenerative Biology, Harvard University, Cambridge, Massachusetts, USA
| | - Shimin Shuai
- Department of Molecular Genetics, University of Toronto, Toronto, Ontario, Canada
| | - Steven Gallinger
- Lunenfeld-Tanenbaum Research Institute, Mount Sinai Hospital, Toronto, Ontario, Canada
- Division of General Surgery, Toronto General Hospital, Toronto, Ontario, Canada
| | - John D McPherson
- Genome Technologies Program, Ontario Institute for Cancer Research, Toronto, Ontario, Canada
| | - Sean M Grimmond
- Wolfson Wohl Cancer Research Centre, Institute of Cancer Sciences, University of Glasgow, Glasgow, Scotland, UK
- Queensland Centre for Medical Genomics, Institute for Molecular Bioscience, The University of Queensland, Brisbane, Queensland, Australia
| | - Ekta Khurana
- Sandra and Edward Meyer Cancer Center, Institute for Computational Biomedicine, Department of Physiology and Biophysics, Weill Medical College of Cornell University, New York, New York, USA
| | - Lincoln D Stein
- Department of Molecular Genetics, University of Toronto, Toronto, Ontario, Canada
- Informatics and Biocomputing, Ontario Institute for Cancer Research, Toronto, Ontario, Canada
| | - Andrew V Biankin
- Wolfson Wohl Cancer Research Centre, Institute of Cancer Sciences, University of Glasgow, Glasgow, Scotland, UK
- South Western Sydney Clinical School, Faculty of Medicine, University of New South Wales, Liverpool, New South Wales, Australia
- West of Scotland Pancreatic Unit, Glasgow Royal Infirmary, Glasgow, Scotland, UK
| | - Michael C Schatz
- Cold Spring Harbor Laboratory, Cold Spring Harbor, New York, USA
- Department of Computer Science, Johns Hopkins University, Baltimore, Maryland, USA
- Department of Biology, Johns Hopkins University, Baltimore, Maryland, USA
| | - David A Tuveson
- Cold Spring Harbor Laboratory, Cold Spring Harbor, New York, USA
- Lustgarten Foundation Pancreatic Cancer Research Laboratory, Cold Spring Harbor, New York, USA
- Rubenstein Center for Pancreatic Cancer Research, Memorial Sloan Kettering Cancer Center, New York, New York, USA
| |
Collapse
|
542
|
Pärnamaa T, Parts L. Accurate Classification of Protein Subcellular Localization from High-Throughput Microscopy Images Using Deep Learning. G3 (BETHESDA, MD.) 2017; 7:1385-1392. [PMID: 28391243 PMCID: PMC5427497 DOI: 10.1534/g3.116.033654] [Citation(s) in RCA: 88] [Impact Index Per Article: 11.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 07/18/2016] [Accepted: 11/22/2016] [Indexed: 11/29/2022]
Abstract
High-throughput microscopy of many single cells generates high-dimensional data that are far from straightforward to analyze. One important problem is automatically detecting the cellular compartment where a fluorescently-tagged protein resides, a task relatively simple for an experienced human, but difficult to automate on a computer. Here, we train an 11-layer neural network on data from mapping thousands of yeast proteins, achieving per cell localization classification accuracy of 91%, and per protein accuracy of 99% on held-out images. We confirm that low-level network features correspond to basic image characteristics, while deeper layers separate localization classes. Using this network as a feature calculator, we train standard classifiers that assign proteins to previously unseen compartments after observing only a small number of training examples. Our results are the most accurate subcellular localization classifications to date, and demonstrate the usefulness of deep learning for high-throughput microscopy.
Collapse
Affiliation(s)
- Tanel Pärnamaa
- Institute of Computer Science, University of Tartu, 50409, Estonia
| | - Leopold Parts
- Institute of Computer Science, University of Tartu, 50409, Estonia
- Wellcome Trust Sanger Institute, Hinxton, Cambridgeshire CB10 1SA, United Kingdom
| |
Collapse
|
543
|
Chasman D, Roy S. Inference of cell type specific regulatory networks on mammalian lineages. ACTA ACUST UNITED AC 2017; 2:130-139. [PMID: 29082337 DOI: 10.1016/j.coisb.2017.04.001] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/28/2022]
Abstract
Transcriptional regulatory networks are at the core of establishing cell type specific gene expression programs. In mammalian systems, such regulatory networks are determined by multiple levels of regulation, including by transcription factors, chromatin environment, and three-dimensional organization of the genome. Recent efforts to measure diverse regulatory genomic datasets across multiple cell types and tissues offer unprecedented opportunities to examine the context-specificity and dynamics of regulatory networks at a greater resolution and scale than before. In parallel, numerous computational approaches to analyze these data have emerged that serve as important tools for understanding mammalian cell type specific regulation. In this article, we review recent computational approaches to predict the expression and sequence-based regulators of a gene's expression level and examine long-range gene regulation. We highlight promising approaches, insights gained, and open challenges that need to be overcome to build a comprehensive picture of cell type specific transcriptional regulatory networks.
Collapse
Affiliation(s)
- Deborah Chasman
- Wisconsin Institute for Discovery University of Wisconsin-Madison, Madison, WI 53715
| | - Sushmita Roy
- Wisconsin Institute for Discovery University of Wisconsin-Madison, Madison, WI 53715.,Department of Biostatistics and Medical Informatics University of Wisconsin-Madison, Madison, WI 53792
| |
Collapse
|
544
|
Bussemaker HJ, Causton HC, Fazlollahi M, Lee E, Muroff I. Network-based approaches that exploit inferred transcription factor activity to analyze the impact of genetic variation on gene expression. ACTA ACUST UNITED AC 2017; 2:98-102. [PMID: 28691107 DOI: 10.1016/j.coisb.2017.04.002] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/26/2022]
Abstract
Over the past decade, a number of methods have emerged for inferring protein-level transcription factor activities in individual samples based on prior information about the structure of the gene regulatory network. We discuss how this has enabled new methods for dissecting trans-acting mechanisms that underpin genetic variation in gene expression.
Collapse
Affiliation(s)
- Harmen J Bussemaker
- Department of Biological Sciences, Columbia University, New York, NY 10027.,Department of Systems Biology, Columbia University, New York, NY 10032
| | - Helen C Causton
- Department of Pathology and Cell Biology, Columbia University Medical Center, New York, NY 10032
| | - Mina Fazlollahi
- Department of Genetics and Genomic Sciences, Mount Sinai School of Medicine, New York, NY 10029
| | - Eunjee Lee
- Department of Genetics and Genomic Sciences, Mount Sinai School of Medicine, New York, NY 10029
| | - Ivor Muroff
- Department of Biological Sciences, Columbia University, New York, NY 10027
| |
Collapse
|
545
|
Angermueller C, Lee HJ, Reik W, Stegle O. DeepCpG: accurate prediction of single-cell DNA methylation states using deep learning. Genome Biol 2017; 18:67. [PMID: 28395661 PMCID: PMC5387360 DOI: 10.1186/s13059-017-1189-z] [Citation(s) in RCA: 245] [Impact Index Per Article: 30.6] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/24/2017] [Accepted: 03/07/2017] [Indexed: 12/31/2022] Open
Abstract
Recent technological advances have enabled DNA methylation to be assayed at single-cell resolution. However, current protocols are limited by incomplete CpG coverage and hence methods to predict missing methylation states are critical to enable genome-wide analyses. We report DeepCpG, a computational approach based on deep neural networks to predict methylation states in single cells. We evaluate DeepCpG on single-cell methylation data from five cell types generated using alternative sequencing protocols. DeepCpG yields substantially more accurate predictions than previous methods. Additionally, we show that the model parameters can be interpreted, thereby providing insights into how sequence composition affects methylation variability.
Collapse
Affiliation(s)
- Christof Angermueller
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, CB10 1SD, UK.
| | - Heather J Lee
- Epigenetics Programme, Babraham Institute, Cambridge, UK.,Wellcome Trust Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridge, CB10 1SA, UK
| | - Wolf Reik
- Epigenetics Programme, Babraham Institute, Cambridge, UK.,Wellcome Trust Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridge, CB10 1SA, UK
| | - Oliver Stegle
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, CB10 1SD, UK.
| |
Collapse
|
546
|
Huang YF, Gulko B, Siepel A. Fast, scalable prediction of deleterious noncoding variants from functional and population genomic data. Nat Genet 2017; 49:618-624. [PMID: 28288115 PMCID: PMC5395419 DOI: 10.1038/ng.3810] [Citation(s) in RCA: 232] [Impact Index Per Article: 29.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/15/2016] [Accepted: 02/13/2017] [Indexed: 12/17/2022]
Abstract
Many genetic variants that influence phenotypes of interest are located outside of protein-coding genes, yet existing methods for identifying such variants have poor predictive power. Here we introduce a new computational method, called LINSIGHT, that substantially improves the prediction of noncoding nucleotide sites at which mutations are likely to have deleterious fitness consequences, and which, therefore, are likely to be phenotypically important. LINSIGHT combines a generalized linear model for functional genomic data with a probabilistic model of molecular evolution. The method is fast and highly scalable, enabling it to exploit the 'big data' available in modern genomics. We show that LINSIGHT outperforms the best available methods in identifying human noncoding variants associated with inherited diseases. In addition, we apply LINSIGHT to an atlas of human enhancers and show that the fitness consequences at enhancers depend on cell type, tissue specificity, and constraints at associated promoters.
Collapse
Affiliation(s)
- Yi-Fei Huang
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, New York, USA
| | - Brad Gulko
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, New York, USA.,Graduate Field of Computer Science, Cornell University, Ithaca, New York, USA
| | - Adam Siepel
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, New York, USA
| |
Collapse
|
547
|
Pan X, Shen HB. RNA-protein binding motifs mining with a new hybrid deep learning based cross-domain knowledge integration approach. BMC Bioinformatics 2017; 18:136. [PMID: 28245811 PMCID: PMC5331642 DOI: 10.1186/s12859-017-1561-8] [Citation(s) in RCA: 114] [Impact Index Per Article: 14.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/15/2016] [Accepted: 02/23/2017] [Indexed: 01/08/2023] Open
Abstract
Background RNAs play key roles in cells through the interactions with proteins known as the RNA-binding proteins (RBP) and their binding motifs enable crucial understanding of the post-transcriptional regulation of RNAs. How the RBPs correctly recognize the target RNAs and why they bind specific positions is still far from clear. Machine learning-based algorithms are widely acknowledged to be capable of speeding up this process. Although many automatic tools have been developed to predict the RNA-protein binding sites from the rapidly growing multi-resource data, e.g. sequence, structure, their domain specific features and formats have posed significant computational challenges. One of current difficulties is that the cross-source shared common knowledge is at a higher abstraction level beyond the observed data, resulting in a low efficiency of direct integration of observed data across domains. The other difficulty is how to interpret the prediction results. Existing approaches tend to terminate after outputting the potential discrete binding sites on the sequences, but how to assemble them into the meaningful binding motifs is a topic worth of further investigation. Results In viewing of these challenges, we propose a deep learning-based framework (iDeep) by using a novel hybrid convolutional neural network and deep belief network to predict the RBP interaction sites and motifs on RNAs. This new protocol is featured by transforming the original observed data into a high-level abstraction feature space using multiple layers of learning blocks, where the shared representations across different domains are integrated. To validate our iDeep method, we performed experiments on 31 large-scale CLIP-seq datasets, and our results show that by integrating multiple sources of data, the average AUC can be improved by 8% compared to the best single-source-based predictor; and through cross-domain knowledge integration at an abstraction level, it outperforms the state-of-the-art predictors by 6%. Besides the overall enhanced prediction performance, the convolutional neural network module embedded in iDeep is also able to automatically capture the interpretable binding motifs for RBPs. Large-scale experiments demonstrate that these mined binding motifs agree well with the experimentally verified results, suggesting iDeep is a promising approach in the real-world applications. Conclusion The iDeep framework not only can achieve promising performance than the state-of-the-art predictors, but also easily capture interpretable binding motifs. iDeep is available at http://www.csbio.sjtu.edu.cn/bioinf/iDeep Electronic supplementary material The online version of this article (doi:10.1186/s12859-017-1561-8) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Xiaoyong Pan
- Department of Veterinary Clinical and Animal Sciences, University of Copenhagen, Copenhagen, Denmark.
| | - Hong-Bin Shen
- Institute of Image Processing and Pattern Recognition, Shanghai Jiao Tong University, and Key Laboratory of System Control and Information Processing, Ministry of Education of China, Shanghai, China.
| |
Collapse
|
548
|
Abstract
Background The recent success of deep learning techniques in machine learning and artificial intelligence has stimulated a great deal of interest among bioinformaticians, who now wish to bring the power of deep learning to bare on a host of bioinformatical problems. Deep learning is ideally suited for biological problems that require automatic or hierarchical feature representation for biological data when prior knowledge is limited. In this work, we address the sequence-specific bias correction problem for RNA-seq data redusing Recurrent Neural Networks (RNNs) to model nucleotide sequences without pre-determining sequence structures. The sequence-specific bias of a read is then calculated based on the sequence probabilities estimated by RNNs, and used in the estimation of gene abundance. Result We explore the application of two popular RNN recurrent units for this task and demonstrate that RNN-based approaches provide a flexible way to model nucleotide sequences without knowledge of predetermined sequence structures. Our experiments show that training a RNN-based nucleotide sequence model is efficient and RNN-based bias correction methods compare well with the-state-of-the-art sequence-specific bias correction method on the commonly used MAQC-III data set. Conclustions RNNs provides an alternative and flexible way to calculate sequence-specific bias without explicitly pre-determining sequence structures. Electronic supplementary material The online version of this article (doi:10.1186/s12864-016-3262-5) contains supplementary material, which is available to authorized users.
Collapse
|
549
|
Wouters J, Kalender Atak Z, Aerts S. Decoding transcriptional states in cancer. Curr Opin Genet Dev 2017; 43:82-92. [PMID: 28129557 DOI: 10.1016/j.gde.2017.01.003] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/09/2016] [Revised: 01/05/2017] [Accepted: 01/09/2017] [Indexed: 12/27/2022]
Abstract
Gene regulatory networks determine cellular identity. In cancer, aberrations of gene networks are caused by driver mutations that often affect transcription factors and chromatin modifiers. Nevertheless, gene transcription in cancer follows the same cis-regulatory rules as normal cells, and cancer cells have served as convenient model systems to study transcriptional regulation. Tumours often show regulatory heterogeneity, with subpopulations of cells in different transcriptional states, which has important therapeutic implications. Here, we review recent experimental and computational techniques to reverse engineer cancer gene networks using transcriptome and epigenome data. New algorithms, data integration strategies, and increasing amounts of single cell genomics data provide exciting opportunities to model dynamic regulatory states at unprecedented resolution.
Collapse
Affiliation(s)
- Jasper Wouters
- Laboratory of Computational Biology, VIB Center for Brain & Disease Research, Leuven, Belgium; Department of Human Genetics, KU Leuven (University of Leuven), Leuven, Belgium
| | - Zeynep Kalender Atak
- Laboratory of Computational Biology, VIB Center for Brain & Disease Research, Leuven, Belgium; Department of Human Genetics, KU Leuven (University of Leuven), Leuven, Belgium
| | - Stein Aerts
- Laboratory of Computational Biology, VIB Center for Brain & Disease Research, Leuven, Belgium; Department of Human Genetics, KU Leuven (University of Leuven), Leuven, Belgium.
| |
Collapse
|
550
|
Lanchantin J, Singh R, Wang B, Qi Y. DEEP MOTIF DASHBOARD: VISUALIZING AND UNDERSTANDING GENOMIC SEQUENCES USING DEEP NEURAL NETWORKS. PACIFIC SYMPOSIUM ON BIOCOMPUTING. PACIFIC SYMPOSIUM ON BIOCOMPUTING 2017; 22:254-265. [PMID: 27896980 PMCID: PMC5787355 DOI: 10.1142/9789813207813_0025] [Citation(s) in RCA: 46] [Impact Index Per Article: 5.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/22/2023]
Abstract
Deep neural network (DNN) models have recently obtained state-of-the-art prediction accuracy for the transcription factor binding (TFBS) site classification task. However, it remains unclear how these approaches identify meaningful DNA sequence signals and give insights as to why TFs bind to certain locations. In this paper, we propose a toolkit called the Deep Motif Dashboard (DeMo Dashboard) which provides a suite of visualization strategies to extract motifs, or sequence patterns from deep neural network models for TFBS classification. We demonstrate how to visualize and understand three important DNN models: convolutional, recurrent, and convolutional-recurrent networks. Our first visualization method is finding a test sequence's saliency map which uses first-order derivatives to describe the importance of each nucleotide in making the final prediction. Second, considering recurrent models make predictions in a temporal manner (from one end of a TFBS sequence to the other), we introduce temporal output scores, indicating the prediction score of a model over time for a sequential input. Lastly, a class-specific visualization strategy finds the optimal input sequence for a given TFBS positive class via stochastic gradient optimization. Our experimental results indicate that a convolutional-recurrent architecture performs the best among the three architectures. The visualization techniques indicate that CNN-RNN makes predictions by modeling both motifs as well as dependencies among them.
Collapse
Affiliation(s)
- Jack Lanchantin
- Department of Computer Science, University of Virginia, Charlottesville, VA 22903, USA,
| | | | | | | |
Collapse
|