1
|
Liu Q, Hua K, Zhang X, Wong WH, Jiang R. DeepCAGE: Incorporating Transcription Factors in Genome-wide Prediction of Chromatin Accessibility. GENOMICS, PROTEOMICS & BIOINFORMATICS 2022; 20:496-507. [PMID: 35293310 PMCID: PMC9801045 DOI: 10.1016/j.gpb.2021.08.015] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 01/31/2021] [Revised: 05/31/2021] [Accepted: 09/27/2021] [Indexed: 01/26/2023]
Abstract
Although computational approaches have been complementing high-throughput biological experiments for the identification of functional regions in the human genome, it remains a great challenge to systematically decipher interactions between transcription factors (TFs) and regulatory elements to achieve interpretable annotations of chromatin accessibility across diverse cellular contexts. To solve this problem, we propose DeepCAGE, a deep learning framework that integrates sequence information and binding statuses of TFs, for the accurate prediction of chromatin accessible regions at a genome-wide scale in a variety of cell types. DeepCAGE takes advantage of a densely connected deep convolutional neural network architecture to automatically learn sequence signatures of known chromatin accessible regions and then incorporates such features with expression levels and binding activities of human core TFs to predict novel chromatin accessible regions. In a series of systematic comparisons with existing methods, DeepCAGE exhibits superior performance in not only the classification but also the regression of chromatin accessibility signals. In a detailed analysis of TF activities, DeepCAGE successfully extracts novel binding motifs and measures the contribution of a TF to the regulation with respect to a specific locus in a certain cell type. When applied to whole-genome sequencing data analysis, our method successfully prioritizes putative deleterious variants underlying a human complex trait and thus provides insights into the understanding of disease-associated genetic variants. DeepCAGE can be downloaded from https://github.com/kimmo1019/DeepCAGE.
Collapse
Affiliation(s)
- Qiao Liu
- Ministry of Education Key Laboratory of Bioinformatics; Bioinformatics Division, Beijing National Research Center for Information Science and Technology; Center for Synthetic and Systems Biology, Department of Automation, Tsinghua University, Beijing 100084, China,Department of Statistics, Stanford University, Stanford, CA 94305, USA
| | - Kui Hua
- Ministry of Education Key Laboratory of Bioinformatics; Bioinformatics Division, Beijing National Research Center for Information Science and Technology; Center for Synthetic and Systems Biology, Department of Automation, Tsinghua University, Beijing 100084, China
| | - Xuegong Zhang
- Ministry of Education Key Laboratory of Bioinformatics; Bioinformatics Division, Beijing National Research Center for Information Science and Technology; Center for Synthetic and Systems Biology, Department of Automation, Tsinghua University, Beijing 100084, China
| | - Wing Hung Wong
- Department of Statistics, Stanford University, Stanford, CA 94305, USA,Corresponding authors.
| | - Rui Jiang
- Ministry of Education Key Laboratory of Bioinformatics; Bioinformatics Division, Beijing National Research Center for Information Science and Technology; Center for Synthetic and Systems Biology, Department of Automation, Tsinghua University, Beijing 100084, China,Corresponding authors.
| |
Collapse
|
2
|
SemanticCAP: Chromatin Accessibility Prediction Enhanced by Features Learning from a Language Model. Genes (Basel) 2022; 13:genes13040568. [PMID: 35456374 PMCID: PMC9028922 DOI: 10.3390/genes13040568] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/17/2022] [Revised: 03/22/2022] [Accepted: 03/22/2022] [Indexed: 11/16/2022] Open
Abstract
A large number of inorganic and organic compounds are able to bind DNA and form complexes, among which drug-related molecules are important. Chromatin accessibility changes not only directly affect drug–DNA interactions, but they can promote or inhibit the expression of the critical genes associated with drug resistance by affecting the DNA binding capacity of TFs and transcriptional regulators. However, the biological experimental techniques for measuring it are expensive and time-consuming. In recent years, several kinds of computational methods have been proposed to identify accessible regions of the genome. Existing computational models mostly ignore the contextual information provided by the bases in gene sequences. To address these issues, we proposed a new solution called SemanticCAP. It introduces a gene language model that models the context of gene sequences and is thus able to provide an effective representation of a certain site in a gene sequence. Basically, we merged the features provided by the gene language model into our chromatin accessibility model. During the process, we designed methods called SFA and SFC to make feature fusion smoother. Compared to DeepSEA, gkm-SVM, and k-mer using public benchmarks, our model proved to have better performance, showing a 1.25% maximum improvement in auROC and a 2.41% maximum improvement in auPRC.
Collapse
|
3
|
Thakur RK, Prasad P, Bhardwaj SC, Gangwar OP, Kumar S. Epigenetics of wheat-rust interaction: an update. PLANTA 2022; 255:50. [PMID: 35084577 DOI: 10.1007/s00425-022-03829-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/29/2021] [Accepted: 01/08/2022] [Indexed: 06/14/2023]
Abstract
The outcome of different host-pathogen interactions is influenced by both genetic and epigenetic systems, which determine the response of plants to pathogens and vice versa. This review highlights key molecular mechanisms and conceptual advances involved in epigenetic research and the progress made in epigenetics of wheat-rust interactions. Epigenetics implies the heritable changes in the way of gene expression as a consequence of the modification of DNA bases, histone proteins, and/or non-coding-RNA biogenesis without disturbing the underlying nucleotide sequence. The changes occurring between DNA and its surrounding chromatin without altering its DNA sequence and leading to significant changes in the genome of any organism are called epigenetic changes. Epigenetics has already been used successfully to explain the mechanism of human pathogens and in the identification of pathogen-induced modifications within various host plants. Wheat rusts are one of the most vital fungal diseases throughout the major wheat-growing areas of the world. The epigenome in plant pathogens causing diseases such as wheat rusts is mysterious. The investigations of host and pathogen epigenetics in the wheat rusts system can offer a piece of suitable evidence for elucidation of the molecular basis of host-pathogen interaction. Besides, the information on the epigenetic regulation of the genes involved in resistance or pathogenicity will provide better insights into the complex resistance signaling pathways and could provide answers to certain key questions, such as whether epigenetic regulation of certain genes is imparting resistance to host in response of certain pathogen elicitors or not. In the last few years, there has been an upsurge in research on the host as well as pathogen epigenetics and its outcome in plant-pathogen interactions. This review summarizes the progress made in the areas related to the epigenetic control of host-pathogen interaction with particular emphasis on wheat rusts.
Collapse
Affiliation(s)
- Rajni Kant Thakur
- ICAR-Indian Institute of Wheat and Barley Research, Regional Station, Shimla, Himachal Pradesh, 171002, India
| | - Pramod Prasad
- ICAR-Indian Institute of Wheat and Barley Research, Regional Station, Shimla, Himachal Pradesh, 171002, India.
| | - S C Bhardwaj
- ICAR-Indian Institute of Wheat and Barley Research, Regional Station, Shimla, Himachal Pradesh, 171002, India.
| | - O P Gangwar
- ICAR-Indian Institute of Wheat and Barley Research, Regional Station, Shimla, Himachal Pradesh, 171002, India
| | - Subodh Kumar
- ICAR-Indian Institute of Wheat and Barley Research, Regional Station, Shimla, Himachal Pradesh, 171002, India
| |
Collapse
|
4
|
Liu H, Yan F. Gene Regulation Network Modeling and Mechanism Analysis Based on MicroRNA-Disease Related Data. SYSTEMS MEDICINE 2021. [DOI: 10.1016/b978-0-12-801238-3.11339-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022] Open
|
5
|
Targeting Chromatin Complexes in Myeloid Malignancies and Beyond: From Basic Mechanisms to Clinical Innovation. Cells 2020; 9:cells9122721. [PMID: 33371192 PMCID: PMC7767226 DOI: 10.3390/cells9122721] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/19/2020] [Revised: 12/13/2020] [Accepted: 12/20/2020] [Indexed: 12/12/2022] Open
Abstract
The aberrant function of chromatin regulatory networks (epigenetics) is a hallmark of cancer promoting oncogenic gene expression. A growing body of evidence suggests that the disruption of specific chromatin-associated protein complexes has therapeutic potential in malignant conditions, particularly those that are driven by aberrant chromatin modifiers. Of note, a number of enzymatic inhibitors that block the catalytic function of histone modifying enzymes have been established and entered clinical trials. Unfortunately, many of these molecules do not have potent single-agent activity. One potential explanation for this phenomenon is the fact that those drugs do not profoundly disrupt the integrity of the aberrant network of multiprotein complexes on chromatin. Recent advances in drug development have led to the establishment of novel inhibitors of protein–protein interactions as well as targeted protein degraders that may provide inroads to longstanding effort to physically disrupt oncogenic multiprotein complexes on chromatin. In this review, we summarize some of the current concepts on the role epigenetic modifiers in malignant chromatin states with a specific focus on myeloid malignancies and recent advances in early-phase clinical trials.
Collapse
|
6
|
Li Y, Ma A, Mathé EA, Li L, Liu B, Ma Q. Elucidation of Biological Networks across Complex Diseases Using Single-Cell Omics. Trends Genet 2020; 36:951-966. [PMID: 32868128 DOI: 10.1016/j.tig.2020.08.004] [Citation(s) in RCA: 18] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/17/2020] [Revised: 07/29/2020] [Accepted: 08/04/2020] [Indexed: 12/14/2022]
Abstract
Single-cell multimodal omics (scMulti-omics) technologies have made it possible to trace cellular lineages during differentiation and to identify new cell types in heterogeneous cell populations. The derived information is especially promising for computing cell-type-specific biological networks encoded in complex diseases and improving our understanding of the underlying gene regulatory mechanisms. The integration of these networks could, therefore, give rise to a heterogeneous regulatory landscape (HRL) in support of disease diagnosis and drug therapeutics. In this review, we provide an overview of this field and pay particular attention to how diverse biological networks can be inferred in a specific cell type based on integrative methods. Then, we discuss how HRL can advance our understanding of regulatory mechanisms underlying complex diseases and aid in the prediction of prognosis and therapeutic responses. Finally, we outline challenges and future trends that will be central to bringing the field of HRL in complex diseases forward.
Collapse
Affiliation(s)
- Yang Li
- Department of Biomedical Informatics, The Ohio State University, Columbus, OH, 43210, USA
| | - Anjun Ma
- Department of Biomedical Informatics, The Ohio State University, Columbus, OH, 43210, USA
| | - Ewy A Mathé
- Division of Preclinical Innovation, National Center for Advancing Translational Sciences, National Institutes of Health (NIH), Rockville, MD, 20892, USA
| | - Lang Li
- Department of Biomedical Informatics, The Ohio State University, Columbus, OH, 43210, USA
| | - Bingqiang Liu
- School of Mathematics, Shandong University, Jinan, Shandong, 250100, China.
| | - Qin Ma
- Department of Biomedical Informatics, The Ohio State University, Columbus, OH, 43210, USA.
| |
Collapse
|
7
|
Pourkheirandish M, Golicz AA, Bhalla PL, Singh MB. Global Role of Crop Genomics in the Face of Climate Change. FRONTIERS IN PLANT SCIENCE 2020; 11:922. [PMID: 32765541 PMCID: PMC7378793 DOI: 10.3389/fpls.2020.00922] [Citation(s) in RCA: 24] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/24/2019] [Accepted: 06/05/2020] [Indexed: 05/05/2023]
Abstract
The development of climate change resilient crops is necessary if we are to meet the challenge of feeding the growing world's population. We must be able to increase food production despite the projected decrease in arable land and unpredictable environmental conditions. This review summarizes the technological and conceptual advances that have the potential to transform plant breeding, help overcome the challenges of climate change, and initiate the next plant breeding revolution. Recent developments in genomics in combination with high-throughput and precision phenotyping facilitate the identification of genes controlling critical agronomic traits. The discovery of these genes can now be paired with genome editing techniques to rapidly develop climate change resilient crops, including plants with better biotic and abiotic stress tolerance and enhanced nutritional value. Utilizing the genetic potential of crop wild relatives (CWRs) enables the domestication of new species and the generation of synthetic polyploids. The high-quality crop plant genome assemblies and annotations provide new, exciting research targets, including long non-coding RNAs (lncRNAs) and cis-regulatory regions. Metagenomic studies give insights into plant-microbiome interactions and guide selection of optimal soils for plant cultivation. Together, all these advances will allow breeders to produce improved, resilient crops in relatively short timeframes meeting the demands of the growing population and changing climate.
Collapse
Affiliation(s)
| | | | | | - Mohan B. Singh
- Plant Molecular Biology and Biotechnology Laboratory, Faculty of Veterinary and Agricultural Sciences, University of Melbourne, Parkville, VIC, Australia
| |
Collapse
|
8
|
Yan F, Powell DR, Curtis DJ, Wong NC. From reads to insight: a hitchhiker's guide to ATAC-seq data analysis. Genome Biol 2020; 21:22. [PMID: 32014034 PMCID: PMC6996192 DOI: 10.1186/s13059-020-1929-3] [Citation(s) in RCA: 196] [Impact Index Per Article: 49.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/22/2019] [Accepted: 01/08/2020] [Indexed: 12/16/2022] Open
Abstract
Assay of Transposase Accessible Chromatin sequencing (ATAC-seq) is widely used in studying chromatin biology, but a comprehensive review of the analysis tools has not been completed yet. Here, we discuss the major steps in ATAC-seq data analysis, including pre-analysis (quality check and alignment), core analysis (peak calling), and advanced analysis (peak differential analysis and annotation, motif enrichment, footprinting, and nucleosome position analysis). We also review the reconstruction of transcriptional regulatory networks with multiomics data and highlight the current challenges of each step. Finally, we describe the potential of single-cell ATAC-seq and highlight the necessity of developing ATAC-seq specific analysis tools to obtain biologically meaningful insights.
Collapse
Affiliation(s)
- Feng Yan
- Australian Centre for Blood Diseases, Central Clinical School, Monash University, Melbourne, VIC, Australia
| | - David R Powell
- Monash Bioinformatics Platform, Monash University, Melbourne, VIC, Australia
| | - David J Curtis
- Australian Centre for Blood Diseases, Central Clinical School, Monash University, Melbourne, VIC, Australia.,Department of Clinical Haematology, Alfred Health, Melbourne, VIC, Australia
| | - Nicholas C Wong
- Australian Centre for Blood Diseases, Central Clinical School, Monash University, Melbourne, VIC, Australia. .,Monash Bioinformatics Platform, Monash University, Melbourne, VIC, Australia.
| |
Collapse
|
9
|
Duren Z, Wang Y, Wang J, Zhao XM, Lv L, Li X, Liu J, Zhu XG, Chen L, Wang Y. Hierarchical graphical model reveals HFR1 bridging circadian rhythm and flower development in Arabidopsis thaliana. NPJ Syst Biol Appl 2019; 5:28. [PMID: 31428455 PMCID: PMC6690920 DOI: 10.1038/s41540-019-0106-3] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/26/2018] [Accepted: 07/23/2019] [Indexed: 01/02/2023] Open
Abstract
To study systems-level properties of the cell, it is necessary to go beyond individual regulators and target genes to study the regulatory network among transcription factors (TFs). However, it is difficult to directly dissect the TFs mediated genome-wide gene regulatory network (GRN) by experiment. Here, we proposed a hierarchical graphical model to estimate TF activity from mRNA expression by building TF complexes with protein cofactors and inferring TF's downstream regulatory network simultaneously. Then we applied our model on flower development and circadian rhythm processes in Arabidopsis thaliana. The computational results show that the sequence specific bHLH family TF HFR1 recruits the chromatin regulator HAC1 to flower development master regulator TF AG and further activates AG's expression by histone acetylation. Both independent data and experimental results supported this discovery. We also found a flower tissue specific H3K27ac ChIP-seq peak at AG gene body and a HFR1 motif in the center of this H3K27ac peak. Furthermore, we verified that HFR1 physically interacts with HAC1 by yeast two-hybrid experiment. This HFR1-HAC1-AG triplet relationship may imply that flower development and circadian rhythm are bridged by epigenetic regulation and enrich the classical ABC model in flower development. In addition, our TF activity network can serve as a general method to elucidate molecular mechanisms on other complex biological regulatory processes.
Collapse
Affiliation(s)
- Zhana Duren
- CEMS, NCMIS, MDIS, Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing, 100190 China
- University of Chinese Academy of Sciences, Beijing, 100049 China
| | - Yaling Wang
- State Key Laboratory of Molecular Plant Sciences and Center of Excellence for Molecular Plant Sciences, Chinese Academy of Sciences, Shanghai, 200032 China
| | - Jiguang Wang
- Division of Life Science, Department of Chemical and Biological Engineering, Center of Systems Biology and Human Health, State Key Laboratory of Molecular Neuroscience, The Hong Kong University of Science and Technology, Hong Kong, China
| | - Xing-Ming Zhao
- Institute of Science and Technology for Brain-Inspired Intelligence, Fudan University, Shanghai, 200433 China
- Key Laboratory of Computational Neuroscience and Brain-Inspired Intelligence, Ministry of Education, Shanghai, China
| | - Le Lv
- Bayer U.S. – Crop Science, Monsanto Legal Entity, St. Louis, MO 63156 USA
| | - Xiaobo Li
- Department of Plant Biology, Carnegie Institution for Science, 260 Panama Street, Stanford, CA 94305 USA
| | - Jingdong Liu
- Bayer U.S. – Crop Science, Monsanto Legal Entity, St. Louis, MO 63156 USA
| | - Xin-Guang Zhu
- State Key Laboratory of Molecular Plant Sciences and Center of Excellence for Molecular Plant Sciences, Chinese Academy of Sciences, Shanghai, 200032 China
| | - Luonan Chen
- Key Laboratory of Systems Biology, Center for Excellence in Molecular Cell Science, Institute of Biochemistry and Cell Biology, Chinese Academy of Sciences, Shanghai, 200031 China
- Center for Excellence in Animal Evolution and Genetics, Chinese Academy of Sciences, Kunming, 650223 China
- School of Life Science and Technology, ShanghaiTech University, Shanghai, 201210 China
- Research Center for Brain Science and Brain-Inspired Intelligence, 201210 Shanghai, China
| | - Yong Wang
- CEMS, NCMIS, MDIS, Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing, 100190 China
- University of Chinese Academy of Sciences, Beijing, 100049 China
- Center for Excellence in Animal Evolution and Genetics, Chinese Academy of Sciences, Kunming, 650223 China
| |
Collapse
|
10
|
Song S, Cui H, Chen S, Liu Q, Jiang R. EpiFIT: functional interpretation of transcription factors based on combination of sequence and epigenetic information. QUANTITATIVE BIOLOGY 2019. [DOI: 10.1007/s40484-019-0175-8] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
|
11
|
Ma S, Jiang T, Jiang R. Constructing tissue-specific transcriptional regulatory networks via a Markov random field. BMC Genomics 2018; 19:884. [PMID: 30598101 PMCID: PMC6311931 DOI: 10.1186/s12864-018-5277-6] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/16/2022] Open
Abstract
BACKGROUND Recent advances in sequencing technologies have enabled parallel assays of chromatin accessibility and gene expression for major human cell lines. Such innovation provides a great opportunity to decode phenotypic consequences of genetic variation via the construction of predictive gene regulatory network models. However, there still lacks a computational method to systematically integrate chromatin accessibility information with gene expression data to recover complicated regulatory relationships between genes in a tissue-specific manner. RESULTS We propose a Markov random field (MRF) model for constructing tissue-specific transcriptional regulatory networks via integrative analysis of DNase-seq and RNA-seq data. Our method, named CSNets (cell-line specific regulatory networks), first infers regulatory networks for individual cell lines using chromatin accessibility information, and then fine-tunes these networks using the MRF based on pairwise similarity between cell lines derived from gene expression data. Using this method, we constructed regulatory networks specific to 110 human cell lines and 13 major tissues with the use of ENCODE data. We demonstrated the high quality of these networks via comprehensive statistical analysis based on ChIP-seq profiles, functional annotations, taxonomic analysis, and literature surveys. We further applied these networks to analyze GWAS data of Crohn's disease and prostate cancer. Results were either consistent with the literature or provided biological insights into regulatory mechanisms of these two complex diseases. The website of CSNets is freely available at http://bioinfo.au.tsinghua.edu.cn/jianglab/CSNETS/ . CONCLUSIONS CSNets demonstrated the power of joint analysis on epigenomic and transcriptomic data towards the accurate construction of gene regulatory network. Our work provides not only a useful resource of regulatory networks to the community, but also valuable experiences in methodology development for multi-omics data integration.
Collapse
Affiliation(s)
- Shining Ma
- Department of Statistics, Department of Biomedical Data Science, Bio-X Program Stanford University, Stanford, CA 94305 USA
| | - Tao Jiang
- Ministry of Education Key Laboratory of Bioinformatics; Bioinformatics Division, Beijing National Research Center for Information Science and Technology; Department of Automation, Tsinghua University, Beijing, 100084 China
- Department of Computer Science and Engineering, University of California, Riverside, CA 92521 USA
| | - Rui Jiang
- Ministry of Education Key Laboratory of Bioinformatics; Bioinformatics Division, Beijing National Research Center for Information Science and Technology; Department of Automation, Tsinghua University, Beijing, 100084 China
| |
Collapse
|
12
|
Min X, Zeng W, Chen N, Chen T, Jiang R. Chromatin accessibility prediction via convolutional long short-term memory networks with k-mer embedding. Bioinformatics 2018; 33:i92-i101. [PMID: 28881969 PMCID: PMC5870572 DOI: 10.1093/bioinformatics/btx234] [Citation(s) in RCA: 80] [Impact Index Per Article: 13.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/30/2023] Open
Abstract
Motivation Experimental techniques for measuring chromatin accessibility are expensive and time consuming, appealing for the development of computational approaches to predict open chromatin regions from DNA sequences. Along this direction, existing methods fall into two classes: one based on handcrafted k-mer features and the other based on convolutional neural networks. Although both categories have shown good performance in specific applications thus far, there still lacks a comprehensive framework to integrate useful k-mer co-occurrence information with recent advances in deep learning. Results We fill this gap by addressing the problem of chromatin accessibility prediction with a convolutional Long Short-Term Memory (LSTM) network with k-mer embedding. We first split DNA sequences into k-mers and pre-train k-mer embedding vectors based on the co-occurrence matrix of k-mers by using an unsupervised representation learning approach. We then construct a supervised deep learning architecture comprised of an embedding layer, three convolutional layers and a Bidirectional LSTM (BLSTM) layer for feature learning and classification. We demonstrate that our method gains high-quality fixed-length features from variable-length sequences and consistently outperforms baseline methods. We show that k-mer embedding can effectively enhance model performance by exploring different embedding strategies. We also prove the efficacy of both the convolution and the BLSTM layers by comparing two variations of the network architecture. We confirm the robustness of our model to hyper-parameters by performing sensitivity analysis. We hope our method can eventually reinforce our understanding of employing deep learning in genomic studies and shed light on research regarding mechanisms of chromatin accessibility. Availability and implementation The source code can be downloaded from https://github.com/minxueric/ismb2017_lstm. Supplementary information Supplementary materials are available at Bioinformatics online.
Collapse
Affiliation(s)
- Xu Min
- MOE Key Laboratory of Bioinformatics and Bioinformatics Division, TNLIST, Tsinghua University, Beijing, China.,Department of Computer Science and Technology, State Key Lab of Intelligent Technology and Systems, Tsinghua University, Beijing, China
| | - Wanwen Zeng
- MOE Key Laboratory of Bioinformatics and Bioinformatics Division, TNLIST, Tsinghua University, Beijing, China.,Department of Automation, Tsinghua University, Beijing, China
| | - Ning Chen
- MOE Key Laboratory of Bioinformatics and Bioinformatics Division, TNLIST, Tsinghua University, Beijing, China.,Department of Computer Science and Technology, State Key Lab of Intelligent Technology and Systems, Tsinghua University, Beijing, China
| | - Ting Chen
- MOE Key Laboratory of Bioinformatics and Bioinformatics Division, TNLIST, Tsinghua University, Beijing, China.,Department of Computer Science and Technology, State Key Lab of Intelligent Technology and Systems, Tsinghua University, Beijing, China.,Program in Computational Biology and Bioinformatics, University of Southern California, CA, USA
| | - Rui Jiang
- MOE Key Laboratory of Bioinformatics and Bioinformatics Division, TNLIST, Tsinghua University, Beijing, China.,Department of Automation, Tsinghua University, Beijing, China
| |
Collapse
|
13
|
Wang L, Li X, Zhang L, Gao Q. Improved anticancer drug response prediction in cell lines using matrix factorization with similarity regularization. BMC Cancer 2017; 17:513. [PMID: 28768489 PMCID: PMC5541434 DOI: 10.1186/s12885-017-3500-5] [Citation(s) in RCA: 90] [Impact Index Per Article: 12.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/19/2017] [Accepted: 07/24/2017] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Human cancer cell lines are used in research to study the biology of cancer and to test cancer treatments. Recently there are already some large panels of several hundred human cancer cell lines which are characterized with genomic and pharmacological data. The ability to predict drug responses using these pharmacogenomics data can facilitate the development of precision cancer medicines. Although several methods have been developed to address the drug response prediction, there are many challenges in obtaining accurate prediction. METHODS Based on the fact that similar cell lines and similar drugs exhibit similar drug responses, we adopted a similarity-regularized matrix factorization (SRMF) method to predict anticancer drug responses of cell lines using chemical structures of drugs and baseline gene expression levels in cell lines. Specifically, chemical structural similarity of drugs and gene expression profile similarity of cell lines were considered as regularization terms, which were incorporated to the drug response matrix factorization model. RESULTS We first demonstrated the effectiveness of SRMF using a set of simulation data and compared it with two typical similarity-based methods. Furthermore, we applied it to the Genomics of Drug Sensitivity in Cancer (GDSC) and Cancer Cell Line Encyclopedia (CCLE) datasets, and performance of SRMF exceeds three state-of-the-art methods. We also applied SRMF to estimate the missing drug response values in the GDSC dataset. Even though SRMF does not specifically model mutation information, it could correctly predict drug-cancer gene associations that are consistent with existing data, and identify novel drug-cancer gene associations that are not found in existing data as well. SRMF can also aid in drug repositioning. The newly predicted drug responses of GDSC dataset suggest that mTOR inhibitor rapamycin was sensitive to non-small cell lung cancer (NSCLC), and expression of AK1RC3 and HINT1 may be adjunct markers of cell line sensitivity to rapamycin. CONCLUSIONS Our analysis showed that the proposed data integration method is able to improve the accuracy of prediction of anticancer drug responses in cell lines, and can identify consistent and novel drug-cancer gene associations compared to existing data as well as aid in drug repositioning.
Collapse
Affiliation(s)
- Lin Wang
- School of Computer Science and Information Engineering, Tianjin University of Science and Technology, Tianjin, 300457, China.
| | - Xiaozhong Li
- School of Computer Science and Information Engineering, Tianjin University of Science and Technology, Tianjin, 300457, China
| | - Louxin Zhang
- Department of Mathematics, National University of Singapore, Singapore, 119076, Singapore
| | - Qiang Gao
- Key Lab of Industrial Fermentation Microbiology, Ministry of Education & Tianjin City, College of Biotechnology, Tianjin University of Science and Technology, Tianjin, 300457, China
| |
Collapse
|