1
|
Chandrashekar PB, Chen H, Lee M, Ahmadinejad N, Liu L. DeepCORE: An interpretable multi-view deep neural network model to detect co-operative regulatory elements. Comput Struct Biotechnol J 2024; 23:679-687. [PMID: 38292477 PMCID: PMC10825326 DOI: 10.1016/j.csbj.2023.12.044] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/29/2023] [Revised: 12/14/2023] [Accepted: 12/27/2023] [Indexed: 02/01/2024] Open
Abstract
Gene transcription is an essential process involved in all aspects of cellular functions with significant impact on biological traits and diseases. This process is tightly regulated by multiple elements that co-operate to jointly modulate the transcription levels of target genes. To decipher the complicated regulatory network, we present a novel multi-view attention-based deep neural network that models the relationship between genetic, epigenetic, and transcriptional patterns and identifies co-operative regulatory elements (COREs). We applied this new method, named DeepCORE, to predict transcriptomes in various tissues and cell lines, which outperformed the state-of-the-art algorithms. Furthermore, DeepCORE contains an interpreter that extracts the attention values embedded in the deep neural network, maps the attended regions to putative regulatory elements, and infers COREs based on correlated attentions. The identified COREs are significantly enriched with known promoters and enhancers. Novel regulatory elements discovered by DeepCORE showed epigenetic signatures consistent with the status of histone modification marks.
Collapse
Affiliation(s)
- Pramod Bharadwaj Chandrashekar
- Waisman Center, University of Wisconsin-Madison, Madison, WI 53705, USA
- Department of Biostatistics and Medical Informatics, University of Wisconsin-Madison, Madison, WI 53076, USA
| | - Hai Chen
- College of Health Solutions, Arizona State University, Phoenix, AZ, United States
- Biodesign Institute, Arizona State University, Tempe, AZ, United States
| | - Matthew Lee
- College of Health Solutions, Arizona State University, Phoenix, AZ, United States
| | - Navid Ahmadinejad
- College of Health Solutions, Arizona State University, Phoenix, AZ, United States
- Biodesign Institute, Arizona State University, Tempe, AZ, United States
| | - Li Liu
- College of Health Solutions, Arizona State University, Phoenix, AZ, United States
- Biodesign Institute, Arizona State University, Tempe, AZ, United States
| |
Collapse
|
2
|
Wang Z, Peng Y, Li J, Li J, Yuan H, Yang S, Ding X, Xie A, Zhang J, Wang S, Li K, Shi J, Xing G, Shi W, Yan J, Liu J. DeepCBA: A deep learning framework for gene expression prediction in maize based on DNA sequences and chromatin interactions. PLANT COMMUNICATIONS 2024; 5:100985. [PMID: 38859587 DOI: 10.1016/j.xplc.2024.100985] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/28/2024] [Revised: 05/25/2024] [Accepted: 06/05/2024] [Indexed: 06/12/2024]
Abstract
Chromatin interactions create spatial proximity between distal regulatory elements and target genes in the genome, which has an important impact on gene expression, transcriptional regulation, and phenotypic traits. To date, several methods have been developed for predicting gene expression. However, existing methods do not take into consideration the effect of chromatin interactions on target gene expression, thus potentially reducing the accuracy of gene expression prediction and mining of important regulatory elements. In this study, we developed a highly accurate deep learning-based gene expression prediction model (DeepCBA) based on maize chromatin interaction data. Compared with existing models, DeepCBA exhibits higher accuracy in expression classification and expression value prediction. The average Pearson correlation coefficients (PCCs) for predicting gene expression using gene promoter proximal interactions, proximal-distal interactions, and both proximal and distal interactions were 0.818, 0.625, and 0.929, respectively, representing an increase of 0.357, 0.16, and 0.469 over the PCCs obtained with traditional methods that use only gene proximal sequences. Some important motifs were identified through DeepCBA; they were enriched in open chromatin regions and expression quantitative trait loci and showed clear tissue specificity. Importantly, experimental results for the maize flowering-related gene ZmRap2.7 and the tillering-related gene ZmTb1 demonstrated the feasibility of DeepCBA for exploration of regulatory elements that affect gene expression. Moreover, promoter editing and verification of two reported genes (ZmCLE7 and ZmVTE4) demonstrated the utility of DeepCBA for the precise design of gene expression and even for future intelligent breeding. DeepCBA is available at http://www.deepcba.com/ or http://124.220.197.196/.
Collapse
Affiliation(s)
- Zhenye Wang
- National Key Laboratory of Crop Genetic Improvement, Huazhong Agricultural University, Wuhan 430070, China; Hubei Key Laboratory of Agricultural Bioinformatics, Huazhong Agricultural University, Wuhan 430070, China; College of Informatics, Huazhong Agricultural University, Wuhan 430070, China
| | - Yong Peng
- National Key Laboratory of Crop Genetic Improvement, Huazhong Agricultural University, Wuhan 430070, China; Hubei Hongshan Laboratory, Wuhan 430070, China
| | - Jie Li
- National Key Laboratory of Crop Genetic Improvement, Huazhong Agricultural University, Wuhan 430070, China; Hubei Key Laboratory of Agricultural Bioinformatics, Huazhong Agricultural University, Wuhan 430070, China; College of Informatics, Huazhong Agricultural University, Wuhan 430070, China
| | - Jiying Li
- Microsoft Corporation, Redmond, WA 98052, USA
| | - Hao Yuan
- National Key Laboratory of Crop Genetic Improvement, Huazhong Agricultural University, Wuhan 430070, China; Hubei Key Laboratory of Agricultural Bioinformatics, Huazhong Agricultural University, Wuhan 430070, China; College of Informatics, Huazhong Agricultural University, Wuhan 430070, China
| | - Shangpo Yang
- National Key Laboratory of Crop Genetic Improvement, Huazhong Agricultural University, Wuhan 430070, China; Hubei Key Laboratory of Agricultural Bioinformatics, Huazhong Agricultural University, Wuhan 430070, China; College of Informatics, Huazhong Agricultural University, Wuhan 430070, China
| | - Xinru Ding
- National Key Laboratory of Crop Genetic Improvement, Huazhong Agricultural University, Wuhan 430070, China; Hubei Key Laboratory of Agricultural Bioinformatics, Huazhong Agricultural University, Wuhan 430070, China; College of Informatics, Huazhong Agricultural University, Wuhan 430070, China
| | - Ao Xie
- National Key Laboratory of Crop Genetic Improvement, Huazhong Agricultural University, Wuhan 430070, China; Hubei Key Laboratory of Agricultural Bioinformatics, Huazhong Agricultural University, Wuhan 430070, China; College of Informatics, Huazhong Agricultural University, Wuhan 430070, China
| | - Jiangling Zhang
- College of Informatics, Huazhong Agricultural University, Wuhan 430070, China
| | - Shouzhe Wang
- National Key Laboratory of Crop Genetic Improvement, Huazhong Agricultural University, Wuhan 430070, China; Hubei Hongshan Laboratory, Wuhan 430070, China; WIMI Biotechnology Co., Ltd., Changzhou 213000, China
| | - Keqin Li
- National Key Laboratory of Crop Genetic Improvement, Huazhong Agricultural University, Wuhan 430070, China; Hubei Key Laboratory of Agricultural Bioinformatics, Huazhong Agricultural University, Wuhan 430070, China; College of Informatics, Huazhong Agricultural University, Wuhan 430070, China
| | - Jiaqi Shi
- College of Informatics, Huazhong Agricultural University, Wuhan 430070, China
| | - Guangjie Xing
- College of Informatics, Huazhong Agricultural University, Wuhan 430070, China
| | - Weihan Shi
- College of Informatics, Huazhong Agricultural University, Wuhan 430070, China
| | - Jianbing Yan
- National Key Laboratory of Crop Genetic Improvement, Huazhong Agricultural University, Wuhan 430070, China; Hubei Hongshan Laboratory, Wuhan 430070, China
| | - Jianxiao Liu
- National Key Laboratory of Crop Genetic Improvement, Huazhong Agricultural University, Wuhan 430070, China; Hubei Key Laboratory of Agricultural Bioinformatics, Huazhong Agricultural University, Wuhan 430070, China; College of Informatics, Huazhong Agricultural University, Wuhan 430070, China; Hubei Hongshan Laboratory, Wuhan 430070, China.
| |
Collapse
|
3
|
Gonzalez-Avalos E, Onodera A, Samaniego-Castruita D, Rao A, Ay F. Predicting gene expression state and prioritizing putative enhancers using 5hmC signal. Genome Biol 2024; 25:142. [PMID: 38825692 PMCID: PMC11145787 DOI: 10.1186/s13059-024-03273-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/03/2023] [Accepted: 05/11/2024] [Indexed: 06/04/2024] Open
Abstract
BACKGROUND Like its parent base 5-methylcytosine (5mC), 5-hydroxymethylcytosine (5hmC) is a direct epigenetic modification of cytosines in the context of CpG dinucleotides. 5hmC is the most abundant oxidized form of 5mC, generated through the action of TET dioxygenases at gene bodies of actively-transcribed genes and at active or lineage-specific enhancers. Although such enrichments are reported for 5hmC, to date, predictive models of gene expression state or putative regulatory regions for genes using 5hmC have not been developed. RESULTS Here, by using only 5hmC enrichment in genic regions and their vicinity, we develop neural network models that predict gene expression state across 49 cell types. We show that our deep neural network models distinguish high vs low expression state utilizing only 5hmC levels and these predictive models generalize to unseen cell types. Further, in order to leverage 5hmC signal in distal enhancers for expression prediction, we employ an Activity-by-Contact model and also develop a graph convolutional neural network model with both utilizing Hi-C data and 5hmC enrichment to prioritize enhancer-promoter links. These approaches identify known and novel putative enhancers for key genes in multiple immune cell subsets. CONCLUSIONS Our work highlights the importance of 5hmC in gene regulation through proximal and distal mechanisms and provides a framework to link it to genome function. With the recent advances in 6-letter DNA sequencing by short and long-read techniques, profiling of 5mC and 5hmC may be done routinely in the near future, hence, providing a broad range of applications for the methods developed here.
Collapse
Affiliation(s)
- Edahi Gonzalez-Avalos
- La Jolla Institute for Immunology, 9420 Athena Circle, La Jolla, CA, 92037, USA
- Bioinformatics and Systems Biology Graduate Program, University of California San Diego, La Jolla, CA, 92093, USA
| | - Atsushi Onodera
- La Jolla Institute for Immunology, 9420 Athena Circle, La Jolla, CA, 92037, USA
- Department of Immunology, Graduate School of Medicine, Chiba University, Chiba, 260-8670, Japan
| | - Daniela Samaniego-Castruita
- La Jolla Institute for Immunology, 9420 Athena Circle, La Jolla, CA, 92037, USA
- Biological Sciences Graduate Program, University of California San Diego, La Jolla, CA, 92093, USA
| | - Anjana Rao
- La Jolla Institute for Immunology, 9420 Athena Circle, La Jolla, CA, 92037, USA.
- Bioinformatics and Systems Biology Graduate Program, University of California San Diego, La Jolla, CA, 92093, USA.
- Department of Pharmacology, University of California San Diego, La Jolla, CA, 92093, USA.
- Sanford Consortium for Regenerative Medicine, La Jolla, CA, 92093, USA.
- Moores Cancer Center, University of California San Diego, La Jolla, CA, 92093, USA.
| | - Ferhat Ay
- La Jolla Institute for Immunology, 9420 Athena Circle, La Jolla, CA, 92037, USA.
- Bioinformatics and Systems Biology Graduate Program, University of California San Diego, La Jolla, CA, 92093, USA.
- Moores Cancer Center, University of California San Diego, La Jolla, CA, 92093, USA.
- Department of Pediatrics, University of California San Diego, La Jolla, CA, 92093, USA.
| |
Collapse
|
4
|
Hwang H, Jeon H, Yeo N, Baek D. Big data and deep learning for RNA biology. Exp Mol Med 2024; 56:1293-1321. [PMID: 38871816 PMCID: PMC11263376 DOI: 10.1038/s12276-024-01243-w] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/31/2024] [Revised: 02/27/2024] [Accepted: 03/05/2024] [Indexed: 06/15/2024] Open
Abstract
The exponential growth of big data in RNA biology (RB) has led to the development of deep learning (DL) models that have driven crucial discoveries. As constantly evidenced by DL studies in other fields, the successful implementation of DL in RB depends heavily on the effective utilization of large-scale datasets from public databases. In achieving this goal, data encoding methods, learning algorithms, and techniques that align well with biological domain knowledge have played pivotal roles. In this review, we provide guiding principles for applying these DL concepts to various problems in RB by demonstrating successful examples and associated methodologies. We also discuss the remaining challenges in developing DL models for RB and suggest strategies to overcome these challenges. Overall, this review aims to illuminate the compelling potential of DL for RB and ways to apply this powerful technology to investigate the intriguing biology of RNA more effectively.
Collapse
Affiliation(s)
- Hyeonseo Hwang
- School of Biological Sciences, Seoul National University, Seoul, Republic of Korea
| | - Hyeonseong Jeon
- Interdisciplinary Program in Bioinformatics, Seoul National University, Seoul, Republic of Korea
- Genome4me Inc., Seoul, Republic of Korea
| | - Nagyeong Yeo
- School of Biological Sciences, Seoul National University, Seoul, Republic of Korea
| | - Daehyun Baek
- School of Biological Sciences, Seoul National University, Seoul, Republic of Korea.
- Interdisciplinary Program in Bioinformatics, Seoul National University, Seoul, Republic of Korea.
- Genome4me Inc., Seoul, Republic of Korea.
| |
Collapse
|
5
|
Xin R, Cheng Q, Chi X, Feng X, Zhang H, Wang Y, Duan M, Xie T, Song X, Yu Q, Fan Y, Huang L, Zhou F. Computational Characterization of Undifferentially Expressed Genes with Altered Transcription Regulation in Lung Cancer. Genes (Basel) 2023; 14:2169. [PMID: 38136991 PMCID: PMC10742656 DOI: 10.3390/genes14122169] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/01/2023] [Revised: 11/19/2023] [Accepted: 11/27/2023] [Indexed: 12/24/2023] Open
Abstract
A transcriptome profiles the expression levels of genes in cells and has accumulated a huge amount of public data. Most of the existing biomarker-related studies investigated the differential expression of individual transcriptomic features under the assumption of inter-feature independence. Many transcriptomic features without differential expression were ignored from the biomarker lists. This study proposed a computational analysis protocol (mqTrans) to analyze transcriptomes from the view of high-dimensional inter-feature correlations. The mqTrans protocol trained a regression model to predict the expression of an mRNA feature from those of the transcription factors (TFs). The difference between the predicted and real expression of an mRNA feature in a query sample was defined as the mqTrans feature. The new mqTrans view facilitated the detection of thirteen transcriptomic features with differentially expressed mqTrans features, but without differential expression in the original transcriptomic values in three independent datasets of lung cancer. These features were called dark biomarkers because they would have been ignored in a conventional differential analysis. The detailed discussion of one dark biomarker, GBP5, and additional validation experiments suggested that the overlapping long non-coding RNAs might have contributed to this interesting phenomenon. In summary, this study aimed to find undifferentially expressed genes with significantly changed mqTrans values in lung cancer. These genes were usually ignored in most biomarker detection studies of undifferential expression. However, their differentially expressed mqTrans values in three independent datasets suggested their strong associations with lung cancer.
Collapse
Affiliation(s)
- Ruihao Xin
- Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, College of Computer Science and Technology, Jilin University, Changchun 130012, China; (R.X.); (Y.W.); (M.D.); (L.H.)
- Jilin Institute of Chemical Technology, College of Information and Control Engineering, Jilin 132000, China; (Q.C.); (X.C.); (H.Z.)
| | - Qian Cheng
- Jilin Institute of Chemical Technology, College of Information and Control Engineering, Jilin 132000, China; (Q.C.); (X.C.); (H.Z.)
| | - Xiaohang Chi
- Jilin Institute of Chemical Technology, College of Information and Control Engineering, Jilin 132000, China; (Q.C.); (X.C.); (H.Z.)
| | - Xin Feng
- School of Science, Jilin Institute of Chemical Technology, Jilin 132000, China;
- Department of Epidemiology and Biostatistics, School of Public Health, Jilin University, Changchun 130012, China;
| | - Hang Zhang
- Jilin Institute of Chemical Technology, College of Information and Control Engineering, Jilin 132000, China; (Q.C.); (X.C.); (H.Z.)
| | - Yueying Wang
- Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, College of Computer Science and Technology, Jilin University, Changchun 130012, China; (R.X.); (Y.W.); (M.D.); (L.H.)
| | - Meiyu Duan
- Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, College of Computer Science and Technology, Jilin University, Changchun 130012, China; (R.X.); (Y.W.); (M.D.); (L.H.)
| | - Tunyang Xie
- Centre for Mathematical Sciences, University of Cambridge, Wilberforce Road, Cambridge CB3 0WA, UK;
| | - Xiaonan Song
- Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, College of Software, Jilin University, Changchun 130012, China;
| | - Qiong Yu
- Department of Epidemiology and Biostatistics, School of Public Health, Jilin University, Changchun 130012, China;
| | - Yusi Fan
- Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, College of Software, Jilin University, Changchun 130012, China;
| | - Lan Huang
- Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, College of Computer Science and Technology, Jilin University, Changchun 130012, China; (R.X.); (Y.W.); (M.D.); (L.H.)
| | - Fengfeng Zhou
- Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, College of Computer Science and Technology, Jilin University, Changchun 130012, China; (R.X.); (Y.W.); (M.D.); (L.H.)
- School of Biology and Engineering, Guizhou Medical University, Guiyang 550025, China
| |
Collapse
|
6
|
Bhogale S, Seward C, Stubbs L, Sinha S. SEAMoD: A fully interpretable neural network for cis-regulatory analysis of differentially expressed genes. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.11.09.565900. [PMID: 38014229 PMCID: PMC10680628 DOI: 10.1101/2023.11.09.565900] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/29/2023]
Abstract
A common way to investigate gene regulatory mechanisms is to identify differentially expressed genes using transcriptomics, find their candidate enhancers using epigenomics, and search for over-represented transcription factor (TF) motifs in these enhancers using bioinformatics tools. A related follow-up task is to model gene expression as a function of enhancer sequences and rank TF motifs by their contribution to such models, thus prioritizing among regulators. We present a new computational tool called SEAMoD that performs the above tasks of motif finding and sequence-to-expression modeling simultaneously. It trains a convolutional neural network model to relate enhancer sequences to differential expression in one or more biological conditions. The model uses TF motifs to interpret the sequences, learning these motifs and their relative importance to each biological condition from data. It also utilizes epigenomic information in the form of activity scores of putative enhancers and automatically searches for the most promising enhancer for each gene. Compared to existing neural network models of non-coding sequences, SEAMoD uses far fewer parameters, requires far less training data, and emphasizes biological interpretability. We used SEAMoD to understand regulatory mechanisms underlying the differentiation of neural stem cell (NSC) derived from mouse forebrain. We profiled gene expression and histone modifications in NSC and three differentiated cell types and used SEAMoD to model differential expression of nearly 12,000 genes with an accuracy of 81%, in the process identifying the Olig2, E2f family TFs, Foxo3, and Tcf4 as key transcriptional regulators of the differentiation process.
Collapse
|
7
|
Yue T, Wang Y, Zhang L, Gu C, Xue H, Wang W, Lyu Q, Dun Y. Deep Learning for Genomics: From Early Neural Nets to Modern Large Language Models. Int J Mol Sci 2023; 24:15858. [PMID: 37958843 PMCID: PMC10649223 DOI: 10.3390/ijms242115858] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/30/2023] [Revised: 10/24/2023] [Accepted: 10/30/2023] [Indexed: 11/15/2023] Open
Abstract
The data explosion driven by advancements in genomic research, such as high-throughput sequencing techniques, is constantly challenging conventional methods used in genomics. In parallel with the urgent demand for robust algorithms, deep learning has succeeded in various fields such as vision, speech, and text processing. Yet genomics entails unique challenges to deep learning, since we expect a superhuman intelligence that explores beyond our knowledge to interpret the genome from deep learning. A powerful deep learning model should rely on the insightful utilization of task-specific knowledge. In this paper, we briefly discuss the strengths of different deep learning models from a genomic perspective so as to fit each particular task with proper deep learning-based architecture, and we remark on practical considerations of developing deep learning architectures for genomics. We also provide a concise review of deep learning applications in various aspects of genomic research and point out current challenges and potential research directions for future genomics applications. We believe the collaborative use of ever-growing diverse data and the fast iteration of deep learning models will continue to contribute to the future of genomics.
Collapse
Affiliation(s)
- Tianwei Yue
- School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213, USA; (Y.W.); (L.Z.); (W.W.)
| | - Yuanxin Wang
- School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213, USA; (Y.W.); (L.Z.); (W.W.)
| | - Longxiang Zhang
- School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213, USA; (Y.W.); (L.Z.); (W.W.)
| | - Chunming Gu
- Department of Biomedical Engineering, School of Medicine, Johns Hopkins University, Baltimore, MD 21218, USA;
| | - Haoru Xue
- The Robotics Institute, Carnegie Mellon University, Pittsburgh, PA 15213, USA;
| | - Wenping Wang
- School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213, USA; (Y.W.); (L.Z.); (W.W.)
| | - Qi Lyu
- Department of Computational Mathematics, Science, and Engineering, Michigan State University, East Lansing, MI 48824, USA;
| | - Yujie Dun
- School of Information and Communications Engineering, Xi’an Jiaotong University, Xi’an 710049, China;
| |
Collapse
|
8
|
Groves SM, Quaranta V. Quantifying cancer cell plasticity with gene regulatory networks and single-cell dynamics. FRONTIERS IN NETWORK PHYSIOLOGY 2023; 3:1225736. [PMID: 37731743 PMCID: PMC10507267 DOI: 10.3389/fnetp.2023.1225736] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 05/19/2023] [Accepted: 08/25/2023] [Indexed: 09/22/2023]
Abstract
Phenotypic plasticity of cancer cells can lead to complex cell state dynamics during tumor progression and acquired resistance. Highly plastic stem-like states may be inherently drug-resistant. Moreover, cell state dynamics in response to therapy allow a tumor to evade treatment. In both scenarios, quantifying plasticity is essential for identifying high-plasticity states or elucidating transition paths between states. Currently, methods to quantify plasticity tend to focus on 1) quantification of quasi-potential based on the underlying gene regulatory network dynamics of the system; or 2) inference of cell potency based on trajectory inference or lineage tracing in single-cell dynamics. Here, we explore both of these approaches and associated computational tools. We then discuss implications of each approach to plasticity metrics, and relevance to cancer treatment strategies.
Collapse
Affiliation(s)
- Sarah M. Groves
- Department of Pharmacology, Vanderbilt University, Nashville, TN, United States
| | - Vito Quaranta
- Department of Pharmacology, Vanderbilt University, Nashville, TN, United States
- Department of Biochemistry, Vanderbilt University, Nashville, TN, United States
| |
Collapse
|
9
|
Costa IG. Dissecting gene regulation with multimodal sequencing. Nat Methods 2023; 20:1282-1284. [PMID: 37537350 DOI: 10.1038/s41592-023-01957-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 08/05/2023]
Affiliation(s)
- Ivan G Costa
- Institute for Computational Genomics, Joint Research Center for Computational Biomedicine, RWTH Aachen Medical Faculty, Aachen, Germany.
| |
Collapse
|
10
|
Hepkema J, Lee NK, Stewart BJ, Ruangroengkulrith S, Charoensawan V, Clatworthy MR, Hemberg M. Predicting the impact of sequence motifs on gene regulation using single-cell data. Genome Biol 2023; 24:189. [PMID: 37582793 PMCID: PMC10426127 DOI: 10.1186/s13059-023-03021-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/02/2022] [Accepted: 07/21/2023] [Indexed: 08/17/2023] Open
Abstract
The binding of transcription factors at proximal promoters and distal enhancers is central to gene regulation. Identifying regulatory motifs and quantifying their impact on expression remains challenging. Using a convolutional neural network trained on single-cell data, we infer putative regulatory motifs and cell type-specific importance. Our model, scover, explains 29% of the variance in gene expression in multiple mouse tissues. Applying scover to distal enhancers identified using scATAC-seq from the developing human brain, we identify cell type-specific motif activities in distal enhancers. Scover can identify regulatory motifs and their importance from single-cell data where all parameters and outputs are easily interpretable.
Collapse
Affiliation(s)
- Jacob Hepkema
- Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, CB10 1SA, UK
| | - Nicholas Keone Lee
- Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, CB10 1SA, UK
- The Gurdon Institute, University of Cambridge, Tennis Court Road, Cambridge, CB2 1QN, UK
| | - Benjamin J Stewart
- Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, CB10 1SA, UK
- Molecular Immunity Unit, Department of Medicine, University of Cambridge, Cambridge, CB2 0QQ, UK
- Cambridge University Hospitals NHS Foundation Trust and NIHR Cambridge Biomedical Research Centre, Cambridge, CB2 0QQ, UK
| | - Siwat Ruangroengkulrith
- Department of Biochemistry, Faculty of Science, Mahidol University, Bangkok, 10400, Thailand
| | - Varodom Charoensawan
- Department of Biochemistry, Faculty of Science, Mahidol University, Bangkok, 10400, Thailand
- Integrative Computational BioScience (ICBS) Center, Mahidol University, Nakhon Pathom, 7310, Thailand
- Systems Biology of Diseases Research Unit, Faculty of Science, Mahidol University, Bangkok, 10400, Thailand
| | - Menna R Clatworthy
- Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, CB10 1SA, UK
- Molecular Immunity Unit, Department of Medicine, University of Cambridge, Cambridge, CB2 0QQ, UK
- Cambridge University Hospitals NHS Foundation Trust and NIHR Cambridge Biomedical Research Centre, Cambridge, CB2 0QQ, UK
| | - Martin Hemberg
- Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, CB10 1SA, UK.
- The Gurdon Institute, University of Cambridge, Tennis Court Road, Cambridge, CB2 1QN, UK.
- Gene Lay Institute of Immunology and Inflammation, Brigham and Women's Hospital, Massachusetts General Hospital, and Harvard Medical School, Boston, MA, 02115, USA.
| |
Collapse
|
11
|
Komuro J, Kusumoto D, Hashimoto H, Yuasa S. Machine learning in cardiology: Clinical application and basic research. J Cardiol 2023; 82:128-133. [PMID: 37141938 DOI: 10.1016/j.jjcc.2023.04.020] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 02/08/2023] [Revised: 04/23/2023] [Accepted: 04/28/2023] [Indexed: 05/06/2023]
Abstract
Machine learning is a subfield of artificial intelligence. The quality and versatility of machine learning have been rapidly improving and playing a critical role in many aspects of social life. This trend is also observed in the medical field. Generally, there are three main types of machine learning: supervised, unsupervised, and reinforcement learning. Each type of learning is adequately selected for the purpose and type of data. In the field of medicine, various types of information are collected and used, and research using machine learning is becoming increasingly relevant. Many clinical studies are conducted using electronic health and medical records, including in the cardiovascular area. Machine learning has also been applied in basic research. Machine learning has been widely used for several types of data analysis, such as clustering of microarray analysis and RNA sequence analysis. Machine learning is essential for genome and multi-omics analyses. This review summarizes the recent advancements in the use of machine learning in clinical applications and basic cardiovascular research.
Collapse
Affiliation(s)
- Jin Komuro
- Department of Cardiology, Keio University School of Medicine, Tokyo, Japan
| | - Dai Kusumoto
- Department of Cardiology, Keio University School of Medicine, Tokyo, Japan
| | - Hisayuki Hashimoto
- Department of Cardiology, Keio University School of Medicine, Tokyo, Japan
| | - Shinsuke Yuasa
- Department of Cardiology, Keio University School of Medicine, Tokyo, Japan.
| |
Collapse
|
12
|
Chiliński M, Lipiński J, Agarwal A, Ruan Y, Plewczynski D. Enhanced performance of gene expression predictive models with protein-mediated spatial chromatin interactions. Sci Rep 2023; 13:11693. [PMID: 37474564 PMCID: PMC10359366 DOI: 10.1038/s41598-023-38865-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/02/2023] [Accepted: 07/16/2023] [Indexed: 07/22/2023] Open
Abstract
There have been multiple attempts to predict the expression of the genes based on the sequence, epigenetics, and various other factors. To improve those predictions, we have decided to investigate adding protein-specific 3D interactions that play a significant role in the condensation of the chromatin structure in the cell nucleus. To achieve this, we have used the architecture of one of the state-of-the-art algorithms, ExPecto, and investigated the changes in the model metrics upon adding the spatially relevant data. We have used ChIA-PET interactions that are mediated by cohesin (24 cell lines), CTCF (4 cell lines), and RNAPOL2 (4 cell lines). As the output of the study, we have developed the Spatial Gene Expression (SpEx) algorithm that shows statistically significant improvements in most cell lines. We have compared ourselves to the baseline ExPecto model, which obtained a 0.82 Spearman's rank correlation coefficient (SCC) score, and 0.85, which is reported by newer Enformer were able to obtain the average correlation score of 0.83. However, in some cases (e.g. RNAPOL2 on GM12878), our improvement reached 0.04, and in some cases (e.g. RNAPOL2 on H1), we reached an SCC of 0.86.
Collapse
Affiliation(s)
- Mateusz Chiliński
- Laboratory of Bioinformatics and Computational Genomics, Faculty of Mathematics and Information Science, Warsaw University of Technology, 00-662, Warsaw, Poland
- Laboratory of Functional and Structural Genomics, Centre of New Technologies, University of Warsaw, 02-097, Warsaw, Poland
| | | | - Abhishek Agarwal
- Laboratory of Functional and Structural Genomics, Centre of New Technologies, University of Warsaw, 02-097, Warsaw, Poland
| | - Yijun Ruan
- The Jackson Laboratory for Genomic Medicine, 10 Discovery Drive, Farmington, CT, 06030, USA
- Life Sciences Institute, Zhejiang University, Zhejiang, Hangzhou, China
| | - Dariusz Plewczynski
- Laboratory of Bioinformatics and Computational Genomics, Faculty of Mathematics and Information Science, Warsaw University of Technology, 00-662, Warsaw, Poland.
- Laboratory of Functional and Structural Genomics, Centre of New Technologies, University of Warsaw, 02-097, Warsaw, Poland.
| |
Collapse
|
13
|
Chandrashekar PB, Chen H, Lee M, Ahmadinejad N, Liu L. DeepCORE: An interpretable multi-view deep neural network model to detect co-operative regulatory elements. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.04.19.536807. [PMID: 37131697 PMCID: PMC10153112 DOI: 10.1101/2023.04.19.536807] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/04/2023]
Abstract
Gene transcription is an essential process involved in all aspects of cellular functions with significant impact on biological traits and diseases. This process is tightly regulated by multiple elements that co-operate to jointly modulate the transcription levels of target genes. To decipher the complicated regulatory network, we present a novel multi-view attention-based deep neural network that models the relationship between genetic, epigenetic, and transcriptional patterns and identifies co-operative regulatory elements (COREs). We applied this new method, named DeepCORE, to predict transcriptomes in 25 different cell lines, which outperformed the state-of-the-art algorithms. Furthermore, DeepCORE translates the attention values embedded in the neural network into interpretable information, including locations of putative regulatory elements and their correlations, which collectively implies COREs. These COREs are significantly enriched with known promoters and enhancers. Novel regulatory elements discovered by DeepCORE showed epigenetic signatures consistent with the status of histone modification marks.
Collapse
|
14
|
Chiliński M, Lipiński J, Agarwal A, Ruan Y, Plewczynski D. Enhanced performance of gene expression predictive models with protein-mediated spatial chromatin interactions. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.04.06.535849. [PMID: 37066361 PMCID: PMC10104055 DOI: 10.1101/2023.04.06.535849] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 04/18/2023]
Abstract
There have been multiple attempts to predict the expression of the genes based on the sequence, epigenetics, and various other factors. To improve those predictions, we have decided to investigate adding protein-specific 3D interactions that play a major role in the compensation of the chromatin structure in the cell nucleus. To achieve this, we have used the architecture of one of the state-of-the-art algorithms, ExPecto (J. Zhou et al., 2018), and investigated the changes in the model metrics upon adding the spatially relevant data. We have used ChIA-PET interactions that are mediated by cohesin (24 cell lines), CTCF (4 cell lines), and RNAPOL2 (4 cell lines). As the output of the study, we have developed the Spatial Gene Expression (SpEx) algorithm that shows statistically significant improvements in most cell lines.
Collapse
|
15
|
Comparative Research: Regulatory Mechanisms of Ribosomal Gene Transcription in Saccharomyces cerevisiae and Schizosaccharomyces pombe. Biomolecules 2023; 13:biom13020288. [PMID: 36830657 PMCID: PMC9952952 DOI: 10.3390/biom13020288] [Citation(s) in RCA: 4] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/06/2023] [Revised: 01/31/2023] [Accepted: 02/01/2023] [Indexed: 02/05/2023] Open
Abstract
Restricting ribosome biosynthesis and assembly in response to nutrient starvation is a universal phenomenon that enables cells to survive with limited intracellular resources. When cells experience starvation, nutrient signaling pathways, such as the target of rapamycin (TOR) and protein kinase A (PKA), become quiescent, leading to several transcription factors and histone modification enzymes cooperatively and rapidly repressing ribosomal genes. Fission yeast has factors for heterochromatin formation similar to mammalian cells, such as H3K9 methyltransferase and HP1 protein, which are absent in budding yeast. However, limited studies on heterochromatinization in ribosomal genes have been conducted on fission yeast. Herein, we shed light on and compare the regulatory mechanisms of ribosomal gene transcription in two species with the latest insights.
Collapse
|
16
|
Chen Y, Xie M, Wen J. Predicting gene expression from histone modifications with self-attention based neural networks and transfer learning. Front Genet 2022; 13:1081842. [PMID: 36588793 PMCID: PMC9797047 DOI: 10.3389/fgene.2022.1081842] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/27/2022] [Accepted: 11/28/2022] [Indexed: 12/15/2022] Open
Abstract
It is well known that histone modifications play an important part in various chromatin-dependent processes such as DNA replication, repair, and transcription. Using computational models to predict gene expression based on histone modifications has been intensively studied. However, the accuracy of the proposed models still has room for improvement, especially in cross-cell lines gene expression prediction. In the work, we proposed a new model TransferChrome to predict gene expression from histone modifications based on deep learning. The model uses a densely connected convolutional network to capture the features of histone modifications data and uses self-attention layers to aggregate global features of the data. For cross-cell lines gene expression prediction, TransferChrome adopts transfer learning to improve prediction accuracy. We trained and tested our model on 56 different cell lines from the REMC database. The experimental results show that our model achieved an average Area Under the Curve (AUC) score of 84.79%. Compared to three state-of-the-art models, TransferChrome improves the prediction performance on most cell lines. The experiments of cross-cell lines gene expression prediction show that TransferChrome performs best and is an efficient model for predicting cross-cell lines gene expression.
Collapse
|
17
|
Dutta P, Patra AP, Saha S. DeePROG: Deep Attention-Based Model for Diseased Gene Prognosis by Fusing Multi-Omics Data. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022; 19:2770-2781. [PMID: 34166198 DOI: 10.1109/tcbb.2021.3090302] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/13/2023]
Abstract
An in-depth exploration of gene prognosis using different methodologies aids in understanding various biological regulations of genes in disease pathobiology and molecular functions. Interpreting gene functions at biological and molecular levels remains a daunting yet crucial task in domains such as drug design, personalized medicine, and next-generation diagnostics. Recent advancements in omics technologies have produced diverse heterogeneous genomic datasets like micro-array gene expression, miRNA expression, DNA sequence, 3D structures, which are significant resources for understanding the gene functions. In this paper, we propose a novel self-attention based deep multi-modal model, named DeePROG, for the prognosis of disease affected genes based on heterogeneous omics data. We use three NCBI datasets covering three modalities, namely gene expression profile, the underlying DNA sequence, and the 3D protein structures. To extract useful features from each modality, we develop several context-specific deep learning models. Besides, we develop three attention-based deep bi-modal architectures along with DeePROG to leverage the prognosis of the underlying biomedical data. We assess the performance of the models' in terms of computational assessment of function annotation (CAFA2) metrics. Moreover, we analyze the results in terms of receiver operating characteristics (ROC) curve in high-class imbalance data setting and perform statistical significance tests in terms of Welch's t-test. Experiment results show that DeePROG significantly outperforms baseline models across in terms of performance metrics. The source code and all preprocessed datasets used in this study are available at https://github.com/duttaprat/DeePROG.
Collapse
|
18
|
Al taweraqi N, King RD. Improved prediction of gene expression through integrating cell signalling models with machine learning. BMC Bioinformatics 2022; 23:323. [PMID: 35933367 PMCID: PMC9356471 DOI: 10.1186/s12859-022-04787-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/23/2020] [Accepted: 04/13/2022] [Indexed: 11/24/2022] Open
Abstract
BACKGROUND A key problem in bioinformatics is that of predicting gene expression levels. There are two broad approaches: use of mechanistic models that aim to directly simulate the underlying biology, and use of machine learning (ML) to empirically predict expression levels from descriptors of the experiments. There are advantages and disadvantages to both approaches: mechanistic models more directly reflect the underlying biological causation, but do not directly utilize the available empirical data; while ML methods do not fully utilize existing biological knowledge. RESULTS Here, we investigate overcoming these disadvantages by integrating mechanistic cell signalling models with ML. Our approach to integration is to augment ML with similarity features (attributes) computed from cell signalling models. Seven sets of different similarity feature were generated using graph theory. Each set of features was in turn used to learn multi-target regression models. All the features have significantly improved accuracy over the baseline model - without the similarity features. Finally, the seven multi-target regression models were stacked together to form an overall prediction model that was significantly better than the baseline on 95% of genes on an independent test set. The similarity features enable this stacking model to provide interpretable knowledge about cancer, e.g. the role of ERBB3 in the MCF7 breast cancer cell line. CONCLUSION Integrating mechanistic models as graphs helps to both improve the predictive results of machine learning models, and to provide biological knowledge about genes that can help in building state-of-the-art mechanistic models.
Collapse
Affiliation(s)
- Nada Al taweraqi
- Department of Computer Science, University of Manchester, Manchester, UK
- Department of Computer Science, Taif University, Taif, Saudi Arabia
| | - Ross D. King
- Department of Chemical Engineering and Biotechnology, University of Cambridge, Cambridge, UK
- Department of Biology and Biological Engineering, Chalmers University of Technology, Gothenburg, Sweden
- Alan Turing Institute, London, UK
| |
Collapse
|
19
|
Pan-cancer identification of the relationship of metabolism-related differentially expressed transcription regulation with non-differentially expressed target genes via a gated recurrent unit network. Comput Biol Med 2022; 148:105883. [PMID: 35878490 DOI: 10.1016/j.compbiomed.2022.105883] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/06/2022] [Revised: 07/10/2022] [Accepted: 07/16/2022] [Indexed: 11/20/2022]
Abstract
The transcriptome describes the expression of all genes in a sample. Most studies have investigated the differential patterns or discrimination powers of transcript expression levels. In this study, we hypothesized that the quantitative correlations between the expression levels of transcription factors (TFs) and their regulated target genes (mRNAs) serve as a novel view of healthy status, and a disease sample exhibits a differential landscape (mqTrans) of transcription regulations compared with healthy status. We formulated quantitative transcription regulation relationships of metabolism-related genes as a multi-input multi-output regression model via a gated recurrent unit (GRU) network. The GRU model was trained using healthy blood transcriptomes and the expression levels of mRNAs were predicted by those of the TFs. The mqTrans feature of a gene was defined as the difference between its predicted and actual expression levels. A pan-cancer investigation of the differentially expressed mqTrans features was conducted between the early- and late-stage cancers in 26 cancer types of The Cancer Genome Atlas database. This study focused on the differentially expressed mqTrans features, that did not show differential expression in the actual expression levels. These genes could not be detected by conventional differential analysis. Such dark biomarkers are worthy of further wet-lab investigation. The experimental data also showed that the proposed mqTrans investigation improved the classification between early- and late-stage samples for some cancer types. Thus, the mqTrans features serve as a complementary view to transcriptomes, an OMIC type with mature high-throughput production technologies, and abundant public resources.
Collapse
|
20
|
DeepSTARR predicts enhancer activity from DNA sequence and enables the de novo design of synthetic enhancers. Nat Genet 2022; 54:613-624. [PMID: 35551305 DOI: 10.1038/s41588-022-01048-5] [Citation(s) in RCA: 69] [Impact Index Per Article: 34.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/20/2021] [Accepted: 03/08/2022] [Indexed: 02/06/2023]
Abstract
Enhancer sequences control gene expression and comprise binding sites (motifs) for different transcription factors (TFs). Despite extensive genetic and computational studies, the relationship between DNA sequence and regulatory activity is poorly understood, and de novo enhancer design has been challenging. Here, we built a deep-learning model, DeepSTARR, to quantitatively predict the activities of thousands of developmental and housekeeping enhancers directly from DNA sequence in Drosophila melanogaster S2 cells. The model learned relevant TF motifs and higher-order syntax rules, including functionally nonequivalent instances of the same TF motif that are determined by motif-flanking sequence and intermotif distances. We validated these rules experimentally and demonstrated that they can be generalized to humans by testing more than 40,000 wildtype and mutant Drosophila and human enhancers. Finally, we designed and functionally validated synthetic enhancers with desired activities de novo.
Collapse
|
21
|
Park JJ, Chen S. Metaviromic identification of discriminative genomic features in SARS-CoV-2 using machine learning. PATTERNS 2022; 3:100407. [PMID: 34812427 PMCID: PMC8598947 DOI: 10.1016/j.patter.2021.100407] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 07/13/2021] [Revised: 08/12/2021] [Accepted: 11/11/2021] [Indexed: 01/18/2023]
Abstract
The COVID-19 pandemic caused by SARS-CoV-2 has become a major threat across the globe. Here, we developed machine learning approaches to identify key pathogenic regions in coronavirus genomes. We trained and evaluated 7,562,625 models on 3,665 genomes including SARS-CoV-2, MERS-CoV, SARS-CoV, and other coronaviruses of human and animal origins to return quantitative and biologically interpretable signatures at nucleotide and amino acid resolutions. We identified hotspots across the SARS-CoV-2 genome, including previously unappreciated features in spike, RdRp, and other proteins. Finally, we integrated pathogenicity genomic profiles with B cell and T cell epitope predictions for enrichment of sequence targets to help guide vaccine development. These results provide a systematic map of predicted pathogenicity in SARS-CoV-2 that incorporates sequence, structural, and immunologic features, providing an unbiased collection of genetic elements for functional studies. This metavirome-based framework can also be applied for rapid characterization of new coronavirus strains or emerging pathogenic viruses. Machine learning identifies discriminative signatures in coronavirus genomes Hotspots in key viral proteins have evolutionary and structural significance Integration of hotspots with B cell and T cell epitopes identify joint features Hotspots correlate with emerging variants of concern for mutation prioritization
Identifying which genomic regions of the novel severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) virus are pathogenic remains a major challenge in COVID-19 research. However, there is currently a lack of systematic and unbiased methods for such functional characterization. In this study, we set up a machine learning-based approach to identify which genomic regions distinguish SARS-CoV-2 and other high case fatality rate coronaviruses from other coronaviruses. Discriminative scores were obtained for every nucleotide in the SARS-CoV-2 genome. We then performed a series of evolutionary and structural analyses of candidate hotspots, as well as integrative analyses with predicted B cell and T cell epitopes and emerging variants of concern. Our approach can be extended to other viral genomes or microbial pathogens to gain insights on which sequence features are pathogenic or immunogenic.
Collapse
|
22
|
Lagator M, Sarikas S, Steinrueck M, Toledo-Aparicio D, Bollback JP, Guet CC, Tkačik G. Predicting bacterial promoter function and evolution from random sequences. eLife 2022; 11:64543. [PMID: 35080492 PMCID: PMC8791639 DOI: 10.7554/elife.64543] [Citation(s) in RCA: 16] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/02/2020] [Accepted: 01/09/2022] [Indexed: 12/12/2022] Open
Abstract
Predicting function from sequence is a central problem of biology. Currently, this is possible only locally in a narrow mutational neighborhood around a wildtype sequence rather than globally from any sequence. Using random mutant libraries, we developed a biophysical model that accounts for multiple features of σ70 binding bacterial promoters to predict constitutive gene expression levels from any sequence. We experimentally and theoretically estimated that 10–20% of random sequences lead to expression and ~80% of non-expressing sequences are one mutation away from a functional promoter. The potential for generating expression from random sequences is so pervasive that selection acts against σ70-RNA polymerase binding sites even within inter-genic, promoter-containing regions. This pervasiveness of σ70-binding sites implies that emergence of promoters is not the limiting step in gene regulatory evolution. Ultimately, the inclusion of novel features of promoter function into a mechanistic model enabled not only more accurate predictions of gene expression levels, but also identified that promoters evolve more rapidly than previously thought.
Collapse
Affiliation(s)
- Mato Lagator
- School of Biological Sciences, Faculty of Biology, Medicine and Health, University of Manchester, Manchester, United Kingdom.,Institute of Science and Technology Austria, Klosterneuburg, Austria
| | - Srdjan Sarikas
- Institute of Science and Technology Austria, Klosterneuburg, Austria.,Center for Physiology and Pharmacology, Medical University of Vienna, Klosterneuburg, Austria
| | | | | | - Jonathan P Bollback
- Institute of Integrative Biology, Functional and Comparative Genomics, University of Liverpool, Liverpool, United Kingdom
| | - Calin C Guet
- Institute of Science and Technology Austria, Klosterneuburg, Austria
| | - Gašper Tkačik
- Institute of Science and Technology Austria, Klosterneuburg, Austria
| |
Collapse
|
23
|
Karanth S, Tanui CK, Meng J, Pradhan AK. Exploring the predictive capability of advanced machine learning in identifying severe disease phenotype in Salmonella enterica. Food Res Int 2022; 151:110817. [PMID: 34980422 DOI: 10.1016/j.foodres.2021.110817] [Citation(s) in RCA: 7] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/17/2021] [Revised: 11/12/2021] [Accepted: 11/17/2021] [Indexed: 11/26/2022]
Abstract
The past few years have seen a significant increase in availability of whole genome sequencing information, allowing for its incorporation in predictive modeling for foodborne pathogens to account for inter- and intra-species differences in their virulence. However, this is hindered by the inability of traditional statistical methods to analyze such large amounts of data compared to the number of observations/isolates. In this study, we have explored the applicability of machine learning (ML) models to predict the disease outcome, while identifying features that exert a significant effect on the prediction. This study was conducted on Salmonella enterica, a major foodborne pathogen with considerable inter- and intra-serovar variation. WGS of isolates obtained from various sources (i.e., human, chicken, and swine) were used as input in four machine learning models (logistic regression with ridge, random forest, support vector machine, and AdaBoost) to classify isolates based on disease severity (extraintestinal vs. gastrointestinal) in the host. The predictive performances of all models were tested with and without Elastic Net regularization to combat dimensionality issues. Elastic Net-regularized logistic regression model showed the best area under the receiver operating characteristic curve (AUC-ROC; 0.86) and outcome prediction accuracy (0.76). Additionally, genes coding for transcriptional regulation, acidic, oxidative, and anaerobic stress response, and antibiotic resistance were found to be significant predictors of disease severity. These genes, which were significantly associated with each outcome, could possibly be input in amended, gene-expression-specific predictive models to estimate virulence pattern-specific effect of Salmonella and other foodborne pathogens on human health.
Collapse
Affiliation(s)
- Shraddha Karanth
- Department of Nutrition and Food Science, University of Maryland, College Park, MD 20742, USA
| | - Collins K Tanui
- Department of Nutrition and Food Science, University of Maryland, College Park, MD 20742, USA; Center for Food Safety and Security Systems, University of Maryland, College Park, MD 20742, USA
| | - Jianghong Meng
- Department of Nutrition and Food Science, University of Maryland, College Park, MD 20742, USA; Center for Food Safety and Security Systems, University of Maryland, College Park, MD 20742, USA; Joint Institute for Food Safety and Applied Nutrition, University of Maryland, College Park, MD 20742, USA
| | - Abani K Pradhan
- Department of Nutrition and Food Science, University of Maryland, College Park, MD 20742, USA; Center for Food Safety and Security Systems, University of Maryland, College Park, MD 20742, USA.
| |
Collapse
|
24
|
He L, Chen IW, Zhang Z, Zheng W, Sayadi A, Wang L, Sang W, Ji R, Lei J, Arnqvist G, Lei C, Zhu-Salzman K. In silico promoter analysis and functional validation identify CmZFH, the co-regulator of hypoxia-responsive genes CmScylla and CmLPCAT. INSECT BIOCHEMISTRY AND MOLECULAR BIOLOGY 2022; 140:103681. [PMID: 34800642 DOI: 10.1016/j.ibmb.2021.103681] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/02/2021] [Revised: 09/30/2021] [Accepted: 11/06/2021] [Indexed: 06/13/2023]
Abstract
Oxygen (O2) plays an essential role in aerobic organisms including terrestrial insects. Under hypoxic stress, the cowpea bruchid (Callosobruchus maculatus) ceases feeding and growth. However, larvae, particularly 4th instar larvae exhibit very high tolerance to hypoxia and can recover normal growth once brought to normoxia. To better understand the molecular mechanism that enables insects to cope with low O2 stress, we performed RNA-seq to distinguish hypoxia-responsive genes in midguts and subsequently identified potential common cis-elements in promoters of hypoxia-induced and -repressed genes, respectively. Selected elements were subjected to gel-shift and transient transfection assays to confirm their cis-regulatory function. Of these putative common cis-elements, AREB6 appeared to regulate the expression of CmLPCAT and CmScylla, two hypoxia-induced genes. CmZFH, the putative AREB6-binding protein, was hypoxia-inducible. Transient expression of CmZFH in Drosophila S2 cells activated CmLPCAT and CmScylla, and their induction was likely through interaction of CmZFH with AREB6. Binding to AREB6 was further confirmed by bacterially expressed CmZFH recombinant protein. Deletion analyses indicated that the N-terminal zinc-finger cluster of CmZFH was the key AREB6-binding domain. Through in silico and experimental exploration, we discovered novel transcriptional regulatory components associated with gene expression dynamics under hypoxia that facilitated insect survival.
Collapse
Affiliation(s)
- Li He
- Hubei Insect Resources Utilization and Sustainable Pest Management Key Laboratory, College of Plant Science and Technology, Huazhong Agricultural University, Wuhan, 430070, China; Department of Entomology, Texas A&M University, College Station, TX, 77843, USA; Institute for Plant Genomics & Biotechnology, Texas A&M University, College Station, TX, 77843, USA
| | - Ivy W Chen
- Department of Entomology, Texas A&M University, College Station, TX, 77843, USA; Institute for Plant Genomics & Biotechnology, Texas A&M University, College Station, TX, 77843, USA
| | - Zan Zhang
- Key Laboratory of Entomology and Pest Control Engineering, College of Plant Protection, Academy of Agricultural Sciences, Southwest University, Chongqing, 400716, China
| | - Wenping Zheng
- Key Laboratory of Horticultural Plant Biology (MOE), Institute of Urban and Horticultural Entomology, College of Plant Science and Technology, Huazhong Agricultural University, Wuhan, 430070, China
| | - Ahmed Sayadi
- Animal Ecology, Department of Ecology and Genetics, Uppsala University, Uppsala, 75236, Sweden
| | - Lei Wang
- Department of Entomology, Texas A&M University, College Station, TX, 77843, USA; Institute for Plant Genomics & Biotechnology, Texas A&M University, College Station, TX, 77843, USA
| | - Wen Sang
- Department of Entomology, Texas A&M University, College Station, TX, 77843, USA; Institute for Plant Genomics & Biotechnology, Texas A&M University, College Station, TX, 77843, USA
| | - Rui Ji
- Department of Entomology, Texas A&M University, College Station, TX, 77843, USA; Institute for Plant Genomics & Biotechnology, Texas A&M University, College Station, TX, 77843, USA
| | - Jiaxin Lei
- Department of Entomology, Texas A&M University, College Station, TX, 77843, USA; Institute for Plant Genomics & Biotechnology, Texas A&M University, College Station, TX, 77843, USA
| | - Göran Arnqvist
- Animal Ecology, Department of Ecology and Genetics, Uppsala University, Uppsala, 75236, Sweden
| | - Chaoliang Lei
- Hubei Insect Resources Utilization and Sustainable Pest Management Key Laboratory, College of Plant Science and Technology, Huazhong Agricultural University, Wuhan, 430070, China
| | - Keyan Zhu-Salzman
- Department of Entomology, Texas A&M University, College Station, TX, 77843, USA; Institute for Plant Genomics & Biotechnology, Texas A&M University, College Station, TX, 77843, USA.
| |
Collapse
|
25
|
Chien CH, Huang LY, Lo SF, Chen LJ, Liao CC, Chen JJ, Chu YW. Using Machine Learning Approaches to Predict Target Gene Expression in Rice T-DNA Insertional Mutants. Front Genet 2021; 12:798107. [PMID: 34976025 PMCID: PMC8718795 DOI: 10.3389/fgene.2021.798107] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/19/2021] [Accepted: 11/15/2021] [Indexed: 11/13/2022] Open
Abstract
To change the expression of the flanking genes by inserting T-DNA into the genome is commonly used in rice functional gene research. However, whether the expression of a gene of interest is enhanced must be validated experimentally. Consequently, to improve the efficiency of screening activated genes, we established a model to predict gene expression in T-DNA mutants through machine learning methods. We gathered experimental datasets consisting of gene expression data in T-DNA mutants and captured the PROMOTER and MIDDLE sequences for encoding. In first-layer models, support vector machine (SVM) models were constructed with nine features consisting of information about biological function and local and global sequences. Feature encoding based on the PROMOTER sequence was weighted by logistic regression. The second-layer models integrated 16 first-layer models with minimum redundancy maximum relevance (mRMR) feature selection and the LADTree algorithm, which were selected from nine feature selection methods and 65 classified methods, respectively. The accuracy of the final two-layer machine learning model, referred to as TIMgo, was 99.3% based on fivefold cross-validation, and 85.6% based on independent testing. We discovered that the information within the local sequence had a greater contribution than the global sequence with respect to classification. TIMgo had a good predictive ability for target genes within 20 kb from the 35S enhancer. Based on the analysis of significant sequences, the G-box regulatory sequence may also play an important role in the activation mechanism of the 35S enhancer.
Collapse
Affiliation(s)
- Ching-Hsuan Chien
- Ph.D. Program in Medical Biotechnology, National Chung Hsing University, Taichung, Taiwan
| | - Lan-Ying Huang
- Ph.D. Program in Medical Biotechnology, National Chung Hsing University, Taichung, Taiwan
| | - Shuen-Fang Lo
- Biotechnology Center, National Chung Hsing University, Taichung, Taiwan
| | - Liang-Jwu Chen
- Institute of Molecular Biology, National Chung Hsing University, Taichung, Taiwan
- Advanced Plant Biotechnology Center National Chung Hsing University, Taichung, Taiwan
| | - Chi-Chou Liao
- Institute of Molecular Biology, National Chung Hsing University, Taichung, Taiwan
| | - Jia-Jyun Chen
- Institute of Genomics and Bioinformatics, National Chung Hsing University, Taichung, Taiwan
| | - Yen-Wei Chu
- Ph.D. Program in Medical Biotechnology, National Chung Hsing University, Taichung, Taiwan
- Biotechnology Center, National Chung Hsing University, Taichung, Taiwan
- Institute of Molecular Biology, National Chung Hsing University, Taichung, Taiwan
- Institute of Genomics and Bioinformatics, National Chung Hsing University, Taichung, Taiwan
- Agricultural Biotechnology Center, National Chung Hsing University, Taichung, Taiwan
- Ph.D. Program in Translational Medicine, National Chung Hsing University, Taichung, Taiwan
- Rong Hsing Research Center for Translational Medicine, National Chung Hsing University, Taichung, Taiwan
- *Correspondence: Yen-Wei Chu,
| |
Collapse
|
26
|
Gardiner LJ, Krishna R. Bluster or Lustre: Can AI Improve Crops and Plant Health? PLANTS (BASEL, SWITZERLAND) 2021; 10:plants10122707. [PMID: 34961177 PMCID: PMC8707749 DOI: 10.3390/plants10122707] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 11/03/2021] [Revised: 11/24/2021] [Accepted: 12/06/2021] [Indexed: 06/14/2023]
Abstract
In a changing climate where future food security is a growing concern, researchers are exploring new methods and technologies in the effort to meet ambitious crop yield targets. The application of Artificial Intelligence (AI) including Machine Learning (ML) methods in this area has been proposed as a potential mechanism to support this. This review explores current research in the area to convey the state-of-the-art as to how AI/ML have been used to advance research, gain insights, and generally enable progress in this area. We address the question-Can AI improve crops and plant health? We further discriminate the bluster from the lustre by identifying the key challenges that AI has been shown to address, balanced with the potential issues with its usage, and the key requisites for its success. Overall, we hope to raise awareness and, as a result, promote usage, of AI related approaches where they can have appropriate impact to improve practices in agricultural and plant sciences.
Collapse
|
27
|
Guharajan S, Chhabra S, Parisutham V, Brewster RC. Quantifying the regulatory role of individual transcription factors in Escherichia coli. Cell Rep 2021; 37:109952. [PMID: 34758318 PMCID: PMC8667592 DOI: 10.1016/j.celrep.2021.109952] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/16/2021] [Revised: 08/02/2021] [Accepted: 10/13/2021] [Indexed: 11/30/2022] Open
Abstract
Gene regulation often results from the action of multiple transcription factors (TFs) acting at a promoter, obscuring the individual regulatory effect of each TF on RNA polymerase (RNAP). Here we measure the fundamental regulatory interactions of TFs in E. coli by designing synthetic target genes that isolate individual TFs' regulatory effects. Using a thermodynamic model, each TF's regulatory interactions are decoupled from TF occupancy and interpreted as acting through (de)stabilization of RNAP and (de)acceleration of transcription initiation. We find that the contribution of each mechanism depends on TF identity and binding location; regulation immediately downstream of the promoter is insensitive to TF identity, but the same TFs regulate by distinct mechanisms upstream of the promoter. These two mechanisms are uncoupled and can act coherently, to reinforce the observed regulatory role (activation/repression), or incoherently, wherein the TF regulates two distinct steps with opposing effects.
Collapse
Affiliation(s)
- Sunil Guharajan
- Department of Systems Biology, University of Massachusetts Chan Medical School, Worcester, MA 01605, USA
| | - Shivani Chhabra
- Department of Pharmacological Sciences, Icahn School of Medicine at Mount Sinai, New York, NY 10029, USA
| | - Vinuselvi Parisutham
- Department of Systems Biology, University of Massachusetts Chan Medical School, Worcester, MA 01605, USA
| | - Robert C Brewster
- Department of Systems Biology, University of Massachusetts Chan Medical School, Worcester, MA 01605, USA; Department of Microbiology and Physiological Systems, University of Massachusetts Chan Medical School, Worcester, MA 01605, USA.
| |
Collapse
|
28
|
Dibaeinia P, Sinha S. Deciphering enhancer sequence using thermodynamics-based models and convolutional neural networks. Nucleic Acids Res 2021; 49:10309-10327. [PMID: 34508359 PMCID: PMC8501998 DOI: 10.1093/nar/gkab765] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/16/2021] [Revised: 08/18/2021] [Accepted: 08/25/2021] [Indexed: 11/18/2022] Open
Abstract
Deciphering the sequence-function relationship encoded in enhancers holds the key to interpreting non-coding variants and understanding mechanisms of transcriptomic variation. Several quantitative models exist for predicting enhancer function and underlying mechanisms; however, there has been no systematic comparison of these models characterizing their relative strengths and shortcomings. Here, we interrogated a rich data set of neuroectodermal enhancers in Drosophila, representing cis- and trans- sources of expression variation, with a suite of biophysical and machine learning models. We performed rigorous comparisons of thermodynamics-based models implementing different mechanisms of activation, repression and cooperativity. Moreover, we developed a convolutional neural network (CNN) model, called CoNSEPT, that learns enhancer 'grammar' in an unbiased manner. CoNSEPT is the first general-purpose CNN tool for predicting enhancer function in varying conditions, such as different cell types and experimental conditions, and we show that such complex models can suggest interpretable mechanisms. We found model-based evidence for mechanisms previously established for the studied system, including cooperative activation and short-range repression. The data also favored one hypothesized activation mechanism over another and suggested an intriguing role for a direct, distance-independent repression mechanism. Our modeling shows that while fundamentally different models can yield similar fits to data, they vary in their utility for mechanistic inference. CoNSEPT is freely available at: https://github.com/PayamDiba/CoNSEPT.
Collapse
Affiliation(s)
- Payam Dibaeinia
- Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA
| | - Saurabh Sinha
- Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA
- Carl R. Woese Institute for Genomic Biology, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA
- Cancer Center at Illinois, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA
| |
Collapse
|
29
|
Findley AS, Zhang X, Boye C, Lin YL, Kalita CA, Barreiro L, Lohmueller KE, Pique-Regi R, Luca F. A signature of Neanderthal introgression on molecular mechanisms of environmental responses. PLoS Genet 2021; 17:e1009493. [PMID: 34570765 PMCID: PMC8509894 DOI: 10.1371/journal.pgen.1009493] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/18/2021] [Revised: 10/12/2021] [Accepted: 08/18/2021] [Indexed: 12/17/2022] Open
Abstract
Ancient human migrations led to the settlement of population groups in varied environmental contexts worldwide. The extent to which adaptation to local environments has shaped human genetic diversity is a longstanding question in human evolution. Recent studies have suggested that introgression of archaic alleles in the genome of modern humans may have contributed to adaptation to environmental pressures such as pathogen exposure. Functional genomic studies have demonstrated that variation in gene expression across individuals and in response to environmental perturbations is a main mechanism underlying complex trait variation. We considered gene expression response to in vitro treatments as a molecular phenotype to identify genes and regulatory variants that may have played an important role in adaptations to local environments. We investigated if Neanderthal introgression in the human genome may contribute to the transcriptional response to environmental perturbations. To this end we used eQTLs for genes differentially expressed in a panel of 52 cellular environments, resulting from 5 cell types and 26 treatments, including hormones, vitamins, drugs, and environmental contaminants. We found that SNPs with introgressed Neanderthal alleles (N-SNPs) disrupt binding of transcription factors important for environmental responses, including ionizing radiation and hypoxia, and for glucose metabolism. We identified an enrichment for N-SNPs among eQTLs for genes differentially expressed in response to 8 treatments, including glucocorticoids, caffeine, and vitamin D. Using Massively Parallel Reporter Assays (MPRA) data, we validated the regulatory function of 21 introgressed Neanderthal variants in the human genome, corresponding to 8 eQTLs regulating 15 genes that respond to environmental perturbations. These findings expand the set of environments where archaic introgression may have contributed to adaptations to local environments in modern humans and provide experimental validation for the regulatory function of introgressed variants.
Collapse
Affiliation(s)
- Anthony S. Findley
- Center for Molecular Medicine and Genetics, Wayne State University, Detroit, Michigan, United States of America
| | - Xinjun Zhang
- Department of Ecology and Evolutionary Biology, UCLA, Los Angeles, California, United States of America
| | - Carly Boye
- Center for Molecular Medicine and Genetics, Wayne State University, Detroit, Michigan, United States of America
| | - Yen Lung Lin
- Genetics Section, Department of Medicine, University of Chicago, Chicago, Illinois, United States of America
| | - Cynthia A. Kalita
- Center for Molecular Medicine and Genetics, Wayne State University, Detroit, Michigan, United States of America
| | - Luis Barreiro
- Genetics Section, Department of Medicine, University of Chicago, Chicago, Illinois, United States of America
| | - Kirk E. Lohmueller
- Department of Ecology and Evolutionary Biology, UCLA, Los Angeles, California, United States of America
- Department of Human Genetics, David Geffen School of Medicine, UCLA, Los Angeles, California, United States of America
| | - Roger Pique-Regi
- Center for Molecular Medicine and Genetics, Wayne State University, Detroit, Michigan, United States of America
- Department of Obstetrics and Gynecology, Wayne State University, Detroit, Michigan, United States of America
| | - Francesca Luca
- Center for Molecular Medicine and Genetics, Wayne State University, Detroit, Michigan, United States of America
- Department of Obstetrics and Gynecology, Wayne State University, Detroit, Michigan, United States of America
| |
Collapse
|
30
|
Interpreting machine learning models to investigate circadian regulation and facilitate exploration of clock function. Proc Natl Acad Sci U S A 2021; 118:2103070118. [PMID: 34353905 PMCID: PMC8364196 DOI: 10.1073/pnas.2103070118] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/03/2023] Open
Abstract
The circadian clock is an internal molecular 24-h timer that is critical to life on Earth. We describe a series of artificial intelligence (AI)– and machine learning (ML)–based approaches that enable more cost-effective analysis and insight into circadian regulation and function. Throughout the manuscript, we illuminate what is inside the ML “black box” via explanation or interpretation of predictive ML models. Using this interpretation of our models, we derive biological insights into why a prediction was made, alongside accurate predictions. Most innovatively, we use only DNA sequence features for accurate circadian gene expression prediction. Using explainable AI, we define possible, responsible regulatory elements as we make these predictions; this critically requires no prior knowledge of regulatory elements. The circadian clock is an important adaptation to life on Earth. Here, we use machine learning to predict complex, temporal, and circadian gene expression patterns in Arabidopsis. Most significantly, we classify circadian genes using DNA sequence features generated de novo from public, genomic resources, facilitating downstream application of our methods with no experimental work or prior knowledge needed. We use local model explanation that is transcript specific to rank DNA sequence features, providing a detailed profile of the potential circadian regulatory mechanisms for each transcript. Furthermore, we can discriminate the temporal phase of transcript expression using the local, explanation-derived, and ranked DNA sequence features, revealing hidden subclasses within the circadian class. Model interpretation/explanation provides the backbone of our methodological advances, giving insight into biological processes and experimental design. Next, we use model interpretation to optimize sampling strategies when we predict circadian transcripts using reduced numbers of transcriptomic timepoints. Finally, we predict the circadian time from a single, transcriptomic timepoint, deriving marker transcripts that are most impactful for accurate prediction; this could facilitate the identification of altered clock function from existing datasets.
Collapse
|
31
|
Wang N, Lefaudeux D, Mazumder A, Li JJ, Hoffmann A. Identifying the combinatorial control of signal-dependent transcription factors. PLoS Comput Biol 2021; 17:e1009095. [PMID: 34166361 PMCID: PMC8263068 DOI: 10.1371/journal.pcbi.1009095] [Citation(s) in RCA: 9] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/06/2020] [Revised: 07/07/2021] [Accepted: 05/18/2021] [Indexed: 12/13/2022] Open
Abstract
The effectiveness of immune responses depends on the precision of stimulus-responsive gene expression programs. Cells specify which genes to express by activating stimulus-specific combinations of stimulus-induced transcription factors (TFs). Their activities are decoded by a gene regulatory strategy (GRS) associated with each response gene. Here, we examined whether the GRSs of target genes may be inferred from stimulus-response (input-output) datasets, which remains an unresolved model-identifiability challenge. We developed a mechanistic modeling framework and computational workflow to determine the identifiability of all possible combinations of synergistic (AND) or non-synergistic (OR) GRSs involving three transcription factors. Considering different sets of perturbations for stimulus-response studies, we found that two thirds of GRSs are easily distinguishable but that substantially more quantitative data is required to distinguish the remaining third. To enhance the accuracy of the inference with timecourse experimental data, we developed an advanced error model that avoids error overestimates by distinguishing between value and temporal error. Incorporating this error model into a Bayesian framework, we show that GRS models can be identified for individual genes by considering multiple datasets. Our analysis rationalizes the allocation of experimental resources by identifying most informative TF stimulation conditions. Applying this computational workflow to experimental data of immune response genes in macrophages, we found that a much greater fraction of genes are combinatorially controlled than previously reported by considering compensation among transcription factors. Specifically, we revealed that a group of known NFκB target genes may also be regulated by IRF3, which is supported by chromatin immuno-precipitation analysis. Our study provides a computational workflow for designing and interpreting stimulus-response gene expression studies to identify underlying gene regulatory strategies and further a mechanistic understanding.
Collapse
Affiliation(s)
- Ning Wang
- Institute for Quantitative and Computational Biosciences (QCBio), University of California, Los Angeles, California, United States of America
- Department of Microbiology, Immunology, and Molecular Genetics, University of California, Los Angeles, California, United States of America
- Interdepartmental Program in Bioinformatics, University of California, Los Angeles, California, United States of America
| | - Diane Lefaudeux
- Institute for Quantitative and Computational Biosciences (QCBio), University of California, Los Angeles, California, United States of America
- Department of Microbiology, Immunology, and Molecular Genetics, University of California, Los Angeles, California, United States of America
| | - Anup Mazumder
- Department of Microbiology, Immunology, and Molecular Genetics, University of California, Los Angeles, California, United States of America
| | - Jingyi Jessica Li
- Institute for Quantitative and Computational Biosciences (QCBio), University of California, Los Angeles, California, United States of America
- Department of Statistics, University of California, Los Angeles, California, United States of America
| | - Alexander Hoffmann
- Institute for Quantitative and Computational Biosciences (QCBio), University of California, Los Angeles, California, United States of America
- Department of Microbiology, Immunology, and Molecular Genetics, University of California, Los Angeles, California, United States of America
- * E-mail:
| |
Collapse
|
32
|
Asif M, Orenstein Y. DeepSELEX: inferring DNA-binding preferences from HT-SELEX data using multi-class CNNs. Bioinformatics 2020; 36:i634-i642. [PMID: 33381817 DOI: 10.1093/bioinformatics/btaa789] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022] Open
Abstract
MOTIVATION Transcription factor (TF) DNA-binding is a central mechanism in gene regulation. Biologists would like to know where and when these factors bind DNA. Hence, they require accurate DNA-binding models to enable binding prediction to any DNA sequence. Recent technological advancements measure the binding of a single TF to thousands of DNA sequences. One of the prevailing techniques, high-throughput SELEX, measures protein-DNA binding by high-throughput sequencing over several cycles of enrichment. Unfortunately, current computational methods to infer the binding preferences from high-throughput SELEX data do not exploit the richness of these data, and are under-using the most advanced computational technique, deep neural networks. RESULTS To better characterize the binding preferences of TFs from these experimental data, we developed DeepSELEX, a new algorithm to infer intrinsic DNA-binding preferences using deep neural networks. DeepSELEX takes advantage of the richness of high-throughput sequencing data and learns the DNA-binding preferences by observing the changes in DNA sequences through the experimental cycles. DeepSELEX outperforms extant methods for the task of DNA-binding inference from high-throughput SELEX data in binding prediction in vitro and is on par with the state of the art in in vivo binding prediction. Analysis of model parameters reveals it learns biologically relevant features that shed light on TFs' binding mechanism. AVAILABILITY AND IMPLEMENTATION DeepSELEX is available through github.com/OrensteinLab/DeepSELEX/. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Maor Asif
- School of Electrical and Computer Engineering, Ben-Gurion University of the Negev, Beer-Sheva 8410501, Israel
| | - Yaron Orenstein
- School of Electrical and Computer Engineering, Ben-Gurion University of the Negev, Beer-Sheva 8410501, Israel
| |
Collapse
|
33
|
Blakely D, Collins E, Singh R, Norton A, Lanchantin J, Qi Y. FastSK: fast sequence analysis with gapped string kernels. Bioinformatics 2020; 36:i857-i865. [PMID: 33381828 DOI: 10.1093/bioinformatics/btaa817] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 09/08/2020] [Indexed: 11/12/2022] Open
Abstract
MOTIVATION Gapped k-mer kernels with support vector machines (gkm-SVMs) have achieved strong predictive performance on regulatory DNA sequences on modestly sized training sets. However, existing gkm-SVM algorithms suffer from slow kernel computation time, as they depend exponentially on the sub-sequence feature length, number of mismatch positions, and the task's alphabet size. RESULTS In this work, we introduce a fast and scalable algorithm for calculating gapped k-mer string kernels. Our method, named FastSK, uses a simplified kernel formulation that decomposes the kernel calculation into a set of independent counting operations over the possible mismatch positions. This simplified decomposition allows us to devise a fast Monte Carlo approximation that rapidly converges. FastSK can scale to much greater feature lengths, allows us to consider more mismatches, and is performant on a variety of sequence analysis tasks. On multiple DNA transcription factor binding site prediction datasets, FastSK consistently matches or outperforms the state-of-the-art gkmSVM-2.0 algorithms in area under the ROC curve, while achieving average speedups in kernel computation of ∼100× and speedups of ∼800× for large feature lengths. We further show that FastSK outperforms character-level recurrent and convolutional neural networks while achieving low variance. We then extend FastSK to 7 English-language medical named entity recognition datasets and 10 protein remote homology detection datasets. FastSK consistently matches or outperforms these baselines. AVAILABILITY AND IMPLEMENTATION Our algorithm is available as a Python package and as C++ source code at https://github.com/QData/FastSK. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Derrick Blakely
- Department of Computer Science, University of Virginia, Charlottesville, VA, USA
| | - Eamon Collins
- Department of Computer Science, University of Virginia, Charlottesville, VA, USA
| | - Ritambhara Singh
- Center for Computational Molecular Biology, Brown University, Providence, RI, USA
| | - Andrew Norton
- Department of Computer Science, University of Virginia, Charlottesville, VA, USA
| | - Jack Lanchantin
- Department of Computer Science, University of Virginia, Charlottesville, VA, USA
| | - Yanjun Qi
- Department of Computer Science, University of Virginia, Charlottesville, VA, USA
| |
Collapse
|
34
|
Abstract
Spatiotemporal control of gene expression during development requires orchestrated activities of numerous enhancers, which are cis-regulatory DNA sequences that, when bound by transcription factors, support selective activation or repression of associated genes. Proper activation of enhancers is critical during embryonic development, adult tissue homeostasis, and regeneration, and inappropriate enhancer activity is often associated with pathological conditions such as cancer. Multiple consortia [e.g., the Encyclopedia of DNA Elements (ENCODE) Consortium and National Institutes of Health Roadmap Epigenomics Mapping Consortium] and independent investigators have mapped putative regulatory regions in a large number of cell types and tissues, but the sequence determinants of cell-specific enhancers are not yet fully understood. Machine learning approaches trained on large sets of these regulatory regions can identify core transcription factor binding sites and generate quantitative predictions of enhancer activity and the impact of sequence variants on activity. Here, we review these computational methods in the context of enhancer prediction and gene regulatory network models specifying cell fate.
Collapse
Affiliation(s)
- Michael A Beer
- Department of Biomedical Engineering and McKusick-Nathans Department of Genetic Medicine, Johns Hopkins University, Baltimore, Maryland 21205, USA;
| | - Dustin Shigaki
- Department of Biomedical Engineering and McKusick-Nathans Department of Genetic Medicine, Johns Hopkins University, Baltimore, Maryland 21205, USA;
| | | |
Collapse
|
35
|
Mahood EH, Kruse LH, Moghe GD. Machine learning: A powerful tool for gene function prediction in plants. APPLICATIONS IN PLANT SCIENCES 2020; 8:e11376. [PMID: 32765975 PMCID: PMC7394712 DOI: 10.1002/aps3.11376] [Citation(s) in RCA: 51] [Impact Index Per Article: 12.8] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/01/2019] [Accepted: 03/19/2020] [Indexed: 05/06/2023]
Abstract
Recent advances in sequencing and informatic technologies have led to a deluge of publicly available genomic data. While it is now relatively easy to sequence, assemble, and identify genic regions in diploid plant genomes, functional annotation of these genes is still a challenge. Over the past decade, there has been a steady increase in studies utilizing machine learning algorithms for various aspects of functional prediction, because these algorithms are able to integrate large amounts of heterogeneous data and detect patterns inconspicuous through rule-based approaches. The goal of this review is to introduce experimental plant biologists to machine learning, by describing how it is currently being used in gene function prediction to gain novel biological insights. In this review, we discuss specific applications of machine learning in identifying structural features in sequenced genomes, predicting interactions between different cellular components, and predicting gene function and organismal phenotypes. Finally, we also propose strategies for stimulating functional discovery using machine learning-based approaches in plants.
Collapse
Affiliation(s)
- Elizabeth H. Mahood
- Plant Biology SectionSchool of Integrative Plant SciencesCornell UniversityIthacaNew York14853USA
| | - Lars H. Kruse
- Plant Biology SectionSchool of Integrative Plant SciencesCornell UniversityIthacaNew York14853USA
| | - Gaurav D. Moghe
- Plant Biology SectionSchool of Integrative Plant SciencesCornell UniversityIthacaNew York14853USA
| |
Collapse
|
36
|
Panchy NL, Lloyd JP, Shiu SH. Improved recovery of cell-cycle gene expression in Saccharomyces cerevisiae from regulatory interactions in multiple omics data. BMC Genomics 2020; 21:159. [PMID: 32054475 PMCID: PMC7020519 DOI: 10.1186/s12864-020-6554-8] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/02/2019] [Accepted: 02/04/2020] [Indexed: 12/11/2022] Open
Abstract
BACKGROUND Gene expression is regulated by DNA-binding transcription factors (TFs). Together with their target genes, these factors and their interactions collectively form a gene regulatory network (GRN), which is responsible for producing patterns of transcription, including cyclical processes such as genome replication and cell division. However, identifying how this network regulates the timing of these patterns, including important interactions and regulatory motifs, remains a challenging task. RESULTS We employed four in vivo and in vitro regulatory data sets to investigate the regulatory basis of expression timing and phase-specific patterns cell-cycle expression in Saccharomyces cerevisiae. Specifically, we considered interactions based on direct binding between TF and target gene, indirect effects of TF deletion on gene expression, and computational inference. We found that the source of regulatory information significantly impacts the accuracy and completeness of recovering known cell-cycle expressed genes. The best approach involved combining TF-target and TF-TF interactions features from multiple datasets in a single model. In addition, TFs important to multiple phases of cell-cycle expression also have the greatest impact on individual phases. Important TFs regulating a cell-cycle phase also tend to form modules in the GRN, including two sub-modules composed entirely of unannotated cell-cycle regulators (STE12-TEC1 and RAP1-HAP1-MSN4). CONCLUSION Our findings illustrate the importance of integrating both multiple omics data and regulatory motifs in order to understand the significance regulatory interactions involved in timing gene expression. This integrated approached allowed us to recover both known cell-cycles interactions and the overall pattern of phase-specific expression across the cell-cycle better than any single data set. Likewise, by looking at regulatory motifs in the form of TF-TF interactions, we identified sets of TFs whose co-regulation of target genes was important for cell-cycle expression, even when regulation by individual TFs was not. Overall, this demonstrates the power of integrating multiple data sets and models of interaction in order to understand the regulatory basis of established biological processes and their associated gene regulatory networks.
Collapse
Affiliation(s)
- Nicholas L Panchy
- Genetics Graduate Program, Michigan State University, East Lansing, MI, 48824, USA.,Present address: National Institute for Mathematical and Biological Synthesis, University of Tennessee, 1122 Volunteer Blvd., Suite 106, Knoxville, TN, 37996-3410, USA
| | - John P Lloyd
- Department of Human Genetics and Internal Medicine, University of Michigan, Ann Arbor, MI, 48109, USA
| | - Shin-Han Shiu
- Genetics Graduate Program, Michigan State University, East Lansing, MI, 48824, USA. .,Department of Computational Mathematics, Science and Engineering, Michigan State University, East Lansing, MI, 48824, USA. .,Michigan State University, Plant Biology Laboratories, 612 Wilson Road, Room 166, East Lansing, MI, 48824-1312, USA.
| |
Collapse
|
37
|
de Jongh RP, van Dijk AD, Julsing MK, Schaap PJ, de Ridder D. Designing Eukaryotic Gene Expression Regulation Using Machine Learning. Trends Biotechnol 2020; 38:191-201. [DOI: 10.1016/j.tibtech.2019.07.007] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/08/2019] [Revised: 07/12/2019] [Accepted: 07/19/2019] [Indexed: 12/11/2022]
|
38
|
Ren J, Lee J, Na D. Recent advances in genetic engineering tools based on synthetic biology. J Microbiol 2020; 58:1-10. [PMID: 31898252 DOI: 10.1007/s12275-020-9334-x] [Citation(s) in RCA: 18] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/03/2019] [Revised: 08/19/2019] [Accepted: 11/05/2019] [Indexed: 12/26/2022]
Abstract
Genome-scale engineering is a crucial methodology to rationally regulate microbiological system operations, leading to expected biological behaviors or enhanced bioproduct yields. Over the past decade, innovative genome modification technologies have been developed for effectively regulating and manipulating genes at the genome level. Here, we discuss the current genome-scale engineering technologies used for microbial engineering. Recently developed strategies, such as clustered regularly interspaced short palindromic repeats (CRISPR)-Cas9, multiplex automated genome engineering (MAGE), promoter engineering, CRISPR-based regulations, and synthetic small regulatory RNA (sRNA)-based knockdown, are considered as powerful tools for genome-scale engineering in microbiological systems. MAGE, which modifies specific nucleotides of the genome sequence, is utilized as a genome-editing tool. Contrastingly, synthetic sRNA, CRISPRi, and CRISPRa are mainly used to regulate gene expression without modifying the genome sequence. This review introduces the recent genome-scale editing and regulating technologies and their applications in metabolic engineering.
Collapse
Affiliation(s)
- Jun Ren
- School of Integrative Engineering, Chung-Ang University, Seoul, 06974, Republic of Korea
| | - Jingyu Lee
- School of Integrative Engineering, Chung-Ang University, Seoul, 06974, Republic of Korea
| | - Dokyun Na
- School of Integrative Engineering, Chung-Ang University, Seoul, 06974, Republic of Korea.
| |
Collapse
|
39
|
Identification and Characterization of Cis-Regulatory Elements for Photoreceptor-Type-Specific Transcription in ZebraFish. Methods Mol Biol 2020; 2092:123-145. [PMID: 31786786 DOI: 10.1007/978-1-0716-0175-4_10] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/06/2022]
Abstract
Tissue-specific or cell-type-specific transcription of protein-coding genes is controlled by both trans-regulatory elements (TREs) and cis-regulatory elements (CREs). However, it is challenging to identify TREs and CREs, which are unknown for most genes. Here, we describe a protocol for identifying two types of transcription-activating CREs-core promoters and enhancers-of zebrafish photoreceptor type-specific genes. This protocol is composed of three phases: bioinformatic prediction, experimental validation, and characterization of the CREs. To better illustrate the principles and logic of this protocol, we exemplify it with the discovery of the core promoter and enhancer of the mpp5b apical polarity gene (also known as ponli), whose red, green, and blue (RGB) cone-specific transcription requires its enhancer, a member of the rainbow enhancer family. While exemplified with an RGB-cone-specific gene, this protocol is general and can be used to identify the core promoters and enhancers of other protein-coding genes.
Collapse
|
40
|
Deciphering eukaryotic gene-regulatory logic with 100 million random promoters. Nat Biotechnol 2019; 38:56-65. [PMID: 31792407 PMCID: PMC6954276 DOI: 10.1038/s41587-019-0315-8] [Citation(s) in RCA: 133] [Impact Index Per Article: 26.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/24/2019] [Accepted: 10/16/2019] [Indexed: 11/26/2022]
Abstract
How transcription factors (TFs) interpret cis-regulatory DNA sequence to control gene expression remains unclear, largely because past studies using native and engineered sequences had insufficient scale. Here, we measure the expression output of >100 million synthetic yeast promoter sequences that are fully random. These sequences yield diverse, reproducible expression levels that can be explained by their chance inclusion of functional TF binding sites. We use machine learning to build interpretable models of transcriptional regulation that predict ~94% of the expression driven from independent test promoters and ~89% of the expression driven from native yeast promoter fragments. These models allow us to characterize each TF’s specificity, activity, and interactions with chromatin. TF activity depends on binding-site strand, position, DNA helical face and chromatin context. Notably, expression level is influenced by weak regulatory interactions, which confound designed-sequence studies. Our analyses show that massive-throughput assays of fully random DNA can provide the big data necessary to develop complex, predictive models of gene regulation. Gene expression levels in yeast are predicted using a massive dataset on promoters with random sequences.
Collapse
|
41
|
Read DF, Cook K, Lu YY, Le Roch KG, Noble WS. Predicting gene expression in the human malaria parasite Plasmodium falciparum using histone modification, nucleosome positioning, and 3D localization features. PLoS Comput Biol 2019; 15:e1007329. [PMID: 31509524 PMCID: PMC6756558 DOI: 10.1371/journal.pcbi.1007329] [Citation(s) in RCA: 16] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/11/2019] [Revised: 09/23/2019] [Accepted: 08/12/2019] [Indexed: 12/02/2022] Open
Abstract
Empirical evidence suggests that the malaria parasite Plasmodium falciparum employs a broad range of mechanisms to regulate gene transcription throughout the organism's complex life cycle. To better understand this regulatory machinery, we assembled a rich collection of genomic and epigenomic data sets, including information about transcription factor (TF) binding motifs, patterns of covalent histone modifications, nucleosome occupancy, GC content, and global 3D genome architecture. We used these data to train machine learning models to discriminate between high-expression and low-expression genes, focusing on three distinct stages of the red blood cell phase of the Plasmodium life cycle. Our results highlight the importance of histone modifications and 3D chromatin architecture in Plasmodium transcriptional regulation and suggest that AP2 transcription factors may play a limited regulatory role, perhaps operating in conjunction with epigenetic factors.
Collapse
Affiliation(s)
- David F. Read
- Department of Genome Sciences, University of Washington, Seattle, Washington, United States of America
| | - Kate Cook
- Department of Genome Sciences, University of Washington, Seattle, Washington, United States of America
| | - Yang Y. Lu
- Department of Genome Sciences, University of Washington, Seattle, Washington, United States of America
| | - Karine G. Le Roch
- Department of Molecular, Cell and Systems Biology, University of California, Riverside, California, United States of America
| | - William Stafford Noble
- Department of Genome Sciences, University of Washington, Seattle, Washington, United States of America
| |
Collapse
|
42
|
Miskovic L, Béal J, Moret M, Hatzimanikatis V. Uncertainty reduction in biochemical kinetic models: Enforcing desired model properties. PLoS Comput Biol 2019; 15:e1007242. [PMID: 31430276 PMCID: PMC6716680 DOI: 10.1371/journal.pcbi.1007242] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/19/2019] [Revised: 08/30/2019] [Accepted: 07/03/2019] [Indexed: 11/18/2022] Open
Abstract
A persistent obstacle for constructing kinetic models of metabolism is uncertainty in the kinetic properties of enzymes. Currently, available methods for building kinetic models can cope indirectly with uncertainties by integrating data from different biological levels and origins into models. In this study, we use the recently proposed computational approach iSCHRUNK (in Silico Approach to Characterization and Reduction of Uncertainty in the Kinetic Models), which combines Monte Carlo parameter sampling methods and machine learning techniques, in the context of Bayesian inference. Monte Carlo parameter sampling methods allow us to exploit synergies between different data sources and generate a population of kinetic models that are consistent with the available data and physicochemical laws. The machine learning allows us to data-mine the a priori generated kinetic parameters together with the integrated datasets and derive posterior distributions of kinetic parameters consistent with the observed physiology. In this work, we used iSCHRUNK to address a design question: can we identify which are the kinetic parameters and what are their values that give rise to a desired metabolic behavior? Such information is important for a wide variety of studies ranging from biotechnology to medicine. To illustrate the proposed methodology, we performed Metabolic Control Analysis, computed the flux control coefficients of the xylose uptake (XTR), and identified parameters that ensure a rate improvement of XTR in a glucose-xylose co-utilizing S. cerevisiae strain. Our results indicate that only three kinetic parameters need to be accurately characterized to describe the studied physiology, and ultimately to design and control the desired responses of the metabolism. This framework paves the way for a new generation of methods that will systematically integrate the wealth of available omics data and efficiently extract the information necessary for metabolic engineering and synthetic biology decisions. Kinetic models are the most promising tool for understanding the complex dynamic behavior of living cells. The primary goal of kinetic models is to capture the properties of the metabolic networks as a whole, and thus we need large-scale models for dependable in silico analyses of metabolism. However, uncertainty in kinetic parameters impedes the development of kinetic models, and uncertainty levels increase with the model size. Tools that will address the issues with parameter uncertainty and that will be able to reduce the uncertainty propagation through the system are therefore needed. In this work, we applied a method called iSCHRUNK that combines parameter sampling and machine learning techniques to characterize the uncertainties and uncover intricate relationships between the parameters of kinetic models and the responses of the metabolic network. The proposed method allowed us to identify a small number of parameters that determine the responses in the network regardless of the values of other parameters. As a consequence, in future studies of metabolism, it will be sufficient to explore a reduced kinetic space, and more comprehensive analyses of large-scale and genome-scale metabolic networks will be computationally tractable.
Collapse
Affiliation(s)
- Ljubisa Miskovic
- Laboratory of Computational Systems Biology (LCSB), EPFL, CH, Lausanne, Switzerland
| | - Jonas Béal
- Master's Program in Life Sciences and Technology, EPFL, CH, Lausanne, Switzerland
| | - Michael Moret
- Master's Program in Life Sciences and Technology, EPFL, CH, Lausanne, Switzerland
| | | |
Collapse
|
43
|
Dossa K, Mmadi MA, Zhou R, Zhang T, Su R, Zhang Y, Wang L, You J, Zhang X. Depicting the Core Transcriptome Modulating Multiple Abiotic Stresses Responses in Sesame ( Sesamum indicum L.). Int J Mol Sci 2019; 20:ijms20163930. [PMID: 31412539 PMCID: PMC6721054 DOI: 10.3390/ijms20163930] [Citation(s) in RCA: 32] [Impact Index Per Article: 6.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/08/2019] [Revised: 07/26/2019] [Accepted: 08/10/2019] [Indexed: 01/21/2023] Open
Abstract
Sesame is a source of a healthy vegetable oil, attracting a growing interest worldwide. Abiotic stresses have devastating effects on sesame yield; hence, studies have been performed to understand sesame molecular responses to abiotic stresses, but the core abiotic stress-responsive genes (CARG) that the plant reuses in response to an array of environmental stresses are unknown. We performed a meta-analysis of 72 RNA-Seq datasets from drought, waterlogging, salt and osmotic stresses and identified 543 genes constantly and differentially expressed in response to all stresses, representing the sesame CARG. Weighted gene co-expression network analysis of the CARG revealed three functional modules controlled by key transcription factors. Except for salt stress, the modules were positively correlated with the abiotic stresses. Network topology of the modules showed several hub genes predicted to play prominent functions. As proof of concept, we generated over-expressing Arabidopsis lines with hub and non-hub genes. Transgenic plants performed better under drought, waterlogging, and osmotic stresses than the wild-type plants but did not tolerate the salt treatment. As expected, the hub gene was significantly more potent than the non-hub gene. Overall, we discovered several novel candidate genes, which will fuel investigations on plant responses to multiple abiotic stresses.
Collapse
Affiliation(s)
- Komivi Dossa
- Oil Crops Research Institute of the Chinese Academy of Agricultural Sciences, Key Laboratory of Biology and Genetic Improvement of Oil Crops, Ministry of Agriculture, Wuhan 430062, China.
| | - Marie A Mmadi
- Oil Crops Research Institute of the Chinese Academy of Agricultural Sciences, Key Laboratory of Biology and Genetic Improvement of Oil Crops, Ministry of Agriculture, Wuhan 430062, China
| | - Rong Zhou
- Oil Crops Research Institute of the Chinese Academy of Agricultural Sciences, Key Laboratory of Biology and Genetic Improvement of Oil Crops, Ministry of Agriculture, Wuhan 430062, China
| | - Tianyuan Zhang
- State Key Laboratory of Agricultural Microbiology, Huazhong Agricultural University, Wuhan 430070, China
| | - Ruqi Su
- Oil Crops Research Institute of the Chinese Academy of Agricultural Sciences, Key Laboratory of Biology and Genetic Improvement of Oil Crops, Ministry of Agriculture, Wuhan 430062, China
| | - Yujuan Zhang
- Oil Crops Research Institute of the Chinese Academy of Agricultural Sciences, Key Laboratory of Biology and Genetic Improvement of Oil Crops, Ministry of Agriculture, Wuhan 430062, China
| | - Linhai Wang
- Oil Crops Research Institute of the Chinese Academy of Agricultural Sciences, Key Laboratory of Biology and Genetic Improvement of Oil Crops, Ministry of Agriculture, Wuhan 430062, China
| | - Jun You
- Oil Crops Research Institute of the Chinese Academy of Agricultural Sciences, Key Laboratory of Biology and Genetic Improvement of Oil Crops, Ministry of Agriculture, Wuhan 430062, China
| | - Xiurong Zhang
- Oil Crops Research Institute of the Chinese Academy of Agricultural Sciences, Key Laboratory of Biology and Genetic Improvement of Oil Crops, Ministry of Agriculture, Wuhan 430062, China.
| |
Collapse
|
44
|
Bayrak T, Oğul H. A New Approach for Predicting the Value of Gene Expression: Two-way Collaborative Filtering. Curr Bioinform 2019. [DOI: 10.2174/1574893614666190126144139] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/16/2023]
Abstract
Background:
Predicting the value of gene expression in a given condition is a challenging
topic in computational systems biology. Only a limited number of studies in this area have
provided solutions to predict the expression in a particular pattern, whether or not it can be done
effectively. However, the value of expression for the measurement is usually needed for further
meta-data analysis.
Methods:
Because the problem is considered as a regression task where a feature representation of
the gene under consideration is fed into a trained model to predict a continuous variable that refers
to its exact expression level, we introduced a novel feature representation scheme to support work
on such a task based on two-way collaborative filtering. At this point, our main argument is that
the expressions of other genes in the current condition are as important as the expression of the
current gene in other conditions. For regression analysis, linear regression and a recently popularized
method, called Relevance Vector Machine (RVM), are used. Pearson and Spearman correlation
coefficients and Root Mean Squared Error are used for evaluation. The effects of regression
model type, RVM kernel functions, and parameters have been analysed in our study in a gene expression
profiling data comprising a set of prostate cancer samples.
Results:
According to the findings of this study, in addition to promising results from the experimental
studies, integrating data from another disease type, such as colon cancer in our case, can
significantly improve the prediction performance of the regression model.
Conclusion:
The results also showed that the performed new feature representation approach and
RVM regression model are promising for many machine learning problems in microarray and high
throughput sequencing analysis.
Collapse
Affiliation(s)
- Tuncay Bayrak
- Computer Engineering Department, Baskent University, Eskisehir Road 20. Km Baglica Campus, 06560, Ankara, Turkey
| | - Hasan Oğul
- Computer Engineering Department, Baskent University, Eskisehir Road 20. Km Baglica Campus, 06560, Ankara, Turkey
| |
Collapse
|
45
|
Liao CC, Chen LJ, Lo SF, Chen CW, Chu YW. EAT-Rice: A predictive model for flanking gene expression of T-DNA insertion activation-tagged rice mutants by machine learning approaches. PLoS Comput Biol 2019; 15:e1006942. [PMID: 31067213 PMCID: PMC6505892 DOI: 10.1371/journal.pcbi.1006942] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/12/2018] [Accepted: 03/09/2019] [Indexed: 11/17/2022] Open
Abstract
T-DNA activation-tagging technology is widely used to study rice gene functions. When T-DNA inserts into genome, the flanking gene expression may be altered using CaMV 35S enhancer, but the affected genes still need to be validated by biological experiment. We have developed the EAT-Rice platform to predict the flanking gene expression of T-DNA insertion site in rice mutants. The three kinds of DNA sequences including UPS1K, DISTANCE, and MIDDLE were retrieved to encode and build a forecast model of two-layer machine learning. In the first-layer models, the features nucleotide context (N-gram), cis-regulatory elements (Motif), nucleotide physicochemical properties (NPC), and CG-island (CGI) were used to build SVM models by analysing the concealed information embedded within the three kinds of sequences. Logistic regression was used to estimate the probability of gene activation which as feature-encoding weighting within first-layer model. In the second-layer models, the NaiveBayesUpdateable algorithm was used to integrate these first layer-models, and the system performance was 88.33% on 5-fold cross-validation, and 79.17% on independent-testing finally. In the three kinds of sequences, the model constructed by Middle had the best contribution to the system for identifying the activated genes. The EAT-Rice system provided better performance and gene expression prediction at further distances when compared to the TRIM database. An online server based on EAT-rice is available at http://predictor.nchu.edu.tw/EAT-Rice.
Collapse
Affiliation(s)
- Chi-Chou Liao
- Institute of Molecular Biology, National Chung Hsing University, Taichung, Taiwan
| | - Liang-Jwu Chen
- Institute of Molecular Biology, National Chung Hsing University, Taichung, Taiwan.,Advanced Plant Biotechnology Center National Chung Hsing University, Taichung, Taiwan
| | - Shuen-Fang Lo
- Agricultural Biotechnology Center, National Chung Hsing University, Taichung, Taiwan.,Institute of Molecular Biology, Academia Sinica, Taipei, Taiwan
| | - Chi-Wei Chen
- Department of Computer Science and Engineering, National Chung Hsing University, Taichung, Taiwan
| | - Yen-Wei Chu
- Institute of Molecular Biology, National Chung Hsing University, Taichung, Taiwan.,Agricultural Biotechnology Center, National Chung Hsing University, Taichung, Taiwan.,Biotechnology Center, National Chung Hsing University, Taichung, Taiwan.,Ph.D. Program in Translational Medicine, National Chung Hsing University, Taichung, Taiwan.,Rong Hsing Research Center For Translational Medicine, National Chung Hsing University, Taichung, Taiwan
| |
Collapse
|
46
|
Kabir MH, O'Connor MD. Stems cells, big data and compendium-based analyses for identifying cell types, signalling pathways and gene regulatory networks. Biophys Rev 2019; 11:41-50. [PMID: 30684132 DOI: 10.1007/s12551-018-0486-4] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/23/2018] [Accepted: 11/15/2018] [Indexed: 01/31/2023] Open
Abstract
Identification of new drug and cell therapy targets for disease treatment will be facilitated by a detailed molecular understanding of normal and disease development. Human pluripotent stem cells can provide a large in vitro source of human cell types and, in a growing number of instances, also three-dimensional multicellular tissues called organoids. The application of stem cell technology to discovery and development of new therapies will be aided by detailed molecular characterisation of cell identity, cell signalling pathways and target gene networks. Big data or 'omics' techniques-particularly transcriptomics and proteomics-facilitate cell and tissue characterisation using thousands to tens-of-thousands of genes or proteins. These gene and protein profiles are analysed using existing and/or emergent bioinformatics methods, including a growing number of methods that compare sample profiles against compendia of reference samples. This review assesses how compendium-based analyses can aid the application of stem cell technology for new therapy development. This includes via robust definition of differentiated stem cell identity, as well as elucidation of complex signalling pathways and target gene networks involved in normal and diseased states.
Collapse
Affiliation(s)
- Md Humayun Kabir
- School of Medicine, Western Sydney University, Campbelltown, NSW, Australia.,Department of Computer Science and Engineering, University of Rajshahi, Rajshahi, Bangladesh
| | - Michael D O'Connor
- School of Medicine, Western Sydney University, Campbelltown, NSW, Australia. .,Medical Sciences Research Group, Western Sydney University, Campbelltown, NSW, Australia.
| |
Collapse
|
47
|
Samee MAH, Bruneau BG, Pollard KS. A De Novo Shape Motif Discovery Algorithm Reveals Preferences of Transcription Factors for DNA Shape Beyond Sequence Motifs. Cell Syst 2019; 8:27-42.e6. [PMID: 30660610 PMCID: PMC6368855 DOI: 10.1016/j.cels.2018.12.001] [Citation(s) in RCA: 43] [Impact Index Per Article: 8.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/09/2018] [Revised: 08/18/2018] [Accepted: 12/03/2018] [Indexed: 12/17/2022]
Abstract
DNA shape adds specificity to sequence motifs but has not been explored systematically outside this context. We hypothesized that DNA-binding proteins (DBPs) preferentially occupy DNA with specific structures ("shape motifs") regardless of whether or not these correspond to high information content sequence motifs. We present ShapeMF, a Gibbs sampling algorithm that identifies de novo shape motifs. Using binding data from hundreds of in vivo and in vitro experiments, we show that most DBPs have shape motifs and can occupy these in the absence of sequence motifs. This "shape-only binding" is common for many DBPs and in regions co-bound by multiple DBPs. When shape and sequence motifs co-occur, they can be overlapping, flanking, or separated by consistent spacing. Finally, DBPs within the same protein family have different shape motifs, explaining their distinct genome-wide occupancy despite having similar sequence motifs. These results suggest that shape motifs not only complement sequence motifs but also facilitate recognition of DNA beyond conventionally defined sequence motifs.
Collapse
Affiliation(s)
| | - Benoit G Bruneau
- Gladstone Institutes, San Francisco, CA 94158, USA; Department of Pediatrics and Cardiovascular Research Institute, University of California, San Francisco, San Francisco, CA 94158, USA
| | - Katherine S Pollard
- Gladstone Institutes, San Francisco, CA 94158, USA; Department of Epidemiology & Biostatistics, Institute for Human Genetics, Quantitative Biology Institute, and Institute for Computational Health Sciences, University of California, San Francisco, San Francisco, CA 94158, USA; Chan-Zuckerberg Biohub, San Francisco, CA 94158, USA.
| |
Collapse
|
48
|
Mishra B, Kumar N, Mukhtar MS. Systems Biology and Machine Learning in Plant-Pathogen Interactions. MOLECULAR PLANT-MICROBE INTERACTIONS : MPMI 2019; 32:45-55. [PMID: 30418085 DOI: 10.1094/mpmi-08-18-0221-fi] [Citation(s) in RCA: 47] [Impact Index Per Article: 9.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/02/2023]
Abstract
Systems biology is an inclusive approach to study the static and dynamic emergent properties on a global scale by integrating multiomics datasets to establish qualitative and quantitative associations among multiple biological components. With an abundance of improved high throughput -omics datasets, network-based analyses and machine learning technologies are playing a pivotal role in comprehensive understanding of biological systems. Network topological features reveal most important nodes within a network as well as prioritize significant molecular components for diverse biological networks, including coexpression, protein-protein interaction, and gene regulatory networks. Machine learning techniques provide enormous predictive power through specific feature extraction from biological data. Deep learning, a subtype of machine learning, has plausible future applications because a domain expert for feature extraction is not needed in this algorithm. Inspired by diverse domains of biology, we here review classic systems biology techniques applied in plant immunity thus far. We also discuss additional advanced approaches in both graph theory and machine learning, which may provide new insights for understanding plant-microbe interactions. Finally, we propose a hybrid approach in plant immune systems that harnesses the power of both network biology and machine learning, with a potential to be applicable to both model systems and agronomically important crop plants.
Collapse
Affiliation(s)
| | | | - M Shahid Mukhtar
- 1 Department of Biology, and
- 2 Nutrition Obesity Research Center, University of Alabama at Birmingham, 1300 University Blvd., Birmingham 35294, U.S.A
| |
Collapse
|
49
|
A Data Adaptive Biological Sequence Representation for Supervised Learning. JOURNAL OF HEALTHCARE INFORMATICS RESEARCH 2018; 2:448-471. [DOI: 10.1007/s41666-018-0038-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2017] [Revised: 10/01/2018] [Accepted: 10/02/2018] [Indexed: 11/27/2022]
|
50
|
Raghunath A, Nagarajan R, Sundarraj K, Panneerselvam L, Perumal E. Genome-wide identification and analysis of Nrf2 binding sites - Antioxidant response elements in zebrafish. Toxicol Appl Pharmacol 2018; 360:236-248. [PMID: 30243843 DOI: 10.1016/j.taap.2018.09.013] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/28/2018] [Revised: 09/08/2018] [Accepted: 09/13/2018] [Indexed: 12/30/2022]
Abstract
In the post-genomic era, deciphering the Nrf2 binding sites - antioxidant response elements (AREs) is an essential task that underlies and governs the Keap1-Nrf2-ARE pathway - a cell survival response pathway to environmental stresses in the vertebrate model system. AREs regulate the transcription of a repertoire of phase II detoxifying and/or oxidative-stress responsive genes, offering protection against toxic chemicals, carcinogens, and xenobiotics. In order to identify and analyze AREs in zebrafish, a pattern search algorithm was developed to identify AREs and computational tools available online were utilized to analyze the identified AREs in zebrafish. This study identified the AREs within 30 kb upstream from the transcription start site of antioxidant genes and mitochondrial genes. We report for the first time the AREs of all the known protein coding genes in the zebrafish genome. Western blotting, RT2 profiler array PCR, and qRT-PCR were performed to test whether AREs influence the Nrf2 target genes expression in the zebrafish larvae using sulforaphane. This study reveals unique AREs that have not been previously reported in the cytoprotective genes. Nine TGAG/CNNNTC and six TGAG/CNNNGC AREs were observed significantly. Our findings suggest that AREs drive the dynamic transcriptional events of Nrf2 target genes in the zebrafish larvae on exposure to sulforaphane. The identified abundant putative AREs will define the Keap1-Nrf2-ARE network and elucidate the precise regulation of Nrf2-ARE pathway in not only diseases but also in embryonic development, inflammation, and aerobic respiration. Our results help to understand the dynamic complexity of the Nrf2-ARE system in zebrafish.
Collapse
Affiliation(s)
- Azhwar Raghunath
- Molecular Toxicology Laboratory, Department of Biotechnology, Bharathiar University, Coimbatore 641 046, Tamilnadu, India
| | - Raju Nagarajan
- Department of Biotechnology, Indian Institute of Technology Madras, Chennai 600 036, Tamilnadu, India
| | - Kiruthika Sundarraj
- Molecular Toxicology Laboratory, Department of Biotechnology, Bharathiar University, Coimbatore 641 046, Tamilnadu, India
| | - Lakshmikanthan Panneerselvam
- Molecular Toxicology Laboratory, Department of Biotechnology, Bharathiar University, Coimbatore 641 046, Tamilnadu, India
| | - Ekambaram Perumal
- Molecular Toxicology Laboratory, Department of Biotechnology, Bharathiar University, Coimbatore 641 046, Tamilnadu, India.
| |
Collapse
|