1
|
Friedman RZ, Ramu A, Lichtarge S, Wu Y, Tripp L, Lyon D, Myers CA, Granas DM, Gause M, Corbo JC, Cohen BA, White MA. Active learning of enhancers and silencers in the developing neural retina. Cell Syst 2025; 16:101163. [PMID: 39778579 DOI: 10.1016/j.cels.2024.12.004] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/26/2024] [Revised: 10/17/2024] [Accepted: 12/06/2024] [Indexed: 01/11/2025]
Abstract
Deep learning is a promising strategy for modeling cis-regulatory elements. However, models trained on genomic sequences often fail to explain why the same transcription factor can activate or repress transcription in different contexts. To address this limitation, we developed an active learning approach to train models that distinguish between enhancers and silencers composed of binding sites for the photoreceptor transcription factor cone-rod homeobox (CRX). After training the model on nearly all bound CRX sites from the genome, we coupled synthetic biology with uncertainty sampling to generate additional rounds of informative training data. This allowed us to iteratively train models on data from multiple rounds of massively parallel reporter assays. The ability of the resulting models to discriminate between CRX sites with identical sequence but opposite functions establishes active learning as an effective strategy to train models of regulatory DNA. A record of this paper's transparent peer review process is included in the supplemental information.
Collapse
Affiliation(s)
- Ryan Z Friedman
- The Edison Family Center for Genome Sciences & Systems Biology, Saint Louis, MO 63110, USA; Department of Genetics, Saint Louis, MO 63110, USA
| | - Avinash Ramu
- The Edison Family Center for Genome Sciences & Systems Biology, Saint Louis, MO 63110, USA; Department of Genetics, Saint Louis, MO 63110, USA
| | - Sara Lichtarge
- The Edison Family Center for Genome Sciences & Systems Biology, Saint Louis, MO 63110, USA; Department of Genetics, Saint Louis, MO 63110, USA
| | - Yawei Wu
- The Edison Family Center for Genome Sciences & Systems Biology, Saint Louis, MO 63110, USA; Department of Genetics, Saint Louis, MO 63110, USA
| | - Lloyd Tripp
- The Edison Family Center for Genome Sciences & Systems Biology, Saint Louis, MO 63110, USA; Department of Genetics, Saint Louis, MO 63110, USA
| | - Daniel Lyon
- The Edison Family Center for Genome Sciences & Systems Biology, Saint Louis, MO 63110, USA; Department of Genetics, Saint Louis, MO 63110, USA
| | - Connie A Myers
- Department of Pathology and Immunology, Washington University School of Medicine, Saint Louis, MO 63110, USA
| | - David M Granas
- The Edison Family Center for Genome Sciences & Systems Biology, Saint Louis, MO 63110, USA; Department of Genetics, Saint Louis, MO 63110, USA
| | - Maria Gause
- Department of Pathology and Immunology, Washington University School of Medicine, Saint Louis, MO 63110, USA
| | - Joseph C Corbo
- Department of Pathology and Immunology, Washington University School of Medicine, Saint Louis, MO 63110, USA
| | - Barak A Cohen
- The Edison Family Center for Genome Sciences & Systems Biology, Saint Louis, MO 63110, USA; Department of Genetics, Saint Louis, MO 63110, USA
| | - Michael A White
- The Edison Family Center for Genome Sciences & Systems Biology, Saint Louis, MO 63110, USA; Department of Genetics, Saint Louis, MO 63110, USA.
| |
Collapse
|
2
|
Reimão-Pinto MM, Castillo-Hair SM, Seelig G, Schier AF. The regulatory landscape of 5' UTRs in translational control during zebrafish embryogenesis. Dev Cell 2025:S1534-5807(24)00777-9. [PMID: 39818206 DOI: 10.1016/j.devcel.2024.12.038] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/14/2023] [Revised: 07/22/2024] [Accepted: 12/19/2024] [Indexed: 01/18/2025]
Abstract
The 5' UTRs of mRNAs are critical for translation regulation during development, but their in vivo regulatory features are poorly characterized. Here, we report the regulatory landscape of 5' UTRs during early zebrafish embryogenesis using a massively parallel reporter assay of 18,154 sequences coupled to polysome profiling. We found that the 5' UTR suffices to confer temporal dynamics to translation initiation and identified 86 motifs enriched in 5' UTRs with distinct ribosome recruitment capabilities. A quantitative deep learning model, Danio Optimus 5-Prime (DaniO5P), identified a combined role for 5' UTR length, translation initiation site context, upstream AUGs, and sequence motifs on ribosome recruitment. DaniO5P predicts the activities of maternal and zygotic 5' UTR isoforms and indicates that modulating 5' UTR length and motif grammar contributes to translation initiation dynamics. This study provides a first quantitative model of 5' UTR-based translation regulation in development and lays the foundation for identifying the underlying molecular effectors.
Collapse
Affiliation(s)
| | - Sebastian M Castillo-Hair
- Department of Electrical & Computer Engineering, University of Washington, Seattle, WA 98195, USA; eScience Institute, University of Washington, Seattle, WA 98195, USA
| | - Georg Seelig
- Department of Electrical & Computer Engineering, University of Washington, Seattle, WA 98195, USA; Paul G. Allen School of Computer Science & Engineering, University of Washington, Seattle, WA 98195, USA
| | - Alexander F Schier
- Biozentrum, University of Basel, 4056 Basel, Switzerland; Allen Discovery Center for Cell Lineage Tracing, Seattle, WA 98195, USA.
| |
Collapse
|
3
|
Strayer EC, Krishna S, Lee H, Vejnar C, Neuenkirchen N, Gupta A, Beaudoin JD, Giraldez AJ. NaP-TRAP reveals the regulatory grammar in 5'UTR-mediated translation regulation during zebrafish development. Nat Commun 2024; 15:10898. [PMID: 39738051 PMCID: PMC11685710 DOI: 10.1038/s41467-024-55274-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/28/2023] [Accepted: 12/06/2024] [Indexed: 01/01/2025] Open
Abstract
The cis-regulatory elements encoded in an mRNA determine its stability and translational output. While there has been a considerable effort to understand the factors driving mRNA stability, the regulatory frameworks governing translational control remain more elusive. We have developed a novel massively parallel reporter assay (MPRA) to measure mRNA translation, named Nascent Peptide Translating Ribosome Affinity Purification (NaP-TRAP). NaP-TRAP measures translation in a frame-specific manner through the immunocapture of epitope tagged nascent peptides of reporter mRNAs. We benchmark NaP-TRAP to polysome profiling and use it to quantify Kozak strength and the regulatory landscapes of 5' UTRs in the developing zebrafish embryo and in human cells. Through this approach we identified general and developmentally dynamic cis-regulatory elements, as well as potential trans-acting proteins. We find that U-rich motifs are general enhancers, and upstream ORFs and GC-rich motifs are global repressors of translation. We also observe a translational switch during the maternal-to-zygotic transition, where C-rich motifs shift from repressors to prominent activators of translation. Conversely, we show that microRNA sites in the 5' UTR repress translation following the zygotic expression of miR-430. Together these results demonstrate that NaP-TRAP is a versatile, accessible, and powerful method to decode the regulatory functions of UTRs across different systems.
Collapse
Affiliation(s)
- Ethan C Strayer
- Department of Genetics, Yale University, Yale School of Medicine, New Haven, 06510, CT, USA
| | - Srikar Krishna
- Department of Genetics, Yale University, Yale School of Medicine, New Haven, 06510, CT, USA
| | - Haejeong Lee
- Department of Genetics, Yale University, Yale School of Medicine, New Haven, 06510, CT, USA
| | - Charles Vejnar
- Department of Genetics, Yale University, Yale School of Medicine, New Haven, 06510, CT, USA
| | - Nils Neuenkirchen
- Department of Cell Biology, Yale University, Yale School of Medicine, New Haven, 06510, CT, USA
| | - Amit Gupta
- Department of Genetics and Genome Sciences, Institute for Systems Genomics, University of Connecticut Health Center, Farmington, CT, USA
| | - Jean-Denis Beaudoin
- Department of Genetics and Genome Sciences, Institute for Systems Genomics, University of Connecticut Health Center, Farmington, CT, USA.
- Yale Center for RNA Science and Medicine, Yale University, New Haven, 06510, CT, USA.
| | - Antonio J Giraldez
- Department of Genetics, Yale University, Yale School of Medicine, New Haven, 06510, CT, USA.
- Yale Center for RNA Science and Medicine, Yale University, New Haven, 06510, CT, USA.
- Yale Stem Cell Center, Yale University, Yale School of Medicine, New Haven, 06510, CT, USA.
| |
Collapse
|
4
|
Patil MR, Bihari A. Role of artificial intelligence in cancer detection using protein p53: A Review. Mol Biol Rep 2024; 52:46. [PMID: 39658610 DOI: 10.1007/s11033-024-10051-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/21/2024] [Accepted: 10/22/2024] [Indexed: 12/12/2024]
Abstract
Normal cell development and prevention of tumor formation rely on the tumor-suppressor protein p53. This crucial protein is produced from the Tp53 gene, which encodes the p53 protein. The p53 protein plays a vital role in regulating cell growth, DNA repair, and apoptosis (programmed cell death), thereby maintaining the integrity of the genome and preventing the formation of tumors. Since p53 was discovered 43 years ago, many researchers have clarified its functions in the development of tumors. With the support of the protein p53 and targeted artificial intelligence modeling, it will be possible to detect cancer and tumor activity at an early stage. This will open up new research opportunities. In this review article, a comprehensive analysis was conducted on different machine learning techniques utilized in conjunction with the protein p53 to predict and speculate cancer. The study examined the types of data incorporated and evaluated the performance of these techniques. The aim was to provide a thorough understanding of the effectiveness of machine learning in predicting and speculating cancer using the protein p53.
Collapse
Affiliation(s)
- Manisha R Patil
- School of Computer Science Engineering and Information System, Vellore Institute of Technology, Vellore, Tamil Nadu, India
| | - Anand Bihari
- Department of Computational Intelligence, School of Computer Science and Engineering, Vellore Institute of Technology, Vellore, Tamil Nadu, India.
| |
Collapse
|
5
|
Li Z, Zhang Y, Peng B, Qin S, Zhang Q, Chen Y, Chen C, Bao Y, Zhu Y, Hong Y, Liu B, Liu Q, Xu L, Chen X, Ma X, Wang H, Xie L, Yao Y, Deng B, Li J, De B, Chen Y, Wang J, Li T, Liu R, Tang Z, Cao J, Zuo E, Mei C, Zhu F, Shao C, Wang G, Sun T, Wang N, Liu G, Ni JQ, Liu Y. A novel interpretable deep learning-based computational framework designed synthetic enhancers with broad cross-species activity. Nucleic Acids Res 2024; 52:13447-13468. [PMID: 39420601 PMCID: PMC11602155 DOI: 10.1093/nar/gkae912] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/21/2024] [Revised: 09/25/2024] [Accepted: 10/03/2024] [Indexed: 10/19/2024] Open
Abstract
Enhancers play a critical role in dynamically regulating spatial-temporal gene expression and establishing cell identity, underscoring the significance of designing them with specific properties for applications in biosynthetic engineering and gene therapy. Despite numerous high-throughput methods facilitating genome-wide enhancer identification, deciphering the sequence determinants of their activity remains challenging. Here, we present the DREAM (DNA cis-Regulatory Elements with controllable Activity design platforM) framework, a novel deep learning-based approach for synthetic enhancer design. Proficient in uncovering subtle and intricate patterns within extensive enhancer screening data, DREAM achieves cutting-edge sequence-based enhancer activity prediction and highlights critical sequence features implicating strong enhancer activity. Leveraging DREAM, we have engineered enhancers that surpass the potency of the strongest enhancer within the Drosophila genome by approximately 3.6-fold. Remarkably, these synthetic enhancers exhibited conserved functionality across species that have diverged more than billion years, indicating that DREAM was able to learn highly conserved enhancer regulatory grammar. Additionally, we designed silencers and cell line-specific enhancers using DREAM, demonstrating its versatility. Overall, our study not only introduces an interpretable approach for enhancer design but also lays out a general framework applicable to the design of other types of cis-regulatory elements.
Collapse
Affiliation(s)
- Zhaohong Li
- Shenzhen Branch, Guangdong Laboratory for Lingnan Modern Agriculture, Key Laboratory of Livestock and Poultry Multi-Omics of MARA, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Buxin Road NO. 97, Dapeng District, Shenzhen 518124, China
- Innovation Group of Pig Genome Design and Breeding, Research Centre for Animal Genome, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Buxin Road NO. 97, Dapeng District, Shenzhen 518124, China
| | - Yuanyuan Zhang
- Shenzhen Branch, Guangdong Laboratory for Lingnan Modern Agriculture, Key Laboratory of Livestock and Poultry Multi-Omics of MARA, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Buxin Road NO. 97, Dapeng District, Shenzhen 518124, China
- Innovation Group of Pig Genome Design and Breeding, Research Centre for Animal Genome, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Buxin Road NO. 97, Dapeng District, Shenzhen 518124, China
| | - Bo Peng
- Gene Regulatory Lab, School of Basic Medical Sciences, Tsinghua University, NO. 30 Shuangqing road, Haidian district, Beijing 100084, China
- State Key Laboratory of Molecular Oncology, Tsinghua University, NO. 30 Shuangqing road, Haidian district, Beijing 100084, China
| | - Shenghua Qin
- Shenzhen Branch, Guangdong Laboratory for Lingnan Modern Agriculture, Key Laboratory of Livestock and Poultry Multi-Omics of MARA, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Buxin Road NO. 97, Dapeng District, Shenzhen 518124, China
- Innovation Group of Pig Genome Design and Breeding, Research Centre for Animal Genome, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Buxin Road NO. 97, Dapeng District, Shenzhen 518124, China
| | - Qian Zhang
- State Key Laboratory of Mycology, Institute of Microbiology, Chinese Academy of Sciences, NO.1 Beichen West Road, Chaoyang District, Beijing 100101, China
| | - Yun Chen
- Shenzhen Branch, Guangdong Laboratory for Lingnan Modern Agriculture, Key Laboratory of Livestock and Poultry Multi-Omics of MARA, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Buxin Road NO. 97, Dapeng District, Shenzhen 518124, China
- Innovation Group of Pig Genome Design and Breeding, Research Centre for Animal Genome, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Buxin Road NO. 97, Dapeng District, Shenzhen 518124, China
| | - Choulin Chen
- Shenzhen Branch, Guangdong Laboratory for Lingnan Modern Agriculture, Key Laboratory of Livestock and Poultry Multi-Omics of MARA, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Buxin Road NO. 97, Dapeng District, Shenzhen 518124, China
- Innovation Group of Pig Genome Design and Breeding, Research Centre for Animal Genome, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Buxin Road NO. 97, Dapeng District, Shenzhen 518124, China
| | - Yongzhou Bao
- Shenzhen Branch, Guangdong Laboratory for Lingnan Modern Agriculture, Key Laboratory of Livestock and Poultry Multi-Omics of MARA, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Buxin Road NO. 97, Dapeng District, Shenzhen 518124, China
- Innovation Group of Pig Genome Design and Breeding, Research Centre for Animal Genome, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Buxin Road NO. 97, Dapeng District, Shenzhen 518124, China
| | - Yuqi Zhu
- Shenzhen Branch, Guangdong Laboratory for Lingnan Modern Agriculture, Genome Analysis Laboratory of the Ministry of Agriculture and Rural Affairs, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, NO. 7 Pengfei Road, Dapeng District, Shenzhen 518124, China
| | - Yi Hong
- Shenzhen Branch, Guangdong Laboratory for Lingnan Modern Agriculture, Genome Analysis Laboratory of the Ministry of Agriculture and Rural Affairs, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, NO. 7 Pengfei Road, Dapeng District, Shenzhen 518124, China
| | - Binghua Liu
- State Key Laboratory of Maricultural Biobreeding and Sustainable Goods, Yellow Sea Fisheries Research Institute, Chinese Academy of Fishery Sciences, NO.106 Nanjing Road, Shinan District, Qingdao, Shandong 266071, China
| | - Qian Liu
- State Key Laboratory of Maricultural Biobreeding and Sustainable Goods, Yellow Sea Fisheries Research Institute, Chinese Academy of Fishery Sciences, NO.106 Nanjing Road, Shinan District, Qingdao, Shandong 266071, China
| | - Lingna Xu
- Shenzhen Branch, Guangdong Laboratory for Lingnan Modern Agriculture, Key Laboratory of Livestock and Poultry Multi-Omics of MARA, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Buxin Road NO. 97, Dapeng District, Shenzhen 518124, China
- Innovation Group of Pig Genome Design and Breeding, Research Centre for Animal Genome, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Buxin Road NO. 97, Dapeng District, Shenzhen 518124, China
| | - Xi Chen
- Shenzhen Branch, Guangdong Laboratory of Lingnan Modern Agriculture, Key Laboratory of Synthetic Biology, Ministry of Agriculture and Rural Affairs, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Buxin Road NO. 97, Dapeng District, Shenzhen 518124, China
| | - Xinhao Ma
- College of Grassland Agriculture, National Beef Cattle Improvement Center, College of Animal Science and Technology, Northwest A&F University, NO. 3 Taicheng Road, Yangling District, Yangling, Shaanxi 712100, China
| | - Hongyan Wang
- State Key Laboratory of Maricultural Biobreeding and Sustainable Goods, Yellow Sea Fisheries Research Institute, Chinese Academy of Fishery Sciences, NO.106 Nanjing Road, Shinan District, Qingdao, Shandong 266071, China
| | - Long Xie
- Shenzhen Branch, Guangdong Laboratory for Lingnan Modern Agriculture, Key Laboratory of Livestock and Poultry Multi-Omics of MARA, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Buxin Road NO. 97, Dapeng District, Shenzhen 518124, China
| | - Yilong Yao
- Green Healthy Aquaculture Research Center, Kunpeng Institute of Modern Agriculture at Foshan, Chinese Academy of Agricultural Sciences, Building 26 Lihe Technology Park, Auxiliary Road of Xinxi Avenue South, Nanhai District, Foshan 528226, China
| | - Biao Deng
- Shenzhen Branch, Guangdong Laboratory for Lingnan Modern Agriculture, Key Laboratory of Livestock and Poultry Multi-Omics of MARA, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Buxin Road NO. 97, Dapeng District, Shenzhen 518124, China
- Innovation Group of Pig Genome Design and Breeding, Research Centre for Animal Genome, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Buxin Road NO. 97, Dapeng District, Shenzhen 518124, China
| | - Jiaying Li
- Department of Ophthalmology, Beijing Institute of Ophthalmology, Beijing Tongren Eye Center, Beijing Tongren Hospital, Capital Medical University, Dongjiaomin lane No1, Dongcheng District, Beijing 100101, China
| | - Baojun De
- College of Life Sciences, Inner Mongolia Autonomous Region Key Laboratory of Biomanufacturing, Inner Mongolia Agricultural University, NO. 306 Zhaowuda Road, Saihan District, Hohhot 010018, China
| | - Yuting Chen
- College of Life Sciences, Inner Mongolia Autonomous Region Key Laboratory of Biomanufacturing, Inner Mongolia Agricultural University, NO. 306 Zhaowuda Road, Saihan District, Hohhot 010018, China
| | - Jing Wang
- Shenzhen Branch, Guangdong Laboratory of Lingnan Modern Agriculture, Key Laboratory of Synthetic Biology, Ministry of Agriculture and Rural Affairs, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Buxin Road NO. 97, Dapeng District, Shenzhen 518124, China
| | - Tian Li
- College of JUNCAO Science and Ecology, Haixia Institute of Science and Technology, National Engineering Research Center of JUNCAO, Fujian Agriculture and Forestry University (FAFU), NO.15 Shangxiadian Road, Cangshan District, Fuzhou 0350002, China
| | - Ranran Liu
- Institute of Animal Science, Chinese Academy of Agricultural Sciences, Yuanmingyuan West Road NO. 2, Haidian District, Beijing 100193, China
| | - Zhonglin Tang
- Green Healthy Aquaculture Research Center, Kunpeng Institute of Modern Agriculture at Foshan, Chinese Academy of Agricultural Sciences, Building 26 Lihe Technology Park, Auxiliary Road of Xinxi Avenue South, Nanhai District, Foshan 528226, China
| | - Junwei Cao
- College of Life Sciences, Inner Mongolia Autonomous Region Key Laboratory of Biomanufacturing, Inner Mongolia Agricultural University, NO. 306 Zhaowuda Road, Saihan District, Hohhot 010018, China
| | - Erwei Zuo
- Shenzhen Branch, Guangdong Laboratory for Lingnan Modern Agriculture, Key Laboratory of Livestock and Poultry Multi-Omics of MARA, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Buxin Road NO. 97, Dapeng District, Shenzhen 518124, China
| | - Chugang Mei
- College of Grassland Agriculture, National Beef Cattle Improvement Center, College of Animal Science and Technology, Northwest A&F University, NO. 3 Taicheng Road, Yangling District, Yangling, Shaanxi 712100, China
| | - Fangjie Zhu
- College of JUNCAO Science and Ecology, Haixia Institute of Science and Technology, National Engineering Research Center of JUNCAO, Fujian Agriculture and Forestry University (FAFU), NO.15 Shangxiadian Road, Cangshan District, Fuzhou 0350002, China
| | - Changwei Shao
- State Key Laboratory of Maricultural Biobreeding and Sustainable Goods, Yellow Sea Fisheries Research Institute, Chinese Academy of Fishery Sciences, NO.106 Nanjing Road, Shinan District, Qingdao, Shandong 266071, China
| | - Guirong Wang
- Shenzhen Branch, Guangdong Laboratory of Lingnan Modern Agriculture, Key Laboratory of Synthetic Biology, Ministry of Agriculture and Rural Affairs, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Buxin Road NO. 97, Dapeng District, Shenzhen 518124, China
| | - Tongjun Sun
- Shenzhen Branch, Guangdong Laboratory for Lingnan Modern Agriculture, Genome Analysis Laboratory of the Ministry of Agriculture and Rural Affairs, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, NO. 7 Pengfei Road, Dapeng District, Shenzhen 518124, China
| | - Ningli Wang
- Department of Ophthalmology, Beijing Institute of Ophthalmology, Beijing Tongren Eye Center, Beijing Tongren Hospital, Capital Medical University, Dongjiaomin lane No1, Dongcheng District, Beijing 100101, China
| | - Gang Liu
- State Key Laboratory of Mycology, Institute of Microbiology, Chinese Academy of Sciences, NO.1 Beichen West Road, Chaoyang District, Beijing 100101, China
| | - Jian-Quan Ni
- Gene Regulatory Lab, School of Basic Medical Sciences, Tsinghua University, NO. 30 Shuangqing road, Haidian district, Beijing 100084, China
- State Key Laboratory of Molecular Oncology, Tsinghua University, NO. 30 Shuangqing road, Haidian district, Beijing 100084, China
- SXMU-Tsinghua Collaborative Innovation Center for Frontier Medicine, Shanxi Medical University, NO. 56 Xinjian South Road, Yingze District, Taiyuan 030001, China
| | - Yuwen Liu
- Shenzhen Branch, Guangdong Laboratory for Lingnan Modern Agriculture, Key Laboratory of Livestock and Poultry Multi-Omics of MARA, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Buxin Road NO. 97, Dapeng District, Shenzhen 518124, China
- Innovation Group of Pig Genome Design and Breeding, Research Centre for Animal Genome, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Buxin Road NO. 97, Dapeng District, Shenzhen 518124, China
- Green Healthy Aquaculture Research Center, Kunpeng Institute of Modern Agriculture at Foshan, Chinese Academy of Agricultural Sciences, Building 26 Lihe Technology Park, Auxiliary Road of Xinxi Avenue South, Nanhai District, Foshan 528226, China
| |
Collapse
|
6
|
La Fleur A, Shi Y, Seelig G. Decoding biology with massively parallel reporter assays and machine learning. Genes Dev 2024; 38:843-865. [PMID: 39362779 PMCID: PMC11535156 DOI: 10.1101/gad.351800.124] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/05/2024]
Abstract
Massively parallel reporter assays (MPRAs) are powerful tools for quantifying the impacts of sequence variation on gene expression. Reading out molecular phenotypes with sequencing enables interrogating the impact of sequence variation beyond genome scale. Machine learning models integrate and codify information learned from MPRAs and enable generalization by predicting sequences outside the training data set. Models can provide a quantitative understanding of cis-regulatory codes controlling gene expression, enable variant stratification, and guide the design of synthetic regulatory elements for applications from synthetic biology to mRNA and gene therapy. This review focuses on cis-regulatory MPRAs, particularly those that interrogate cotranscriptional and post-transcriptional processes: alternative splicing, cleavage and polyadenylation, translation, and mRNA decay.
Collapse
Affiliation(s)
- Alyssa La Fleur
- Paul G. Allen School of Computer Science and Engineering, University of Washington, Seattle, Washington 98195, USA
| | - Yongsheng Shi
- Department of Microbiology and Molecular Genetics, School of Medicine, University of California, Irvine, Irvine, California 92697, USA;
| | - Georg Seelig
- Paul G. Allen School of Computer Science and Engineering, University of Washington, Seattle, Washington 98195, USA;
- Department of Electrical & Computer Engineering, University of Washington, Seattle, Washington 98195, USA
| |
Collapse
|
7
|
Schlusser N, González A, Pandey M, Zavolan M. Current limitations in predicting mRNA translation with deep learning models. Genome Biol 2024; 25:227. [PMID: 39164757 PMCID: PMC11337900 DOI: 10.1186/s13059-024-03369-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/26/2023] [Accepted: 08/07/2024] [Indexed: 08/22/2024] Open
Abstract
BACKGROUND The design of nucleotide sequences with defined properties is a long-standing problem in bioengineering. An important application is protein expression, be it in the context of research or the production of mRNA vaccines. The rate of protein synthesis depends on the 5' untranslated region (5'UTR) of the mRNAs, and recently, deep learning models were proposed to predict the translation output of mRNAs from the 5'UTR sequence. At the same time, large data sets of endogenous and reporter mRNA translation have become available. RESULTS In this study, we use complementary data obtained in two different cell types to assess the accuracy and generality of currently available models for predicting translational output. We find that while performing well on the data sets on which they were trained, deep learning models do not generalize well to other data sets, in particular of endogenous mRNAs, which differ in many properties from reporter constructs. CONCLUSIONS These differences limit the ability of deep learning models to uncover mechanisms of translation control and to predict the impact of genetic variation. We suggest directions that combine high-throughput measurements and machine learning to unravel mechanisms of translation control and improve construct design.
Collapse
Affiliation(s)
- Niels Schlusser
- Biozentrum, University of Basel, Spitalstrasse 41, 4056, Basel, Switzerland.
| | - Asier González
- Biozentrum, University of Basel, Spitalstrasse 41, 4056, Basel, Switzerland
- Departament de Bioquímica i Biologia Molecular and Institut de Biotecnologia i Biomedicina, Universitat Autònoma de Barcelona, 08193, Cerdanyola del Vallès, Spain
| | - Muskan Pandey
- Biozentrum, University of Basel, Spitalstrasse 41, 4056, Basel, Switzerland
- Current address: Institute of Molecular Biology and Biophysics, Department of Biology, ETH Zurich, 8093, Zurich, Switzerland
| | - Mihaela Zavolan
- Biozentrum, University of Basel, Spitalstrasse 41, 4056, Basel, Switzerland.
| |
Collapse
|
8
|
Xu L, Bai X, Joong Oh E. Strategic approaches for designing yeast strains as protein secretion and display platforms. Crit Rev Biotechnol 2024:1-18. [PMID: 39138023 DOI: 10.1080/07388551.2024.2385996] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/02/2024] [Revised: 07/03/2024] [Accepted: 07/04/2024] [Indexed: 08/15/2024]
Abstract
Yeast has been established as a versatile platform for expressing functional molecules, owing to its well-characterized biology and extensive genetic modification tools. Compared to prokaryotic systems, yeast possesses advanced cellular mechanisms that ensure accurate protein folding and post-translational modifications. These capabilities are particularly advantageous for the expression of human-derived functional proteins. However, designing yeast strains as an expression platform for proteins requires the integration of molecular and cellular functions. By delving into the complexities of yeast-based expression systems, this review aims to empower researchers with the knowledge to fully exploit yeast as a functional platform to produce a diverse range of proteins. This review includes an exploration of the host strains, gene cassette structures, as well as considerations for maximizing the efficiency of the expression system. Through this in-depth analysis, the review anticipates stimulating further innovation in the field of yeast biotechnology and protein engineering.
Collapse
Affiliation(s)
- Luping Xu
- Department of Food Science, Purdue University, West Lafayette, IN, USA
- Whistler Center for Carbohydrate Research, Purdue University, West Lafayette, IN, USA
| | | | - Eun Joong Oh
- Department of Food Science, Purdue University, West Lafayette, IN, USA
- Whistler Center for Carbohydrate Research, Purdue University, West Lafayette, IN, USA
| |
Collapse
|
9
|
Xu L, Liu Y. Identification, Design, and Application of Noncoding Cis-Regulatory Elements. Biomolecules 2024; 14:945. [PMID: 39199333 PMCID: PMC11352686 DOI: 10.3390/biom14080945] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/26/2024] [Revised: 07/25/2024] [Accepted: 07/30/2024] [Indexed: 09/01/2024] Open
Abstract
Cis-regulatory elements (CREs) play a pivotal role in orchestrating interactions with trans-regulatory factors such as transcription factors, RNA-binding proteins, and noncoding RNAs. These interactions are fundamental to the molecular architecture underpinning complex and diverse biological functions in living organisms, facilitating a myriad of sophisticated and dynamic processes. The rapid advancement in the identification and characterization of these regulatory elements has been marked by initiatives such as the Encyclopedia of DNA Elements (ENCODE) project, which represents a significant milestone in the field. Concurrently, the development of CRE detection technologies, exemplified by massively parallel reporter assays, has progressed at an impressive pace, providing powerful tools for CRE discovery. The exponential growth of multimodal functional genomic data has necessitated the application of advanced analytical methods. Deep learning algorithms, particularly large language models, have emerged as invaluable tools for deconstructing the intricate nucleotide sequences governing CRE function. These advancements facilitate precise predictions of CRE activity and enable the de novo design of CREs. A deeper understanding of CRE operational dynamics is crucial for harnessing their versatile regulatory properties. Such insights are instrumental in refining gene therapy techniques, enhancing the efficacy of selective breeding programs, pushing the boundaries of genetic innovation, and opening new possibilities in microbial synthetic biology.
Collapse
Affiliation(s)
- Lingna Xu
- Shenzhen Branch, Guangdong Laboratory for Lingnan Modern Agriculture, Key Laboratory of Livestock and Poultry Multi-Omics of MARA, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen 518124, China;
- Innovation Group of Pig Genome Design and Breeding, Research Centre for Animal Genome, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen 518124, China
| | - Yuwen Liu
- Shenzhen Branch, Guangdong Laboratory for Lingnan Modern Agriculture, Key Laboratory of Livestock and Poultry Multi-Omics of MARA, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen 518124, China;
- Innovation Group of Pig Genome Design and Breeding, Research Centre for Animal Genome, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen 518124, China
- Kunpeng Institute of Modern Agriculture at Foshan, Chinese Academy of Agricultural Sciences, Foshan 528226, China
| |
Collapse
|
10
|
Zhang W, Zhang P, Wang H, Xu R, Xie Z, Wang Y, Du G, Kang Z. Enhancing the expression of chondroitin 4-O-sulfotransferase for one-pot enzymatic synthesis of chondroitin sulfate A. Carbohydr Polym 2024; 337:122158. [PMID: 38710555 DOI: 10.1016/j.carbpol.2024.122158] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/01/2023] [Revised: 03/28/2024] [Accepted: 04/11/2024] [Indexed: 05/08/2024]
Abstract
Chondroitin sulfate (CS) stands as a pivotal compound in dietary supplements for osteoarthritis treatment, propelling significant interest in the biotechnological pursuit of environmentally friendly and safe CS production. Enzymatic synthesis of CS for instance CSA has been considered as one of the most promising methods. However, the bottleneck consistently encountered is the active expression of chondroitin 4-O-sulfotransferase (C4ST) during CSA biosynthesis. This study meticulously delved into optimizing C4ST expression through systematic enhancements in transcription, translation, and secretion mechanisms via modifications in the 5' untranslated region, the N-terminal encoding sequence, and the Komagataella phaffii chassis. Ultimately, the active C4ST expression escalated to 2713.1 U/L, representing a striking 43.7-fold increase. By applying the culture broth supernatant of C4ST and integrating the 3'-phosphoadenosine-5'-phosphosulfate (PAPS) biosynthesis module, we constructed a one-pot enzymatic system for CSA biosynthesis, achieving a remarkable sulfonation degree of up to 97.0 %. The substantial enhancement in C4ST expression and the development of an engineered one-pot enzymatic synthesis system promises to expedite large-scale CSA biosynthesis with customizable sulfonation degrees.
Collapse
Affiliation(s)
- Weijiao Zhang
- Key Laboratory of Carbohydrate Chemistry and Biotechnology, Ministry of Education, Jiangnan University, Wuxi 214122, China; Science Center for Future Foods, Jiangnan University, Wuxi 214122, China; The Key Laboratory of Industrial Biotechnology, Ministry of Education, School of Biotechnology, Jiangnan University, Wuxi, China
| | - Ping Zhang
- Key Laboratory of Carbohydrate Chemistry and Biotechnology, Ministry of Education, Jiangnan University, Wuxi 214122, China; Science Center for Future Foods, Jiangnan University, Wuxi 214122, China; The Key Laboratory of Industrial Biotechnology, Ministry of Education, School of Biotechnology, Jiangnan University, Wuxi, China
| | - Hao Wang
- Bloomage Biotechnology CO, LTD, 250000 Jinan, China
| | - Ruirui Xu
- Key Laboratory of Carbohydrate Chemistry and Biotechnology, Ministry of Education, Jiangnan University, Wuxi 214122, China; Science Center for Future Foods, Jiangnan University, Wuxi 214122, China; The Key Laboratory of Industrial Biotechnology, Ministry of Education, School of Biotechnology, Jiangnan University, Wuxi, China
| | - Zhuan Xie
- Key Laboratory of Carbohydrate Chemistry and Biotechnology, Ministry of Education, Jiangnan University, Wuxi 214122, China; Science Center for Future Foods, Jiangnan University, Wuxi 214122, China; The Key Laboratory of Industrial Biotechnology, Ministry of Education, School of Biotechnology, Jiangnan University, Wuxi, China
| | - Yang Wang
- Key Laboratory of Carbohydrate Chemistry and Biotechnology, Ministry of Education, Jiangnan University, Wuxi 214122, China; Science Center for Future Foods, Jiangnan University, Wuxi 214122, China
| | - Guocheng Du
- Key Laboratory of Carbohydrate Chemistry and Biotechnology, Ministry of Education, Jiangnan University, Wuxi 214122, China; Science Center for Future Foods, Jiangnan University, Wuxi 214122, China
| | - Zhen Kang
- Key Laboratory of Carbohydrate Chemistry and Biotechnology, Ministry of Education, Jiangnan University, Wuxi 214122, China; Science Center for Future Foods, Jiangnan University, Wuxi 214122, China.
| |
Collapse
|
11
|
Sokolova K, Chen KM, Hao Y, Zhou J, Troyanskaya OG. Deep Learning Sequence Models for Transcriptional Regulation. Annu Rev Genomics Hum Genet 2024; 25:105-122. [PMID: 38594933 DOI: 10.1146/annurev-genom-021623-024727] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 04/11/2024]
Abstract
Deciphering the regulatory code of gene expression and interpreting the transcriptional effects of genome variation are critical challenges in human genetics. Modern experimental technologies have resulted in an abundance of data, enabling the development of sequence-based deep learning models that link patterns embedded in DNA to the biochemical and regulatory properties contributing to transcriptional regulation, including modeling epigenetic marks, 3D genome organization, and gene expression, with tissue and cell-type specificity. Such methods can predict the functional consequences of any noncoding variant in the human genome, even rare or never-before-observed variants, and systematically characterize their consequences beyond what is tractable from experiments or quantitative genetics studies alone. Recently, the development and application of interpretability approaches have led to the identification of key sequence patterns contributing to the predicted tasks, providing insights into the underlying biological mechanisms learned and revealing opportunities for improvement in future models.
Collapse
Affiliation(s)
- Ksenia Sokolova
- Department of Computer Science and Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, New Jersey, USA; , ,
| | - Kathleen M Chen
- Department of Computer Science and Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, New Jersey, USA; , ,
| | - Yun Hao
- Flatiron Institute, Simons Foundation, New York, NY, USA;
| | - Jian Zhou
- Lyda Hill Department of Bioinformatics, University of Texas Southwestern Medical Center, Dallas, Texas, USA;
| | - Olga G Troyanskaya
- Princeton Precision Health, Princeton University, Princeton, New Jersey, USA
- Flatiron Institute, Simons Foundation, New York, NY, USA;
- Department of Computer Science and Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, New Jersey, USA; , ,
| |
Collapse
|
12
|
Sejour R, Leatherwood J, Yurovsky A, Futcher B. Enrichment of rare codons at 5' ends of genes is a spandrel caused by evolutionary sequence turnover and does not improve translation. eLife 2024; 12:RP89656. [PMID: 39008347 PMCID: PMC11249729 DOI: 10.7554/elife.89656] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/16/2024] Open
Abstract
Previously, Tuller et al. found that the first 30-50 codons of the genes of yeast and other eukaryotes are slightly enriched for rare codons. They argued that this slowed translation, and was adaptive because it queued ribosomes to prevent collisions. Today, the translational speeds of different codons are known, and indeed rare codons are translated slowly. We re-examined this 5' slow translation 'ramp.' We confirm that 5' regions are slightly enriched for rare codons; in addition, they are depleted for downstream Start codons (which are fast), with both effects contributing to slow 5' translation. However, we also find that the 5' (and 3') ends of yeast genes are poorly conserved in evolution, suggesting that they are unstable and turnover relatively rapidly. When a new 5' end forms de novo, it is likely to include codons that would otherwise be rare. Because evolution has had a relatively short time to select against these codons, 5' ends are typically slightly enriched for rare, slow codons. Opposite to the expectation of Tuller et al., we show by direct experiment that genes with slowly translated codons at the 5' end are expressed relatively poorly, and that substituting faster synonymous codons improves expression. Direct experiment shows that slow codons do not prevent downstream ribosome collisions. Further informatic studies suggest that for natural genes, slow 5' ends are correlated with poor gene expression, opposite to the expectation of Tuller et al. Thus, we conclude that slow 5' translation is a 'spandrel'--a non-adaptive consequence of something else, in this case, the turnover of 5' ends in evolution, and it does not improve translation.
Collapse
Affiliation(s)
- Richard Sejour
- Department of Pharmacological Sciences, Stony Brook UniversityStony BrookUnited States
| | - Janet Leatherwood
- Department of Microbiology and Immunology, Stony Brook UniversityStony BrookUnited States
| | - Alisa Yurovsky
- Department of Biomedical Informatics, Stony Brook UniversityStony BrookUnited States
| | - Bruce Futcher
- Department of Microbiology and Immunology, Stony Brook UniversityStony BrookUnited States
| |
Collapse
|
13
|
Gorjifard S, Jores T, Tonnies J, Mueth NA, Bubb K, Wrightsman T, Buckler ES, Fields S, Cuperus JT, Queitsch C. Arabidopsis and maize terminator strength is determined by GC content, polyadenylation motifs and cleavage probability. Nat Commun 2024; 15:5868. [PMID: 38997252 PMCID: PMC11245536 DOI: 10.1038/s41467-024-50174-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/16/2023] [Accepted: 07/03/2024] [Indexed: 07/14/2024] Open
Abstract
The 3' end of a gene, often called a terminator, modulates mRNA stability, localization, translation, and polyadenylation. Here, we adapted Plant STARR-seq, a massively parallel reporter assay, to measure the activity of over 50,000 terminators from the plants Arabidopsis thaliana and Zea mays. We characterize thousands of plant terminators, including many that outperform bacterial terminators commonly used in plants. Terminator activity is species-specific, differing in tobacco leaf and maize protoplast assays. While recapitulating known biology, our results reveal the relative contributions of polyadenylation motifs to terminator strength. We built a computational model to predict terminator strength and used it to conduct in silico evolution that generated optimized synthetic terminators. Additionally, we discover alternative polyadenylation sites across tens of thousands of terminators; however, the strongest terminators tend to have a dominant cleavage site. Our results establish features of plant terminator function and identify strong naturally occurring and synthetic terminators.
Collapse
Affiliation(s)
- Sayeh Gorjifard
- Department of Genome Sciences, University of Washington, Seattle, WA, 98195, USA
| | - Tobias Jores
- Department of Genome Sciences, University of Washington, Seattle, WA, 98195, USA
| | - Jackson Tonnies
- Department of Genome Sciences, University of Washington, Seattle, WA, 98195, USA
- Graduate Program in Biology, University of Washington, Seattle, WA, 98195, USA
| | - Nicholas A Mueth
- Department of Genome Sciences, University of Washington, Seattle, WA, 98195, USA
| | - Kerry Bubb
- Department of Genome Sciences, University of Washington, Seattle, WA, 98195, USA
| | - Travis Wrightsman
- Section of Plant Breeding and Genetics, Cornell University, Ithaca, NY, 14853, USA
| | - Edward S Buckler
- Section of Plant Breeding and Genetics, Cornell University, Ithaca, NY, 14853, USA
- Agricultural Research Service, United States Department of Agriculture, Ithaca, NY, 14853, USA
- Institute for Genomic Diversity, Cornell University, Ithaca, NY, 14853, USA
| | - Stanley Fields
- Department of Genome Sciences, University of Washington, Seattle, WA, 98195, USA
- Department of Medicine, University of Washington, Seattle, WA, 98195, USA
| | - Josh T Cuperus
- Department of Genome Sciences, University of Washington, Seattle, WA, 98195, USA
| | - Christine Queitsch
- Department of Genome Sciences, University of Washington, Seattle, WA, 98195, USA.
| |
Collapse
|
14
|
Routhier E, Joubert A, Westbrook A, Pierre E, Lancrey A, Cariou M, Boulé JB, Mozziconacci J. In silico design of DNA sequences for in vivo nucleosome positioning. Nucleic Acids Res 2024; 52:6802-6810. [PMID: 38828788 PMCID: PMC11229325 DOI: 10.1093/nar/gkae468] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/13/2023] [Revised: 04/24/2024] [Accepted: 05/29/2024] [Indexed: 06/05/2024] Open
Abstract
The computational design of synthetic DNA sequences with designer in vivo properties is gaining traction in the field of synthetic genomics. We propose here a computational method which combines a kinetic Monte Carlo framework with a deep mutational screening based on deep learning predictions. We apply our method to build regular nucleosome arrays with tailored nucleosomal repeat lengths (NRL) in yeast. Our design was validated in vivo by successfully engineering and integrating thousands of kilobases long tandem arrays of computationally optimized sequences which could accommodate NRLs much larger than the yeast natural NRL (namely 197 and 237 bp, compared to the natural NRL of ∼165 bp). RNA-seq results show that transcription of the arrays can occur but is not driven by the NRL. The computational method proposed here delineates the key sequence rules for nucleosome positioning in yeast and should be easily applicable to other sequence properties and other genomes.
Collapse
Affiliation(s)
- Etienne Routhier
- Laboratoire de Physique Théorique, CNRS, Sorbonne Université, Paris, France de la Matière Condensée, CNRS, Sorbonne Université, Paris, France
| | - Alexandra Joubert
- Structure et Instabilité des Génomes, Museum National d’Histoire Naturelle, CNRS, INSERM, Paris, France
| | - Alex Westbrook
- Structure et Instabilité des Génomes, Museum National d’Histoire Naturelle, CNRS, INSERM, Paris, France
| | - Edgard Pierre
- Laboratoire de Physique Théorique, CNRS, Sorbonne Université, Paris, France de la Matière Condensée, CNRS, Sorbonne Université, Paris, France
| | - Astrid Lancrey
- Structure et Instabilité des Génomes, Museum National d’Histoire Naturelle, CNRS, INSERM, Paris, France
| | - Marie Cariou
- Acquisition et Analyse de données pour l’histoire naturelle, Museum National d’Histoire Naturelle, CNRS, Paris, France
| | - Jean-Baptiste Boulé
- Structure et Instabilité des Génomes, Museum National d’Histoire Naturelle, CNRS, INSERM, Paris, France
| | - Julien Mozziconacci
- Laboratoire de Physique Théorique, CNRS, Sorbonne Université, Paris, France de la Matière Condensée, CNRS, Sorbonne Université, Paris, France
- Structure et Instabilité des Génomes, Museum National d’Histoire Naturelle, CNRS, INSERM, Paris, France
- Acquisition et Analyse de données pour l’histoire naturelle, Museum National d’Histoire Naturelle, CNRS, Paris, France
- Institut Universitaire de France, Paris, France
| |
Collapse
|
15
|
Akirtava C, May G, McManus CJ. Deciphering the cis-regulatory landscape of natural yeast Transcript Leaders. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.07.03.601937. [PMID: 39005336 PMCID: PMC11245039 DOI: 10.1101/2024.07.03.601937] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 07/16/2024]
Abstract
Protein synthesis is a vital process that is highly regulated at the initiation step of translation. Eukaryotic 5' transcript leaders (TLs) contain a variety of cis-regulatory features that influence translation and mRNA stability. However, the relative influences of these features in natural TLs are poorly characterized. To address this, we used massively parallel reporter assays (MPRAs) to quantify RNA levels, ribosome loading, and protein levels from 11,027 natural yeast TLs in vivo and systematically compared the relative impacts of their sequence features on gene expression. We found that yeast TLs influence gene expression over two orders of magnitude. While a leaky scanning model using Kozak contexts and uAUGs explained half of the variance in expression across transcript leaders, the addition of other features explained ~70% of gene expression variation. Our analyses detected key cis-acting sequence features, quantified their effects in vivo, and compared their roles to motifs reported from an in vitro study of ribosome recruitment. In addition, our work quantitated the effects of alternative transcription start site usage on gene expression in yeast. Thus, our study provides new quantitative insights into the roles of TL cis-acting sequences in regulating gene expression.
Collapse
Affiliation(s)
- Christina Akirtava
- Department of Biological Sciences, Carnegie Mellon University, Pittsburgh, PA, 15213, USA
- RNA Bioscience Initiative, University of Colorado - Anshutz, Aurora, CO, 80045, USA
| | - Gemma May
- Department of Biological Sciences, Carnegie Mellon University, Pittsburgh, PA, 15213, USA
| | - C Joel McManus
- Department of Biological Sciences, Carnegie Mellon University, Pittsburgh, PA, 15213, USA
- Computational Biology Department, Carnegie Mellon University, Pittsburgh, PA, 15213, USA
| |
Collapse
|
16
|
Andreani V, South EJ, Dunlop MJ. Generating information-dense promoter sequences with optimal string packing. PLoS Comput Biol 2024; 20:e1012276. [PMID: 39047028 PMCID: PMC11268586 DOI: 10.1371/journal.pcbi.1012276] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/03/2024] [Accepted: 06/25/2024] [Indexed: 07/27/2024] Open
Abstract
Dense arrangements of binding sites within nucleotide sequences can collectively influence downstream transcription rates or initiate biomolecular interactions. For example, natural promoter regions can harbor many overlapping transcription factor binding sites that influence the rate of transcription initiation. Despite the prevalence of overlapping binding sites in nature, rapid design of nucleotide sequences with many overlapping sites remains a challenge. Here, we show that this is an NP-hard problem, coined here as the nucleotide String Packing Problem (SPP). We then introduce a computational technique that efficiently assembles sets of DNA-protein binding sites into dense, contiguous stretches of double-stranded DNA. For the efficient design of nucleotide sequences spanning hundreds of base pairs, we reduce the SPP to an Orienteering Problem with integer distances, and then leverage modern integer linear programming solvers. Our method optimally packs sets of 20-100 binding sites into dense nucleotide arrays of 50-300 base pairs in 0.05-10 seconds. Unlike approximation algorithms or meta-heuristics, our approach finds provably optimal solutions. We demonstrate how our method can generate large sets of diverse sequences suitable for library generation, where the frequency of binding site usage across the returned sequences can be controlled by modulating the objective function. As an example, we then show how adding additional constraints, like the inclusion of sequence elements with fixed positions, allows for the design of bacterial promoters. The nucleotide string packing approach we present can accelerate the design of sequences with complex DNA-protein interactions. When used in combination with synthesis and high-throughput screening, this design strategy could help interrogate how complex binding site arrangements impact either gene expression or biomolecular mechanisms in varied cellular contexts.
Collapse
Affiliation(s)
- Virgile Andreani
- Biomedical Engineering Department, Boston University, Boston, Massachusetts, United States of America
- Biological Design Center, Boston University, Boston, Massachusetts, United States of America
| | - Eric J. South
- Biological Design Center, Boston University, Boston, Massachusetts, United States of America
- Molecular Biology, Cell Biology & Biochemistry Program, Boston University, Boston, Massachusetts, United States of America
| | - Mary J. Dunlop
- Biomedical Engineering Department, Boston University, Boston, Massachusetts, United States of America
- Biological Design Center, Boston University, Boston, Massachusetts, United States of America
- Molecular Biology, Cell Biology & Biochemistry Program, Boston University, Boston, Massachusetts, United States of America
| |
Collapse
|
17
|
Hwang H, Jeon H, Yeo N, Baek D. Big data and deep learning for RNA biology. Exp Mol Med 2024; 56:1293-1321. [PMID: 38871816 PMCID: PMC11263376 DOI: 10.1038/s12276-024-01243-w] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/31/2024] [Revised: 02/27/2024] [Accepted: 03/05/2024] [Indexed: 06/15/2024] Open
Abstract
The exponential growth of big data in RNA biology (RB) has led to the development of deep learning (DL) models that have driven crucial discoveries. As constantly evidenced by DL studies in other fields, the successful implementation of DL in RB depends heavily on the effective utilization of large-scale datasets from public databases. In achieving this goal, data encoding methods, learning algorithms, and techniques that align well with biological domain knowledge have played pivotal roles. In this review, we provide guiding principles for applying these DL concepts to various problems in RB by demonstrating successful examples and associated methodologies. We also discuss the remaining challenges in developing DL models for RB and suggest strategies to overcome these challenges. Overall, this review aims to illuminate the compelling potential of DL for RB and ways to apply this powerful technology to investigate the intriguing biology of RNA more effectively.
Collapse
Affiliation(s)
- Hyeonseo Hwang
- School of Biological Sciences, Seoul National University, Seoul, Republic of Korea
| | - Hyeonseong Jeon
- Interdisciplinary Program in Bioinformatics, Seoul National University, Seoul, Republic of Korea
- Genome4me Inc., Seoul, Republic of Korea
| | - Nagyeong Yeo
- School of Biological Sciences, Seoul National University, Seoul, Republic of Korea
| | - Daehyun Baek
- School of Biological Sciences, Seoul National University, Seoul, Republic of Korea.
- Interdisciplinary Program in Bioinformatics, Seoul National University, Seoul, Republic of Korea.
- Genome4me Inc., Seoul, Republic of Korea.
| |
Collapse
|
18
|
Sinzger-D'Angelo M, Hanst M, Reinhardt F, Koeppl H. Effects of mRNA conformational switching on translational noise in gene circuits. J Chem Phys 2024; 160:134108. [PMID: 38573847 DOI: 10.1063/5.0186927] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/09/2023] [Accepted: 03/08/2024] [Indexed: 04/06/2024] Open
Abstract
Intragenic translational heterogeneity describes the variation in translation at the level of transcripts for an individual gene. A factor that contributes to this source of variation is the mRNA structure. Both the composition of the thermodynamic ensemble, i.e., the stationary distribution of mRNA structures, and the switching dynamics between those play a role. The effect of the switching dynamics on intragenic translational heterogeneity remains poorly understood. We present a stochastic translation model that accounts for mRNA structure switching and is derived from a Markov model via approximate stochastic filtering. We assess the approximation on various timescales and provide a method to quantify how mRNA structure dynamics contributes to translational heterogeneity. With our approach, we allow quantitative information on mRNA switching from biophysical experiments or coarse-grain molecular dynamics simulations of mRNA structures to be included in gene regulatory chemical reaction network models without an increase in the number of species. Thereby, our model bridges a gap between mRNA structure kinetics and gene expression models, which we hope will further improve our understanding of gene regulatory networks and facilitate genetic circuit design.
Collapse
Affiliation(s)
| | - Maleen Hanst
- Centre for Synthetic Biology, Technische Universität Darmstadt, Darmstadt, Germany
| | - Felix Reinhardt
- Centre for Synthetic Biology, Technische Universität Darmstadt, Darmstadt, Germany
| | - Heinz Koeppl
- Centre for Synthetic Biology, Technische Universität Darmstadt, Darmstadt, Germany
| |
Collapse
|
19
|
Tang X, Huo M, Chen Y, Huang H, Qin S, Luo J, Qin Z, Jiang X, Liu Y, Duan X, Wang R, Chen L, Li H, Fan N, He Z, He X, Shen B, Li SC, Song X. A novel deep generative model for mRNA vaccine development: Designing 5' UTRs with N1-methyl-pseudouridine modification. Acta Pharm Sin B 2024; 14:1814-1826. [PMID: 38572113 PMCID: PMC10985129 DOI: 10.1016/j.apsb.2023.11.003] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/24/2023] [Revised: 10/26/2023] [Accepted: 11/01/2023] [Indexed: 04/05/2024] Open
Abstract
Efficient translation mediated by the 5' untranslated region (5' UTR) is essential for the robust efficacy of mRNA vaccines. However, the N1-methyl-pseudouridine (m1Ψ) modification of mRNA can impact the translation efficiency of the 5' UTR. We discovered that the optimal 5' UTR for m1Ψ-modified mRNA (m1Ψ-5' UTR) differs significantly from its unmodified counterpart, highlighting the need for a specialized tool for designing m1Ψ-5' UTRs rather than directly utilizing high-expression endogenous gene 5' UTRs. In response, we developed a novel machine learning-based tool, Smart5UTR, which employs a deep generative model to identify superior m1Ψ-5' UTRs in silico. The tailored loss function and network architecture enable Smart5UTR to overcome limitations inherent in existing models. As a result, Smart5UTR can successfully design superior 5' UTRs, greatly benefiting mRNA vaccine development. Notably, Smart5UTR-designed superior 5' UTRs significantly enhanced antibody titers induced by COVID-19 mRNA vaccines against the Delta and Omicron variants of SARS-CoV-2, surpassing the performance of vaccines using high-expression endogenous gene 5' UTRs.
Collapse
Affiliation(s)
- Xiaoshan Tang
- Institute of Systems Genetics, Department of Critical Care Medicine, Frontiers Science Center for Disease-related Molecular Network, State Key Laboratory of Biotherapy and Cancer Center, West China Hospital, Sichuan University, Chengdu 610000, China
| | - Miaozhe Huo
- Department of Computer Science, City University of Hong Kong, Hong Kong 99907, China
| | - Yuting Chen
- Institute of Systems Genetics, Department of Critical Care Medicine, Frontiers Science Center for Disease-related Molecular Network, State Key Laboratory of Biotherapy and Cancer Center, West China Hospital, Sichuan University, Chengdu 610000, China
| | - Hai Huang
- Institute of Systems Genetics, Department of Critical Care Medicine, Frontiers Science Center for Disease-related Molecular Network, State Key Laboratory of Biotherapy and Cancer Center, West China Hospital, Sichuan University, Chengdu 610000, China
| | - Shugang Qin
- Institute of Systems Genetics, Department of Critical Care Medicine, Frontiers Science Center for Disease-related Molecular Network, State Key Laboratory of Biotherapy and Cancer Center, West China Hospital, Sichuan University, Chengdu 610000, China
| | - Jiaqi Luo
- Department of Computer Science, City University of Hong Kong, Hong Kong 99907, China
| | - Zeyi Qin
- Department of Biology, Brandeis University, Boston, MA 02453, USA
| | - Xin Jiang
- Institute of Systems Genetics, Department of Critical Care Medicine, Frontiers Science Center for Disease-related Molecular Network, State Key Laboratory of Biotherapy and Cancer Center, West China Hospital, Sichuan University, Chengdu 610000, China
| | - Yongmei Liu
- Institute of Systems Genetics, Department of Critical Care Medicine, Frontiers Science Center for Disease-related Molecular Network, State Key Laboratory of Biotherapy and Cancer Center, West China Hospital, Sichuan University, Chengdu 610000, China
| | - Xing Duan
- Institute of Systems Genetics, Department of Critical Care Medicine, Frontiers Science Center for Disease-related Molecular Network, State Key Laboratory of Biotherapy and Cancer Center, West China Hospital, Sichuan University, Chengdu 610000, China
| | - Ruohan Wang
- Department of Computer Science, City University of Hong Kong, Hong Kong 99907, China
| | - Lingxi Chen
- Department of Computer Science, City University of Hong Kong, Hong Kong 99907, China
| | - Hao Li
- Institute of Systems Genetics, Department of Critical Care Medicine, Frontiers Science Center for Disease-related Molecular Network, State Key Laboratory of Biotherapy and Cancer Center, West China Hospital, Sichuan University, Chengdu 610000, China
| | - Na Fan
- Institute of Systems Genetics, Department of Critical Care Medicine, Frontiers Science Center for Disease-related Molecular Network, State Key Laboratory of Biotherapy and Cancer Center, West China Hospital, Sichuan University, Chengdu 610000, China
| | - Zhongshan He
- Institute of Systems Genetics, Department of Critical Care Medicine, Frontiers Science Center for Disease-related Molecular Network, State Key Laboratory of Biotherapy and Cancer Center, West China Hospital, Sichuan University, Chengdu 610000, China
| | - Xi He
- Institute of Systems Genetics, Department of Critical Care Medicine, Frontiers Science Center for Disease-related Molecular Network, State Key Laboratory of Biotherapy and Cancer Center, West China Hospital, Sichuan University, Chengdu 610000, China
| | - Bairong Shen
- Institute of Systems Genetics, Department of Critical Care Medicine, Frontiers Science Center for Disease-related Molecular Network, State Key Laboratory of Biotherapy and Cancer Center, West China Hospital, Sichuan University, Chengdu 610000, China
| | - Shuai Cheng Li
- Department of Computer Science, City University of Hong Kong, Hong Kong 99907, China
| | - Xiangrong Song
- Institute of Systems Genetics, Department of Critical Care Medicine, Frontiers Science Center for Disease-related Molecular Network, State Key Laboratory of Biotherapy and Cancer Center, West China Hospital, Sichuan University, Chengdu 610000, China
| |
Collapse
|
20
|
Goshisht MK. Machine Learning and Deep Learning in Synthetic Biology: Key Architectures, Applications, and Challenges. ACS OMEGA 2024; 9:9921-9945. [PMID: 38463314 PMCID: PMC10918679 DOI: 10.1021/acsomega.3c05913] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 08/11/2023] [Revised: 01/19/2024] [Accepted: 01/30/2024] [Indexed: 03/12/2024]
Abstract
Machine learning (ML), particularly deep learning (DL), has made rapid and substantial progress in synthetic biology in recent years. Biotechnological applications of biosystems, including pathways, enzymes, and whole cells, are being probed frequently with time. The intricacy and interconnectedness of biosystems make it challenging to design them with the desired properties. ML and DL have a synergy with synthetic biology. Synthetic biology can be employed to produce large data sets for training models (for instance, by utilizing DNA synthesis), and ML/DL models can be employed to inform design (for example, by generating new parts or advising unrivaled experiments to perform). This potential has recently been brought to light by research at the intersection of engineering biology and ML/DL through achievements like the design of novel biological components, best experimental design, automated analysis of microscopy data, protein structure prediction, and biomolecular implementations of ANNs (Artificial Neural Networks). I have divided this review into three sections. In the first section, I describe predictive potential and basics of ML along with myriad applications in synthetic biology, especially in engineering cells, activity of proteins, and metabolic pathways. In the second section, I describe fundamental DL architectures and their applications in synthetic biology. Finally, I describe different challenges causing hurdles in the progress of ML/DL and synthetic biology along with their solutions.
Collapse
Affiliation(s)
- Manoj Kumar Goshisht
- Department of Chemistry, Natural and
Applied Sciences, University of Wisconsin—Green
Bay, Green
Bay, Wisconsin 54311-7001, United States
| |
Collapse
|
21
|
Luthra I, Jensen C, Chen XE, Salaudeen AL, Rafi AM, de Boer CG. Regulatory activity is the default DNA state in eukaryotes. Nat Struct Mol Biol 2024; 31:559-567. [PMID: 38448573 DOI: 10.1038/s41594-024-01235-4] [Citation(s) in RCA: 6] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/29/2023] [Accepted: 01/29/2024] [Indexed: 03/08/2024]
Abstract
Genomes encode for genes and non-coding DNA, both capable of transcriptional activity. However, unlike canonical genes, many transcripts from non-coding DNA have limited evidence of conservation or function. Here, to determine how much biological noise is expected from non-genic sequences, we quantify the regulatory activity of evolutionarily naive DNA using RNA-seq in yeast and computational predictions in humans. In yeast, more than 99% of naive DNA bases were transcribed. Unlike the evolved transcriptome, naive transcripts frequently overlapped with opposite sense transcripts, suggesting selection favored coherent gene structures in the yeast genome. In humans, regulation-associated chromatin activity is predicted to be common in naive dinucleotide-content-matched randomized DNA. Here, naive and evolved DNA have similar co-occurrence and cell-type specificity of chromatin marks, challenging these as indicators of selection. However, in both yeast and humans, extreme high activities were rare in naive DNA, suggesting they result from selection. Overall, basal regulatory activity seems to be the default, which selection can hone to evolve a function or, if detrimental, repress.
Collapse
Affiliation(s)
- Ishika Luthra
- School of Biomedical Engineering, University of British Columbia, Vancouver, British Columbia, Canada
| | - Cassandra Jensen
- School of Biomedical Engineering, University of British Columbia, Vancouver, British Columbia, Canada
| | - Xinyi E Chen
- School of Biomedical Engineering, University of British Columbia, Vancouver, British Columbia, Canada
| | - Asfar Lathif Salaudeen
- School of Biomedical Engineering, University of British Columbia, Vancouver, British Columbia, Canada
| | - Abdul Muntakim Rafi
- School of Biomedical Engineering, University of British Columbia, Vancouver, British Columbia, Canada
| | - Carl G de Boer
- School of Biomedical Engineering, University of British Columbia, Vancouver, British Columbia, Canada.
| |
Collapse
|
22
|
Hernández G, García A, Weingarten-Gabbay S, Mishra R, Hussain T, Amiri M, Moreno-Hagelsieb G, Montiel-Dávalos A, Lasko P, Sonenberg N. Functional analysis of the AUG initiator codon context reveals novel conserved sequences that disfavor mRNA translation in eukaryotes. Nucleic Acids Res 2024; 52:1064-1079. [PMID: 38038264 PMCID: PMC10853783 DOI: 10.1093/nar/gkad1152] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/22/2023] [Revised: 11/09/2023] [Accepted: 11/15/2023] [Indexed: 12/02/2023] Open
Abstract
mRNA translation is a fundamental process for life. Selection of the translation initiation site (TIS) is crucial, as it establishes the correct open reading frame for mRNA decoding. Studies in vertebrate mRNAs discovered that a purine at -3 and a G at +4 (where A of the AUG initiator codon is numbered + 1), promote TIS recognition. However, the TIS context in other eukaryotes has been poorly experimentally analyzed. We analyzed in vitro the influence of the -3, -2, -1 and + 4 positions of the TIS context in rabbit, Drosophila, wheat, and yeast. We observed that -3A conferred the best translational efficiency across these species. However, we found variability at the + 4 position for optimal translation. In addition, the Kozak motif that was defined from mammalian cells was only weakly predictive for wheat and essentially non-predictive for yeast. We discovered eight conserved sequences that significantly disfavored translation. Due to the big differences in translational efficiency observed among weak TIS context sequences, we define a novel category that we termed 'barren AUG context sequences (BACS)', which represent sequences disfavoring translation. Analysis of mRNA-ribosomal complexes structures provided insights into the function of BACS. The gene ontology of the BACS-containing mRNAs is presented.
Collapse
Affiliation(s)
- Greco Hernández
- mRNA and Cancer Laboratory, Unit of Biomedical Research on Cancer, National Institute of Cancer (INCan), Mexico City 14080, Mexico
| | - Alejandra García
- mRNA and Cancer Laboratory, Unit of Biomedical Research on Cancer, National Institute of Cancer (INCan), Mexico City 14080, Mexico
| | - Shira Weingarten-Gabbay
- Broad Institute of MIT and Harvard, Cambridge, MA, USA
- Department of Organismic and Evolutionary Biology, Harvard University, Cambridge, MA, USA
- Laboratory of Virology and Infectious Disease, The Rockefeller University, New York, NY, USA
| | - Rishi Kumar Mishra
- Department of Developmental Biology and Genetics, Indian Institute of Science, Bengaluru-560012, India
| | - Tanweer Hussain
- Department of Developmental Biology and Genetics, Indian Institute of Science, Bengaluru-560012, India
| | - Mehdi Amiri
- Department of Biochemistry and Goodman Cancer Institute. McGill University., Montreal, QC H3A 1A3, Canada
| | - Gabriel Moreno-Hagelsieb
- Department of Biology, Wilfrid Laurier University. 75 University Ave. W, Waterloo, ON N2L 3C5, Canada
| | - Angélica Montiel-Dávalos
- mRNA and Cancer Laboratory, Unit of Biomedical Research on Cancer, National Institute of Cancer (INCan), Mexico City 14080, Mexico
| | - Paul Lasko
- Department of Biology, McGill University. Montreal, QC H3G 0B1, Canada
| | - Nahum Sonenberg
- Department of Biochemistry and Goodman Cancer Institute. McGill University., Montreal, QC H3A 1A3, Canada
| |
Collapse
|
23
|
Andreani V, South EJ, Dunlop MJ. Generating information-dense promoter sequences with optimal string packing. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2023.11.01.565124. [PMID: 37961203 PMCID: PMC10635063 DOI: 10.1101/2023.11.01.565124] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/15/2023]
Abstract
Dense arrangements of binding sites within nucleotide sequences can collectively influence downstream transcription rates or initiate biomolecular interactions. For example, natural promoter regions can harbor many overlapping transcription factor binding sites that influence the rate of transcription initiation. Despite the prevalence of overlapping binding sites in nature, rapid design of nucleotide sequences with many overlapping sites remains a challenge. Here, we show that this is an NP-hard problem, coined here as the nucleotide String Packing Problem (SPP). We then introduce a computational technique that efficiently assembles sets of DNA-protein binding sites into dense, contiguous stretches of double-stranded DNA. For the efficient design of nucleotide sequences spanning hundreds of base pairs, we reduce the SPP to an Orienteering Problem with integer distances, and then leverage modern integer linear programming solvers. Our method optimally packs libraries of 20-100 binding sites into dense nucleotide arrays of 50-300 base pairs in 0.05-10 seconds. Unlike approximation algorithms or meta-heuristics, our approach finds provably optimal solutions. We demonstrate how our method can generate large sets of diverse sequences suitable for library generation, where the frequency of binding site usage across the returned sequences can be controlled by modulating the objective function. As an example, we then show how adding additional constraints, like the inclusion of sequence elements with fixed positions, allows for the design of bacterial promoters. The nucleotide string packing approach we present can accelerate the design of sequences with complex DNA-protein interactions. When used in combination with synthesis and high-throughput screening, this design strategy could help interrogate how complex binding site arrangements impact either gene expression or biomolecular mechanisms in varied cellular contexts. Author Summary The way protein binding sites are arranged on DNA can control the regulation and transcription of downstream genes. Areas with a high concentration of binding sites can enable complex interplay between transcription factors, a feature that is exploited by natural promoters. However, designing synthetic promoters that contain dense arrangements of binding sites is a challenge. The task involves overlapping many binding sites, each typically about 10 nucleotides long, within a constrained sequence area, which becomes increasingly difficult as sequence length decreases, and binding site variety increases. We introduce an approach to design nucleotide sequences with optimally packed protein binding sites, which we call the nucleotide String Packing Problem (SPP). We show that the SPP can be solved efficiently using integer linear programming to identify the densest arrangements of binding sites for a specified sequence length. We show how adding additional constraints, like the inclusion of sequence elements with fixed positions, allows for the design of bacterial promoters. The presented approach enables the rapid design and study of nucleotide sequences with complex, dense binding site architectures.
Collapse
|
24
|
Gorjifard S, Jores T, Tonnies J, Mueth NA, Bubb K, Wrightsman T, Buckler ES, Fields S, Cuperus JT, Queitsch C. Arabidopsis and Maize Terminator Strength is Determined by GC Content, Polyadenylation Motifs and Cleavage Probability. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2023.06.16.545379. [PMID: 37398426 PMCID: PMC10312805 DOI: 10.1101/2023.06.16.545379] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 07/04/2023]
Abstract
The 3' end of a gene, often called a terminator, modulates mRNA stability, localization, translation, and polyadenylation. Here, we adapted Plant STARR-seq, a massively parallel reporter assay, to measure the activity of over 50,000 terminators from the plants Arabidopsis thaliana and Zea mays. We characterize thousands of plant terminators, including many that outperform bacterial terminators commonly used in plants. Terminator activity is species-specific, differing in tobacco leaf and maize protoplast assays. While recapitulating known biology, our results reveal the relative contributions of polyadenylation motifs to terminator strength. We built a computational model to predict terminator strength and used it to conduct in silico evolution that generated optimized synthetic terminators. Additionally, we discover alternative polyadenylation sites across tens of thousands of terminators; however, the strongest terminators tend to have a dominant cleavage site. Our results establish features of plant terminator function and identify strong naturally occurring and synthetic terminators.
Collapse
Affiliation(s)
- Sayeh Gorjifard
- Department of Genome Sciences, University of Washington, Seattle, WA 98195
| | - Tobias Jores
- Department of Genome Sciences, University of Washington, Seattle, WA 98195
| | - Jackson Tonnies
- Department of Genome Sciences, University of Washington, Seattle, WA 98195
- Graduate Program in Biology, University of Washington, Seattle, WA 98195
| | - Nicholas A Mueth
- Department of Genome Sciences, University of Washington, Seattle, WA 98195
| | - Kerry Bubb
- Department of Genome Sciences, University of Washington, Seattle, WA 98195
| | - Travis Wrightsman
- Section of Plant Breeding and Genetics, Cornell University, Ithaca, NY 14853
| | - Edward S Buckler
- Section of Plant Breeding and Genetics, Cornell University, Ithaca, NY 14853
- Agricultural Research Service, United States Department of Agriculture, Ithaca, NY 14853
- Institute for Genomic Diversity, Cornell University, Ithaca, NY 14853
| | - Stanley Fields
- Department of Genome Sciences, University of Washington, Seattle, WA 98195
- Department of Medicine, University of Washington, Seattle, WA 98195
| | - Josh T Cuperus
- Department of Genome Sciences, University of Washington, Seattle, WA 98195
| | - Christine Queitsch
- Department of Genome Sciences, University of Washington, Seattle, WA 98195
| |
Collapse
|
25
|
Zeng J, Song K, Wang J, Wen H, Zhou J, Ni T, Lu H, Yu Y. Characterization and optimization of 5´ untranslated region containing poly-adenine tracts in Kluyveromyces marxianus using machine-learning model. Microb Cell Fact 2024; 23:7. [PMID: 38172836 PMCID: PMC10763412 DOI: 10.1186/s12934-023-02271-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/15/2023] [Accepted: 12/12/2023] [Indexed: 01/05/2024] Open
Abstract
BACKGROUND The 5´ untranslated region (5´ UTR) plays a key role in regulating translation efficiency and mRNA stability, making it a favored target in genetic engineering and synthetic biology. A common feature found in the 5´ UTR is the poly-adenine (poly(A)) tract. However, the effect of 5´ UTR poly(A) on protein production remains controversial. Machine-learning models are powerful tools for explaining the complex contributions of features, but models incorporating features of 5´ UTR poly(A) are currently lacking. Thus, our goal is to construct such a model, using natural 5´ UTRs from Kluyveromyces marxianus, a promising cell factory for producing heterologous proteins. RESULTS We constructed a mini-library consisting of 207 5´ UTRs harboring poly(A) and 34 5´ UTRs without poly(A) from K. marxianus. The effects of each 5´ UTR on the production of a GFP reporter were evaluated individually in vivo, and the resulting protein abundance spanned an approximately 450-fold range throughout. The data were used to train a multi-layer perceptron neural network (MLP-NN) model that incorporated the length and position of poly(A) as features. The model exhibited good performance in predicting protein abundance (average R2 = 0.7290). The model suggests that the length of poly(A) is negatively correlated with protein production, whereas poly(A) located between 10 and 30 nt upstream of the start codon (AUG) exhibits a weak positive effect on protein abundance. Using the model as guidance, the deletion or reduction of poly(A) upstream of 30 nt preceding AUG tended to improve the production of GFP and a feruloyl esterase. Deletions of poly(A) showed inconsistent effects on mRNA levels, suggesting that poly(A) represses protein production either with or without reducing mRNA levels. CONCLUSION The effects of poly(A) on protein production depend on its length and position. Integrating poly(A) features into machine-learning models improves simulation accuracy. Deleting or reducing poly(A) upstream of 30 nt preceding AUG tends to enhance protein production. This optimization strategy can be applied to enhance the yield of K. marxianus and other microbial cell factories.
Collapse
Affiliation(s)
- Junyuan Zeng
- State Key Laboratory of Genetic Engineering, School of Life Sciences, Fudan University, Shanghai, China
- Shanghai Engineering Research Center of Industrial Microorganisms, Shanghai, 200438, China
| | - Kunfeng Song
- State Key Laboratory of Genetic Engineering, School of Life Sciences, Fudan University, Shanghai, China
- Shanghai Engineering Research Center of Industrial Microorganisms, Shanghai, 200438, China
| | - Jingqi Wang
- State Key Laboratory of Genetic Engineering, School of Life Sciences, Fudan University, Shanghai, China
- Shanghai Engineering Research Center of Industrial Microorganisms, Shanghai, 200438, China
| | - Haimei Wen
- State Key Laboratory of Genetic Engineering, School of Life Sciences, Fudan University, Shanghai, China
- Shanghai Engineering Research Center of Industrial Microorganisms, Shanghai, 200438, China
| | - Jungang Zhou
- State Key Laboratory of Genetic Engineering, School of Life Sciences, Fudan University, Shanghai, China
- Shanghai Engineering Research Center of Industrial Microorganisms, Shanghai, 200438, China
| | - Ting Ni
- State Key Laboratory of Genetic Engineering, School of Life Sciences, Fudan University, Shanghai, China
- Shanghai Engineering Research Center of Industrial Microorganisms, Shanghai, 200438, China
| | - Hong Lu
- State Key Laboratory of Genetic Engineering, School of Life Sciences, Fudan University, Shanghai, China
- Shanghai Engineering Research Center of Industrial Microorganisms, Shanghai, 200438, China
| | - Yao Yu
- State Key Laboratory of Genetic Engineering, School of Life Sciences, Fudan University, Shanghai, China.
- Shanghai Engineering Research Center of Industrial Microorganisms, Shanghai, 200438, China.
| |
Collapse
|
26
|
de Boer CG, Taipale J. Hold out the genome: a roadmap to solving the cis-regulatory code. Nature 2024; 625:41-50. [PMID: 38093018 DOI: 10.1038/s41586-023-06661-w] [Citation(s) in RCA: 17] [Impact Index Per Article: 17.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/17/2023] [Accepted: 09/20/2023] [Indexed: 01/05/2024]
Abstract
Gene expression is regulated by transcription factors that work together to read cis-regulatory DNA sequences. The 'cis-regulatory code' - how cells interpret DNA sequences to determine when, where and how much genes should be expressed - has proven to be exceedingly complex. Recently, advances in the scale and resolution of functional genomics assays and machine learning have enabled substantial progress towards deciphering this code. However, the cis-regulatory code will probably never be solved if models are trained only on genomic sequences; regions of homology can easily lead to overestimation of predictive performance, and our genome is too short and has insufficient sequence diversity to learn all relevant parameters. Fortunately, randomly synthesized DNA sequences enable testing a far larger sequence space than exists in our genomes, and designed DNA sequences enable targeted queries to maximally improve the models. As the same biochemical principles are used to interpret DNA regardless of its source, models trained on these synthetic data can predict genomic activity, often better than genome-trained models. Here we provide an outlook on the field, and propose a roadmap towards solving the cis-regulatory code by a combination of machine learning and massively parallel assays using synthetic DNA.
Collapse
Affiliation(s)
- Carl G de Boer
- School of Biomedical Engineering, University of British Columbia, Vancouver, British Columbia, Canada.
| | - Jussi Taipale
- Applied Tumor Genomics Research Program, Faculty of Medicine, University of Helsinki, Helsinki, Finland.
- Department of Medical Biochemistry and Biophysics, Karolinska Institutet, Stockholm, Sweden.
- Department of Biochemistry, University of Cambridge, Cambridge, UK.
| |
Collapse
|
27
|
Zheng W, Fong JHC, Wan YK, Chu AHY, Huang Y, Wong ASL, Ho JWK. Discovery of regulatory motifs in 5' untranslated regions using interpretable multi-task learning models. Cell Syst 2023; 14:1103-1112.e6. [PMID: 38016465 DOI: 10.1016/j.cels.2023.10.011] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/31/2023] [Revised: 09/18/2023] [Accepted: 10/31/2023] [Indexed: 11/30/2023]
Abstract
The sequence in the 5' untranslated regions (UTRs) is known to affect mRNA translation rates. However, the underlying regulatory grammar remains elusive. Here, we propose MTtrans, a multi-task translation rate predictor capable of learning common sequence patterns from datasets across various experimental techniques. The core premise is that common motifs are more likely to be genuinely involved in translation control. MTtrans outperforms existing methods in both accuracy and the ability to capture transferable motifs across species, highlighting its strength in identifying evolutionarily conserved sequence motifs. Our independent fluorescence-activated cell sorting coupled with deep sequencing (FACS-seq) experiment validates the impact of most motifs identified by MTtrans. Additionally, we introduce "GRU-rewiring," a technique to interpret the hidden states of the recurrent units. Gated recurrent unit (GRU)-rewiring allows us to identify regulatory element-enriched positions and examine the local effects of 5' UTR mutations. MTtrans is a powerful tool for deciphering the translation regulatory motifs.
Collapse
Affiliation(s)
- Weizhong Zheng
- School of Biomedical Sciences, Li Ka Shing Faculty of Medicine, The University of Hong Kong, Hong Kong SAR, China
| | - John H C Fong
- School of Biomedical Sciences, Li Ka Shing Faculty of Medicine, The University of Hong Kong, Hong Kong SAR, China
| | - Yuk Kei Wan
- School of Biomedical Sciences, Li Ka Shing Faculty of Medicine, The University of Hong Kong, Hong Kong SAR, China
| | - Athena H Y Chu
- School of Biomedical Sciences, Li Ka Shing Faculty of Medicine, The University of Hong Kong, Hong Kong SAR, China; Centre for Oncology and Immunology, Hong Kong Science Park, Hong Kong SAR, China
| | - Yuanhua Huang
- School of Biomedical Sciences, Li Ka Shing Faculty of Medicine, The University of Hong Kong, Hong Kong SAR, China; Department of Statistics and Actuarial Science, The University of Hong Kong, Hong Kong SAR, China; Center for Translational Stem Cell Biology, Hong Kong Science and Technology Park, Hong Kong SAR, China
| | - Alan S L Wong
- School of Biomedical Sciences, Li Ka Shing Faculty of Medicine, The University of Hong Kong, Hong Kong SAR, China; Centre for Oncology and Immunology, Hong Kong Science Park, Hong Kong SAR, China; Department of Electrical and Electronic Engineering, The University of Hong Kong, Hong Kong SAR, China
| | - Joshua W K Ho
- School of Biomedical Sciences, Li Ka Shing Faculty of Medicine, The University of Hong Kong, Hong Kong SAR, China; Laboratory of Data Discovery for Health (D24H) Limited, Hong Kong Science Park, Hong Kong SAR, China.
| |
Collapse
|
28
|
Perchlik M, Sasse A, Mostafavi S, Fields S, Cuperus JT. Impact on splicing in Saccharomyces cerevisiae of random 50-base sequences inserted into an intron. RNA (NEW YORK, N.Y.) 2023; 30:52-67. [PMID: 37879864 PMCID: PMC10726166 DOI: 10.1261/rna.079752.123] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/21/2023] [Accepted: 10/18/2023] [Indexed: 10/27/2023]
Abstract
Intron splicing is a key regulatory step in gene expression in eukaryotes. Three sequence elements required for splicing-5' and 3' splice sites and a branchpoint-are especially well-characterized in Saccharomyces cerevisiae, but our understanding of additional intron features that impact splicing in this organism is incomplete, due largely to its small number of introns. To overcome this limitation, we constructed a library in S. cerevisiae of random 50-nt (N50) elements individually inserted into the intron of a reporter gene and quantified canonical splicing and the use of cryptic splice sites by sequencing analysis. More than 70% of approximately 140,000 N50 elements reduced splicing by at least 20%. N50 features, including higher GC content, presence of GU repeats, and stronger predicted secondary structure of its pre-mRNA, correlated with reduced splicing efficiency. A likely basis for the reduced splicing of such a large proportion of variants is the formation of RNA structures that pair N50 bases-such as the GU repeats-with other bases specifically within the reporter pre-mRNA analyzed. However, multiple models were unable to explain more than a small fraction of the variance in splicing efficiency across the library, suggesting that complex nonlinear interactions in RNA structures are not accurately captured by RNA structure prediction methods. Our results imply that the specific context of a pre-mRNA may determine the bases allowable in an intron to prevent secondary structures that reduce splicing. This large data set can serve as a resource for further exploration of splicing mechanisms.
Collapse
Affiliation(s)
- Molly Perchlik
- Department of Genome Sciences, University of Washington, Seattle, Washington 98195, USA
| | - Alexander Sasse
- Paul G. Allen School of Computer Science and Engineering, University of Washington, Seattle, Washington 98195, USA
| | - Sara Mostafavi
- Paul G. Allen School of Computer Science and Engineering, University of Washington, Seattle, Washington 98195, USA
| | - Stanley Fields
- Department of Genome Sciences, University of Washington, Seattle, Washington 98195, USA
- Department of Medicine, University of Washington, Seattle, Washington 98195, USA
| | - Josh T Cuperus
- Department of Genome Sciences, University of Washington, Seattle, Washington 98195, USA
| |
Collapse
|
29
|
Irshad IU, Sharma AK. Decoding stoichiometric protein synthesis in E. coli through translation rate parameters. BIOPHYSICAL REPORTS 2023; 3:100131. [PMID: 37789867 PMCID: PMC10542608 DOI: 10.1016/j.bpr.2023.100131] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 05/24/2023] [Accepted: 09/11/2023] [Indexed: 10/05/2023]
Abstract
E. coli is one of the most widely used organisms for understanding the principles of cellular and molecular genetics. However, we are yet to understand the origin of several experimental observations related to the regulation of gene expression in E. coli. One of the prominent examples in this context is the proportional synthesis in multiprotein complexes where all of their obligate subunits are produced in proportion to their stoichiometry. In this work, by combining the next-generation sequencing data with the stochastic simulations of protein synthesis, we explain the origin of proportional protein synthesis in multicomponent complexes. We find that the estimated initiation rates for the translation of all subunits in those complexes are proportional to their stoichiometry. This constraint on protein synthesis kinetics enforces proportional protein synthesis without requiring any feedback mechanism. We also find that the translation initiation rates in E. coli are influenced by the coding sequence length and the enrichment of A and C nucleotides near the start codon. Thus, this study rationalizes the role of conserved and nonrandom features of genes in regulating the translation kinetics and unravels a key principle of the regulation of protein synthesis.
Collapse
Affiliation(s)
| | - Ajeet K. Sharma
- Department of Physics, Indian Institute of Technology Jammu, Jammu, India
- Department of Biosciences and Bioengineering, Indian Institute of Technology Jammu, Jammu, India
| |
Collapse
|
30
|
Reimão-Pinto MM, Castillo-Hair SM, Seelig G, Schier AF. The regulatory landscape of 5' UTRs in translational control during zebrafish embryogenesis. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.11.23.568470. [PMID: 38045294 PMCID: PMC10690280 DOI: 10.1101/2023.11.23.568470] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/05/2023]
Abstract
The 5' UTRs of mRNAs are critical for translation regulation, but their in vivo regulatory features are poorly characterized. Here, we report the regulatory landscape of 5' UTRs during early zebrafish embryogenesis using a massively parallel reporter assay of 18,154 sequences coupled to polysome profiling. We found that the 5' UTR is sufficient to confer temporal dynamics to translation initiation, and identified 86 motifs enriched in 5' UTRs with distinct ribosome recruitment capabilities. A quantitative deep learning model, DaniO5P, revealed a combined role for 5' UTR length, translation initiation site context, upstream AUGs and sequence motifs on in vivo ribosome recruitment. DaniO5P predicts the activities of 5' UTR isoforms and indicates that modulating 5' UTR length and motif grammar contributes to translation initiation dynamics. This study provides a first quantitative model of 5' UTR-based translation regulation in early vertebrate development and lays the foundation for identifying the underlying molecular effectors.
Collapse
Affiliation(s)
| | - Sebastian M Castillo-Hair
- Department of Electrical & Computer Engineering, University of Washington, Seattle, Washington 98195, United States
| | - Georg Seelig
- Department of Electrical & Computer Engineering, University of Washington, Seattle, Washington 98195, United States
- Paul G. Allen School of Computer Science & Engineering, University of Washington, Seattle, Washington 98195, United States
| | - Alex F Schier
- Biozentrum, University of Basel, 4056 Basel, Switzerland
- Allen Discovery Center for Cell Lineage Tracing, Seattle, Washington 98195, United States
| |
Collapse
|
31
|
Friedman RZ, Ramu A, Lichtarge S, Myers CA, Granas DM, Gause M, Corbo JC, Cohen BA, White MA. Active learning of enhancer and silencer regulatory grammar in photoreceptors. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.08.21.554146. [PMID: 37662358 PMCID: PMC10473580 DOI: 10.1101/2023.08.21.554146] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 09/05/2023]
Abstract
Cis-regulatory elements (CREs) direct gene expression in health and disease, and models that can accurately predict their activities from DNA sequences are crucial for biomedicine. Deep learning represents one emerging strategy to model the regulatory grammar that relates CRE sequence to function. However, these models require training data on a scale that exceeds the number of CREs in the genome. We address this problem using active machine learning to iteratively train models on multiple rounds of synthetic DNA sequences assayed in live mammalian retinas. During each round of training the model actively selects sequence perturbations to assay, thereby efficiently generating informative training data. We iteratively trained a model that predicts the activities of sequences containing binding motifs for the photoreceptor transcription factor Cone-rod homeobox (CRX) using an order of magnitude less training data than current approaches. The model's internal confidence estimates of its predictions are reliable guides for designing sequences with high activity. The model correctly identified critical sequence differences between active and inactive sequences with nearly identical transcription factor binding sites, and revealed order and spacing preferences for combinations of motifs. Our results establish active learning as an effective method to train accurate deep learning models of cis-regulatory function after exhausting naturally occurring training examples in the genome.
Collapse
Affiliation(s)
- Ryan Z. Friedman
- The Edison Family Center for Genome Sciences & Systems Biology, Washington University School of Medicine, Saint Louis, MO, 63110
- Department of Genetics, Washington University School of Medicine, Saint Louis, MO, 63110
| | - Avinash Ramu
- The Edison Family Center for Genome Sciences & Systems Biology, Washington University School of Medicine, Saint Louis, MO, 63110
- Department of Genetics, Washington University School of Medicine, Saint Louis, MO, 63110
| | - Sara Lichtarge
- The Edison Family Center for Genome Sciences & Systems Biology, Washington University School of Medicine, Saint Louis, MO, 63110
- Department of Genetics, Washington University School of Medicine, Saint Louis, MO, 63110
| | - Connie A. Myers
- Department of Pathology and Immunology, Washington University School of Medicine, Saint Louis, MO, 63110
| | - David M. Granas
- The Edison Family Center for Genome Sciences & Systems Biology, Washington University School of Medicine, Saint Louis, MO, 63110
- Department of Genetics, Washington University School of Medicine, Saint Louis, MO, 63110
| | - Maria Gause
- Department of Pathology and Immunology, Washington University School of Medicine, Saint Louis, MO, 63110
| | - Joseph C. Corbo
- Department of Pathology and Immunology, Washington University School of Medicine, Saint Louis, MO, 63110
| | - Barak A. Cohen
- The Edison Family Center for Genome Sciences & Systems Biology, Washington University School of Medicine, Saint Louis, MO, 63110
- Department of Genetics, Washington University School of Medicine, Saint Louis, MO, 63110
| | - Michael A. White
- The Edison Family Center for Genome Sciences & Systems Biology, Washington University School of Medicine, Saint Louis, MO, 63110
- Department of Genetics, Washington University School of Medicine, Saint Louis, MO, 63110
| |
Collapse
|
32
|
Wienecke AN, Barry ML, Pollard DA. Natural variation in codon bias and mRNA folding strength interact synergistically to modify protein expression in Saccharomyces cerevisiae. Genetics 2023; 224:iyad113. [PMID: 37310925 PMCID: PMC10411576 DOI: 10.1093/genetics/iyad113] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/10/2023] [Revised: 04/10/2023] [Accepted: 05/15/2023] [Indexed: 06/15/2023] Open
Abstract
Codon bias and mRNA folding strength (mF) are hypothesized molecular mechanisms by which polymorphisms in genes modify protein expression. Natural patterns of codon bias and mF across genes as well as effects of altering codon bias and mF suggest that the influence of these 2 mechanisms may vary depending on the specific location of polymorphisms within a transcript. Despite the central role codon bias and mF may play in natural trait variation within populations, systematic studies of how polymorphic codon bias and mF relate to protein expression variation are lacking. To address this need, we analyzed genomic, transcriptomic, and proteomic data for 22 Saccharomyces cerevisiae isolates, estimated protein accumulation for each allele of 1,620 genes as the log of protein molecules per RNA molecule (logPPR), and built linear mixed-effects models associating allelic variation in codon bias and mF with allelic variation in logPPR. We found that codon bias and mF interact synergistically in a positive association with logPPR, and this interaction explains almost all the effects of codon bias and mF. We examined how the locations of polymorphisms within transcripts influence their effects and found that codon bias primarily acts through polymorphisms in domain-encoding and 3' coding sequences, while mF acts most significantly through coding sequences with weaker effects from untranslated regions. Our results present the most comprehensive characterization to date of how polymorphisms in transcripts influence protein expression.
Collapse
Affiliation(s)
- Anastacia N Wienecke
- Biology Department, Western Washington University, Bellingham, WA 98225, USA
- Department of Biology, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, USA
- Curriculum in Bioinformatics and Computational Biology, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, USA
| | - Margaret L Barry
- Biology Department, Western Washington University, Bellingham, WA 98225, USA
| | - Daniel A Pollard
- Biology Department, Western Washington University, Bellingham, WA 98225, USA
| |
Collapse
|
33
|
Yang W, Li D, Huang R. EVMP: enhancing machine learning models for synthetic promoter strength prediction by Extended Vision Mutant Priority framework. Front Microbiol 2023; 14:1215609. [PMID: 37476664 PMCID: PMC10354429 DOI: 10.3389/fmicb.2023.1215609] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/02/2023] [Accepted: 06/19/2023] [Indexed: 07/22/2023] Open
Abstract
Introduction In metabolic engineering and synthetic biology applications, promoters with appropriate strengths are critical. However, it is time-consuming and laborious to annotate promoter strength by experiments. Nowadays, constructing mutation-based synthetic promoter libraries that span multiple orders of magnitude of promoter strength is receiving increasing attention. A number of machine learning (ML) methods are applied to synthetic promoter strength prediction, but existing models are limited by the excessive proximity between synthetic promoters. Methods In order to enhance ML models to better predict the synthetic promoter strength, we propose EVMP(Extended Vision Mutant Priority), a universal framework which utilize mutation information more effectively. In EVMP, synthetic promoters are equivalently transformed into base promoter and corresponding k-mer mutations, which are input into BaseEncoder and VarEncoder, respectively. EVMP also provides optional data augmentation, which generates multiple copies of the data by selecting different base promoters for the same synthetic promoter. Results In Trc synthetic promoter library, EVMP was applied to multiple ML models and the model effect was enhanced to varying extents, up to 61.30% (MAE), while the SOTA(state-of-the-art) record was improved by 15.25% (MAE) and 4.03% (R2). Data augmentation based on multiple base promoters further improved the model performance by 17.95% (MAE) and 7.25% (R2) compared with non-EVMP SOTA record. Discussion In further study, extended vision (or k-mer) is shown to be essential for EVMP. We also found that EVMP can alleviate the over-smoothing phenomenon, which may contributes to its effectiveness. Our work suggests that EVMP can highlight the mutation information of synthetic promoters and significantly improve the prediction accuracy of strength. The source code is publicly available on GitHub: https://github.com/Tiny-Snow/EVMP.
Collapse
Affiliation(s)
- Weiqin Yang
- Institute of Marine Science and Technology, Shandong University, Qingdao, China
- School of Computer Science and Technology, Shandong University, Qingdao, China
| | - Dexin Li
- Institute of Marine Science and Technology, Shandong University, Qingdao, China
- School of Computer Science and Technology, Shandong University, Qingdao, China
| | - Ranran Huang
- Institute of Marine Science and Technology, Shandong University, Qingdao, China
| |
Collapse
|
34
|
Li J, Li P, Liu Q, Li J, Qi H. Translation initiation consistency between in vivo and in vitro bacterial protein expression systems. Front Bioeng Biotechnol 2023; 11:1201580. [PMID: 37304134 PMCID: PMC10248181 DOI: 10.3389/fbioe.2023.1201580] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/06/2023] [Accepted: 05/17/2023] [Indexed: 06/13/2023] Open
Abstract
Strict on-demand control of protein synthesis is a crucial aspect of synthetic biology. The 5'-terminal untranslated region (5'-UTR) is an essential bacterial genetic element that can be designed for the regulation of translation initiation. However, there is insufficient systematical data on the consistency of 5'-UTR function among various bacterial cells and in vitro protein synthesis systems, which is crucial for the standardization and modularization of genetic elements in synthetic biology. Here, more than 400 expression cassettes comprising the GFP gene under the regulation of various 5'-UTRs were systematically characterized to evaluate the protein translation consistency in the two popular Escherichia coli strains of JM109 and BL21, as well as an in vitro protein expression system based on cell lysate. In contrast to the very strong correlation between the two cellular systems, the consistency between in vivo and in vitro protein translation was lost, whereby both in vivo and in vitro translation evidently deviated from the estimation of the standard statistical thermodynamic model. Finally, we found that the absence of nucleotide C and complex secondary structure in the 5'-UTR significantly improve the efficiency of protein translation, both in vitro and in vivo.
Collapse
Affiliation(s)
- Jiaojiao Li
- School of Chemical Engineering and Technology, Tianjin University, Tianjin, China
- Frontier Science Center for Synthetic Biology (Ministry of Education), Tianjin University, Tianjin, China
| | - Peixian Li
- School of Chemical Engineering and Technology, Tianjin University, Tianjin, China
- Frontier Science Center for Synthetic Biology (Ministry of Education), Tianjin University, Tianjin, China
| | - Qian Liu
- School of Chemical Engineering and Technology, Tianjin University, Tianjin, China
- Frontier Science Center for Synthetic Biology (Ministry of Education), Tianjin University, Tianjin, China
| | - Jinjin Li
- School of Chemical Engineering and Technology, Tianjin University, Tianjin, China
- Frontier Science Center for Synthetic Biology (Ministry of Education), Tianjin University, Tianjin, China
| | - Hao Qi
- School of Chemical Engineering and Technology, Tianjin University, Tianjin, China
- Frontier Science Center for Synthetic Biology (Ministry of Education), Tianjin University, Tianjin, China
- Zhejiang Shaoxing Research Institute of Tianjin University, Shaoxing, China
| |
Collapse
|
35
|
May GE, Akirtava C, Agar-Johnson M, Micic J, Woolford J, McManus J. Unraveling the influences of sequence and position on yeast uORF activity using massively parallel reporter systems and machine learning. eLife 2023; 12:e69611. [PMID: 37227054 PMCID: PMC10259493 DOI: 10.7554/elife.69611] [Citation(s) in RCA: 8] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/20/2021] [Accepted: 05/24/2023] [Indexed: 05/26/2023] Open
Abstract
Upstream open-reading frames (uORFs) are potent cis-acting regulators of mRNA translation and nonsense-mediated decay (NMD). While both AUG- and non-AUG initiated uORFs are ubiquitous in ribosome profiling studies, few uORFs have been experimentally tested. Consequently, the relative influences of sequence, structural, and positional features on uORF activity have not been determined. We quantified thousands of yeast uORFs using massively parallel reporter assays in wildtype and ∆upf1 yeast. While nearly all AUG uORFs were robust repressors, most non-AUG uORFs had relatively weak impacts on expression. Machine learning regression modeling revealed that both uORF sequences and locations within transcript leaders predict their effect on gene expression. Indeed, alternative transcription start sites highly influenced uORF activity. These results define the scope of natural uORF activity, identify features associated with translational repression and NMD, and suggest that the locations of uORFs in transcript leaders are nearly as predictive as uORF sequences.
Collapse
Affiliation(s)
- Gemma E May
- Department of Biological Sciences, Carnegie Mellon UniversityPittsburghUnited States
| | - Christina Akirtava
- Department of Biological Sciences, Carnegie Mellon UniversityPittsburghUnited States
| | - Matthew Agar-Johnson
- Department of Biological Sciences, Carnegie Mellon UniversityPittsburghUnited States
| | - Jelena Micic
- Department of Biological Sciences, Carnegie Mellon UniversityPittsburghUnited States
| | - John Woolford
- Department of Biological Sciences, Carnegie Mellon UniversityPittsburghUnited States
| | - Joel McManus
- Department of Biological Sciences, Carnegie Mellon UniversityPittsburghUnited States
- Computational Biology Department, Carnegie Mellon UniversityPittsburghUnited States
| |
Collapse
|
36
|
Nikolados EM, Oyarzún DA. Deep learning for optimization of protein expression. Curr Opin Biotechnol 2023; 81:102941. [PMID: 37087839 DOI: 10.1016/j.copbio.2023.102941] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/22/2022] [Revised: 02/02/2023] [Accepted: 03/17/2023] [Indexed: 04/25/2023]
Abstract
Recent progress in high-throughput DNA synthesis and sequencing has enabled the development of massively parallel reporter assays for strain characterization. These datasets map a large number of DNA sequences to protein expression levels, sparking increased interest in data-driven methods for sequence-to-expression modeling. Here, we highlight advances in deep learning models of protein expression and their potential for optimizing strains engineered to produce recombinant proteins. We review recent works that built highly accurate models and discuss challenges that hinder adoption by end users. There is a need to better align this technology with the constraints encountered in strain engineering, particularly the cost of acquiring large amounts of data and the requirement for interpretable models that generalize beyond the training data. Overcoming these barriers will help to incentivize academic and industrial laboratories to tap into a new era of data-centric strain engineering.
Collapse
Affiliation(s)
| | - Diego A Oyarzún
- School of Biological Sciences, University of Edinburgh, Edinburgh EH9 3JH, UK; School of Informatics, University of Edinburgh, Edinburgh EH8 9AB, UK; The Alan Turing Institute, London NW1 2DB, UK.
| |
Collapse
|
37
|
Park JH, Bassalo MC, Lin GM, Chen Y, Doosthosseini H, Schmitz J, Roubos JA, Voigt CA. Design of Four Small-Molecule-Inducible Systems in the Yeast Chromosome, Applied to Optimize Terpene Biosynthesis. ACS Synth Biol 2023; 12:1119-1132. [PMID: 36943773 PMCID: PMC10127285 DOI: 10.1021/acssynbio.2c00607] [Citation(s) in RCA: 8] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/23/2023]
Abstract
The optimization of cellular functions often requires the balancing of gene expression, but the physical construction and screening of alternative designs are costly and time-consuming. Here, we construct a strain of Saccharomyces cerevisiae that contains a "sensor array" containing bacterial regulators that respond to four small-molecule inducers (vanillic acid, xylose, aTc, IPTG). Four promoters can be independently controlled with low background and a 40- to 5000-fold dynamic range. These systems can be used to study the impact of changing the level and timing of gene expression without requiring the construction of multiple strains. We apply this approach to the optimization of a four-gene heterologous pathway to the terpene linalool, which is a flavor and precursor to energetic materials. Using this approach, we identify bottlenecks in the metabolic pathway. This work can aid the rapid automated strain development of yeasts for the bio-manufacturing of diverse products, including chemicals, materials, fuels, and food ingredients.
Collapse
Affiliation(s)
- Jong Hyun Park
- Synthetic Biology Center, Department of Biological Engineering, Massachusetts Institute of Technology, 500 Technology Square, Cambridge, Massachusetts 02139, United States
| | - Marcelo C Bassalo
- Synthetic Biology Center, Department of Biological Engineering, Massachusetts Institute of Technology, 500 Technology Square, Cambridge, Massachusetts 02139, United States
| | - Geng-Min Lin
- Synthetic Biology Center, Department of Biological Engineering, Massachusetts Institute of Technology, 500 Technology Square, Cambridge, Massachusetts 02139, United States
| | - Ye Chen
- Synthetic Biology Center, Department of Biological Engineering, Massachusetts Institute of Technology, 500 Technology Square, Cambridge, Massachusetts 02139, United States
| | - Hamid Doosthosseini
- Synthetic Biology Center, Department of Biological Engineering, Massachusetts Institute of Technology, 500 Technology Square, Cambridge, Massachusetts 02139, United States
| | - Joep Schmitz
- DSM Science & Innovation, Biodata & Translational Sciences, P.O. Box 1, 2600 MA Delft, The Netherlands
| | - Johannes A Roubos
- DSM Science & Innovation, Biodata & Translational Sciences, P.O. Box 1, 2600 MA Delft, The Netherlands
| | - Christopher A Voigt
- Synthetic Biology Center, Department of Biological Engineering, Massachusetts Institute of Technology, 500 Technology Square, Cambridge, Massachusetts 02139, United States
| |
Collapse
|
38
|
Li Z, Gao E, Zhou J, Han W, Xu X, Gao X. Applications of deep learning in understanding gene regulation. CELL REPORTS METHODS 2023; 3:100384. [PMID: 36814848 PMCID: PMC9939384 DOI: 10.1016/j.crmeth.2022.100384] [Citation(s) in RCA: 14] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/22/2023]
Abstract
Gene regulation is a central topic in cell biology. Advances in omics technologies and the accumulation of omics data have provided better opportunities for gene regulation studies than ever before. For this reason deep learning, as a data-driven predictive modeling approach, has been successfully applied to this field during the past decade. In this article, we aim to give a brief yet comprehensive overview of representative deep-learning methods for gene regulation. Specifically, we discuss and compare the design principles and datasets used by each method, creating a reference for researchers who wish to replicate or improve existing methods. We also discuss the common problems of existing approaches and prospectively introduce the emerging deep-learning paradigms that will potentially alleviate them. We hope that this article will provide a rich and up-to-date resource and shed light on future research directions in this area.
Collapse
Affiliation(s)
- Zhongxiao Li
- Computer Science Program, Computer, Electrical and Mathematical Sciences and Engineering (CEMSE) Division, King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Kingdom of Saudi Arabia
- KAUST Computational Bioscience Research Center (CBRC), King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Kingdom of Saudi Arabia
| | - Elva Gao
- The KAUST School, King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Kingdom of Saudi Arabia
| | - Juexiao Zhou
- Computer Science Program, Computer, Electrical and Mathematical Sciences and Engineering (CEMSE) Division, King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Kingdom of Saudi Arabia
- KAUST Computational Bioscience Research Center (CBRC), King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Kingdom of Saudi Arabia
| | - Wenkai Han
- Computer Science Program, Computer, Electrical and Mathematical Sciences and Engineering (CEMSE) Division, King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Kingdom of Saudi Arabia
- KAUST Computational Bioscience Research Center (CBRC), King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Kingdom of Saudi Arabia
| | - Xiaopeng Xu
- Computer Science Program, Computer, Electrical and Mathematical Sciences and Engineering (CEMSE) Division, King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Kingdom of Saudi Arabia
- KAUST Computational Bioscience Research Center (CBRC), King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Kingdom of Saudi Arabia
| | - Xin Gao
- Computer Science Program, Computer, Electrical and Mathematical Sciences and Engineering (CEMSE) Division, King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Kingdom of Saudi Arabia
- KAUST Computational Bioscience Research Center (CBRC), King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Kingdom of Saudi Arabia
| |
Collapse
|
39
|
Zabolotskii AI, Kozlovskiy SV, Katrukha AG. The Influence of the Nucleotide Composition of Genes and Gene Regulatory Elements on the Efficiency of Protein Expression in Escherichia coli. BIOCHEMISTRY (MOSCOW) 2023; 88:S176-S191. [PMID: 37069120 DOI: 10.1134/s0006297923140109] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/22/2023]
Abstract
Recombinant proteins expressed in Escherichia coli are widely used in biochemical research and industrial processes. At the same time, achieving higher protein expression levels and correct protein folding still remains the key problem, since optimization of nutrient media, growth conditions, and methods for induction of protein synthesis do not always lead to the desired result. Often, low protein expression is determined by the sequences of the expressed genes and their regulatory regions. The genetic code is degenerated; 18 out of 20 amino acids are encoded by more than one codon. Choosing between synonymous codons in the coding sequence can significantly affect the level of protein expression and protein folding due to the influence of the gene nucleotide composition on the probability of formation of secondary mRNA structures that affect the ribosome binding at the translation initiation phase, as well as the ribosome movement along the mRNA during elongation, which, in turn, influences the mRNA degradation and the folding of the nascent protein. The nucleotide composition of the mRNA untranslated regions, in particular the promoter and Shine-Dalgarno sequences, also affects the efficiency of mRNA transcription, translation, and degradation. In this review, we describe the genetic principles that determine the efficiency of protein production in Escherichia coli.
Collapse
Affiliation(s)
- Artur I Zabolotskii
- Faculty of Biology, Lomonosov Moscow State University, Moscow, 119991, Russia.
| | | | - Alexey G Katrukha
- Faculty of Biology, Lomonosov Moscow State University, Moscow, 119991, Russia
| |
Collapse
|
40
|
Abstract
This chapter outlines the myriad applications of machine learning (ML) in synthetic biology, specifically in engineering cell and protein activity, and metabolic pathways. Though by no means comprehensive, the chapter highlights several prominent computational tools applied in the field and their potential use cases. The examples detailed reinforce how ML algorithms can enhance synthetic biology research by providing data-driven insights into the behavior of living systems, even without detailed knowledge of their underlying mechanisms. By doing so, ML promises to increase the efficiency of research projects by modeling hypotheses in silico that can then be tested through experiments. While challenges related to training dataset generation and computational costs remain, ongoing improvements in ML tools are paving the way for smarter and more streamlined synthetic biology workflows that can be readily employed to address grand challenges across manufacturing, medicine, engineering, agriculture, and beyond.
Collapse
Affiliation(s)
- Brendan Fu-Long Sieow
- NUS Synthetic Biology for Clinical and Technological Innovation (SynCTI), National University of Singapore, Singapore, Singapore
- Synthetic Biology Translational Research Programme, Yong Loo Lin School of Medicine, National University of Singapore, Singapore, Singapore
- Department of Biochemistry, Yong Loo Lin School of Medicine, National University of Singapore, Singapore, Singapore
- NUS Graduate School for Integrative Sciences and Engineering Programme, National University of Singapore, Singapore, Singapore
| | - Ryan De Sotto
- NUS Synthetic Biology for Clinical and Technological Innovation (SynCTI), National University of Singapore, Singapore, Singapore
- Synthetic Biology Translational Research Programme, Yong Loo Lin School of Medicine, National University of Singapore, Singapore, Singapore
- Department of Biochemistry, Yong Loo Lin School of Medicine, National University of Singapore, Singapore, Singapore
| | - Zhi Ren Darren Seet
- NUS Synthetic Biology for Clinical and Technological Innovation (SynCTI), National University of Singapore, Singapore, Singapore
- Synthetic Biology Translational Research Programme, Yong Loo Lin School of Medicine, National University of Singapore, Singapore, Singapore
- Department of Biochemistry, Yong Loo Lin School of Medicine, National University of Singapore, Singapore, Singapore
| | - In Young Hwang
- NUS Synthetic Biology for Clinical and Technological Innovation (SynCTI), National University of Singapore, Singapore, Singapore
- Synthetic Biology Translational Research Programme, Yong Loo Lin School of Medicine, National University of Singapore, Singapore, Singapore
- Department of Biochemistry, Yong Loo Lin School of Medicine, National University of Singapore, Singapore, Singapore
| | - Matthew Wook Chang
- NUS Synthetic Biology for Clinical and Technological Innovation (SynCTI), National University of Singapore, Singapore, Singapore.
- Synthetic Biology Translational Research Programme, Yong Loo Lin School of Medicine, National University of Singapore, Singapore, Singapore.
- Department of Biochemistry, Yong Loo Lin School of Medicine, National University of Singapore, Singapore, Singapore.
| |
Collapse
|
41
|
Accuracy and data efficiency in deep learning models of protein expression. Nat Commun 2022; 13:7755. [PMID: 36517468 PMCID: PMC9751117 DOI: 10.1038/s41467-022-34902-5] [Citation(s) in RCA: 16] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/21/2022] [Accepted: 11/10/2022] [Indexed: 12/23/2022] Open
Abstract
Synthetic biology often involves engineering microbial strains to express high-value proteins. Thanks to progress in rapid DNA synthesis and sequencing, deep learning has emerged as a promising approach to build sequence-to-expression models for strain optimization. But such models need large and costly training data that create steep entry barriers for many laboratories. Here we study the relation between accuracy and data efficiency in an atlas of machine learning models trained on datasets of varied size and sequence diversity. We show that deep learning can achieve good prediction accuracy with much smaller datasets than previously thought. We demonstrate that controlled sequence diversity leads to substantial gains in data efficiency and employed Explainable AI to show that convolutional neural networks can finely discriminate between input DNA sequences. Our results provide guidelines for designing genotype-phenotype screens that balance cost and quality of training data, thus helping promote the wider adoption of deep learning in the biotechnology sector.
Collapse
|
42
|
Li K, Kong J, Zhang S, Zhao T, Qian W. Distance-dependent inhibition of translation initiation by downstream out-of-frame AUGs is consistent with a Brownian ratchet process of ribosome scanning. Genome Biol 2022; 23:254. [PMID: 36510274 PMCID: PMC9743702 DOI: 10.1186/s13059-022-02829-1] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/26/2021] [Accepted: 12/01/2022] [Indexed: 12/14/2022] Open
Abstract
BACKGROUND Eukaryotic ribosomes are widely presumed to scan mRNA for the AUG codon to initiate translation in a strictly 5'-3' movement (i.e., strictly unidirectional scanning model), so that ribosomes initiate translation exclusively at the 5' proximal AUG codon (i.e., the first-AUG rule). RESULTS We generate 13,437 yeast variants, each with an ATG triplet placed downstream (dATGs) of the annotated ATG (aATG) codon of a green fluorescent protein. We find that out-of-frame dATGs can inhibit translation at the aATG, but with diminishing strength over increasing distance between aATG and dATG, undetectable beyond ~17 nt. This phenomenon is best explained by a Brownian ratchet mechanism of ribosome scanning, in which the ribosome uses small-amplitude 5'-3' and 3'-5' oscillations with a net 5'-3' movement to scan the AUG codon, thereby leading to competition for translation initiation between aAUG and a proximal dAUG. This scanning model further predicts that the inhibitory effect induced by an out-of-frame upstream AUG triplet (uAUG) will diminish as uAUG approaches aAUG, which is indeed observed among the 15,586 uATG variants generated in this study. Computational simulations suggest that each triplet is scanned back and forth approximately ten times until the ribosome eventually migrates to downstream regions. Moreover, this scanning process could constrain the evolution of sequences downstream of the aATG to minimize proximal out-of-frame dATG triplets in yeast and humans. CONCLUSIONS Collectively, our findings uncover the basic process by which eukaryotic ribosomes scan for initiation codons, and how this process could shape eukaryotic genome evolution.
Collapse
Affiliation(s)
- Ke Li
- State Key Laboratory of Plant Genomics, Institute of Genetics and Developmental Biology, Innovation Academy for Seed Design, Chinese Academy of Sciences, Beijing, 100101, China
| | - Jinhui Kong
- State Key Laboratory of Plant Genomics, Institute of Genetics and Developmental Biology, Innovation Academy for Seed Design, Chinese Academy of Sciences, Beijing, 100101, China
- University of Chinese Academy of Sciences, Beijing, 100049, China
| | - Shuo Zhang
- State Key Laboratory of Plant Genomics, Institute of Genetics and Developmental Biology, Innovation Academy for Seed Design, Chinese Academy of Sciences, Beijing, 100101, China
- University of Chinese Academy of Sciences, Beijing, 100049, China
| | - Tong Zhao
- Institute of Microbiology, Chinese Academy of Sciences, Beijing, 100101, China
| | - Wenfeng Qian
- State Key Laboratory of Plant Genomics, Institute of Genetics and Developmental Biology, Innovation Academy for Seed Design, Chinese Academy of Sciences, Beijing, 100101, China.
- University of Chinese Academy of Sciences, Beijing, 100049, China.
| |
Collapse
|
43
|
Wang J, Shin BS, Alvarado C, Kim JR, Bohlen J, Dever TE, Puglisi JD. Rapid 40S scanning and its regulation by mRNA structure during eukaryotic translation initiation. Cell 2022; 185:4474-4487.e17. [PMID: 36334590 PMCID: PMC9691599 DOI: 10.1016/j.cell.2022.10.005] [Citation(s) in RCA: 39] [Impact Index Per Article: 13.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/05/2022] [Revised: 08/22/2022] [Accepted: 10/05/2022] [Indexed: 11/06/2022]
Abstract
How the eukaryotic 43S preinitiation complex scans along the 5' untranslated region (5' UTR) of a capped mRNA to locate the correct start codon remains elusive. Here, we directly track yeast 43S-mRNA binding, scanning, and 60S subunit joining by real-time single-molecule fluorescence spectroscopy. 43S engagement with mRNA occurs through a slow, ATP-dependent process driven by multiple initiation factors including the helicase eIF4A. Once engaged, 43S scanning occurs rapidly and directionally at ∼100 nucleotides per second, independent of multiple cycles of ATP hydrolysis by RNA helicases post ribosomal loading. Scanning ribosomes can proceed through RNA secondary structures, but 5' UTR hairpin sequences near start codons drive scanning ribosomes at start codons backward in the 5' direction, requiring rescanning to arrive once more at a start codon. Direct observation of scanning ribosomes provides a mechanistic framework for translational regulation by 5' UTR structures and upstream near-cognate start codons.
Collapse
Affiliation(s)
- Jinfan Wang
- Department of Structural Biology, Stanford University School of Medicine, Stanford, CA, USA
| | - Byung-Sik Shin
- Eunice Kennedy Shriver National Institute of Child Health and Human Development, National Institutes of Health, Bethesda, MD, USA
| | - Carlos Alvarado
- Department of Structural Biology, Stanford University School of Medicine, Stanford, CA, USA
| | - Joo-Ran Kim
- Eunice Kennedy Shriver National Institute of Child Health and Human Development, National Institutes of Health, Bethesda, MD, USA
| | - Jonathan Bohlen
- Laboratory of Human Genetics of Infectious Diseases, Necker Branch, Institut National de la Santé et de la Recherche Médicale U1163, Paris, France; University of Paris, Imagine Institute, Paris, France
| | - Thomas E Dever
- Eunice Kennedy Shriver National Institute of Child Health and Human Development, National Institutes of Health, Bethesda, MD, USA.
| | - Joseph D Puglisi
- Department of Structural Biology, Stanford University School of Medicine, Stanford, CA, USA.
| |
Collapse
|
44
|
High-throughput approaches to functional characterization of genetic variation in yeast. Curr Opin Genet Dev 2022; 76:101979. [PMID: 36075138 DOI: 10.1016/j.gde.2022.101979] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/02/2022] [Revised: 07/29/2022] [Accepted: 08/02/2022] [Indexed: 11/20/2022]
Abstract
Expansion of sequencing efforts to include thousands of genomes is providing a fundamental resource for determining the genetic diversity that exists in a population. Now, high-throughput approaches are necessary to begin to understand the role these genotypic changes play in affecting phenotypic variation. Saccharomyces cerevisiae maintains its position as an excellent model system to determine the function of unknown variants with its exceptional genetic diversity, phenotypic diversity, and reliable genetic manipulation tools. Here, we review strategies and techniques developed in yeast that scale classic approaches of assessing variant function. These approaches improve our ability to better map quantitative trait loci at a higher resolution, even for rare variants, and are already providing greater insight into the role that different types of mutations play in phenotypic variation and evolution not just in yeast but across taxa.
Collapse
|
45
|
Controlling gene expression with deep generative design of regulatory DNA. Nat Commun 2022; 13:5099. [PMID: 36042233 PMCID: PMC9427793 DOI: 10.1038/s41467-022-32818-8] [Citation(s) in RCA: 36] [Impact Index Per Article: 12.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/27/2022] [Accepted: 08/18/2022] [Indexed: 11/25/2022] Open
Abstract
Design of de novo synthetic regulatory DNA is a promising avenue to control gene expression in biotechnology and medicine. Using mutagenesis typically requires screening sizable random DNA libraries, which limits the designs to span merely a short section of the promoter and restricts their control of gene expression. Here, we prototype a deep learning strategy based on generative adversarial networks (GAN) by learning directly from genomic and transcriptomic data. Our ExpressionGAN can traverse the entire regulatory sequence-expression landscape in a gene-specific manner, generating regulatory DNA with prespecified target mRNA levels spanning the whole gene regulatory structure including coding and adjacent non-coding regions. Despite high sequence divergence from natural DNA, in vivo measurements show that 57% of the highly-expressed synthetic sequences surpass the expression levels of highly-expressed natural controls. This demonstrates the applicability and relevance of deep generative design to expand our knowledge and control of gene expression regulation in any desired organism, condition or tissue. Design of de novo synthetic regulatory DNA is a promising avenue to control gene expression in biotechnology and medicine. Here the authors present EspressionGAN, a generative adversarial network that uses genomic and transcriptomic data to generate regulatory sequences.
Collapse
|
46
|
Design of 5′-UTR to Enhance Keratinase Activity in Bacillus subtilis. FERMENTATION-BASEL 2022. [DOI: 10.3390/fermentation8090426] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
Keratinase is an important industrial enzyme, but its application performance is limited by its low activity. A rational design of 5′-UTRs that increases translation efficiency is an important approach to enhance protein expression. Herein, we optimized the 5′-UTR of the recombinant keratinase KerZ1 expression element to enhance its secretory activity in Bacillus subtilis WB600 through Spacer design, RBS screening, and sequence simplification. First, the A/U content in Spacer was increased by the site-directed saturation mutation of G/C bases, and the activity of keratinase secreted by mutant strain B. subtilis WB600-SP was 7.94 times higher than that of KerZ1. Subsequently, the keratinase activity secreted by the mutant strain B. subtilis WB600-SP-R was further increased to 13.45 times that of KerZ1 based on the prediction of RBS translation efficiency and the multi-site saturation mutation screening. Finally, the keratinase activity secreted by the mutant strain B. subtilis WB600-SP-R-D reached 204.44 KU mL−1 by reducing the length of the 5′ end of the 5′-UTR, which was 19.70 times that of KerZ1. In a 5 L fermenter, the keratinase activity secreted by B. subtilis WB600-SP-R-D after 25 h fermentation was 797.05 KU mL−1, which indicated its high production intensity. Overall, the strategy of this study and the obtained keratinase mutants will provide a good reference for the expression regulation of keratinase and other industrial enzymes.
Collapse
|
47
|
Fages-Lartaud M, Tietze L, Elie F, Lale R, Hohmann-Marriott MF. mCherry contains a fluorescent protein isoform that interferes with its reporter function. Front Bioeng Biotechnol 2022; 10:892138. [PMID: 36017355 PMCID: PMC9395592 DOI: 10.3389/fbioe.2022.892138] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/08/2022] [Accepted: 06/30/2022] [Indexed: 11/13/2022] Open
Abstract
Fluorescent proteins are essential reporters in cell and molecular biology. Here, we found that red-fluorescent proteins possess an alternative translation initiation site that produces a short functional protein isoform in both prokaryotes and eukaryotes. The short isoform creates significant background fluorescence that biases the outcome of expression studies. In this study, we identified the short protein isoform, traced its origin, and determined the extent of the issue within the family of red fluorescent protein. Our analysis showed that the short isoform defect of the red fluorescent protein family may affect the interpretation of many published studies. We provided a re-engineered mCherry variant that lacks background expression as an improved tool for imaging and protein expression studies.
Collapse
Affiliation(s)
- Maxime Fages-Lartaud
- Department of Biotechnology, Norwegian University of Science and Technology, Trondheim, Norway
| | - Lisa Tietze
- Department of Biotechnology, Norwegian University of Science and Technology, Trondheim, Norway
| | - Florence Elie
- Department of Biotechnology, Norwegian University of Science and Technology, Trondheim, Norway
| | - Rahmi Lale
- Department of Biotechnology, Norwegian University of Science and Technology, Trondheim, Norway
| | - Martin Frank Hohmann-Marriott
- Department of Biotechnology, Norwegian University of Science and Technology, Trondheim, Norway
- United Scientists CORE (Limited), Dunedin, New Zealand
| |
Collapse
|
48
|
Beardall WA, Stan GB, Dunlop MJ. Deep Learning Concepts and Applications for Synthetic Biology. GEN BIOTECHNOLOGY 2022; 1:360-371. [PMID: 36061221 PMCID: PMC9428732 DOI: 10.1089/genbio.2022.0017] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 04/27/2022] [Accepted: 07/14/2022] [Indexed: 12/24/2022]
Abstract
Synthetic biology has a natural synergy with deep learning. It can be used to generate large data sets to train models, for example by using DNA synthesis, and deep learning models can be used to inform design, such as by generating novel parts or suggesting optimal experiments to conduct. Recently, research at the interface of engineering biology and deep learning has highlighted this potential through successes including the design of novel biological parts, protein structure prediction, automated analysis of microscopy data, optimal experimental design, and biomolecular implementations of artificial neural networks. In this review, we present an overview of synthetic biology-relevant classes of data and deep learning architectures. We also highlight emerging studies in synthetic biology that capitalize on deep learning to enable novel understanding and design, and discuss challenges and future opportunities in this space.
Collapse
Affiliation(s)
- William A.V. Beardall
- Department of Bioengineering, Imperial College London, London, United Kingdom
- Imperial College Centre of Excellence in Synthetic Biology, Imperial College London, London, United Kingdom
| | - Guy-Bart Stan
- Department of Bioengineering, Imperial College London, London, United Kingdom
- Imperial College Centre of Excellence in Synthetic Biology, Imperial College London, London, United Kingdom
| | - Mary J. Dunlop
- Department of Biomedical Engineering, Boston University, Boston, Massachusetts, USA
- Biological Design Center, Boston University, Boston, Massachusetts, USA
| |
Collapse
|
49
|
Villalobos-Alva J, Ochoa-Toledo L, Villalobos-Alva MJ, Aliseda A, Pérez-Escamirosa F, Altamirano-Bustamante NF, Ochoa-Fernández F, Zamora-Solís R, Villalobos-Alva S, Revilla-Monsalve C, Kemper-Valverde N, Altamirano-Bustamante MM. Protein Science Meets Artificial Intelligence: A Systematic Review and a Biochemical Meta-Analysis of an Inter-Field. Front Bioeng Biotechnol 2022; 10:788300. [PMID: 35875501 PMCID: PMC9301016 DOI: 10.3389/fbioe.2022.788300] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/02/2021] [Accepted: 05/25/2022] [Indexed: 11/23/2022] Open
Abstract
Proteins are some of the most fascinating and challenging molecules in the universe, and they pose a big challenge for artificial intelligence. The implementation of machine learning/AI in protein science gives rise to a world of knowledge adventures in the workhorse of the cell and proteome homeostasis, which are essential for making life possible. This opens up epistemic horizons thanks to a coupling of human tacit-explicit knowledge with machine learning power, the benefits of which are already tangible, such as important advances in protein structure prediction. Moreover, the driving force behind the protein processes of self-organization, adjustment, and fitness requires a space corresponding to gigabytes of life data in its order of magnitude. There are many tasks such as novel protein design, protein folding pathways, and synthetic metabolic routes, as well as protein-aggregation mechanisms, pathogenesis of protein misfolding and disease, and proteostasis networks that are currently unexplored or unrevealed. In this systematic review and biochemical meta-analysis, we aim to contribute to bridging the gap between what we call binomial artificial intelligence (AI) and protein science (PS), a growing research enterprise with exciting and promising biotechnological and biomedical applications. We undertake our task by exploring "the state of the art" in AI and machine learning (ML) applications to protein science in the scientific literature to address some critical research questions in this domain, including What kind of tasks are already explored by ML approaches to protein sciences? What are the most common ML algorithms and databases used? What is the situational diagnostic of the AI-PS inter-field? What do ML processing steps have in common? We also formulate novel questions such as Is it possible to discover what the rules of protein evolution are with the binomial AI-PS? How do protein folding pathways evolve? What are the rules that dictate the folds? What are the minimal nuclear protein structures? How do protein aggregates form and why do they exhibit different toxicities? What are the structural properties of amyloid proteins? How can we design an effective proteostasis network to deal with misfolded proteins? We are a cross-functional group of scientists from several academic disciplines, and we have conducted the systematic review using a variant of the PICO and PRISMA approaches. The search was carried out in four databases (PubMed, Bireme, OVID, and EBSCO Web of Science), resulting in 144 research articles. After three rounds of quality screening, 93 articles were finally selected for further analysis. A summary of our findings is as follows: regarding AI applications, there are mainly four types: 1) genomics, 2) protein structure and function, 3) protein design and evolution, and 4) drug design. In terms of the ML algorithms and databases used, supervised learning was the most common approach (85%). As for the databases used for the ML models, PDB and UniprotKB/Swissprot were the most common ones (21 and 8%, respectively). Moreover, we identified that approximately 63% of the articles organized their results into three steps, which we labeled pre-process, process, and post-process. A few studies combined data from several databases or created their own databases after the pre-process. Our main finding is that, as of today, there are no research road maps serving as guides to address gaps in our knowledge of the AI-PS binomial. All research efforts to collect, integrate multidimensional data features, and then analyze and validate them are, so far, uncoordinated and scattered throughout the scientific literature without a clear epistemic goal or connection between the studies. Therefore, our main contribution to the scientific literature is to offer a road map to help solve problems in drug design, protein structures, design, and function prediction while also presenting the "state of the art" on research in the AI-PS binomial until February 2021. Thus, we pave the way toward future advances in the synthetic redesign of novel proteins and protein networks and artificial metabolic pathways, learning lessons from nature for the welfare of humankind. Many of the novel proteins and metabolic pathways are currently non-existent in nature, nor are they used in the chemical industry or biomedical field.
Collapse
Affiliation(s)
- Jalil Villalobos-Alva
- Unidad de Investigación en Enfermedades Metabólicas, Centro Médico Nacional Siglo XXI, Instituto Mexicano del Seguro Social, Mexico City, Mexico
| | - Luis Ochoa-Toledo
- Instituto de Ciencias Aplicadas y Tecnología (ICAT), Universidad Nacional Autónoma de México (UNAM), Mexico City, Mexico
| | - Mario Javier Villalobos-Alva
- Unidad de Investigación en Enfermedades Metabólicas, Centro Médico Nacional Siglo XXI, Instituto Mexicano del Seguro Social, Mexico City, Mexico
| | - Atocha Aliseda
- Instituto de Investigaciones Filosóficas, Universidad Nacional Autónoma de México (UNAM), Mexico City, Mexico
| | - Fernando Pérez-Escamirosa
- Instituto de Ciencias Aplicadas y Tecnología (ICAT), Universidad Nacional Autónoma de México (UNAM), Mexico City, Mexico
| | | | - Francine Ochoa-Fernández
- Unidad de Investigación en Enfermedades Metabólicas, Centro Médico Nacional Siglo XXI, Instituto Mexicano del Seguro Social, Mexico City, Mexico
| | - Ricardo Zamora-Solís
- Unidad de Investigación en Enfermedades Metabólicas, Centro Médico Nacional Siglo XXI, Instituto Mexicano del Seguro Social, Mexico City, Mexico
| | - Sebastián Villalobos-Alva
- Unidad de Investigación en Enfermedades Metabólicas, Centro Médico Nacional Siglo XXI, Instituto Mexicano del Seguro Social, Mexico City, Mexico
| | - Cristina Revilla-Monsalve
- Unidad de Investigación en Enfermedades Metabólicas, Centro Médico Nacional Siglo XXI, Instituto Mexicano del Seguro Social, Mexico City, Mexico
| | - Nicolás Kemper-Valverde
- Instituto de Ciencias Aplicadas y Tecnología (ICAT), Universidad Nacional Autónoma de México (UNAM), Mexico City, Mexico
| | - Myriam M. Altamirano-Bustamante
- Unidad de Investigación en Enfermedades Metabólicas, Centro Médico Nacional Siglo XXI, Instituto Mexicano del Seguro Social, Mexico City, Mexico
| |
Collapse
|
50
|
Abstract
The tremendous amount of biological sequence data available, combined with the recent methodological breakthrough in deep learning in domains such as computer vision or natural language processing, is leading today to the transformation of bioinformatics through the emergence of deep genomics, the application of deep learning to genomic sequences. We review here the new applications that the use of deep learning enables in the field, focusing on three aspects: the functional annotation of genomes, the sequence determinants of the genome functions and the possibility to write synthetic genomic sequences.
Collapse
|