1
|
Chandrashekar PB, Chen H, Lee M, Ahmadinejad N, Liu L. DeepCORE: An interpretable multi-view deep neural network model to detect co-operative regulatory elements. Comput Struct Biotechnol J 2024; 23:679-687. [PMID: 38292477 PMCID: PMC10825326 DOI: 10.1016/j.csbj.2023.12.044] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/29/2023] [Revised: 12/14/2023] [Accepted: 12/27/2023] [Indexed: 02/01/2024] Open
Abstract
Gene transcription is an essential process involved in all aspects of cellular functions with significant impact on biological traits and diseases. This process is tightly regulated by multiple elements that co-operate to jointly modulate the transcription levels of target genes. To decipher the complicated regulatory network, we present a novel multi-view attention-based deep neural network that models the relationship between genetic, epigenetic, and transcriptional patterns and identifies co-operative regulatory elements (COREs). We applied this new method, named DeepCORE, to predict transcriptomes in various tissues and cell lines, which outperformed the state-of-the-art algorithms. Furthermore, DeepCORE contains an interpreter that extracts the attention values embedded in the deep neural network, maps the attended regions to putative regulatory elements, and infers COREs based on correlated attentions. The identified COREs are significantly enriched with known promoters and enhancers. Novel regulatory elements discovered by DeepCORE showed epigenetic signatures consistent with the status of histone modification marks.
Collapse
Affiliation(s)
- Pramod Bharadwaj Chandrashekar
- Waisman Center, University of Wisconsin-Madison, Madison, WI 53705, USA
- Department of Biostatistics and Medical Informatics, University of Wisconsin-Madison, Madison, WI 53076, USA
| | - Hai Chen
- College of Health Solutions, Arizona State University, Phoenix, AZ, United States
- Biodesign Institute, Arizona State University, Tempe, AZ, United States
| | - Matthew Lee
- College of Health Solutions, Arizona State University, Phoenix, AZ, United States
| | - Navid Ahmadinejad
- College of Health Solutions, Arizona State University, Phoenix, AZ, United States
- Biodesign Institute, Arizona State University, Tempe, AZ, United States
| | - Li Liu
- College of Health Solutions, Arizona State University, Phoenix, AZ, United States
- Biodesign Institute, Arizona State University, Tempe, AZ, United States
| |
Collapse
|
2
|
Zhu W, Li W, Zhang H, Li L. Big data and artificial intelligence-aided crop breeding: Progress and prospects. JOURNAL OF INTEGRATIVE PLANT BIOLOGY 2024. [PMID: 39467106 DOI: 10.1111/jipb.13791] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/30/2024] [Revised: 08/25/2024] [Accepted: 09/10/2024] [Indexed: 10/30/2024]
Abstract
The past decade has witnessed rapid developments in gene discovery, biological big data (BBD), artificial intelligence (AI)-aided technologies, and molecular breeding. These advancements are expected to accelerate crop breeding under the pressure of increasing demands for food. Here, we first summarize current breeding methods and discuss the need for new ways to support breeding efforts. Then, we review how to combine BBD and AI technologies for genetic dissection, exploring functional genes, predicting regulatory elements and functional domains, and phenotypic prediction. Finally, we propose the concept of intelligent precision design breeding (IPDB) driven by AI technology and offer ideas about how to implement IPDB. We hope that IPDB will enhance the predictability, efficiency, and cost of crop breeding compared with current technologies. As an example of IPDB, we explore the possibilities offered by CropGPT, which combines biological techniques, bioinformatics, and breeding art from breeders, and presents an open, shareable, and cooperative breeding system. IPDB provides integrated services and communication platforms for biologists, bioinformatics experts, germplasm resource specialists, breeders, dealers, and farmers, and should be well suited for future breeding.
Collapse
Affiliation(s)
- Wanchao Zhu
- Key Laboratory of Biology and Genetic Improvement of Maize in Arid Area of Northwest Region, College of Agronomy, Northwest A&F University, Yangling, 712100, China
- National Key Laboratory of Crop Genetic Improvement, Huazhong Agricultural University, Wuhan, 430070, China
| | - Weifu Li
- College of Informatics, Huazhong Agricultural University, Wuhan, 430070, China
- Engineering Research Center of Intelligent Technology for Agriculture, Ministry of Education, Wuhan, 430070, China
| | - Hongwei Zhang
- State Key Laboratory of Crop Gene Resources and Breeding, National Key Facility for Crop Gene Resources and Genetic Improvement, Institute of Crop Sciences, Chinese Academy of Agricultural Sciences, Beijing, 100081, China
| | - Lin Li
- National Key Laboratory of Crop Genetic Improvement, Huazhong Agricultural University, Wuhan, 430070, China
| |
Collapse
|
3
|
He AY, Palamuttam NP, Danko CG. Training deep learning models on personalized genomic sequences improves variant effect prediction. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.10.15.618510. [PMID: 39463940 PMCID: PMC11507713 DOI: 10.1101/2024.10.15.618510] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/29/2024]
Abstract
Sequence-to-function models have broad applications in interpreting the molecular impact of genetic variation, yet have been criticized for poor performance in this task. Here we show that training models on functional genomic data with matched personal genomes improves their performance at variant effect prediction. Variant effect representations are retained even when transfer learning models to unseen cellular contexts and experimental readouts. Our results have implications for interpreting trait-associated genetic variation.
Collapse
|
4
|
Li Z, Zhang Y, Peng B, Qin S, Zhang Q, Chen Y, Chen C, Bao Y, Zhu Y, Hong Y, Liu B, Liu Q, Xu L, Chen X, Ma X, Wang H, Xie L, Yao Y, Deng B, Li J, De B, Chen Y, Wang J, Li T, Liu R, Tang Z, Cao J, Zuo E, Mei C, Zhu F, Shao C, Wang G, Sun T, Wang N, Liu G, Ni JQ, Liu Y. A novel interpretable deep learning-based computational framework designed synthetic enhancers with broad cross-species activity. Nucleic Acids Res 2024:gkae912. [PMID: 39420601 DOI: 10.1093/nar/gkae912] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/21/2024] [Revised: 09/25/2024] [Accepted: 10/03/2024] [Indexed: 10/19/2024] Open
Abstract
Enhancers play a critical role in dynamically regulating spatial-temporal gene expression and establishing cell identity, underscoring the significance of designing them with specific properties for applications in biosynthetic engineering and gene therapy. Despite numerous high-throughput methods facilitating genome-wide enhancer identification, deciphering the sequence determinants of their activity remains challenging. Here, we present the DREAM (DNA cis-Regulatory Elements with controllable Activity design platforM) framework, a novel deep learning-based approach for synthetic enhancer design. Proficient in uncovering subtle and intricate patterns within extensive enhancer screening data, DREAM achieves cutting-edge sequence-based enhancer activity prediction and highlights critical sequence features implicating strong enhancer activity. Leveraging DREAM, we have engineered enhancers that surpass the potency of the strongest enhancer within the Drosophila genome by approximately 3.6-fold. Remarkably, these synthetic enhancers exhibited conserved functionality across species that have diverged more than billion years, indicating that DREAM was able to learn highly conserved enhancer regulatory grammar. Additionally, we designed silencers and cell line-specific enhancers using DREAM, demonstrating its versatility. Overall, our study not only introduces an interpretable approach for enhancer design but also lays out a general framework applicable to the design of other types of cis-regulatory elements.
Collapse
Affiliation(s)
- Zhaohong Li
- Shenzhen Branch, Guangdong Laboratory for Lingnan Modern Agriculture, Key Laboratory of Livestock and Poultry Multi-Omics of MARA, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Buxin Road NO. 97, Dapeng District, Shenzhen 518124, China
- Innovation Group of Pig Genome Design and Breeding, Research Centre for Animal Genome, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Buxin Road NO. 97, Dapeng District, Shenzhen 518124, China
| | - Yuanyuan Zhang
- Shenzhen Branch, Guangdong Laboratory for Lingnan Modern Agriculture, Key Laboratory of Livestock and Poultry Multi-Omics of MARA, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Buxin Road NO. 97, Dapeng District, Shenzhen 518124, China
- Innovation Group of Pig Genome Design and Breeding, Research Centre for Animal Genome, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Buxin Road NO. 97, Dapeng District, Shenzhen 518124, China
| | - Bo Peng
- Gene Regulatory Lab, School of Basic Medical Sciences, Tsinghua University, NO. 30 Shuangqing road, Haidian district, Beijing 100084, China
- State Key Laboratory of Molecular Oncology, Tsinghua University, NO. 30 Shuangqing road, Haidian district, Beijing 100084, China
| | - Shenghua Qin
- Shenzhen Branch, Guangdong Laboratory for Lingnan Modern Agriculture, Key Laboratory of Livestock and Poultry Multi-Omics of MARA, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Buxin Road NO. 97, Dapeng District, Shenzhen 518124, China
- Innovation Group of Pig Genome Design and Breeding, Research Centre for Animal Genome, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Buxin Road NO. 97, Dapeng District, Shenzhen 518124, China
| | - Qian Zhang
- State Key Laboratory of Mycology, Institute of Microbiology, Chinese Academy of Sciences, NO.1 Beichen West Road, Chaoyang District, Beijing 100101, China
| | - Yun Chen
- Shenzhen Branch, Guangdong Laboratory for Lingnan Modern Agriculture, Key Laboratory of Livestock and Poultry Multi-Omics of MARA, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Buxin Road NO. 97, Dapeng District, Shenzhen 518124, China
- Innovation Group of Pig Genome Design and Breeding, Research Centre for Animal Genome, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Buxin Road NO. 97, Dapeng District, Shenzhen 518124, China
| | - Choulin Chen
- Shenzhen Branch, Guangdong Laboratory for Lingnan Modern Agriculture, Key Laboratory of Livestock and Poultry Multi-Omics of MARA, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Buxin Road NO. 97, Dapeng District, Shenzhen 518124, China
- Innovation Group of Pig Genome Design and Breeding, Research Centre for Animal Genome, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Buxin Road NO. 97, Dapeng District, Shenzhen 518124, China
| | - Yongzhou Bao
- Shenzhen Branch, Guangdong Laboratory for Lingnan Modern Agriculture, Key Laboratory of Livestock and Poultry Multi-Omics of MARA, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Buxin Road NO. 97, Dapeng District, Shenzhen 518124, China
- Innovation Group of Pig Genome Design and Breeding, Research Centre for Animal Genome, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Buxin Road NO. 97, Dapeng District, Shenzhen 518124, China
| | - Yuqi Zhu
- Shenzhen Branch, Guangdong Laboratory for Lingnan Modern Agriculture, Genome Analysis Laboratory of the Ministry of Agriculture and Rural Affairs, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, NO. 7 Pengfei Road, Dapeng District, Shenzhen 518124, China
| | - Yi Hong
- Shenzhen Branch, Guangdong Laboratory for Lingnan Modern Agriculture, Genome Analysis Laboratory of the Ministry of Agriculture and Rural Affairs, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, NO. 7 Pengfei Road, Dapeng District, Shenzhen 518124, China
| | - Binghua Liu
- State Key Laboratory of Maricultural Biobreeding and Sustainable Goods, Yellow Sea Fisheries Research Institute, Chinese Academy of Fishery Sciences, NO.106 Nanjing Road, Shinan District, Qingdao, Shandong 266071, China
| | - Qian Liu
- State Key Laboratory of Maricultural Biobreeding and Sustainable Goods, Yellow Sea Fisheries Research Institute, Chinese Academy of Fishery Sciences, NO.106 Nanjing Road, Shinan District, Qingdao, Shandong 266071, China
| | - Lingna Xu
- Shenzhen Branch, Guangdong Laboratory for Lingnan Modern Agriculture, Key Laboratory of Livestock and Poultry Multi-Omics of MARA, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Buxin Road NO. 97, Dapeng District, Shenzhen 518124, China
- Innovation Group of Pig Genome Design and Breeding, Research Centre for Animal Genome, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Buxin Road NO. 97, Dapeng District, Shenzhen 518124, China
| | - Xi Chen
- Shenzhen Branch, Guangdong Laboratory of Lingnan Modern Agriculture, Key Laboratory of Synthetic Biology, Ministry of Agriculture and Rural Affairs, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Buxin Road NO. 97, Dapeng District, Shenzhen 518124, China
| | - Xinhao Ma
- College of Grassland Agriculture, National Beef Cattle Improvement Center, College of Animal Science and Technology, Northwest A&F University, NO. 3 Taicheng Road, Yangling District, Yangling, Shaanxi 712100, China
| | - Hongyan Wang
- State Key Laboratory of Maricultural Biobreeding and Sustainable Goods, Yellow Sea Fisheries Research Institute, Chinese Academy of Fishery Sciences, NO.106 Nanjing Road, Shinan District, Qingdao, Shandong 266071, China
| | - Long Xie
- Shenzhen Branch, Guangdong Laboratory for Lingnan Modern Agriculture, Key Laboratory of Livestock and Poultry Multi-Omics of MARA, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Buxin Road NO. 97, Dapeng District, Shenzhen 518124, China
| | - Yilong Yao
- Green Healthy Aquaculture Research Center, Kunpeng Institute of Modern Agriculture at Foshan, Chinese Academy of Agricultural Sciences, Building 26 Lihe Technology Park, Auxiliary Road of Xinxi Avenue South, Nanhai District, Foshan 528226, China
| | - Biao Deng
- Shenzhen Branch, Guangdong Laboratory for Lingnan Modern Agriculture, Key Laboratory of Livestock and Poultry Multi-Omics of MARA, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Buxin Road NO. 97, Dapeng District, Shenzhen 518124, China
- Innovation Group of Pig Genome Design and Breeding, Research Centre for Animal Genome, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Buxin Road NO. 97, Dapeng District, Shenzhen 518124, China
| | - Jiaying Li
- Department of Ophthalmology, Beijing Institute of Ophthalmology, Beijing Tongren Eye Center, Beijing Tongren Hospital, Capital Medical University, Dongjiaomin lane No1, Dongcheng District, Beijing 100101, China
| | - Baojun De
- College of Life Sciences, Inner Mongolia Autonomous Region Key Laboratory of Biomanufacturing, Inner Mongolia Agricultural University, NO. 306 Zhaowuda Road, Saihan District, Hohhot 010018, China
| | - Yuting Chen
- College of Life Sciences, Inner Mongolia Autonomous Region Key Laboratory of Biomanufacturing, Inner Mongolia Agricultural University, NO. 306 Zhaowuda Road, Saihan District, Hohhot 010018, China
| | - Jing Wang
- Shenzhen Branch, Guangdong Laboratory of Lingnan Modern Agriculture, Key Laboratory of Synthetic Biology, Ministry of Agriculture and Rural Affairs, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Buxin Road NO. 97, Dapeng District, Shenzhen 518124, China
| | - Tian Li
- College of JUNCAO Science and Ecology, Haixia Institute of Science and Technology, National Engineering Research Center of JUNCAO, Fujian Agriculture and Forestry University (FAFU), NO.15 Shangxiadian Road, Cangshan District, Fuzhou 0350002, China
| | - Ranran Liu
- Institute of Animal Science, Chinese Academy of Agricultural Sciences, Yuanmingyuan West Road NO. 2, Haidian District, Beijing 100193, China
| | - Zhonglin Tang
- Green Healthy Aquaculture Research Center, Kunpeng Institute of Modern Agriculture at Foshan, Chinese Academy of Agricultural Sciences, Building 26 Lihe Technology Park, Auxiliary Road of Xinxi Avenue South, Nanhai District, Foshan 528226, China
| | - Junwei Cao
- College of Life Sciences, Inner Mongolia Autonomous Region Key Laboratory of Biomanufacturing, Inner Mongolia Agricultural University, NO. 306 Zhaowuda Road, Saihan District, Hohhot 010018, China
| | - Erwei Zuo
- Shenzhen Branch, Guangdong Laboratory for Lingnan Modern Agriculture, Key Laboratory of Livestock and Poultry Multi-Omics of MARA, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Buxin Road NO. 97, Dapeng District, Shenzhen 518124, China
| | - Chugang Mei
- College of Grassland Agriculture, National Beef Cattle Improvement Center, College of Animal Science and Technology, Northwest A&F University, NO. 3 Taicheng Road, Yangling District, Yangling, Shaanxi 712100, China
| | - Fangjie Zhu
- College of JUNCAO Science and Ecology, Haixia Institute of Science and Technology, National Engineering Research Center of JUNCAO, Fujian Agriculture and Forestry University (FAFU), NO.15 Shangxiadian Road, Cangshan District, Fuzhou 0350002, China
| | - Changwei Shao
- State Key Laboratory of Maricultural Biobreeding and Sustainable Goods, Yellow Sea Fisheries Research Institute, Chinese Academy of Fishery Sciences, NO.106 Nanjing Road, Shinan District, Qingdao, Shandong 266071, China
| | - Guirong Wang
- Shenzhen Branch, Guangdong Laboratory of Lingnan Modern Agriculture, Key Laboratory of Synthetic Biology, Ministry of Agriculture and Rural Affairs, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Buxin Road NO. 97, Dapeng District, Shenzhen 518124, China
| | - Tongjun Sun
- Shenzhen Branch, Guangdong Laboratory for Lingnan Modern Agriculture, Genome Analysis Laboratory of the Ministry of Agriculture and Rural Affairs, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, NO. 7 Pengfei Road, Dapeng District, Shenzhen 518124, China
| | - Ningli Wang
- Department of Ophthalmology, Beijing Institute of Ophthalmology, Beijing Tongren Eye Center, Beijing Tongren Hospital, Capital Medical University, Dongjiaomin lane No1, Dongcheng District, Beijing 100101, China
| | - Gang Liu
- State Key Laboratory of Mycology, Institute of Microbiology, Chinese Academy of Sciences, NO.1 Beichen West Road, Chaoyang District, Beijing 100101, China
| | - Jian-Quan Ni
- Gene Regulatory Lab, School of Basic Medical Sciences, Tsinghua University, NO. 30 Shuangqing road, Haidian district, Beijing 100084, China
- State Key Laboratory of Molecular Oncology, Tsinghua University, NO. 30 Shuangqing road, Haidian district, Beijing 100084, China
- SXMU-Tsinghua Collaborative Innovation Center for Frontier Medicine, Shanxi Medical University, NO. 56 Xinjian South Road, Yingze District, Taiyuan 030001, China
| | - Yuwen Liu
- Shenzhen Branch, Guangdong Laboratory for Lingnan Modern Agriculture, Key Laboratory of Livestock and Poultry Multi-Omics of MARA, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Buxin Road NO. 97, Dapeng District, Shenzhen 518124, China
- Innovation Group of Pig Genome Design and Breeding, Research Centre for Animal Genome, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Buxin Road NO. 97, Dapeng District, Shenzhen 518124, China
- Green Healthy Aquaculture Research Center, Kunpeng Institute of Modern Agriculture at Foshan, Chinese Academy of Agricultural Sciences, Building 26 Lihe Technology Park, Auxiliary Road of Xinxi Avenue South, Nanhai District, Foshan 528226, China
| |
Collapse
|
5
|
Ramprasad P, Pai N, Pan W. Enhancing personalized gene expression prediction from DNA sequences using genomic foundation models. HGG ADVANCES 2024; 5:100347. [PMID: 39205391 PMCID: PMC11416237 DOI: 10.1016/j.xhgg.2024.100347] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/07/2024] [Revised: 08/23/2024] [Accepted: 08/23/2024] [Indexed: 09/04/2024] Open
Abstract
Artificial intelligence (AI)/deep learning (DL) models that predict molecular phenotypes like gene expression directly from DNA sequences have recently emerged. While these models have proven effective at capturing the variation across genes, their ability to explain inter-individual differences has been limited. We hypothesize that the performance gap can be narrowed through the use of pre-trained embeddings from the Nucleotide Transformer, a large foundation model trained on 3,000+ genomes. We train a transformer model using the pre-trained embeddings and compare its predictive performance to Enformer, the current state-of-the-art model, using genotype and expression data from 290 individuals. Our model significantly outperforms Enformer in terms of correlation across individuals, and narrows the performance gap with an elastic net regression approach that uses just the genetic variants as predictors. Although simple regression models have their advantages in personalized prediction tasks, DL approaches based on foundation models pre-trained on diverse genomes have unique strengths in flexibility and interpretability. With further methodological and computational improvements with more training data, these models may eventually predict molecular phenotypes from DNA sequences with an accuracy surpassing that of regression-based approaches. Our work demonstrates the potential for large pre-trained AI/DL models to advance functional genomics.
Collapse
Affiliation(s)
- Pratik Ramprasad
- Division of Biostatistics and Health Data Science, University of Minnesota, Minneapolis, Minneapolis, MN, USA
| | - Nidhi Pai
- Division of Biostatistics and Health Data Science, University of Minnesota, Minneapolis, Minneapolis, MN, USA
| | - Wei Pan
- Division of Biostatistics and Health Data Science, University of Minnesota, Minneapolis, Minneapolis, MN, USA.
| |
Collapse
|
6
|
Koido M, Tomizuka K, Terao C. Fundamentals for predicting transcriptional regulations from DNA sequence patterns. J Hum Genet 2024; 69:499-504. [PMID: 38730006 PMCID: PMC11422166 DOI: 10.1038/s10038-024-01256-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/18/2024] [Revised: 04/10/2024] [Accepted: 04/25/2024] [Indexed: 05/12/2024]
Abstract
Cell-type-specific regulatory elements, cataloged through extensive experiments and bioinformatics in large-scale consortiums, have enabled enrichment analyses of genetic associations that primarily utilize positional information of the regulatory elements. These analyses have identified cell types and pathways genetically associated with human complex traits. However, our understanding of detailed allelic effects on these elements' activities and on-off states remains incomplete, hampering the interpretation of human genetic study results. This review introduces machine learning methods to learn sequence-dependent transcriptional regulation mechanisms from DNA sequences for predicting such allelic effects (not associations). We provide a concise history of machine-learning-based approaches, the requirements, and the key computational processes, focusing on primers in machine learning. Convolution and self-attention, pivotal in modern deep-learning models, are explained through geometrical interpretations using dot products. This facilitates understanding of the concept and why these have been used for machine learning for DNA sequences. These will inspire further research in this genetics and genomics field.
Collapse
Affiliation(s)
- Masaru Koido
- Laboratory of Complex Trait Genomics, Department of Computational Biology and Medical Sciences, Graduate School of Frontier Sciences, The University of Tokyo, Tokyo, Japan.
- Laboratory for Statistical and Translational Genetics, RIKEN Center for Integrative Medical Sciences, Yokohama, Japan.
| | - Kohei Tomizuka
- Laboratory for Statistical and Translational Genetics, RIKEN Center for Integrative Medical Sciences, Yokohama, Japan
| | - Chikashi Terao
- Laboratory for Statistical and Translational Genetics, RIKEN Center for Integrative Medical Sciences, Yokohama, Japan.
- Clinical Research Center, Shizuoka General Hospital, Shizuoka, Japan.
- The Department of Applied Genetics, The School of Pharmaceutical Sciences, University of Shizuoka, Shizuoka, Japan.
| |
Collapse
|
7
|
Naito T, Okada Y. Genotype imputation methods for whole and complex genomic regions utilizing deep learning technology. J Hum Genet 2024; 69:481-486. [PMID: 38225263 PMCID: PMC11422162 DOI: 10.1038/s10038-023-01213-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/13/2023] [Revised: 11/23/2023] [Accepted: 12/04/2023] [Indexed: 01/17/2024]
Abstract
The imputation of unmeasured genotypes is essential in human genetic research, particularly in enhancing the power of genome-wide association studies and conducting subsequent fine-mapping. Recently, several deep learning-based genotype imputation methods for genome-wide variants with the capability of learning complex linkage disequilibrium patterns have been developed. Additionally, deep learning-based imputation has been applied to a distinct genomic region known as the major histocompatibility complex, referred to as HLA imputation. Despite their various advantages, the current deep learning-based genotype imputation methods do have certain limitations and have not yet become standard. These limitations include the modest accuracy improvement over statistical and conventional machine learning-based methods. However, their benefits include other aspects, such as their "reference-free" nature, which ensures complete privacy protection, and their higher computational efficiency. Furthermore, the continuing evolution of deep learning technologies is expected to contribute to further improvements in prediction accuracy and usability in the future.
Collapse
Affiliation(s)
- Tatsuhiko Naito
- Department of Statistical Genetics, Osaka University Graduate School of Medicine, 2-2, Yamadaoka, Suita-shi, Osaka, 565-0871, Japan.
- Laboratory for Systems Genetics, RIKEN Center for Integrative Medical Sciences, 1-7-22, Suehiro-cho, Tsurumi-ku, Yokohama City, Kanagawa, 230-0045, Japan.
| | - Yukinori Okada
- Department of Statistical Genetics, Osaka University Graduate School of Medicine, 2-2, Yamadaoka, Suita-shi, Osaka, 565-0871, Japan
- Laboratory for Systems Genetics, RIKEN Center for Integrative Medical Sciences, 1-7-22, Suehiro-cho, Tsurumi-ku, Yokohama City, Kanagawa, 230-0045, Japan
- Department of Genome Informatics, Graduate School of Medicine, the University of Tokyo, 7-3-1, Hongo, Bunkyo-ku, Tokyo, 113-8655, Japan
- Integrated Frontier Research for Medical Science Division, Institute for Open and Transdisciplinary Research Initiatives, Osaka University, 2-2, Yamadaoka, Suita-shi, Osaka, 565-0871, Japan
- Premium Research Institute for Human Metaverse Medicine (WPI-PRIMe), Osaka University, 2-2, Yamadaoka, Suita-shi, Osaka, 565-0871, Japan
| |
Collapse
|
8
|
Shams A. Leveraging State-of-the-Art AI Algorithms in Personalized Oncology: From Transcriptomics to Treatment. Diagnostics (Basel) 2024; 14:2174. [PMID: 39410578 PMCID: PMC11476216 DOI: 10.3390/diagnostics14192174] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/07/2024] [Revised: 09/17/2024] [Accepted: 09/23/2024] [Indexed: 10/20/2024] Open
Abstract
BACKGROUND Continuous breakthroughs in computational algorithms have positioned AI-based models as some of the most sophisticated technologies in the healthcare system. AI shows dynamic contributions in advancing various medical fields involving data interpretation and monitoring, imaging screening and diagnosis, and treatment response and survival prediction. Despite advances in clinical oncology, more effort must be employed to tailor therapeutic plans based on each patient's unique transcriptomic profile within the precision/personalized oncology frame. Furthermore, the standard analysis method is not compatible with the comprehensive deciphering of significant data streams, thus precluding the prediction of accurate treatment options. METHODOLOGY We proposed a novel approach that includes obtaining different tumour tissues and preparing RNA samples for comprehensive transcriptomic interpretation using specifically trained, programmed, and optimized AI-based models for extracting large data volumes, refining, and analyzing them. Next, the transcriptomic results will be scanned against an expansive drug library to predict the response of each target to the tested drugs. The obtained target-drug combination/s will be then validated using in vitro and in vivo experimental models. Finally, the best treatment combination option/s will be introduced to the patient. We also provided a comprehensive review discussing AI models' recent innovations and implementations to aid in molecular diagnosis and treatment planning. RESULTS The expected transcriptomic analysis generated by the AI-based algorithms will provide an inclusive genomic profile for each patient, containing statistical and bioinformatics analyses, identification of the dysregulated pathways, detection of the targeted genes, and recognition of molecular biomarkers. Subjecting these results to the prediction and pairing AI-based processes will result in statistical graphs presenting each target's likely response rate to various treatment options. Different in vitro and in vivo investigations will further validate the selection of the target drug/s pairs. CONCLUSIONS Leveraging AI models will provide more rigorous manipulation of large-scale datasets on specific cancer care paths. Such a strategy would shape treatment according to each patient's demand, thus fortifying the avenue of personalized/precision medicine. Undoubtedly, this will assist in improving the oncology domain and alleviate the burden of clinicians in the coming decade.
Collapse
Affiliation(s)
- Anwar Shams
- Department of Pharmacology, College of Medicine, Taif University, P.O. Box 11099, Taif 21944, Saudi Arabia; or ; Tel.: +00966-548638099
- Research Center for Health Sciences, Deanship of Graduate Studies and Scientific Research, Taif University, Taif 26432, Saudi Arabia
- High Altitude Research Center, Taif University, P.O. Box 11099, Taif 21944, Saudi Arabia
| |
Collapse
|
9
|
Woo BJ, Moussavi-Baygi R, Karner H, Karimzadeh M, Yousefi H, Lee S, Garcia K, Joshi T, Yin K, Navickas A, Gilbert LA, Wang B, Asgharian H, Feng FY, Goodarzi H. Integrative identification of non-coding regulatory regions driving metastatic prostate cancer. Cell Rep 2024; 43:114764. [PMID: 39276353 PMCID: PMC11466230 DOI: 10.1016/j.celrep.2024.114764] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/13/2023] [Revised: 07/08/2024] [Accepted: 08/29/2024] [Indexed: 09/17/2024] Open
Abstract
Large-scale sequencing efforts have been undertaken to understand the mutational landscape of the coding genome. However, the vast majority of variants occur within non-coding genomic regions. We designed an integrative computational and experimental framework to identify recurrently mutated non-coding regulatory regions that drive tumor progression. Applying this framework to sequencing data from a large prostate cancer patient cohort revealed a large set of candidate drivers. We used (1) in silico analyses, (2) massively parallel reporter assays, and (3) in vivo CRISPR interference screens to systematically validate metastatic castration-resistant prostate cancer (mCRPC) drivers. One identified enhancer region, GH22I030351, acts on a bidirectional promoter to simultaneously modulate expression of the U2-associated splicing factor SF3A1 and chromosomal protein CCDC157. SF3A1 and CCDC157 promote tumor growth in vivo. We nominated a number of transcription factors, notably SOX6, to regulate expression of SF3A1 and CCDC157. Our integrative approach enables the systematic detection of non-coding regulatory regions that drive human cancers.
Collapse
Affiliation(s)
- Brian J Woo
- Department of Biochemistry & Biophysics, University of California, San Francisco, San Francisco, CA, USA; Department of Urology, University of California, San Francisco, San Francisco, CA, USA; Helen Diller Family Comprehensive Cancer Center, University of California, San Francisco, San Francisco, CA, USA; Arc Institute, Palo Alto, CA 94305, USA
| | - Ruhollah Moussavi-Baygi
- Department of Biochemistry & Biophysics, University of California, San Francisco, San Francisco, CA, USA; Department of Urology, University of California, San Francisco, San Francisco, CA, USA; Helen Diller Family Comprehensive Cancer Center, University of California, San Francisco, San Francisco, CA, USA
| | - Heather Karner
- Department of Biochemistry & Biophysics, University of California, San Francisco, San Francisco, CA, USA; Department of Urology, University of California, San Francisco, San Francisco, CA, USA; Helen Diller Family Comprehensive Cancer Center, University of California, San Francisco, San Francisco, CA, USA; Arc Institute, Palo Alto, CA 94305, USA
| | - Mehran Karimzadeh
- Department of Biochemistry & Biophysics, University of California, San Francisco, San Francisco, CA, USA; Department of Urology, University of California, San Francisco, San Francisco, CA, USA; Helen Diller Family Comprehensive Cancer Center, University of California, San Francisco, San Francisco, CA, USA; Vector Institute, Toronto, ON, Canada; Peter Munk Cardiac Centre, University Health Network, Toronto, ON, Canada; Arc Institute, Palo Alto, CA 94305, USA
| | - Hassan Yousefi
- Department of Biochemistry & Biophysics, University of California, San Francisco, San Francisco, CA, USA; Department of Urology, University of California, San Francisco, San Francisco, CA, USA; Helen Diller Family Comprehensive Cancer Center, University of California, San Francisco, San Francisco, CA, USA; Arc Institute, Palo Alto, CA 94305, USA
| | - Sean Lee
- Department of Biochemistry & Biophysics, University of California, San Francisco, San Francisco, CA, USA; Department of Urology, University of California, San Francisco, San Francisco, CA, USA; Helen Diller Family Comprehensive Cancer Center, University of California, San Francisco, San Francisco, CA, USA; Arc Institute, Palo Alto, CA 94305, USA
| | - Kristle Garcia
- Department of Biochemistry & Biophysics, University of California, San Francisco, San Francisco, CA, USA; Department of Urology, University of California, San Francisco, San Francisco, CA, USA; Helen Diller Family Comprehensive Cancer Center, University of California, San Francisco, San Francisco, CA, USA
| | - Tanvi Joshi
- Department of Biochemistry & Biophysics, University of California, San Francisco, San Francisco, CA, USA; Department of Urology, University of California, San Francisco, San Francisco, CA, USA; Helen Diller Family Comprehensive Cancer Center, University of California, San Francisco, San Francisco, CA, USA
| | - Keyi Yin
- Department of Biochemistry & Biophysics, University of California, San Francisco, San Francisco, CA, USA; Department of Urology, University of California, San Francisco, San Francisco, CA, USA; Helen Diller Family Comprehensive Cancer Center, University of California, San Francisco, San Francisco, CA, USA
| | - Albertas Navickas
- Department of Biochemistry & Biophysics, University of California, San Francisco, San Francisco, CA, USA; Department of Urology, University of California, San Francisco, San Francisco, CA, USA; Helen Diller Family Comprehensive Cancer Center, University of California, San Francisco, San Francisco, CA, USA
| | - Luke A Gilbert
- Department of Biochemistry & Biophysics, University of California, San Francisco, San Francisco, CA, USA; Department of Urology, University of California, San Francisco, San Francisco, CA, USA; Helen Diller Family Comprehensive Cancer Center, University of California, San Francisco, San Francisco, CA, USA; Arc Institute, Palo Alto, CA 94305, USA
| | - Bo Wang
- Vector Institute, Toronto, ON, Canada; Peter Munk Cardiac Centre, University Health Network, Toronto, ON, Canada
| | - Hosseinali Asgharian
- Department of Biochemistry & Biophysics, University of California, San Francisco, San Francisco, CA, USA; Department of Urology, University of California, San Francisco, San Francisco, CA, USA; Helen Diller Family Comprehensive Cancer Center, University of California, San Francisco, San Francisco, CA, USA; Bakar Computational Health Sciences Institute, University of California, San Francisco, San Francisco, CA, USA.
| | - Felix Y Feng
- Department of Urology, University of California, San Francisco, San Francisco, CA, USA; Helen Diller Family Comprehensive Cancer Center, University of California, San Francisco, San Francisco, CA, USA; Department of Radiation Oncology, University of California, San Francisco, San Francisco, CA, USA.
| | - Hani Goodarzi
- Department of Biochemistry & Biophysics, University of California, San Francisco, San Francisco, CA, USA; Department of Urology, University of California, San Francisco, San Francisco, CA, USA; Helen Diller Family Comprehensive Cancer Center, University of California, San Francisco, San Francisco, CA, USA; Arc Institute, Palo Alto, CA 94305, USA; Bakar Computational Health Sciences Institute, University of California, San Francisco, San Francisco, CA, USA.
| |
Collapse
|
10
|
Zhang Y, Wang Z, Ge F, Wang X, Zhang Y, Li S, Guo Y, Song J, Yu DJ. MLSNet: a deep learning model for predicting transcription factor binding sites. Brief Bioinform 2024; 25:bbae489. [PMID: 39350338 PMCID: PMC11442149 DOI: 10.1093/bib/bbae489] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/22/2024] [Revised: 09/05/2024] [Accepted: 09/16/2024] [Indexed: 10/04/2024] Open
Abstract
Accurate prediction of transcription factor binding sites (TFBSs) is essential for understanding gene regulation mechanisms and the etiology of diseases. Despite numerous advances in deep learning for predicting TFBSs, their performance can still be enhanced. In this study, we propose MLSNet, a novel deep learning architecture designed specifically to predict TFBSs. MLSNet innovatively integrates multisize convolutional fusion with long short-term memory (LSTM) networks to effectively capture DNA-sparse higher-order sequence features. Further, MLSNet incorporates super token attention and Bi-LSTM to systematically extract and integrate higher-order DNA shape features. Experimental results on 165 ChIP-seq (chromatin immunoprecipitation followed by sequencing) datasets indicate that MLSNet consistently outperforms several state-of-the-art algorithms in the prediction of TFBSs. Specifically, MLSNet reports average metrics: 0.8306 for ACC, 0.8992 for AUROC, and 0.9035 for AUPRC, surpassing the second-best methods by 1.82%, 1.68%, and 1.54%, respectively. This research delineates the effectiveness of combining multi-size convolutional layers with LSTM and DNA shape-based features in enhancing predictive accuracy. Moreover, this study comprehensively assesses the variability in model performance across different cell lines and transcription factors. The source code of MLSNet is available at https://github.com/minghaidea/MLSNet.
Collapse
Affiliation(s)
- Yuchuan Zhang
- School of Computer Science and Engineering, Nanjing University of Science and Technology, 200 Xiaolingwei, Nanjing 210094, China
| | - Zhikang Wang
- Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Wellington Rd, Clayton, Melbourne, VIC 3800, Australia
| | - Fang Ge
- State Key Laboratory of Organic Electronics and Information Displays & Institute of Advanced Materials (IAM), Nanjing University of Posts & Telecommunications, 9 Wenyuan, Nanjing, 210023, China
| | - Xiaoyu Wang
- Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Wellington Rd, Clayton, Melbourne, VIC 3800, Australia
| | - Yiwen Zhang
- Climate, Air Quality Research Unit, School of Public Health and Preventive Medicine, Monash University, 553 St Kilda Road, Melbourne, VIC 3004, Australia
| | - Shanshan Li
- Climate, Air Quality Research Unit, School of Public Health and Preventive Medicine, Monash University, 553 St Kilda Road, Melbourne, VIC 3004, Australia
| | - Yuming Guo
- Climate, Air Quality Research Unit, School of Public Health and Preventive Medicine, Monash University, 553 St Kilda Road, Melbourne, VIC 3004, Australia
| | - Jiangning Song
- Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Wellington Rd, Clayton, Melbourne, VIC 3800, Australia
- Monash Data Futures Institute, Monash University, Wellington Rd, Clayton, Melbourne, VIC 3800, Australia
| | - Dong-Jun Yu
- School of Computer Science and Engineering, Nanjing University of Science and Technology, 200 Xiaolingwei, Nanjing 210094, China
| |
Collapse
|
11
|
Sasse A, Chikina M, Mostafavi S. Quick and effective approximation of in silico saturation mutagenesis experiments with first-order taylor expansion. iScience 2024; 27:110807. [PMID: 39286491 PMCID: PMC11404212 DOI: 10.1016/j.isci.2024.110807] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/19/2024] [Revised: 08/08/2024] [Accepted: 08/20/2024] [Indexed: 09/19/2024] Open
Abstract
To understand the decision process of genomic sequence-to-function models, explainable AI algorithms determine the importance of each nucleotide in a given input sequence to the model's predictions and enable discovery of cis-regulatory motifs for gene regulation. The most commonly applied method is in silico saturation mutagenesis (ISM) because its per-nucleotide importance scores can be intuitively understood as the computational counterpart to in vivo saturation mutagenesis experiments. While ISM is highly interpretable, it is computationally challenging to perform for many sequences, and becomes prohibitive as the length of the input sequences and size of the model grows. Here, we use the first-order Taylor approximation to approximate ISM values from the model's gradient, which reduces its computation cost to a single forward pass for an input sequence. We show that the Taylor ISM (TISM) approximation is robust across different model ablations, random initializations, training parameters, and dataset sizes.
Collapse
Affiliation(s)
- Alexander Sasse
- Paul G. Allen School of Computer Science and Engineering, University of Washington, Seattle, WA 98195, USA
| | - Maria Chikina
- Department of Computational and Systems Biology, University of Pittsburgh, Pittsburgh, PA 16354, USA
| | - Sara Mostafavi
- Paul G. Allen School of Computer Science and Engineering, University of Washington, Seattle, WA 98195, USA
- Canadian Institute for Advanced Research, Toronto, ON MG5 1ZB, Canada
| |
Collapse
|
12
|
Ghafoor H, Asim MN, Ibrahim MA, Dengel A. ProSol-multi: Protein solubility prediction via amino acids multi-level correlation and discriminative distribution. Heliyon 2024; 10:e36041. [PMID: 39281576 PMCID: PMC11401092 DOI: 10.1016/j.heliyon.2024.e36041] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/16/2024] [Revised: 08/01/2024] [Accepted: 08/08/2024] [Indexed: 09/18/2024] Open
Abstract
Protein solubility prediction is useful for the careful selection of highly effective candidate proteins for drug development. In recombinant proteins synthesis, solubility prediction is valuable for optimizing key protein characteristics, including stability, functionality, and ease of purification. It contains valuable information about potential biomarkers or therapeutic targets and helps in early forecasting of neurodegenerative diseases, cancer, and cardiovascular disorders. Traditional wet-lab experimental protein solubility prediction approaches are error-prone, time-consuming, and costly. Researchers harnessed the competence of Artificial Intelligence approaches for replacing experimental approaches with computational predictors. These predictors inferred the solubility of proteins by analyzing amino acids distributions in raw protein sequences. There is still a lot of room for the development of robust computational predictors because existing predictors remain fail in extracting comprehensive discriminative distribution of amino acids. To more precisely discriminate soluble proteins from insoluble proteins, this paper presents ProSol-Multi predictor that makes use of a novel MLCDE encoder and Random Forest classifier. MLCDE encoder transforms protein sequences into informative statistical vectors by capturing amino acids multi-level correlation and discriminative distribution within raw protein sequences. The performance of proposed encoder is evaluated against 56 existing protein sequence encoding methods on a widely used protein solubility prediction benchmark dataset under two different experimental settings namely intrinsic and extrinsic. Intrinsic evaluation reveals that from all sequence encoders, proposed MLCDE encoder manages to generate non-overlapping clusters of soluble and insoluble classes. In extrinsic evaluation, 10 machine learning classifiers achieve better performance with proposed MLCDE encoder as compared to 56 existing protein sequence encoders. Moreover, across 4 public benchmark datasets, proposed ProSol-Multi predictor outshines 20 existing predictors by an average accuracy of 3%, MCC and AU-ROC of 2%. ProSol-Multi interactive web application is available at https://sds_genetic_analysis.opendfki.de/ProSol-Multi.
Collapse
Affiliation(s)
- Hina Ghafoor
- Department of Computer Science, Rhineland-Palatinate Technical University of Kaiserslautern-Landau, Kaiserslautern, 67663, Germany
- German Research Center for Artificial Intelligence GmbH, Kaiserslautern, 67663, Germany
| | - Muhammad Nabeel Asim
- German Research Center for Artificial Intelligence GmbH, Kaiserslautern, 67663, Germany
| | - Muhammad Ali Ibrahim
- Department of Computer Science, Rhineland-Palatinate Technical University of Kaiserslautern-Landau, Kaiserslautern, 67663, Germany
- German Research Center for Artificial Intelligence GmbH, Kaiserslautern, 67663, Germany
| | - Andreas Dengel
- Department of Computer Science, Rhineland-Palatinate Technical University of Kaiserslautern-Landau, Kaiserslautern, 67663, Germany
- German Research Center for Artificial Intelligence GmbH, Kaiserslautern, 67663, Germany
| |
Collapse
|
13
|
Wang Z, Peng Y, Li J, Li J, Yuan H, Yang S, Ding X, Xie A, Zhang J, Wang S, Li K, Shi J, Xing G, Shi W, Yan J, Liu J. DeepCBA: A deep learning framework for gene expression prediction in maize based on DNA sequences and chromatin interactions. PLANT COMMUNICATIONS 2024; 5:100985. [PMID: 38859587 DOI: 10.1016/j.xplc.2024.100985] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/28/2024] [Revised: 05/25/2024] [Accepted: 06/05/2024] [Indexed: 06/12/2024]
Abstract
Chromatin interactions create spatial proximity between distal regulatory elements and target genes in the genome, which has an important impact on gene expression, transcriptional regulation, and phenotypic traits. To date, several methods have been developed for predicting gene expression. However, existing methods do not take into consideration the effect of chromatin interactions on target gene expression, thus potentially reducing the accuracy of gene expression prediction and mining of important regulatory elements. In this study, we developed a highly accurate deep learning-based gene expression prediction model (DeepCBA) based on maize chromatin interaction data. Compared with existing models, DeepCBA exhibits higher accuracy in expression classification and expression value prediction. The average Pearson correlation coefficients (PCCs) for predicting gene expression using gene promoter proximal interactions, proximal-distal interactions, and both proximal and distal interactions were 0.818, 0.625, and 0.929, respectively, representing an increase of 0.357, 0.16, and 0.469 over the PCCs obtained with traditional methods that use only gene proximal sequences. Some important motifs were identified through DeepCBA; they were enriched in open chromatin regions and expression quantitative trait loci and showed clear tissue specificity. Importantly, experimental results for the maize flowering-related gene ZmRap2.7 and the tillering-related gene ZmTb1 demonstrated the feasibility of DeepCBA for exploration of regulatory elements that affect gene expression. Moreover, promoter editing and verification of two reported genes (ZmCLE7 and ZmVTE4) demonstrated the utility of DeepCBA for the precise design of gene expression and even for future intelligent breeding. DeepCBA is available at http://www.deepcba.com/ or http://124.220.197.196/.
Collapse
Affiliation(s)
- Zhenye Wang
- National Key Laboratory of Crop Genetic Improvement, Huazhong Agricultural University, Wuhan 430070, China; Hubei Key Laboratory of Agricultural Bioinformatics, Huazhong Agricultural University, Wuhan 430070, China; College of Informatics, Huazhong Agricultural University, Wuhan 430070, China
| | - Yong Peng
- National Key Laboratory of Crop Genetic Improvement, Huazhong Agricultural University, Wuhan 430070, China; Hubei Hongshan Laboratory, Wuhan 430070, China
| | - Jie Li
- National Key Laboratory of Crop Genetic Improvement, Huazhong Agricultural University, Wuhan 430070, China; Hubei Key Laboratory of Agricultural Bioinformatics, Huazhong Agricultural University, Wuhan 430070, China; College of Informatics, Huazhong Agricultural University, Wuhan 430070, China
| | - Jiying Li
- Microsoft Corporation, Redmond, WA 98052, USA
| | - Hao Yuan
- National Key Laboratory of Crop Genetic Improvement, Huazhong Agricultural University, Wuhan 430070, China; Hubei Key Laboratory of Agricultural Bioinformatics, Huazhong Agricultural University, Wuhan 430070, China; College of Informatics, Huazhong Agricultural University, Wuhan 430070, China
| | - Shangpo Yang
- National Key Laboratory of Crop Genetic Improvement, Huazhong Agricultural University, Wuhan 430070, China; Hubei Key Laboratory of Agricultural Bioinformatics, Huazhong Agricultural University, Wuhan 430070, China; College of Informatics, Huazhong Agricultural University, Wuhan 430070, China
| | - Xinru Ding
- National Key Laboratory of Crop Genetic Improvement, Huazhong Agricultural University, Wuhan 430070, China; Hubei Key Laboratory of Agricultural Bioinformatics, Huazhong Agricultural University, Wuhan 430070, China; College of Informatics, Huazhong Agricultural University, Wuhan 430070, China
| | - Ao Xie
- National Key Laboratory of Crop Genetic Improvement, Huazhong Agricultural University, Wuhan 430070, China; Hubei Key Laboratory of Agricultural Bioinformatics, Huazhong Agricultural University, Wuhan 430070, China; College of Informatics, Huazhong Agricultural University, Wuhan 430070, China
| | - Jiangling Zhang
- College of Informatics, Huazhong Agricultural University, Wuhan 430070, China
| | - Shouzhe Wang
- National Key Laboratory of Crop Genetic Improvement, Huazhong Agricultural University, Wuhan 430070, China; Hubei Hongshan Laboratory, Wuhan 430070, China; WIMI Biotechnology Co., Ltd., Changzhou 213000, China
| | - Keqin Li
- National Key Laboratory of Crop Genetic Improvement, Huazhong Agricultural University, Wuhan 430070, China; Hubei Key Laboratory of Agricultural Bioinformatics, Huazhong Agricultural University, Wuhan 430070, China; College of Informatics, Huazhong Agricultural University, Wuhan 430070, China
| | - Jiaqi Shi
- College of Informatics, Huazhong Agricultural University, Wuhan 430070, China
| | - Guangjie Xing
- College of Informatics, Huazhong Agricultural University, Wuhan 430070, China
| | - Weihan Shi
- College of Informatics, Huazhong Agricultural University, Wuhan 430070, China
| | - Jianbing Yan
- National Key Laboratory of Crop Genetic Improvement, Huazhong Agricultural University, Wuhan 430070, China; Hubei Hongshan Laboratory, Wuhan 430070, China
| | - Jianxiao Liu
- National Key Laboratory of Crop Genetic Improvement, Huazhong Agricultural University, Wuhan 430070, China; Hubei Key Laboratory of Agricultural Bioinformatics, Huazhong Agricultural University, Wuhan 430070, China; College of Informatics, Huazhong Agricultural University, Wuhan 430070, China; Hubei Hongshan Laboratory, Wuhan 430070, China.
| |
Collapse
|
14
|
Barozzi I, Slaven N, Canale E, Lopes R, Amorim Monteiro Barbosa I, Bleu M, Ivanoiu D, Pacini C, Mensa’ E, Chambers A, Bravaccini S, Ravaioli S, Győrffy B, Dieci MV, Pruneri G, Galli GG, Magnani L. A Functional Survey of the Regulatory Landscape of Estrogen Receptor-Positive Breast Cancer Evolution. Cancer Discov 2024; 14:1612-1630. [PMID: 38753319 PMCID: PMC11372371 DOI: 10.1158/2159-8290.cd-23-1157] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/04/2023] [Revised: 03/12/2024] [Accepted: 05/14/2024] [Indexed: 09/05/2024]
Abstract
Only a handful of somatic alterations have been linked to endocrine therapy resistance in hormone-dependent breast cancer, potentially explaining ∼40% of relapses. If other mechanisms underlie the evolution of hormone-dependent breast cancer under adjuvant therapy is currently unknown. In this work, we employ functional genomics to dissect the contribution of cis-regulatory elements (CRE) to cancer evolution by focusing on 12 megabases of noncoding DNA, including clonal enhancers, gene promoters, and boundaries of topologically associating domains. Parallel epigenetic perturbation (CRISPRi) in vitro reveals context-dependent roles for many of these CREs, with a specific impact on dormancy entrance and endocrine therapy resistance. Profiling of CRE somatic alterations in a unique, longitudinal cohort of patients treated with endocrine therapies identifies a limited set of noncoding changes potentially involved in therapy resistance. Overall, our data uncover how endocrine therapies trigger the emergence of transient features which could ultimately be exploited to hinder the adaptive process. Significance: This study shows that cells adapting to endocrine therapies undergo changes in the usage or regulatory regions. Dormant cells are less vulnerable to regulatory perturbation but gain transient dependencies which can be exploited to decrease the formation of dormant persisters.
Collapse
Affiliation(s)
- Iros Barozzi
- Center for Cancer Research, Medical University of Vienna, Vienna, Austria.
| | - Neil Slaven
- Department of Surgery and Cancer, Imperial College London, London, United Kingdom.
- Environmental Genomics and Systems Biology Division, Lawrence Berkeley National Laboratory, Berkeley, California.
| | - Eleonora Canale
- Department of Surgery and Cancer, Imperial College London, London, United Kingdom.
| | - Rui Lopes
- Disease area Oncology, Novartis Biomedical Research, Basel, Switzerland.
| | | | - Melusine Bleu
- Disease area Oncology, Novartis Biomedical Research, Basel, Switzerland.
| | - Diana Ivanoiu
- Department of Surgery and Cancer, Imperial College London, London, United Kingdom.
| | - Claudia Pacini
- Department of Surgery and Cancer, Imperial College London, London, United Kingdom.
| | - Emanuela Mensa’
- Department of Surgery and Cancer, Imperial College London, London, United Kingdom.
| | - Alfie Chambers
- Department of Surgery and Cancer, Imperial College London, London, United Kingdom.
| | - Sara Bravaccini
- IRCCS Istituto Romagnolo per lo Studio dei Tumori (IRST) “Dino Amadori”, Meldola, Italy.
- Faculty of Medicine and Surgery, “Kore” University of Enna, Enna, Italy.
| | - Sara Ravaioli
- IRCCS Istituto Romagnolo per lo Studio dei Tumori (IRST) “Dino Amadori”, Meldola, Italy.
| | - Balázs Győrffy
- Department of Bioinformatics, Semmelweis University, Budapest, Hungary.
- Department of Biophysics, Medical School, University of Pecs, Pecs, Hungary.
- Cancer Biomarker Research Group, Institute of Molecular Life Sciences, Research Centre for Natural Sciences, Budapest, Hungary.
| | - Maria Vittoria Dieci
- Oncology 2, Veneto Institute of Oncology IOV-IRCCS, Padova, Italy.
- Department of Surgery, Oncology and Gastroenterology, University of Padova, Padova, Italy.
| | - Giancarlo Pruneri
- Department of Diagnostic Innovation, Fondazione IRCCS Istituto Nazionale dei Tumori, Milan, Italy.
- Department of Oncology and Hemato-Oncology, University of Milan, Milan, Italy.
| | | | - Luca Magnani
- Department of Surgery and Cancer, Imperial College London, London, United Kingdom.
- The Breast Cancer Now Toby Robins Research Centre, The Institute of Cancer, Research, London, United Kingdom.
| |
Collapse
|
15
|
Lin YJ, Menon AS, Hu Z, Brenner SE. Variant Impact Predictor database (VIPdb), version 2: trends from three decades of genetic variant impact predictors. Hum Genomics 2024; 18:90. [PMID: 39198917 PMCID: PMC11360829 DOI: 10.1186/s40246-024-00663-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/22/2024] [Accepted: 08/19/2024] [Indexed: 09/01/2024] Open
Abstract
BACKGROUND Variant interpretation is essential for identifying patients' disease-causing genetic variants amongst the millions detected in their genomes. Hundreds of Variant Impact Predictors (VIPs), also known as Variant Effect Predictors (VEPs), have been developed for this purpose, with a variety of methodologies and goals. To facilitate the exploration of available VIP options, we have created the Variant Impact Predictor database (VIPdb). RESULTS The Variant Impact Predictor database (VIPdb) version 2 presents a collection of VIPs developed over the past three decades, summarizing their characteristics, ClinGen calibrated scores, CAGI assessment results, publication details, access information, and citation patterns. We previously summarized 217 VIPs and their features in VIPdb in 2019. Building upon this foundation, we identified and categorized an additional 190 VIPs, resulting in a total of 407 VIPs in VIPdb version 2. The majority of the VIPs have the capacity to predict the impacts of single nucleotide variants and nonsynonymous variants. More VIPs tailored to predict the impacts of insertions and deletions have been developed since the 2010s. In contrast, relatively few VIPs are dedicated to the prediction of splicing, structural, synonymous, and regulatory variants. The increasing rate of citations to VIPs reflects the ongoing growth in their use, and the evolving trends in citations reveal development in the field and individual methods. CONCLUSIONS VIPdb version 2 summarizes 407 VIPs and their features, potentially facilitating VIP exploration for various variant interpretation applications. VIPdb is available at https://genomeinterpretation.org/vipdb.
Collapse
Affiliation(s)
- Yu-Jen Lin
- Department of Molecular and Cell Biology, University of California, Berkeley, CA, 94720, USA
- Center for Computational Biology, University of California, Berkeley, CA, 94720, USA
| | - Arul S Menon
- Department of Molecular and Cell Biology, University of California, Berkeley, CA, 94720, USA
- College of Computing, Data Science, and Society, University of California, Berkeley, CA, 94720, USA
| | - Zhiqiang Hu
- Department of Plant and Microbial Biology, University of California, 111 Koshland Hall #3102, Berkeley, CA, 94720-3102, USA
- Illumina, Foster City, CA, 94404, USA
| | - Steven E Brenner
- Department of Molecular and Cell Biology, University of California, Berkeley, CA, 94720, USA.
- Center for Computational Biology, University of California, Berkeley, CA, 94720, USA.
- College of Computing, Data Science, and Society, University of California, Berkeley, CA, 94720, USA.
- Department of Plant and Microbial Biology, University of California, 111 Koshland Hall #3102, Berkeley, CA, 94720-3102, USA.
| |
Collapse
|
16
|
Read DF, Booth GT, Daza RM, Jackson DL, Gladden RG, Srivatsan SR, Ewing B, Franks JM, Spurrell CH, Gomes AR, O'Day D, Gogate AA, Martin BK, Larson H, Pfleger C, Starita L, Lin Y, Shendure J, Lin S, Trapnell C. Single-cell analysis of chromatin and expression reveals age- and sex-associated alterations in the human heart. Commun Biol 2024; 7:1052. [PMID: 39187646 PMCID: PMC11347658 DOI: 10.1038/s42003-024-06582-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/30/2022] [Accepted: 07/11/2024] [Indexed: 08/28/2024] Open
Abstract
Sex differences and age-related changes in the human heart at the tissue, cell, and molecular level have been well-documented and many may be relevant for cardiovascular disease. However, how molecular programs within individual cell types vary across individuals by age and sex remains poorly characterized. To better understand this variation, we performed single-nucleus combinatorial indexing (sci) ATAC- and RNA-Seq in human heart samples from nine donors. We identify hundreds of differentially expressed genes by age and sex and find epigenetic signatures of variation in ATAC-Seq data in this discovery cohort. We then scale up our single-cell RNA-Seq analysis by combining our data with five recently published single nucleus RNA-Seq datasets of healthy adult hearts. We find variation such as metabolic alterations by sex and immune changes by age in differential expression tests, as well as alterations in abundance of cardiomyocytes by sex and neurons with age. In addition, we compare our adult-derived ATAC-Seq profiles to analogous fetal cell types to identify putative developmental-stage-specific regulatory factors. Finally, we train predictive models of cell-type-specific RNA expression levels utilizing ATAC-Seq profiles to link distal regulatory sequences to promoters, quantifying the predictive value of a simple TF-to-expression regulatory grammar and identifying cell-type-specific TFs. Our analysis represents the largest single-cell analysis of cardiac variation by age and sex to date and provides a resource for further study of healthy cardiac variation and transcriptional regulation at single-cell resolution.
Collapse
Affiliation(s)
- David F Read
- Department of Genome Sciences, University of Washington, Seattle, WA, USA
| | - Gregory T Booth
- Department of Genome Sciences, University of Washington, Seattle, WA, USA
| | - Riza M Daza
- Department of Genome Sciences, University of Washington, Seattle, WA, USA
| | - Dana L Jackson
- Department of Genome Sciences, University of Washington, Seattle, WA, USA
| | - Rula Green Gladden
- Department of Genome Sciences, University of Washington, Seattle, WA, USA
| | - Sanjay R Srivatsan
- Department of Genome Sciences, University of Washington, Seattle, WA, USA
| | - Brent Ewing
- Department of Genome Sciences, University of Washington, Seattle, WA, USA
| | - Jennifer M Franks
- Department of Genome Sciences, University of Washington, Seattle, WA, USA
| | | | | | - Diana O'Day
- Brotman Baty Institute for Precision Medicine, Seattle, WA, USA
| | - Aishwarya A Gogate
- Brotman Baty Institute for Precision Medicine, Seattle, WA, USA
- Seattle Children's Research Institute, Seattle, WA, USA
| | - Beth K Martin
- Department of Genome Sciences, University of Washington, Seattle, WA, USA
| | - Haleigh Larson
- Brotman Baty Institute for Precision Medicine, Seattle, WA, USA
| | - Christian Pfleger
- University of Washington School of Medicine, Division of Cardiology, Seattle, WA, USA
| | - Lea Starita
- Department of Genome Sciences, University of Washington, Seattle, WA, USA
- Brotman Baty Institute for Precision Medicine, Seattle, WA, USA
| | - Yiing Lin
- Department of Surgery, Washington University, St Louis, MO, USA
| | - Jay Shendure
- Brotman Baty Institute for Precision Medicine, Seattle, WA, USA.
- Seattle Children's Research Institute, Seattle, WA, USA.
- Howard Hughes Medical Institute, Seattle, WA, USA.
- Allen Discovery Center for Cell Lineage Tracing, Seattle, WA, USA.
| | - Shin Lin
- University of Washington School of Medicine, Division of Cardiology, Seattle, WA, USA.
| | - Cole Trapnell
- Brotman Baty Institute for Precision Medicine, Seattle, WA, USA.
| |
Collapse
|
17
|
Chauhan V, Baptista ISC, Arsh AM, Jagadeesan R, Dash S, Ribeiro AS. Transcription Attenuation in Synthetic Promoters in Nonoverlapping Tandem Formation. Biochemistry 2024; 63:2009-2022. [PMID: 38997112 PMCID: PMC11339919 DOI: 10.1021/acs.biochem.4c00012] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/06/2024] [Revised: 07/01/2024] [Accepted: 07/01/2024] [Indexed: 07/14/2024]
Abstract
Closely spaced promoters are ubiquitous in prokaryotic and eukaryotic genomes. How their structure and dynamics relate remains unclear, particularly for tandem formations. To study their transcriptional interference, we engineered two pairs and one trio of synthetic promoters in nonoverlapping, tandem formation, in single-copy plasmids transformed into Escherichia coli cells. From in vivo measurements, we found that these promoters in tandem formation can have attenuated transcription rates. The attenuation strength can be widely fine-tuned by the promoters' positioning, natural regulatory mechanisms, and other factors, including the antibiotic rifampicin, which is known to hamper RNAP promoter escape. From this, and supported by in silico models, we concluded that the attenuation in these constructs emerges from premature terminations generated by collisions between RNAPs elongating from upstream promoters and RNAPs occupying downstream promoters. Moreover, we found that these collisions can cause one or both RNAPs to falloff. Finally, the broad spectrum of possible, externally regulated, attenuation strengths observed in our synthetic tandem promoters suggests that they could become useful as externally controllable regulators of future synthetic circuits.
Collapse
Affiliation(s)
- Vatsala Chauhan
- Faculty
of Medicine and Health Technology, Tampere
University, 33520 Tampere, Finland
- Department
of Cell and Molecular Biology (ICM), Uppsala
University, 751 24 Uppsala, Sweden
| | - Ines S. C. Baptista
- Faculty
of Medicine and Health Technology, Tampere
University, 33520 Tampere, Finland
| | - Amir M. Arsh
- Faculty
of Medicine and Health Technology, Tampere
University, 33520 Tampere, Finland
| | - Rahul Jagadeesan
- Faculty
of Medicine and Health Technology, Tampere
University, 33520 Tampere, Finland
| | - Suchintak Dash
- Faculty
of Medicine and Health Technology, Tampere
University, 33520 Tampere, Finland
| | - Andre S. Ribeiro
- Faculty
of Medicine and Health Technology, Tampere
University, 33520 Tampere, Finland
| |
Collapse
|
18
|
Yuan H, Mancuso CA, Johnson K, Braasch I, Krishnan A. Computational strategies for cross-species knowledge transfer and translational biomedicine. ARXIV 2024:arXiv:2408.08503v1. [PMID: 39184546 PMCID: PMC11343225] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Subscribe] [Scholar Register] [Indexed: 08/27/2024]
Abstract
Research organisms provide invaluable insights into human biology and diseases, serving as essential tools for functional experiments, disease modeling, and drug testing. However, evolutionary divergence between humans and research organisms hinders effective knowledge transfer across species. Here, we review state-of-the-art methods for computationally transferring knowledge across species, primarily focusing on methods that utilize transcriptome data and/or molecular networks. We introduce the term "agnology" to describe the functional equivalence of molecular components regardless of evolutionary origin, as this concept is becoming pervasive in integrative data-driven models where the role of evolutionary origin can become unclear. Our review addresses four key areas of information and knowledge transfer across species: (1) transferring disease and gene annotation knowledge, (2) identifying agnologous molecular components, (3) inferring equivalent perturbed genes or gene sets, and (4) identifying agnologous cell types. We conclude with an outlook on future directions and several key challenges that remain in cross-species knowledge transfer.
Collapse
Affiliation(s)
- Hao Yuan
- Genetics and Genome Science Program; Ecology, Evolution, and Behavior Program, Michigan State University
| | - Christopher A. Mancuso
- Department of Biostatistics & Informatics, University of Colorado Anschutz Medical Campus
| | - Kayla Johnson
- Department of Biomedical Informatics, University of Colorado Anschutz Medical Campus
| | - Ingo Braasch
- Department of Integrative Biology; Genetics and Genome Science Program; Ecology, Evolution, and Behavior Program, Michigan State University
| | - Arjun Krishnan
- Department of Biomedical Informatics, University of Colorado Anschutz Medical Campus
| |
Collapse
|
19
|
Zhou H, Gelernter J. Human genetics and epigenetics of alcohol use disorder. J Clin Invest 2024; 134:e172885. [PMID: 39145449 PMCID: PMC11324314 DOI: 10.1172/jci172885] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 08/16/2024] Open
Abstract
Alcohol use disorder (AUD) is a prominent contributor to global morbidity and mortality. Its complex etiology involves genetics, epigenetics, and environmental factors. We review progress in understanding the genetics and epigenetics of AUD, summarizing the key findings. Advancements in technology over the decades have elevated research from early candidate gene studies to present-day genome-wide scans, unveiling numerous genetic and epigenetic risk factors for AUD. The latest GWAS on more than one million participants identified more than 100 genetic variants, and the largest epigenome-wide association studies (EWAS) in blood and brain samples have revealed tissue-specific epigenetic changes. Downstream analyses revealed enriched pathways, genetic correlations with other traits, transcriptome-wide association in brain tissues, and drug-gene interactions for AUD. We also discuss limitations and future directions, including increasing the power of GWAS and EWAS studies as well as expanding the diversity of populations included in these analyses. Larger samples, novel technologies, and analytic approaches are essential; these include whole-genome sequencing, multiomics, single-cell sequencing, spatial transcriptomics, deep-learning prediction of variant function, and integrated methods for disease risk prediction.
Collapse
Affiliation(s)
- Hang Zhou
- Department of Psychiatry, Yale School of Medicine, New Haven, Connecticut, USA
- Veterans Affairs Connecticut Healthcare System, West Haven, Connecticut, USA
- Department of Biomedical Informatics and Data Science
- Center for Brain and Mind Health
| | - Joel Gelernter
- Department of Psychiatry, Yale School of Medicine, New Haven, Connecticut, USA
- Veterans Affairs Connecticut Healthcare System, West Haven, Connecticut, USA
- Department of Genetics, and
- Department of Neuroscience, Yale School of Medicine, New Haven, Connecticut, USA
| |
Collapse
|
20
|
Shin T, Song JHT, Kosicki M, Kenny C, Beck SG, Kelley L, Antony I, Qian X, Bonacina J, Papandile F, Gonzalez D, Scotellaro J, Bushinsky EM, Andersen RE, Maury E, Pennacchio LA, Doan RN, Walsh CA. Rare variation in non-coding regions with evolutionary signatures contributes to autism spectrum disorder risk. CELL GENOMICS 2024; 4:100609. [PMID: 39019033 PMCID: PMC11406188 DOI: 10.1016/j.xgen.2024.100609] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/20/2023] [Revised: 03/11/2024] [Accepted: 06/24/2024] [Indexed: 07/19/2024]
Abstract
Little is known about the role of non-coding regions in the etiology of autism spectrum disorder (ASD). We examined three classes of non-coding regions: human accelerated regions (HARs), which show signatures of positive selection in humans; experimentally validated neural VISTA enhancers (VEs); and conserved regions predicted to act as neural enhancers (CNEs). Targeted and whole-genome analysis of >16,600 samples and >4,900 ASD probands revealed that likely recessive, rare, inherited variants in HARs, VEs, and CNEs substantially contribute to ASD risk in probands whose parents share ancestry, which enriches for recessive contributions, but modestly contribute, if at all, in simplex family structures. We identified multiple patient variants in HARs near IL1RAPL1 and in VEs near OTX1 and SIM1 and showed that they change enhancer activity. Our results implicate both human-evolved and evolutionarily conserved non-coding regions in ASD risk and suggest potential mechanisms of how regulatory changes can modulate social behavior.
Collapse
Affiliation(s)
- Taehwan Shin
- Division of Genetics and Genomics, Boston Children's Hospital, Boston, MA 02115, USA; Department of Pediatrics, Harvard Medical School, Boston, MA 02115, USA; Allen Discovery Center for Human Brain Evolution, Boston, MA 02115, USA; Department of Neurology, Harvard Medical School, Boston, MA 02115, USA; Howard Hughes Medical Institute, Boston Children's Hospital, Boston, MA 02115, USA
| | - Janet H T Song
- Division of Genetics and Genomics, Boston Children's Hospital, Boston, MA 02115, USA; Department of Pediatrics, Harvard Medical School, Boston, MA 02115, USA; Allen Discovery Center for Human Brain Evolution, Boston, MA 02115, USA; Department of Neurology, Harvard Medical School, Boston, MA 02115, USA; Howard Hughes Medical Institute, Boston Children's Hospital, Boston, MA 02115, USA
| | - Michael Kosicki
- Environmental Genomics & System Biology Division, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA
| | - Connor Kenny
- Division of Genetics and Genomics, Boston Children's Hospital, Boston, MA 02115, USA; Department of Pediatrics, Harvard Medical School, Boston, MA 02115, USA; Allen Discovery Center for Human Brain Evolution, Boston, MA 02115, USA; Department of Neurology, Harvard Medical School, Boston, MA 02115, USA; Howard Hughes Medical Institute, Boston Children's Hospital, Boston, MA 02115, USA
| | - Samantha G Beck
- Division of Genetics and Genomics, Boston Children's Hospital, Boston, MA 02115, USA; Department of Pediatrics, Harvard Medical School, Boston, MA 02115, USA; Allen Discovery Center for Human Brain Evolution, Boston, MA 02115, USA; Department of Neurology, Harvard Medical School, Boston, MA 02115, USA; Howard Hughes Medical Institute, Boston Children's Hospital, Boston, MA 02115, USA
| | - Lily Kelley
- Division of Genetics and Genomics, Boston Children's Hospital, Boston, MA 02115, USA; Department of Pediatrics, Harvard Medical School, Boston, MA 02115, USA; Allen Discovery Center for Human Brain Evolution, Boston, MA 02115, USA
| | - Irene Antony
- Division of Genetics and Genomics, Boston Children's Hospital, Boston, MA 02115, USA; Department of Pediatrics, Harvard Medical School, Boston, MA 02115, USA; Allen Discovery Center for Human Brain Evolution, Boston, MA 02115, USA; Department of Neurology, Harvard Medical School, Boston, MA 02115, USA; Howard Hughes Medical Institute, Boston Children's Hospital, Boston, MA 02115, USA
| | - Xuyu Qian
- Division of Genetics and Genomics, Boston Children's Hospital, Boston, MA 02115, USA; Department of Pediatrics, Harvard Medical School, Boston, MA 02115, USA; Allen Discovery Center for Human Brain Evolution, Boston, MA 02115, USA; Department of Neurology, Harvard Medical School, Boston, MA 02115, USA; Howard Hughes Medical Institute, Boston Children's Hospital, Boston, MA 02115, USA
| | - Julieta Bonacina
- Division of Genetics and Genomics, Boston Children's Hospital, Boston, MA 02115, USA; Department of Pediatrics, Harvard Medical School, Boston, MA 02115, USA; Allen Discovery Center for Human Brain Evolution, Boston, MA 02115, USA
| | - Frances Papandile
- Division of Genetics and Genomics, Boston Children's Hospital, Boston, MA 02115, USA; Department of Pediatrics, Harvard Medical School, Boston, MA 02115, USA; Allen Discovery Center for Human Brain Evolution, Boston, MA 02115, USA; Department of Neurology, Harvard Medical School, Boston, MA 02115, USA; Howard Hughes Medical Institute, Boston Children's Hospital, Boston, MA 02115, USA
| | - Dilenny Gonzalez
- Division of Genetics and Genomics, Boston Children's Hospital, Boston, MA 02115, USA; Department of Pediatrics, Harvard Medical School, Boston, MA 02115, USA; Allen Discovery Center for Human Brain Evolution, Boston, MA 02115, USA; Department of Neurology, Harvard Medical School, Boston, MA 02115, USA; Howard Hughes Medical Institute, Boston Children's Hospital, Boston, MA 02115, USA
| | - Julia Scotellaro
- Division of Genetics and Genomics, Boston Children's Hospital, Boston, MA 02115, USA; Department of Pediatrics, Harvard Medical School, Boston, MA 02115, USA
| | - Evan M Bushinsky
- Division of Genetics and Genomics, Boston Children's Hospital, Boston, MA 02115, USA; Department of Pediatrics, Harvard Medical School, Boston, MA 02115, USA; Allen Discovery Center for Human Brain Evolution, Boston, MA 02115, USA; Department of Neurology, Harvard Medical School, Boston, MA 02115, USA; Howard Hughes Medical Institute, Boston Children's Hospital, Boston, MA 02115, USA
| | - Rebecca E Andersen
- Division of Genetics and Genomics, Boston Children's Hospital, Boston, MA 02115, USA; Department of Pediatrics, Harvard Medical School, Boston, MA 02115, USA; Allen Discovery Center for Human Brain Evolution, Boston, MA 02115, USA; Department of Neurology, Harvard Medical School, Boston, MA 02115, USA; Howard Hughes Medical Institute, Boston Children's Hospital, Boston, MA 02115, USA
| | - Eduardo Maury
- Division of Genetics and Genomics, Boston Children's Hospital, Boston, MA 02115, USA; Department of Pediatrics, Harvard Medical School, Boston, MA 02115, USA; Allen Discovery Center for Human Brain Evolution, Boston, MA 02115, USA; Department of Neurology, Harvard Medical School, Boston, MA 02115, USA; Howard Hughes Medical Institute, Boston Children's Hospital, Boston, MA 02115, USA
| | - Len A Pennacchio
- Environmental Genomics & System Biology Division, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA
| | - Ryan N Doan
- Division of Genetics and Genomics, Boston Children's Hospital, Boston, MA 02115, USA; Department of Pediatrics, Harvard Medical School, Boston, MA 02115, USA; Allen Discovery Center for Human Brain Evolution, Boston, MA 02115, USA.
| | - Christopher A Walsh
- Division of Genetics and Genomics, Boston Children's Hospital, Boston, MA 02115, USA; Department of Pediatrics, Harvard Medical School, Boston, MA 02115, USA; Allen Discovery Center for Human Brain Evolution, Boston, MA 02115, USA; Department of Neurology, Harvard Medical School, Boston, MA 02115, USA; Howard Hughes Medical Institute, Boston Children's Hospital, Boston, MA 02115, USA.
| |
Collapse
|
21
|
Zheng D, Wang J, Persyn L, Liu Y, Montoya FU, Cenik C, Agarwal V. Predicting the translation efficiency of messenger RNA in mammalian cells. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.08.11.607362. [PMID: 39149337 PMCID: PMC11326250 DOI: 10.1101/2024.08.11.607362] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 08/17/2024]
Abstract
The degree to which translational control is specified by mRNA sequence is poorly understood in mammalian cells. Here, we constructed and leveraged a compendium of 3,819 ribosomal profiling datasets, distilling them into a transcriptome-wide atlas of translation efficiency (TE) measurements encompassing >140 human and mouse cell types. We subsequently developed RiboNN, a multitask deep convolutional neural network, and classic machine learning models to predict TEs in hundreds of cell types from sequence-encoded mRNA features, achieving state-of-the-art performance (r=0.79 in human and r=0.78 in mouse for mean TE across cell types). While the majority of earlier models solely considered 5' UTR sequence, RiboNN integrates contributions from the full-length mRNA sequence, learning that the 5' UTR, CDS, and 3' UTR respectively possess ~67%, 31%, and 2% per-nucleotide information density in the specification of mammalian TEs. Interpretation of RiboNN revealed that the spatial positioning of low-level di- and tri-nucleotide features (i.e., including codons) largely explain model performance, capturing mechanistic principles such as how ribosomal processivity and tRNA abundance control translational output. RiboNN is predictive of the translational behavior of base-modified therapeutic RNA, and can explain evolutionary selection pressures in human 5' UTRs. Finally, it detects a common language governing mRNA regulatory control and highlights the interconnectedness of mRNA translation, stability, and localization in mammalian organisms.
Collapse
Affiliation(s)
- Dinghai Zheng
- mRNA Center of Excellence, Sanofi, Waltham, MA 02451, USA
| | - Jun Wang
- mRNA Center of Excellence, Sanofi, Waltham, MA 02451, USA
| | - Logan Persyn
- Department of Molecular Biosciences, University of Texas at Austin, Austin, TX 78712, USA
| | - Yue Liu
- Department of Molecular Biosciences, University of Texas at Austin, Austin, TX 78712, USA
| | | | - Can Cenik
- Department of Molecular Biosciences, University of Texas at Austin, Austin, TX 78712, USA
| | - Vikram Agarwal
- mRNA Center of Excellence, Sanofi, Waltham, MA 02451, USA
| |
Collapse
|
22
|
Xu L, Liu Y. Identification, Design, and Application of Noncoding Cis-Regulatory Elements. Biomolecules 2024; 14:945. [PMID: 39199333 PMCID: PMC11352686 DOI: 10.3390/biom14080945] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/26/2024] [Revised: 07/25/2024] [Accepted: 07/30/2024] [Indexed: 09/01/2024] Open
Abstract
Cis-regulatory elements (CREs) play a pivotal role in orchestrating interactions with trans-regulatory factors such as transcription factors, RNA-binding proteins, and noncoding RNAs. These interactions are fundamental to the molecular architecture underpinning complex and diverse biological functions in living organisms, facilitating a myriad of sophisticated and dynamic processes. The rapid advancement in the identification and characterization of these regulatory elements has been marked by initiatives such as the Encyclopedia of DNA Elements (ENCODE) project, which represents a significant milestone in the field. Concurrently, the development of CRE detection technologies, exemplified by massively parallel reporter assays, has progressed at an impressive pace, providing powerful tools for CRE discovery. The exponential growth of multimodal functional genomic data has necessitated the application of advanced analytical methods. Deep learning algorithms, particularly large language models, have emerged as invaluable tools for deconstructing the intricate nucleotide sequences governing CRE function. These advancements facilitate precise predictions of CRE activity and enable the de novo design of CREs. A deeper understanding of CRE operational dynamics is crucial for harnessing their versatile regulatory properties. Such insights are instrumental in refining gene therapy techniques, enhancing the efficacy of selective breeding programs, pushing the boundaries of genetic innovation, and opening new possibilities in microbial synthetic biology.
Collapse
Affiliation(s)
- Lingna Xu
- Shenzhen Branch, Guangdong Laboratory for Lingnan Modern Agriculture, Key Laboratory of Livestock and Poultry Multi-Omics of MARA, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen 518124, China;
- Innovation Group of Pig Genome Design and Breeding, Research Centre for Animal Genome, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen 518124, China
| | - Yuwen Liu
- Shenzhen Branch, Guangdong Laboratory for Lingnan Modern Agriculture, Key Laboratory of Livestock and Poultry Multi-Omics of MARA, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen 518124, China;
- Innovation Group of Pig Genome Design and Breeding, Research Centre for Animal Genome, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen 518124, China
- Kunpeng Institute of Modern Agriculture at Foshan, Chinese Academy of Agricultural Sciences, Foshan 528226, China
| |
Collapse
|
23
|
Lin J, Luo R, Pinello L. EPInformer: a scalable deep learning framework for gene expression prediction by integrating promoter-enhancer sequences with multimodal epigenomic data. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.08.01.606099. [PMID: 39131276 PMCID: PMC11312614 DOI: 10.1101/2024.08.01.606099] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 08/13/2024]
Abstract
Transcriptional regulation, critical for cellular differentiation and adaptation to environmental changes, involves coordinated interactions among DNA sequences, regulatory proteins, and chromatin architecture. Despite extensive data from consortia like ENCODE, understanding the dynamics of cis-regulatory elements (CREs) in gene expression remains challenging. Deep learning is a powerful tool for learning gene expression and epigenomic signals from DNA sequences, exhibiting superior performance compared to conventional machine learning approaches. However, even the most advanced deep learning-based methods may fall short in capturing the regulatory effects of distal elements such as enhancers, limiting their predictive accuracy. In addition, these methods may require significant resources to train or to adapt to newly generated data. To address these challenges, we present EPInformer, a scalable deep-learning framework for predicting gene expression by integrating promoter-enhancer interactions with their sequences, epigenomic signals, and chromatin contacts. Our model outperforms existing gene expression prediction models in rigorous cross-chromosome validation, accurately recapitulates enhancer-gene interactions validated by CRISPR perturbation experiments, and identifies crucial transcription factor motifs within regulatory sequences. EPInformer is available as open-source software at https://github.com/pinellolab/EPInformer.
Collapse
Affiliation(s)
- Jiecong Lin
- Molecular Pathology Unit, Center for Cancer Research, Massachusetts General Hospital, Department of Pathology, Harvard Medical School, Boston, Massachusetts 02129, USA
- Department of Computer Science, The University of Hong Kong, Hong Kong, China
| | - Ruibang Luo
- Department of Computer Science, The University of Hong Kong, Hong Kong, China
| | - Luca Pinello
- Molecular Pathology Unit, Center for Cancer Research, Massachusetts General Hospital, Department of Pathology, Harvard Medical School, Boston, Massachusetts 02129, USA
| |
Collapse
|
24
|
Sasse A, Chikina M, Mostafavi S. Unlocking gene regulation with sequence-to-function models. Nat Methods 2024; 21:1374-1377. [PMID: 39122947 DOI: 10.1038/s41592-024-02331-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 08/12/2024]
Affiliation(s)
- Alexander Sasse
- Paul G. Allen School of Computer Science and Engineering, University of Washington, Seattle, WA, USA
| | - Maria Chikina
- Department of Computational and Systems Biology, University of Pittsburgh, Pittsburgh, PA, USA.
| | - Sara Mostafavi
- Paul G. Allen School of Computer Science and Engineering, University of Washington, Seattle, WA, USA.
| |
Collapse
|
25
|
Sokolova K, Chen KM, Hao Y, Zhou J, Troyanskaya OG. Deep Learning Sequence Models for Transcriptional Regulation. Annu Rev Genomics Hum Genet 2024; 25:105-122. [PMID: 38594933 DOI: 10.1146/annurev-genom-021623-024727] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 04/11/2024]
Abstract
Deciphering the regulatory code of gene expression and interpreting the transcriptional effects of genome variation are critical challenges in human genetics. Modern experimental technologies have resulted in an abundance of data, enabling the development of sequence-based deep learning models that link patterns embedded in DNA to the biochemical and regulatory properties contributing to transcriptional regulation, including modeling epigenetic marks, 3D genome organization, and gene expression, with tissue and cell-type specificity. Such methods can predict the functional consequences of any noncoding variant in the human genome, even rare or never-before-observed variants, and systematically characterize their consequences beyond what is tractable from experiments or quantitative genetics studies alone. Recently, the development and application of interpretability approaches have led to the identification of key sequence patterns contributing to the predicted tasks, providing insights into the underlying biological mechanisms learned and revealing opportunities for improvement in future models.
Collapse
Affiliation(s)
- Ksenia Sokolova
- Department of Computer Science and Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, New Jersey, USA; , ,
| | - Kathleen M Chen
- Department of Computer Science and Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, New Jersey, USA; , ,
| | - Yun Hao
- Flatiron Institute, Simons Foundation, New York, NY, USA;
| | - Jian Zhou
- Lyda Hill Department of Bioinformatics, University of Texas Southwestern Medical Center, Dallas, Texas, USA;
| | - Olga G Troyanskaya
- Princeton Precision Health, Princeton University, Princeton, New Jersey, USA
- Flatiron Institute, Simons Foundation, New York, NY, USA;
- Department of Computer Science and Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, New Jersey, USA; , ,
| |
Collapse
|
26
|
Kathail P, Shuai RW, Chung R, Ye CJ, Loeb GB, Ioannidis NM. Current genomic deep learning models display decreased performance in cell type-specific accessible regions. Genome Biol 2024; 25:202. [PMID: 39090688 PMCID: PMC11293111 DOI: 10.1186/s13059-024-03335-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2023] [Accepted: 07/10/2024] [Indexed: 08/04/2024] Open
Abstract
BACKGROUND A number of deep learning models have been developed to predict epigenetic features such as chromatin accessibility from DNA sequence. Model evaluations commonly report performance genome-wide; however, cis regulatory elements (CREs), which play critical roles in gene regulation, make up only a small fraction of the genome. Furthermore, cell type-specific CREs contain a large proportion of complex disease heritability. RESULTS We evaluate genomic deep learning models in chromatin accessibility regions with varying degrees of cell type specificity. We assess two modeling directions in the field: general purpose models trained across thousands of outputs (cell types and epigenetic marks) and models tailored to specific tissues and tasks. We find that the accuracy of genomic deep learning models, including two state-of-the-art general purpose models-Enformer and Sei-varies across the genome and is reduced in cell type-specific accessible regions. Using accessibility models trained on cell types from specific tissues, we find that increasing model capacity to learn cell type-specific regulatory syntax-through single-task learning or high capacity multi-task models-can improve performance in cell type-specific accessible regions. We also observe that improving reference sequence predictions does not consistently improve variant effect predictions, indicating that novel strategies are needed to improve performance on variants. CONCLUSIONS Our results provide a new perspective on the performance of genomic deep learning models, showing that performance varies across the genome and is particularly reduced in cell type-specific accessible regions. We also identify strategies to maximize performance in cell type-specific accessible regions.
Collapse
Affiliation(s)
- Pooja Kathail
- Center for Computational Biology, University of California, Berkeley, Berkeley, CA, USA.
| | - Richard W Shuai
- Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, Berkeley, CA, USA
| | - Ryan Chung
- Center for Computational Biology, University of California, Berkeley, Berkeley, CA, USA
| | - Chun Jimmie Ye
- Division of Rheumatology, Department of Medicine, University of California, San Francisco, CA, USA
- Institute for Human Genetics, University of California, San Francisco, CA, USA
- Department of Epidemiology and Biostatistics, University of California, San Francisco, CA, USA
- Bakar Computational Health Sciences Institute, University of California, San Francisco, CA, USA
- Parker Institute for Cancer Immunotherapy, San Francisco, CA, USA
- Chan Zuckerberg Biohub, San Francisco, CA, USA
| | - Gabriel B Loeb
- Division of Nephrology, Department of Medicine, University of California, San Francisco, CA, USA.
- Cardiovascular Research Institute, University of California, San Francisco, CA, USA.
| | - Nilah M Ioannidis
- Center for Computational Biology, University of California, Berkeley, Berkeley, CA, USA.
- Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, Berkeley, CA, USA.
- Chan Zuckerberg Biohub, San Francisco, CA, USA.
| |
Collapse
|
27
|
Wang X, Li F, Zhang Y, Imoto S, Shen HH, Li S, Guo Y, Yang J, Song J. Deep learning approaches for non-coding genetic variant effect prediction: current progress and future prospects. Brief Bioinform 2024; 25:bbae446. [PMID: 39276327 PMCID: PMC11401448 DOI: 10.1093/bib/bbae446] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/02/2024] [Revised: 08/08/2024] [Accepted: 08/27/2024] [Indexed: 09/16/2024] Open
Abstract
Recent advancements in high-throughput sequencing technologies have significantly enhanced our ability to unravel the intricacies of gene regulatory processes. A critical challenge in this endeavor is the identification of variant effects, a key factor in comprehending the mechanisms underlying gene regulation. Non-coding variants, constituting over 90% of all variants, have garnered increasing attention in recent years. The exploration of gene variant impacts and regulatory mechanisms has spurred the development of various deep learning approaches, providing new insights into the global regulatory landscape through the analysis of extensive genetic data. Here, we provide a comprehensive overview of the development of the non-coding variants models based on bulk and single-cell sequencing data and their model-based interpretation and downstream tasks. This review delineates the popular sequencing technologies for epigenetic profiling and deep learning approaches for discerning the effects of non-coding variants. Additionally, we summarize the limitations of current approaches in variant effect prediction research and outline opportunities for improvement. We anticipate that our study will offer a practical and useful guide for the bioinformatic community to further advance the unraveling of genetic variant effects.
Collapse
Affiliation(s)
- Xiaoyu Wang
- Monash Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, VIC 3800, Australia
- Monash Data Futures Institute, Monash University, Melbourne, VIC 3800, Australia
| | - Fuyi Li
- South Australian immunoGENomics Cancer Institute (SAiGENCI), Faculty of Health and Medical Sciences, The University of Adelaide, Adelaide, SA 5005, Australia
| | - Yiwen Zhang
- School of Public Health and Preventive Medicine, Monash University, Melbourne, VIC 3004, Australia
| | - Seiya Imoto
- Genome Center, Institute of Medical Science, The University of Tokyo, Minato-ku, Tokyo 108-8639, Japan
- Collaborative Research Institute for Innovative Microbiology, The University of Tokyo, Bunkyo-ku, Tokyo 113-8657, Japan
| | - Hsin-Hui Shen
- Department of Materials Science and Engineering, Faculty of Engineering, Monash University, Clayton, VIC 3800, Australia
| | - Shanshan Li
- School of Public Health and Preventive Medicine, Monash University, Melbourne, VIC 3004, Australia
| | - Yuming Guo
- School of Public Health and Preventive Medicine, Monash University, Melbourne, VIC 3004, Australia
| | - Jian Yang
- School of Life Sciences, Westlake University, Hangzhou, Zhejiang 310030, China
- Westlake Laboratory of Life Sciences and Biomedicine, Hangzhou, Zhejiang 310024, China
| | - Jiangning Song
- Monash Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, VIC 3800, Australia
- Monash Data Futures Institute, Monash University, Melbourne, VIC 3800, Australia
| |
Collapse
|
28
|
Labani M, Beheshti A, O’Brien TA. GENet: A Graph-Based Model Leveraging Histone Marks and Transcription Factors for Enhanced Gene Expression Prediction. Genes (Basel) 2024; 15:938. [PMID: 39062717 PMCID: PMC11275947 DOI: 10.3390/genes15070938] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/18/2024] [Revised: 07/16/2024] [Accepted: 07/17/2024] [Indexed: 07/28/2024] Open
Abstract
Understanding the regulatory mechanisms of gene expression is a crucial objective in genomics. Although the DNA sequence near the transcription start site (TSS) offers valuable insights, recent methods suggest that analyzing only the surrounding DNA may not suffice to accurately predict gene expression levels. We developed GENet (Gene Expression Network from Histone and Transcription Factor Integration), a novel approach that integrates essential regulatory signals from transcription factors and histone modifications into a graph-based model. GENet extends beyond simple DNA sequence analysis by incorporating additional layers of genetic control, which are vital for determining gene expression. Our method markedly enhances the prediction of mRNA levels compared to previous models that depend solely on DNA sequence data. The results underscore the significance of including comprehensive regulatory information in gene expression studies. GENet emerges as a promising tool for researchers, with potential applications extending from fundamental biological research to the development of medical therapies.
Collapse
Affiliation(s)
- Mahdieh Labani
- School of Computing, Macquarie University, Sydney 2109, Australia; (M.L.); (T.A.O.)
| | - Amin Beheshti
- School of Computing, Macquarie University, Sydney 2109, Australia; (M.L.); (T.A.O.)
| | - Tracey A. O’Brien
- School of Computing, Macquarie University, Sydney 2109, Australia; (M.L.); (T.A.O.)
- Cancer Institute NSW, Sydney 2065, Australia
- School of Clinical Medicine, Medicine & Health, University of New South Wales (UNSW), Sydney 2052, Australia
| |
Collapse
|
29
|
Dai Y, Itai T, Pei G, Yan F, Chu Y, Jiang X, Weinberg SM, Mukhopadhyay N, Marazita ML, Simon LM, Jia P, Zhao Z. DeepFace: Deep-learning-based framework to contextualize orofacial-cleft-related variants during human embryonic craniofacial development. HGG ADVANCES 2024; 5:100312. [PMID: 38796699 PMCID: PMC11193024 DOI: 10.1016/j.xhgg.2024.100312] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/08/2024] [Revised: 05/23/2024] [Accepted: 05/23/2024] [Indexed: 05/28/2024] Open
Abstract
Orofacial clefts (OFCs) are among the most common human congenital birth defects. Previous multiethnic studies have identified dozens of associated loci for both cleft lip with or without cleft palate (CL/P) and cleft palate alone (CP). Although several nearby genes have been highlighted, the "casual" variants are largely unknown. Here, we developed DeepFace, a convolutional neural network model, to assess the functional impact of variants by SNP activity difference (SAD) scores. The DeepFace model is trained with 204 epigenomic assays from crucial human embryonic craniofacial developmental stages of post-conception week (pcw) 4 to pcw 10. The Pearson correlation coefficient between the predicted and actual values for 12 epigenetic features achieved a median range of 0.50-0.83. Specifically, our model revealed that SNPs significantly associated with OFCs tended to exhibit higher SAD scores across various variant categories compared to less related groups, indicating a context-specific impact of OFC-related SNPs. Notably, we identified six SNPs with a significant linear relationship to SAD scores throughout developmental progression, suggesting that these SNPs could play a temporal regulatory role. Furthermore, our cell-type specificity analysis pinpointed the trophoblast cell as having the highest enrichment of risk signals associated with OFCs. Overall, DeepFace can harness distal regulatory signals from extensive epigenomic assays, offering new perspectives for prioritizing OFC variants using contextualized functional genomic features. We expect DeepFace to be instrumental in accessing and predicting the regulatory roles of variants associated with OFCs, and the model can be extended to study other complex diseases or traits.
Collapse
Affiliation(s)
- Yulin Dai
- Center for Precision Health, McWilliams School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX 77030, USA
| | - Toshiyuki Itai
- Center for Precision Health, McWilliams School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX 77030, USA
| | - Guangsheng Pei
- Center for Precision Health, McWilliams School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX 77030, USA
| | - Fangfang Yan
- Center for Precision Health, McWilliams School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX 77030, USA
| | - Yan Chu
- Center for Secure Artificial Intelligence for Healthcare, McWilliams School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX 77030, USA
| | - Xiaoqian Jiang
- Center for Secure Artificial Intelligence for Healthcare, McWilliams School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX 77030, USA
| | - Seth M Weinberg
- Department of Oral and Craniofacial Sciences, School of Dental Medicine, Center for Craniofacial and Dental Genetics, University of Pittsburgh, Pittsburgh, PA 15213, USA; Department of Human Genetics, School of Public Health, University of Pittsburgh, Pittsburgh, PA 15261, USA
| | - Nandita Mukhopadhyay
- Department of Oral and Craniofacial Sciences, School of Dental Medicine, Center for Craniofacial and Dental Genetics, University of Pittsburgh, Pittsburgh, PA 15213, USA
| | - Mary L Marazita
- Department of Oral and Craniofacial Sciences, School of Dental Medicine, Center for Craniofacial and Dental Genetics, University of Pittsburgh, Pittsburgh, PA 15213, USA; Department of Human Genetics, School of Public Health, University of Pittsburgh, Pittsburgh, PA 15261, USA; Clinical and Translational Science Institute, School of Medicine, University of Pittsburgh, Pittsburgh, PA 15213, USA
| | - Lukas M Simon
- Therapeutic Innovation Center, Baylor College of Medicine, Houston, TX 77030, USA
| | - Peilin Jia
- Center for Precision Health, McWilliams School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX 77030, USA
| | - Zhongming Zhao
- Center for Precision Health, McWilliams School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX 77030, USA; MD Anderson Cancer Center UTHealth Graduate School of Biomedical Sciences, Houston, TX 77030, USA.
| |
Collapse
|
30
|
Kathail P, Shuai RW, Chung R, Ye CJ, Loeb GB, Ioannidis NM. Current genomic deep learning models display decreased performance in cell type specific accessible regions. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.07.05.602265. [PMID: 39026761 PMCID: PMC11257480 DOI: 10.1101/2024.07.05.602265] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 07/20/2024]
Abstract
Background A number of deep learning models have been developed to predict epigenetic features such as chromatin accessibility from DNA sequence. Model evaluations commonly report performance genome-wide; however, cis regulatory elements (CREs), which play critical roles in gene regulation, make up only a small fraction of the genome. Furthermore, cell type specific CREs contain a large proportion of complex disease heritability. Results We evaluate genomic deep learning models in chromatin accessibility regions with varying degrees of cell type specificity. We assess two modeling directions in the field: general purpose models trained across thousands of outputs (cell types and epigenetic marks), and models tailored to specific tissues and tasks. We find that the accuracy of genomic deep learning models, including two state-of-the-art general purpose models - Enformer and Sei - varies across the genome and is reduced in cell type specific accessible regions. Using accessibility models trained on cell types from specific tissues, we find that increasing model capacity to learn cell type specific regulatory syntax - through single-task learning or high capacity multi-task models - can improve performance in cell type specific accessible regions. We also observe that improving reference sequence predictions does not consistently improve variant effect predictions, indicating that novel strategies are needed to improve performance on variants. Conclusions Our results provide a new perspective on the performance of genomic deep learning models, showing that performance varies across the genome and is particularly reduced in cell type specific accessible regions. We also identify strategies to maximize performance in cell type specific accessible regions.
Collapse
Affiliation(s)
- Pooja Kathail
- Center for Computational Biology, University of California, Berkeley, Berkeley, CA, USA
| | - Richard W. Shuai
- Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, Berkeley, CA, USA
| | - Ryan Chung
- Center for Computational Biology, University of California, Berkeley, Berkeley, CA, USA
| | - Chun Jimmie Ye
- Division of Rheumatology, Department of Medicine, University of California, San Francisco, San Francisco, CA, USA
- Institute for Human Genetics, University of California, San Francisco, San Francisco, CA, USA
- Department of Epidemiology and Biostatistics, University of California, San Francisco, San Francisco, CA, USA
- Bakar Computational Health Sciences Institute, University of California, San Francisco, San Francisco, CA, USA
- Parker Institute for Cancer Immunotherapy, San Francisco, CA, USA
- Chan Zuckerberg Biohub, San Francisco, CA, USA
| | - Gabriel B. Loeb
- Division of Nephrology, Department of Medicine, University of California, San Francisco, San Francisco, CA, USA
- Cardiovascular Research Institute, University of California, San Francisco, San Francisco, CA, USA
| | - Nilah M. Ioannidis
- Center for Computational Biology, University of California, Berkeley, Berkeley, CA, USA
- Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, Berkeley, CA, USA
- Chan Zuckerberg Biohub, San Francisco, CA, USA
| |
Collapse
|
31
|
Mendoza-Revilla J, Trop E, Gonzalez L, Roller M, Dalla-Torre H, de Almeida BP, Richard G, Caton J, Lopez Carranza N, Skwark M, Laterre A, Beguir K, Pierrot T, Lopez M. A foundational large language model for edible plant genomes. Commun Biol 2024; 7:835. [PMID: 38982288 PMCID: PMC11233511 DOI: 10.1038/s42003-024-06465-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/21/2023] [Accepted: 06/17/2024] [Indexed: 07/11/2024] Open
Abstract
Significant progress has been made in the field of plant genomics, as demonstrated by the increased use of high-throughput methodologies that enable the characterization of multiple genome-wide molecular phenotypes. These findings have provided valuable insights into plant traits and their underlying genetic mechanisms, particularly in model plant species. Nonetheless, effectively leveraging them to make accurate predictions represents a critical step in crop genomic improvement. We present AgroNT, a foundational large language model trained on genomes from 48 plant species with a predominant focus on crop species. We show that AgroNT can obtain state-of-the-art predictions for regulatory annotations, promoter/terminator strength, tissue-specific gene expression, and prioritize functional variants. We conduct a large-scale in silico saturation mutagenesis analysis on cassava to evaluate the regulatory impact of over 10 million mutations and provide their predicted effects as a resource for variant characterization. Finally, we propose the use of the diverse datasets compiled here as the Plants Genomic Benchmark (PGB), providing a comprehensive benchmark for deep learning-based methods in plant genomic research. The pre-trained AgroNT model is publicly available on HuggingFace at https://huggingface.co/InstaDeepAI/agro-nucleotide-transformer-1b for future research purposes.
Collapse
|
32
|
Huang D, Ovcharenko I. The contribution of silencer variants to human diseases. Genome Biol 2024; 25:184. [PMID: 38978133 PMCID: PMC11232194 DOI: 10.1186/s13059-024-03328-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/18/2023] [Accepted: 06/28/2024] [Indexed: 07/10/2024] Open
Abstract
BACKGROUND Although disease-causal genetic variants have been found within silencer sequences, we still lack a comprehensive analysis of the association of silencers with diseases. Here, we profiled GWAS variants in 2.8 million candidate silencers across 97 human samples derived from a diverse panel of tissues and developmental time points, using deep learning models. RESULTS We show that candidate silencers exhibit strong enrichment in disease-associated variants, and several diseases display a much stronger association with silencer variants than enhancer variants. Close to 52% of candidate silencers cluster, forming silencer-rich loci, and, in the loci of Parkinson's-disease-hallmark genes TRIM31 and MAL, the associated SNPs densely populate clustered candidate silencers rather than enhancers displaying an overall twofold enrichment in silencers versus enhancers. The disruption of apoptosis in neuronal cells is associated with both schizophrenia and bipolar disorder and can largely be attributed to variants within candidate silencers. Our model permits a mechanistic explanation of causative SNP effects by identifying altered binding of tissue-specific repressors and activators, validated with a 70% of directional concordance using SNP-SELEX. Narrowing the focus of the analysis to individual silencer variants, experimental data confirms the role of the rs62055708 SNP in Parkinson's disease, rs2535629 in schizophrenia, and rs6207121 in type 1 diabetes. CONCLUSIONS In summary, our results indicate that advances in deep learning models for the discovery of disease-causal variants within candidate silencers effectively "double" the number of functionally characterized GWAS variants. This provides a basis for explaining mechanisms of action and designing novel diagnostics and therapeutics.
Collapse
Affiliation(s)
- Di Huang
- Intramural Research Program, National Library of Medicine, National Institutes of Health, Bethesda, MD, 20892, USA
| | - Ivan Ovcharenko
- Intramural Research Program, National Library of Medicine, National Institutes of Health, Bethesda, MD, 20892, USA.
| |
Collapse
|
33
|
Oguchi A, Suzuki A, Komatsu S, Yoshitomi H, Bhagat S, Son R, Bonnal RJP, Kojima S, Koido M, Takeuchi K, Myouzen K, Inoue G, Hirai T, Sano H, Takegami Y, Kanemaru A, Yamaguchi I, Ishikawa Y, Tanaka N, Hirabayashi S, Konishi R, Sekito S, Inoue T, Kere J, Takeda S, Takaori-Kondo A, Endo I, Kawaoka S, Kawaji H, Ishigaki K, Ueno H, Hayashizaki Y, Pagani M, Carninci P, Yanagita M, Parrish N, Terao C, Yamamoto K, Murakawa Y. An atlas of transcribed enhancers across helper T cell diversity for decoding human diseases. Science 2024; 385:eadd8394. [PMID: 38963856 DOI: 10.1126/science.add8394] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/17/2022] [Accepted: 05/01/2024] [Indexed: 07/06/2024]
Abstract
Transcribed enhancer maps can reveal nuclear interactions underpinning each cell type and connect specific cell types to diseases. Using a 5' single-cell RNA sequencing approach, we defined transcription start sites of enhancer RNAs and other classes of coding and noncoding RNAs in human CD4+ T cells, revealing cellular heterogeneity and differentiation trajectories. Integration of these datasets with single-cell chromatin profiles showed that active enhancers with bidirectional RNA transcription are highly cell type-specific and that disease heritability is strongly enriched in these enhancers. The resulting cell type-resolved multimodal atlas of bidirectionally transcribed enhancers, which we linked with promoters using fine-scale chromatin contact maps, enabled us to systematically interpret genetic variants associated with a range of immune-mediated diseases.
Collapse
Affiliation(s)
- Akiko Oguchi
- RIKEN-IFOM Joint Laboratory for Cancer Genomics, RIKEN Center for Integrative Medical Sciences, Yokohama, Japan
- Institute for the Advanced Study of Human Biology, Kyoto University, Kyoto, Japan
- Department of Nephrology, Graduate School of Medicine, Kyoto University, Kyoto, Japan
| | - Akari Suzuki
- Laboratory for Autoimmune Diseases, RIKEN Center for Integrative Medical Sciences, Yokohama, Japan
| | - Shuichiro Komatsu
- RIKEN-IFOM Joint Laboratory for Cancer Genomics, RIKEN Center for Integrative Medical Sciences, Yokohama, Japan
- IFOM ETS - the AIRC Institute of Molecular Oncology, Milan, Italy
| | - Hiroyuki Yoshitomi
- Institute for the Advanced Study of Human Biology, Kyoto University, Kyoto, Japan
- Department of Immunology, Graduate School of Medicine, Kyoto University, Kyoto, Japan
| | - Shruti Bhagat
- Institute for the Advanced Study of Human Biology, Kyoto University, Kyoto, Japan
| | - Raku Son
- RIKEN-IFOM Joint Laboratory for Cancer Genomics, RIKEN Center for Integrative Medical Sciences, Yokohama, Japan
- Institute for the Advanced Study of Human Biology, Kyoto University, Kyoto, Japan
- Department of Nephrology, Graduate School of Medicine, Kyoto University, Kyoto, Japan
| | | | - Shohei Kojima
- Genome Immunobiology RIKEN Hakubi Research Team, RIKEN Center for Integrative Medical Sciences, Yokohama, Japan
| | - Masaru Koido
- Division of Molecular Pathology, Department of Cancer Biology, Institute of Medical Science, The University of Tokyo, Tokyo, Japan
- Laboratory of Complex Trait Genomics, Department of Computational Biology and Medical Sciences, Graduate School of Frontier Sciences, The University of Tokyo, Tokyo, Japan
- Laboratory for Statistical and Translational Genetics, RIKEN Center for Integrative Medical Sciences, Yokohama, Japan
| | - Kazuhiro Takeuchi
- RIKEN-IFOM Joint Laboratory for Cancer Genomics, RIKEN Center for Integrative Medical Sciences, Yokohama, Japan
- Institute for the Advanced Study of Human Biology, Kyoto University, Kyoto, Japan
- Department of Medical Systems Genomics, Graduate School of Medicine, Kyoto University, Kyoto, Japan
| | - Keiko Myouzen
- Laboratory for Autoimmune Diseases, RIKEN Center for Integrative Medical Sciences, Yokohama, Japan
| | - Gyo Inoue
- Laboratory for Autoimmune Diseases, RIKEN Center for Integrative Medical Sciences, Yokohama, Japan
| | - Tomoya Hirai
- RIKEN-IFOM Joint Laboratory for Cancer Genomics, RIKEN Center for Integrative Medical Sciences, Yokohama, Japan
- Department of Gastroenterological Surgery, Yokohama City University Graduate School of Medicine, Yokohama City University, Yokohama, Japan
| | - Hiromi Sano
- RIKEN-IFOM Joint Laboratory for Cancer Genomics, RIKEN Center for Integrative Medical Sciences, Yokohama, Japan
| | | | | | | | - Yuki Ishikawa
- Laboratory for Statistical and Translational Genetics, RIKEN Center for Integrative Medical Sciences, Yokohama, Japan
| | - Nao Tanaka
- Laboratory for Statistical and Translational Genetics, RIKEN Center for Integrative Medical Sciences, Yokohama, Japan
| | - Shigeki Hirabayashi
- RIKEN-IFOM Joint Laboratory for Cancer Genomics, RIKEN Center for Integrative Medical Sciences, Yokohama, Japan
- Department of Hematology and Oncology, Graduate School of Medicine, Kyoto University, Kyoto, Japan
- Division of Precision Medicine, Kyushu University Graduate School of Medical Sciences, Kyushu University, Fukuoka, Japan
| | - Riyo Konishi
- Inter-Organ Communication Research Team, Institute for Life and Medical Sciences, Kyoto University, Kyoto, Japan
| | - Sho Sekito
- Institute for the Advanced Study of Human Biology, Kyoto University, Kyoto, Japan
- Department of Nephro-Urologic Surgery and Andrology, Mie University Graduate School of Medicine, Mie University, Tsu, Japan
| | - Takahiro Inoue
- Department of Nephro-Urologic Surgery and Andrology, Mie University Graduate School of Medicine, Mie University, Tsu, Japan
| | - Juha Kere
- Department of Biosciences and Nutrition, Karolinska Institutet, Stockholm, Sweden
- Stem Cells and Metabolism Research Program, University of Helsinki, Helsinki, Finland
- Folkhalsan Research Center, Helsinki, Finland
| | - Shunichi Takeda
- Department of Radiation Genetics, Graduate School of Medicine, Kyoto University, Kyoto, Japan
- Shenzhen University School of Medicine, Shenzhen, Guangdong, China
| | - Akifumi Takaori-Kondo
- Department of Hematology and Oncology, Graduate School of Medicine, Kyoto University, Kyoto, Japan
| | - Itaru Endo
- Department of Gastroenterological Surgery, Yokohama City University Graduate School of Medicine, Yokohama City University, Yokohama, Japan
| | - Shinpei Kawaoka
- Inter-Organ Communication Research Team, Institute for Life and Medical Sciences, Kyoto University, Kyoto, Japan
- Department of Integrative Bioanalytics, Institute of Development, Aging and Cancer, Tohoku University, Sendai, Japan
| | - Hideya Kawaji
- Research Center for Genome & Medical Sciences, Tokyo Metropolitan Institute of Medical Science, Tokyo, Japan
- Preventive Medicine and Applied Genomics Unit, RIKEN Center for Integrative Medical Science, Yokohama, Japan
- RIKEN Preventive Medicine and Diagnosis Innovation Program, Wako, Japan
| | - Kazuyoshi Ishigaki
- Laboratory for Human Immunogenetics, RIKEN Center for Integrative Medical Sciences, Yokohama, Japan
| | - Hideki Ueno
- Institute for the Advanced Study of Human Biology, Kyoto University, Kyoto, Japan
- Department of Immunology, Graduate School of Medicine, Kyoto University, Kyoto, Japan
| | - Yoshihide Hayashizaki
- K.K. DNAFORM, Yokohama, Japan
- RIKEN Preventive Medicine and Diagnosis Innovation Program, Wako, Japan
| | - Massimiliano Pagani
- IFOM ETS - the AIRC Institute of Molecular Oncology, Milan, Italy
- Department of Medical Biotechnology and Translational Medicine, Università degli Studi, Milan, Italy
| | - Piero Carninci
- Laboratory for Transcriptome Technology, RIKEN Center for Integrative Medical Sciences, Yokohama, Japan
- Human Technopole, Milan, Italy
| | - Motoko Yanagita
- Institute for the Advanced Study of Human Biology, Kyoto University, Kyoto, Japan
- Department of Nephrology, Graduate School of Medicine, Kyoto University, Kyoto, Japan
| | - Nicholas Parrish
- Genome Immunobiology RIKEN Hakubi Research Team, RIKEN Center for Integrative Medical Sciences, Yokohama, Japan
| | - Chikashi Terao
- Laboratory for Statistical and Translational Genetics, RIKEN Center for Integrative Medical Sciences, Yokohama, Japan
- Clinical Research Center, Shizuoka General Hospital, Shizuoka, Japan
- Department of Applied Genetics, School of Pharmaceutical Sciences, University of Shizuoka, Shizuoka, Japan
| | - Kazuhiko Yamamoto
- Laboratory for Autoimmune Diseases, RIKEN Center for Integrative Medical Sciences, Yokohama, Japan
| | - Yasuhiro Murakawa
- RIKEN-IFOM Joint Laboratory for Cancer Genomics, RIKEN Center for Integrative Medical Sciences, Yokohama, Japan
- Institute for the Advanced Study of Human Biology, Kyoto University, Kyoto, Japan
- IFOM ETS - the AIRC Institute of Molecular Oncology, Milan, Italy
- Department of Medical Systems Genomics, Graduate School of Medicine, Kyoto University, Kyoto, Japan
| |
Collapse
|
34
|
Fu X, Mo S, Buendia A, Laurent A, Shao A, del Mar Alvarez-Torres M, Yu T, Tan J, Su J, Sagatelian R, Ferrando AA, Ciccia A, Lan Y, Owens DM, Palomero T, Xing EP, Rabadan R. GET: a foundation model of transcription across human cell types. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2023.09.24.559168. [PMID: 39005360 PMCID: PMC11244937 DOI: 10.1101/2023.09.24.559168] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 07/16/2024]
Abstract
Transcriptional regulation, involving the complex interplay between regulatory sequences and proteins, directs all biological processes. Computational models of transcription lack generalizability to accurately extrapolate in unseen cell types and conditions. Here, we introduce GET, an interpretable foundation model designed to uncover regulatory grammars across 213 human fetal and adult cell types. Relying exclusively on chromatin accessibility data and sequence information, GET achieves experimental-level accuracy in predicting gene expression even in previously unseen cell types. GET showcases remarkable adaptability across new sequencing platforms and assays, enabling regulatory inference across a broad range of cell types and conditions, and uncovering universal and cell type specific transcription factor interaction networks. We evaluated its performance on prediction of regulatory activity, inference of regulatory elements and regulators, and identification of physical interactions between transcription factors. Specifically, we show GET outperforms current models in predicting lentivirus-based massive parallel reporter assay readout with reduced input data. In fetal erythroblasts, we identify distal (>1Mbp) regulatory regions that were missed by previous models. In B cells, we identified a lymphocyte-specific transcription factor-transcription factor interaction that explains the functional significance of a leukemia-risk predisposing germline mutation. In sum, we provide a generalizable and accurate model for transcription together with catalogs of gene regulation and transcription factor interactions, all with cell type specificity.
Collapse
Affiliation(s)
- Xi Fu
- Department of Systems Biology, Columbia University, New York, NY, USA
- Department of Biomedical Informatics, Columbia University, New York, NY, USA
| | - Shentong Mo
- Department of Machine Learning, Carnegie Mellon University, Pittsburgh, PA, USA
- Mohamed bin Zayed University of Artificial Intelligence, Abu Dhabi, UAE
| | - Alejandro Buendia
- Department of Systems Biology, Columbia University, New York, NY, USA
| | - Anouchka Laurent
- Institute for Cancer Genetics, Columbia University, New York, NY, USA
| | - Anqi Shao
- Department of Dermatology, Columbia University, New York, NY, USA
| | | | - Tianji Yu
- Department of Systems Biology, Columbia University, New York, NY, USA
| | - Jimin Tan
- Regeneron Genetics Center, Regeneron, Tarrytown, NY, USA
| | - Jiayu Su
- Department of Systems Biology, Columbia University, New York, NY, USA
| | | | - Adolfo A. Ferrando
- Department of Dermatology, Columbia University, New York, NY, USA
- Regeneron Genetics Center, Regeneron, Tarrytown, NY, USA
| | - Alberto Ciccia
- Department of Genetics and Development, Columbia University, New York, NY, USA
| | - Yanyan Lan
- Institute for AI Industry Research, Tsinghua University, Beijing, China
| | - David M. Owens
- Institute for Cancer Genetics, Columbia University, New York, NY, USA
- Department of Pathology & Cell Biology, Columbia University, New York, NY, USA
| | - Teresa Palomero
- Institute for Cancer Genetics, Columbia University, New York, NY, USA
- Department of Pathology & Cell Biology, Columbia University, New York, NY, USA
| | - Eric P. Xing
- Department of Machine Learning, Carnegie Mellon University, Pittsburgh, PA, USA
- Mohamed bin Zayed University of Artificial Intelligence, Abu Dhabi, UAE
| | - Raul Rabadan
- Department of Systems Biology, Columbia University, New York, NY, USA
- Department of Biomedical Informatics, Columbia University, New York, NY, USA
| |
Collapse
|
35
|
Fan X, Chang T, Chen C, Hafner M, Wang Z. Analysis of RNA translation with a deep learning architecture provides new insight into translation control. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2023.07.08.548206. [PMID: 39005319 PMCID: PMC11244891 DOI: 10.1101/2023.07.08.548206] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 07/16/2024]
Abstract
Accurate annotation of coding regions in RNAs is essential for understanding gene translation. We developed a deep neural network to directly predict and analyze translation initiation and termination sites from RNA sequences. Trained with human transcripts, our model learned hidden rules of translation control and achieved a near perfect prediction of canonical translation sites across entire human transcriptome. Surprisingly, this model revealed a new role of codon usage in regulating translation termination, which was experimentally validated. We also identified thousands of new open reading frames in mRNAs or lncRNAs, some of which were confirmed experimentally. The model trained with human mRNAs achieved high prediction accuracy of canonical translation sites in all eukaryotes and good prediction in polycistronic transcripts from prokaryotes or RNA viruses, suggesting a high degree of conservation in translation control. Collectively, we present a general and efficient deep learning model for RNA translation, generating new insights into the complexity of translation regulation.
Collapse
Affiliation(s)
- Xiaojuan Fan
- Bio-med Big Data Center, CAS Key Laboratory of Computational Biology, CAS Center for Excellence in Molecular Cell Science, Shanghai Institute of Nutrition and Health
- RNA Molecular Biology Laboratory, National Institute of Arthritis and Musculoskeletal and Skin Disease, Bethesda, MD, USA
| | - Tiangen Chang
- Laboratory of Cancer Data Science, National Cancer Institute, Bethesda, MD, USA
| | - Chuyun Chen
- Bio-med Big Data Center, CAS Key Laboratory of Computational Biology, CAS Center for Excellence in Molecular Cell Science, Shanghai Institute of Nutrition and Health
- University of Chinese Academy of Sciences, Chinese Academy of Sciences, Shanghai 200031, China
| | - Markus Hafner
- RNA Molecular Biology Laboratory, National Institute of Arthritis and Musculoskeletal and Skin Disease, Bethesda, MD, USA
| | - Zefeng Wang
- Bio-med Big Data Center, CAS Key Laboratory of Computational Biology, CAS Center for Excellence in Molecular Cell Science, Shanghai Institute of Nutrition and Health
- University of Chinese Academy of Sciences, Chinese Academy of Sciences, Shanghai 200031, China
| |
Collapse
|
36
|
Moeckel C, Mouratidis I, Chantzi N, Uzun Y, Georgakopoulos-Soares I. Advances in computational and experimental approaches for deciphering transcriptional regulatory networks: Understanding the roles of cis-regulatory elements is essential, and recent research utilizing MPRAs, STARR-seq, CRISPR-Cas9, and machine learning has yielded valuable insights. Bioessays 2024; 46:e2300210. [PMID: 38715516 PMCID: PMC11444527 DOI: 10.1002/bies.202300210] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2023] [Revised: 04/22/2024] [Accepted: 04/23/2024] [Indexed: 05/16/2024]
Abstract
Understanding the influence of cis-regulatory elements on gene regulation poses numerous challenges given complexities stemming from variations in transcription factor (TF) binding, chromatin accessibility, structural constraints, and cell-type differences. This review discusses the role of gene regulatory networks in enhancing understanding of transcriptional regulation and covers construction methods ranging from expression-based approaches to supervised machine learning. Additionally, key experimental methods, including MPRAs and CRISPR-Cas9-based screening, which have significantly contributed to understanding TF binding preferences and cis-regulatory element functions, are explored. Lastly, the potential of machine learning and artificial intelligence to unravel cis-regulatory logic is analyzed. These computational advances have far-reaching implications for precision medicine, therapeutic target discovery, and the study of genetic variations in health and disease.
Collapse
Affiliation(s)
- Camille Moeckel
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
| | - Ioannis Mouratidis
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
- Huck Institutes of the Life Sciences, The Pennsylvania State University, University Park, PA, USA
| | - Nikol Chantzi
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
| | - Yasin Uzun
- Huck Institutes of the Life Sciences, The Pennsylvania State University, University Park, PA, USA
- Department of Pediatrics, The Pennsylvania State University College of Medicine, Hershey, PA, USA
| | - Ilias Georgakopoulos-Soares
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
- Huck Institutes of the Life Sciences, The Pennsylvania State University, University Park, PA, USA
| |
Collapse
|
37
|
Lin YJ, Menon AS, Hu Z, Brenner SE. Variant Impact Predictor database (VIPdb), version 2: Trends from 25 years of genetic variant impact predictors. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.06.25.600283. [PMID: 38979289 PMCID: PMC11230257 DOI: 10.1101/2024.06.25.600283] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 07/10/2024]
Abstract
Background Variant interpretation is essential for identifying patients' disease-causing genetic variants amongst the millions detected in their genomes. Hundreds of Variant Impact Predictors (VIPs), also known as Variant Effect Predictors (VEPs), have been developed for this purpose, with a variety of methodologies and goals. To facilitate the exploration of available VIP options, we have created the Variant Impact Predictor database (VIPdb). Results The Variant Impact Predictor database (VIPdb) version 2 presents a collection of VIPs developed over the past 25 years, summarizing their characteristics, ClinGen calibrated scores, CAGI assessment results, publication details, access information, and citation patterns. We previously summarized 217 VIPs and their features in VIPdb in 2019. Building upon this foundation, we identified and categorized an additional 186 VIPs, resulting in a total of 403 VIPs in VIPdb version 2. The majority of the VIPs have the capacity to predict the impacts of single nucleotide variants and nonsynonymous variants. More VIPs tailored to predict the impacts of insertions and deletions have been developed since the 2010s. In contrast, relatively few VIPs are dedicated to the prediction of splicing, structural, synonymous, and regulatory variants. The increasing rate of citations to VIPs reflects the ongoing growth in their use, and the evolving trends in citations reveal development in the field and individual methods. Conclusions VIPdb version 2 summarizes 403 VIPs and their features, potentially facilitating VIP exploration for various variant interpretation applications. Availability VIPdb version 2 is available at https://genomeinterpretation.org/vipdb.
Collapse
Affiliation(s)
- Yu-Jen Lin
- Department of Molecular and Cell Biology, University of California, Berkeley, California 94720, USA
- Center for Computational Biology, University of California, Berkeley, California 94720, USA
| | - Arul S. Menon
- Department of Molecular and Cell Biology, University of California, Berkeley, California 94720, USA
- College of Computing, Data Science, and Society, University of California, Berkeley, California 94720, USA
| | - Zhiqiang Hu
- Department of Plant and Microbial Biology, University of California, Berkeley, California 94720, USA
- Currently at: Illumina, Foster City, California 94404, USA
| | - Steven E. Brenner
- Department of Molecular and Cell Biology, University of California, Berkeley, California 94720, USA
- Center for Computational Biology, University of California, Berkeley, California 94720, USA
- College of Computing, Data Science, and Society, University of California, Berkeley, California 94720, USA
- Department of Plant and Microbial Biology, University of California, Berkeley, California 94720, USA
| |
Collapse
|
38
|
Li T, Xu H, Teng S, Suo M, Bahitwa R, Xu M, Qian Y, Ramstein GP, Song B, Buckler ES, Wang H. Modeling 0.6 million genes for the rational design of functional cis-regulatory variants and de novo design of cis-regulatory sequences. Proc Natl Acad Sci U S A 2024; 121:e2319811121. [PMID: 38889146 PMCID: PMC11214048 DOI: 10.1073/pnas.2319811121] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/12/2023] [Accepted: 05/14/2024] [Indexed: 06/20/2024] Open
Abstract
Rational design of plant cis-regulatory DNA sequences without expert intervention or prior domain knowledge is still a daunting task. Here, we developed PhytoExpr, a deep learning framework capable of predicting both mRNA abundance and plant species using the proximal regulatory sequence as the sole input. PhytoExpr was trained over 17 species representative of major clades of the plant kingdom to enhance its generalizability. Via input perturbation, quantitative functional annotation of the input sequence was achieved at single-nucleotide resolution, revealing an abundance of predicted high-impact nucleotides in conserved noncoding sequences and transcription factor binding sites. Evaluation of maize HapMap3 single-nucleotide polymorphisms (SNPs) by PhytoExpr demonstrates an enrichment of predicted high-impact SNPs in cis-eQTL. Additionally, we provided two algorithms that harnessed the power of PhytoExpr in designing functional cis-regulatory variants, and de novo creation of species-specific cis-regulatory sequences through in silico evolution of random DNA sequences. Our model represents a general and robust approach for functional variant discovery in population genetics and rational design of regulatory sequences for genome editing and synthetic biology.
Collapse
Affiliation(s)
- Tianyi Li
- State Key Laboratory of Maize Bio-breeding, National Maize Improvement Center, Frontiers Science Center for Molecular Design Breeding, Department of Plant Genetics and Breeding, China Agricultural University, Beijing100193, People’s Republic of China
| | - Hui Xu
- State Key Laboratory of Maize Bio-breeding, National Maize Improvement Center, Frontiers Science Center for Molecular Design Breeding, Department of Plant Genetics and Breeding, China Agricultural University, Beijing100193, People’s Republic of China
| | - Shouzhen Teng
- State Key Laboratory of Maize Bio-breeding, National Maize Improvement Center, Frontiers Science Center for Molecular Design Breeding, Department of Plant Genetics and Breeding, China Agricultural University, Beijing100193, People’s Republic of China
| | - Mingrui Suo
- State Key Laboratory of Maize Bio-breeding, National Maize Improvement Center, Frontiers Science Center for Molecular Design Breeding, Department of Plant Genetics and Breeding, China Agricultural University, Beijing100193, People’s Republic of China
| | - Revocatus Bahitwa
- State Key Laboratory of Maize Bio-breeding, National Maize Improvement Center, Frontiers Science Center for Molecular Design Breeding, Department of Plant Genetics and Breeding, China Agricultural University, Beijing100193, People’s Republic of China
- Legumes Research Program, Research and Innovation Division, Tanzania Agricultural Research Institute, Ilonga, Kilosa, Morogoro67410, Tanzania
| | - Mingchi Xu
- State Key Laboratory of Maize Bio-breeding, National Maize Improvement Center, Frontiers Science Center for Molecular Design Breeding, Department of Plant Genetics and Breeding, China Agricultural University, Beijing100193, People’s Republic of China
| | - Yiheng Qian
- State Key Laboratory of Maize Bio-breeding, National Maize Improvement Center, Frontiers Science Center for Molecular Design Breeding, Department of Plant Genetics and Breeding, China Agricultural University, Beijing100193, People’s Republic of China
| | - Guillaume P. Ramstein
- Center for Quantitative Genetics and Genomics, Aarhus University, Aarhus8000, Denmark
| | - Baoxing Song
- National Key Laboratory of Wheat Improvement, Peking University Institute of Advanced Agricultural Sciences, Shandong Laboratory of Advanced Agriculture Sciences in Weifang, Weifang, Shandong261325, People’s Republic of China
- Key Laboratory of Maize Biology and Genetic Breeding in Arid Area of Northwest Region of the Ministry of Agriculture, College of Agronomy, Northwest A&F University, Yangling, Shaanxi712100, People’s Republic of China
| | - Edward S. Buckler
- Institute for Genomic Diversity, Cornell University, Ithaca, NY14853
- Agricultural Research Service, United States Department of Agriculture, Ithaca, NY14853
| | - Hai Wang
- State Key Laboratory of Maize Bio-breeding, National Maize Improvement Center, Frontiers Science Center for Molecular Design Breeding, Department of Plant Genetics and Breeding, China Agricultural University, Beijing100193, People’s Republic of China
- Center for Crop Functional Genomics and Molecular Breeding, China Agricultural University, Beijing100193, People’s Republic of China
- Sanya Institute of China Agricultural University, Sanya572025, People’s Republic of China
| |
Collapse
|
39
|
Gjoni K, Pollard KS. SuPreMo: a computational tool for streamlining in silico perturbation using sequence-based predictive models. Bioinformatics 2024; 40:btae340. [PMID: 38796686 PMCID: PMC11153836 DOI: 10.1093/bioinformatics/btae340] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/27/2023] [Revised: 05/04/2024] [Accepted: 05/24/2024] [Indexed: 05/28/2024] Open
Abstract
SUMMARY The increasing development of sequence-based machine learning models has raised the demand for manipulating sequences for this application. However, existing approaches to edit and evaluate genome sequences using models have limitations, such as incompatibility with structural variants, challenges in identifying responsible sequence perturbations, and the need for vcf file inputs and phased data. To address these bottlenecks, we present Sequence Mutator for Predictive Models (SuPreMo), a scalable and comprehensive tool for performing and supporting in silico mutagenesis experiments. We then demonstrate how pairs of reference and perturbed sequences can be used with machine learning models to prioritize pathogenic variants or discover new functional sequences. AVAILABILITY AND IMPLEMENTATION SuPreMo was written in Python, and can be run using only one line of code to generate both sequences and 3D genome disruption scores. The codebase, instructions for installation and use, and tutorials are on the GitHub page: https://github.com/ketringjoni/SuPreMo.
Collapse
Affiliation(s)
- Ketrin Gjoni
- Institute of Data Science and Biotechnology, Gladstone Institutes, 1650 Owens Street, San Francisco, CA 94158, United States
- Department of Epidemiology & Biostatistics, University of California, San Francisco, CA 94158, United States
| | - Katherine S Pollard
- Institute of Data Science and Biotechnology, Gladstone Institutes, 1650 Owens Street, San Francisco, CA 94158, United States
- Department of Epidemiology & Biostatistics, University of California, San Francisco, CA 94158, United States
- Chan Zuckerberg Biohub, San Francisco, CA 94158, United States
| |
Collapse
|
40
|
Hwang H, Jeon H, Yeo N, Baek D. Big data and deep learning for RNA biology. Exp Mol Med 2024; 56:1293-1321. [PMID: 38871816 PMCID: PMC11263376 DOI: 10.1038/s12276-024-01243-w] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/31/2024] [Revised: 02/27/2024] [Accepted: 03/05/2024] [Indexed: 06/15/2024] Open
Abstract
The exponential growth of big data in RNA biology (RB) has led to the development of deep learning (DL) models that have driven crucial discoveries. As constantly evidenced by DL studies in other fields, the successful implementation of DL in RB depends heavily on the effective utilization of large-scale datasets from public databases. In achieving this goal, data encoding methods, learning algorithms, and techniques that align well with biological domain knowledge have played pivotal roles. In this review, we provide guiding principles for applying these DL concepts to various problems in RB by demonstrating successful examples and associated methodologies. We also discuss the remaining challenges in developing DL models for RB and suggest strategies to overcome these challenges. Overall, this review aims to illuminate the compelling potential of DL for RB and ways to apply this powerful technology to investigate the intriguing biology of RNA more effectively.
Collapse
Affiliation(s)
- Hyeonseo Hwang
- School of Biological Sciences, Seoul National University, Seoul, Republic of Korea
| | - Hyeonseong Jeon
- Interdisciplinary Program in Bioinformatics, Seoul National University, Seoul, Republic of Korea
- Genome4me Inc., Seoul, Republic of Korea
| | - Nagyeong Yeo
- School of Biological Sciences, Seoul National University, Seoul, Republic of Korea
| | - Daehyun Baek
- School of Biological Sciences, Seoul National University, Seoul, Republic of Korea.
- Interdisciplinary Program in Bioinformatics, Seoul National University, Seoul, Republic of Korea.
- Genome4me Inc., Seoul, Republic of Korea.
| |
Collapse
|
41
|
Ghafoor H, Asim MN, Ibrahim MA, Ahmed S, Dengel A. CAPTURE: Comprehensive anti-cancer peptide predictor with a unique amino acid sequence encoder. Comput Biol Med 2024; 176:108538. [PMID: 38759585 DOI: 10.1016/j.compbiomed.2024.108538] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/08/2024] [Revised: 04/26/2024] [Accepted: 04/28/2024] [Indexed: 05/19/2024]
Abstract
Anticancer peptides (ACPs) key properties including bioactivity, high efficacy, low toxicity, and lack of drug resistance make them ideal candidates for cancer therapies. To deeply explore the potential of ACPs and accelerate development of cancer therapies, although 53 Artificial Intelligence supported computational predictors have been developed for ACPs and non ACPs classification but only one predictor has been developed for ACPs functional types annotations. Moreover, these predictors extract amino acids distribution patterns to transform peptides sequences into statistical vectors that are further fed to classifiers for discriminating peptides sequences and annotating peptides functional classes. Overall, these predictors remain fail in extracting diverse types of amino acids distribution patterns from peptide sequences. The paper in hand presents a unique CARE encoder that transforms peptides sequences into statistical vectors by extracting 4 different types of distribution patterns including correlation, distribution, composition, and transition. Across public benchmark dataset, proposed encoder potential is explored under two different evaluation settings namely; intrinsic and extrinsic. Extrinsic evaluation indicates that 12 different machine learning classifiers achieve superior performance with the proposed encoder as compared to 55 existing encoders. Furthermore, an intrinsic evaluation reveals that, unlike existing encoders, the proposed encoder generates more discriminative clusters for ACPs and non-ACPs classes. Across 8 public benchmark ACPs and non-ACPs classification datasets, proposed encoder and Adaboost classifier based CAPTURE predictor outperforms existing predictors with an average accuracy, recall and MCC score of 1%, 4%, and 2% respectively. In generalizeability evaluation case study, across 7 benchmark anti-microbial peptides classification datasets, CAPTURE surpasses existing predictors by an average AU-ROC of 2%. CAPTURE predictive pipeline along with label powerset method outperforms state-of-the-art ACPs functional types predictor by 5%, 5%, 5%, 6%, and 3% in terms of average accuracy, subset accuracy, precision, recall, and F1 respectively. CAPTURE web application is available at https://sds_genetic_analysis.opendfki.de/CAPTURE.
Collapse
Affiliation(s)
- Hina Ghafoor
- Department of Computer Science, Rhineland-Palatinate Technical University of Kaiserslautern-Landau, Kaiserslautern, 67663, Germany; German Research Center for Artificial Intelligence GmbH, Kaiserslautern, 67663, Germany
| | - Muhammad Nabeel Asim
- German Research Center for Artificial Intelligence GmbH, Kaiserslautern, 67663, Germany.
| | - Muhammad Ali Ibrahim
- Department of Computer Science, Rhineland-Palatinate Technical University of Kaiserslautern-Landau, Kaiserslautern, 67663, Germany; German Research Center for Artificial Intelligence GmbH, Kaiserslautern, 67663, Germany
| | - Sheraz Ahmed
- German Research Center for Artificial Intelligence GmbH, Kaiserslautern, 67663, Germany
| | - Andreas Dengel
- Department of Computer Science, Rhineland-Palatinate Technical University of Kaiserslautern-Landau, Kaiserslautern, 67663, Germany; German Research Center for Artificial Intelligence GmbH, Kaiserslautern, 67663, Germany
| |
Collapse
|
42
|
Chin IM, Gardell ZA, Corces MR. Decoding polygenic diseases: advances in noncoding variant prioritization and validation. Trends Cell Biol 2024; 34:465-483. [PMID: 38719704 DOI: 10.1016/j.tcb.2024.03.005] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/22/2023] [Revised: 03/12/2024] [Accepted: 03/21/2024] [Indexed: 06/09/2024]
Abstract
Genome-wide association studies (GWASs) provide a key foundation for elucidating the genetic underpinnings of common polygenic diseases. However, these studies have limitations in their ability to assign causality to particular genetic variants, especially those residing in the noncoding genome. Over the past decade, technological and methodological advances in both analytical and empirical prioritization of noncoding variants have enabled the identification of causative variants by leveraging orthogonal functional evidence at increasing scale. In this review, we present an overview of these approaches and describe how this workflow provides the groundwork necessary to move beyond associations toward genetically informed studies on the molecular and cellular mechanisms of polygenic disease.
Collapse
Affiliation(s)
- Iris M Chin
- Gladstone Institute of Neurological Disease, Gladstone Institutes, San Francisco, CA, USA; Gladstone Institute of Data Science and Biotechnology, Gladstone Institutes, San Francisco, CA, USA; Department of Neurology, University of California San Francisco, San Francisco, CA, USA
| | - Zachary A Gardell
- Gladstone Institute of Neurological Disease, Gladstone Institutes, San Francisco, CA, USA; Gladstone Institute of Data Science and Biotechnology, Gladstone Institutes, San Francisco, CA, USA; Department of Neurology, University of California San Francisco, San Francisco, CA, USA
| | - M Ryan Corces
- Gladstone Institute of Neurological Disease, Gladstone Institutes, San Francisco, CA, USA; Gladstone Institute of Data Science and Biotechnology, Gladstone Institutes, San Francisco, CA, USA; Department of Neurology, University of California San Francisco, San Francisco, CA, USA.
| |
Collapse
|
43
|
Ni P, Wu S, Su Z. Validated Negative Regions (VNRs) in the VISTA Database might be Truncated Forms of Bona Fide Enhancers. ADVANCED GENETICS (HOBOKEN, N.J.) 2024; 5:2300209. [PMID: 38884049 PMCID: PMC11170074 DOI: 10.1002/ggn2.202300209] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 12/11/2023] [Revised: 03/16/2024] [Indexed: 06/18/2024]
Abstract
The VISTA enhancer database is a valuable resource for evaluating predicted enhancers in humans and mice. In addition to thousands of validated positive regions (VPRs) in the human and mouse genomes, the database also contains similar numbers of validated negative regions (VNRs). It is previously shown that the VPRs are on average half as long as predicted overlapping enhancers that are highly conserved and hypothesize that the VPRs may be truncated forms of long bona fide enhancers. Here, it is shown that like the VPRs, the VNRs also are under strong evolutionary constraints and overlap predicted enhancers in the genomes. The VNRs are also on average half as long as predicted overlapping enhancers that are highly conserved. Moreover, the VNRs and the VPRs display similar cell/tissue-specific modification patterns of key epigenetic marks of active enhancers. Furthermore, the VNRs and the VPRs show similar impact score spectra of in silico mutagenesis. These highly similar properties between the VPRs and the VNRs suggest that like the VPRs, the VNRs may also be truncated forms of long bona fide enhancers.
Collapse
Affiliation(s)
- Pengyu Ni
- Department of Bioinformatics and Genomics the University of North Carolina at Charlotte Charlotte NC 28223 USA
- Present address: Department of Molecular Biophysics & Biochemistry Yale University New Haven CT 06520 USA
| | - Siwen Wu
- Department of Bioinformatics and Genomics the University of North Carolina at Charlotte Charlotte NC 28223 USA
| | - Zhengchang Su
- Department of Bioinformatics and Genomics the University of North Carolina at Charlotte Charlotte NC 28223 USA
| |
Collapse
|
44
|
Wang S, Wang W. Interpretable prediction of mRNA abundance from promoter sequence using contextual regression models. NAR Genom Bioinform 2024; 6:lqae055. [PMID: 38807713 PMCID: PMC11131020 DOI: 10.1093/nargab/lqae055] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/19/2023] [Revised: 04/08/2024] [Accepted: 05/12/2024] [Indexed: 05/30/2024] Open
Abstract
While machine learning models have been successfully applied to predicting gene expression from promoter sequences, it remains a great challenge to derive intuitive interpretation of the model and reveal DNA motif grammar such as motif cooperation and distance constraint between motif sites. Previous interpretation approaches are often time-consuming or have difficulty to learn the combinatory rules. In this work, we designed interpretable neural network models to predict the mRNA expression levels from DNA sequences. By applying the Contextual Regression framework we developed, we extracted weighted features to cluster samples into different groups, which have different gene expression levels. We performed motif analysis in each cluster and found motifs with active or repressive regulation on gene expression. By comparing the co-occurrence locations of discovered motifs, we also uncovered multiple grammars of motif combination including communities of cooperative motifs and distance constraints between motif pairs. These results revealed new insights of the regulatory architecture of promoter sequences.
Collapse
Affiliation(s)
- Song Wang
- Department of Chemistry and Biochemistry, University of California, San Diego, La Jolla, CA 92093-0359, USA
| | - Wei Wang
- Department of Chemistry and Biochemistry, University of California, San Diego, La Jolla, CA 92093-0359, USA
- Department of Cellular and Molecular Medicine, University of California, San Diego, La Jolla, CA 92093-0359, USA
| |
Collapse
|
45
|
Cochran K, Yin M, Mantripragada A, Schreiber J, Marinov GK, Kundaje A. Dissecting the cis-regulatory syntax of transcription initiation with deep learning. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.05.28.596138. [PMID: 38853896 PMCID: PMC11160661 DOI: 10.1101/2024.05.28.596138] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2024]
Abstract
Despite extensive characterization of mammalian Pol II transcription, the DNA sequence determinants of transcription initiation at a third of human promoters and most enhancers remain poorly understood. Hence, we trained and interpreted a neural network called ProCapNet that accurately models base-resolution initiation profiles from PRO-cap experiments using local DNA sequence. ProCapNet learns sequence motifs with distinct effects on initiation rates and TSS positioning and uncovers context-specific cryptic initiator elements intertwined within other TF motifs. ProCapNet annotates predictive motifs in nearly all actively transcribed regulatory elements across multiple cell-lines, revealing a shared cis-regulatory logic across promoters and enhancers mediated by a highly epistatic sequence syntax of cooperative and competitive motif interactions. ProCapNet models of RAMPAGE profiles measuring steady-state RNA abundance at TSSs distill initiation signals on par with models trained directly on PRO-cap profiles. ProCapNet learns a largely cell-type-agnostic cis-regulatory code of initiation complementing sequence drivers of cell-type-specific chromatin state critical for accurate prediction of cell-type-specific transcription initiation.
Collapse
Affiliation(s)
- Kelly Cochran
- Department of Computer Science, Stanford University, Stanford, CA, USA
| | | | | | - Jacob Schreiber
- Department of Genetics, Stanford University, Stanford, CA, USA
| | | | - Anshul Kundaje
- Department of Computer Science, Stanford University, Stanford, CA, USA
- Department of Genetics, Stanford University, Stanford, CA, USA
| |
Collapse
|
46
|
Zhang S, Shu H, Zhou J, Rubin-Sigler J, Yang X, Liu Y, Cooper-Knock J, Monte E, Zhu C, Tu S, Li H, Tong M, Ecker JR, Ichida JK, Shen Y, Zeng J, Tsao PS, Snyder MP. Deconvolution of polygenic risk score in single cells unravels cellular and molecular heterogeneity of complex human diseases. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.05.14.594252. [PMID: 38798507 PMCID: PMC11118500 DOI: 10.1101/2024.05.14.594252] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/29/2024]
Abstract
Polygenic risk scores (PRSs) are commonly used for predicting an individual's genetic risk of complex diseases. Yet, their implication for disease pathogenesis remains largely limited. Here, we introduce scPRS, a geometric deep learning model that constructs single-cell-resolved PRS leveraging reference single-cell chromatin accessibility profiling data to enhance biological discovery as well as disease prediction. Real-world applications across multiple complex diseases, including type 2 diabetes (T2D), hypertrophic cardiomyopathy (HCM), and Alzheimer's disease (AD), showcase the superior prediction power of scPRS compared to traditional PRS methods. Importantly, scPRS not only predicts disease risk but also uncovers disease-relevant cells, such as hormone-high alpha and beta cells for T2D, cardiomyocytes and pericytes for HCM, and astrocytes, microglia and oligodendrocyte progenitor cells for AD. Facilitated by a layered multi-omic analysis, scPRS further identifies cell-type-specific genetic underpinnings, linking disease-associated genetic variants to gene regulation within corresponding cell types. We substantiate the disease relevance of scPRS-prioritized HCM genes and demonstrate that the suppression of these genes in HCM cardiomyocytes is rescued by Mavacamten treatment. Additionally, we establish a novel microglia-specific regulatory relationship between the AD risk variant rs7922621 and its target genes ANXA11 and TSPAN14. We further illustrate the detrimental effects of suppressing these two genes on microglia phagocytosis. Our work provides a multi-tasking, interpretable framework for precise disease prediction and systematic investigation of the genetic, cellular, and molecular basis of complex diseases, laying the methodological foundation for single-cell genetics.
Collapse
Affiliation(s)
- Sai Zhang
- Department of Epidemiology, University of Florida, Gainesville, FL, USA
- Departments of Biostatistics & Biomedical Engineering, Genetics Institute, McKnight Brain Institute, University of Florida, Gainesville, FL, USA
- Department of Genetics, Center for Genomics and Personalized Medicine, Stanford University School of Medicine, Stanford, CA, USA
- These authors contributed equally: Sai Zhang, Hantao Shu, and Jingtian Zhou
| | - Hantao Shu
- Institute for Interdisciplinary Information Sciences, Tsinghua University, Beijing, China
- These authors contributed equally: Sai Zhang, Hantao Shu, and Jingtian Zhou
| | - Jingtian Zhou
- Arc Institute, Palo Alto, CA, USA
- Genomic Analysis Laboratory, The Salk Institute for Biological Studies, La Jolla, CA, USA
- Bioinformatics and Systems Biology Program, University of California San Diego, La Jolla, CA, USA
- These authors contributed equally: Sai Zhang, Hantao Shu, and Jingtian Zhou
| | - Jasper Rubin-Sigler
- Department of Stem Cell Biology and Regenerative Medicine, Eli and Edythe Broad Center for Regenerative Medicine and Stem Cell Research, University of Southern California, Los Angeles, CA, USA
| | - Xiaoyu Yang
- Institute for Human Genetics, University of California San Francisco, San Francisco, CA, USA
| | - Yuxi Liu
- Institute for Human Genetics, University of California San Francisco, San Francisco, CA, USA
| | - Johnathan Cooper-Knock
- Sheffield Institute for Translational Neuroscience, University of Sheffield, Sheffield, UK
| | - Emma Monte
- Department of Genetics, Center for Genomics and Personalized Medicine, Stanford University School of Medicine, Stanford, CA, USA
| | - Chenchen Zhu
- Department of Genetics, Center for Genomics and Personalized Medicine, Stanford University School of Medicine, Stanford, CA, USA
| | - Sharon Tu
- Department of Stem Cell Biology and Regenerative Medicine, Eli and Edythe Broad Center for Regenerative Medicine and Stem Cell Research, University of Southern California, Los Angeles, CA, USA
| | - Han Li
- Institute for Interdisciplinary Information Sciences, Tsinghua University, Beijing, China
| | - Mingming Tong
- Department of Genetics, Center for Genomics and Personalized Medicine, Stanford University School of Medicine, Stanford, CA, USA
| | - Joseph R. Ecker
- Genomic Analysis Laboratory, The Salk Institute for Biological Studies, La Jolla, CA, USA
- Howard Hughes Medical Institute, The Salk Institute for Biological Studies, La Jolla, CA, USA
| | - Justin K. Ichida
- Department of Stem Cell Biology and Regenerative Medicine, Eli and Edythe Broad Center for Regenerative Medicine and Stem Cell Research, University of Southern California, Los Angeles, CA, USA
| | - Yin Shen
- Institute for Human Genetics, University of California San Francisco, San Francisco, CA, USA
- Department of Neurology, Weill Institute for Neurosciences, University of California San Francisco, San Francisco, CA, USA
| | - Jianyang Zeng
- School of Engineering, Research Center for Industries of the Future, Westlake University, Hangzhou, Zhejiang, China
| | - Philip S. Tsao
- VA Palo Alto Healthcare System, Palo Alto, CA, USA
- Department of Medicine, Stanford University School of Medicine, Stanford, CA, USA
| | - Michael P. Snyder
- Department of Genetics, Center for Genomics and Personalized Medicine, Stanford University School of Medicine, Stanford, CA, USA
| |
Collapse
|
47
|
Siraj L, Castro RI, Dewey H, Kales S, Nguyen TTL, Kanai M, Berenzy D, Mouri K, Wang QS, McCaw ZR, Gosai SJ, Aguet F, Cui R, Vockley CM, Lareau CA, Okada Y, Gusev A, Jones TR, Lander ES, Sabeti PC, Finucane HK, Reilly SK, Ulirsch JC, Tewhey R. Functional dissection of complex and molecular trait variants at single nucleotide resolution. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.05.05.592437. [PMID: 38766054 PMCID: PMC11100724 DOI: 10.1101/2024.05.05.592437] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/22/2024]
Abstract
Identifying the causal variants and mechanisms that drive complex traits and diseases remains a core problem in human genetics. The majority of these variants have individually weak effects and lie in non-coding gene-regulatory elements where we lack a complete understanding of how single nucleotide alterations modulate transcriptional processes to affect human phenotypes. To address this, we measured the activity of 221,412 trait-associated variants that had been statistically fine-mapped using a Massively Parallel Reporter Assay (MPRA) in 5 diverse cell-types. We show that MPRA is able to discriminate between likely causal variants and controls, identifying 12,025 regulatory variants with high precision. Although the effects of these variants largely agree with orthogonal measures of function, only 69% can plausibly be explained by the disruption of a known transcription factor (TF) binding motif. We dissect the mechanisms of 136 variants using saturation mutagenesis and assign impacted TFs for 91% of variants without a clear canonical mechanism. Finally, we provide evidence that epistasis is prevalent for variants in close proximity and identify multiple functional variants on the same haplotype at a small, but important, subset of trait-associated loci. Overall, our study provides a systematic functional characterization of likely causal common variants underlying complex and molecular human traits, enabling new insights into the regulatory grammar underlying disease risk.
Collapse
Affiliation(s)
- Layla Siraj
- Broad Institute of Harvard and MIT, Cambridge, MA, USA
- Program in Biophysics, Harvard Graduate School of Arts and Sciences, Boston, MA, USA
- Harvard-Massachusetts Institute of Technology MD/PhD Program, Harvard Medical School, Boston, MA, USA
- Analytic and Translational Genetics Unit, Massachusetts General Hospital, Boston, MA, USA
| | | | | | | | | | - Masahiro Kanai
- Analytic and Translational Genetics Unit, Massachusetts General Hospital, Boston, MA, USA
- Program in Medical and Population Genetics, Broad Institute of Harvard and MIT, Cambridge, MA, USA
- Stanley Center for Psychiatric Research, Broad Institute of Harvard and MIT, Cambridge, MA USA
- Center for Computational and Integrative Biology, Massachusetts General Hospital, Boston, MA, USA
| | | | | | - Qingbo S. Wang
- Analytic and Translational Genetics Unit, Massachusetts General Hospital, Boston, MA, USA
- Program in Medical and Population Genetics, Broad Institute of Harvard and MIT, Cambridge, MA, USA
- Stanley Center for Psychiatric Research, Broad Institute of Harvard and MIT, Cambridge, MA USA
- Department of Statistical Genetics, Osaka University Graduate School of Medicine, Suita, Japan
- Department of Genome Informatics, Graduate School of Medicine, the University of Tokyo, Tokyo, Japan
| | | | - Sager J. Gosai
- Broad Institute of Harvard and MIT, Cambridge, MA, USA
- Program in Biological and Biomedical Sciences, Harvard Medical School, Boston, MA, USA
- Department of Organismic and Evolutionary Biology, Harvard University, Cambridge, MA, USA
- Howard Hughes Medical Institute, Chevy Chase, MD, USA
| | - François Aguet
- Broad Institute of Harvard and MIT, Cambridge, MA, USA
- Illumina Artificial Intelligence Laboratory, Illumina, San Diego, CA, USA
| | - Ran Cui
- Analytic and Translational Genetics Unit, Massachusetts General Hospital, Boston, MA, USA
- Program in Medical and Population Genetics, Broad Institute of Harvard and MIT, Cambridge, MA, USA
- Stanley Center for Psychiatric Research, Broad Institute of Harvard and MIT, Cambridge, MA USA
- The Novo Nordisk Foundation Center for Genomic Mechanisms of Disease, Broad Institute of MIT and Harvard, Cambridge, MA, USA
| | | | - Caleb A. Lareau
- Program in Computational and Systems Biology, Memorial Sloan Kettering Cancer Center, New York, NY, USA
| | - Yukinori Okada
- Department of Statistical Genetics, Osaka University Graduate School of Medicine, Suita, Japan
- Department of Genome Informatics, Graduate School of Medicine, the University of Tokyo, Tokyo, Japan
- Laboratory for Systems Genetics, RIKEN Center for Integrative Medical Sciences, Kanagawa, Japan
| | - Alexander Gusev
- Harvard Medical School and Dana-Farber Cancer Institute, Boston, MA, USA
| | - Thouis R. Jones
- Broad Institute of Harvard and MIT, Cambridge, MA, USA
- The Novo Nordisk Foundation Center for Genomic Mechanisms of Disease, Broad Institute of MIT and Harvard, Cambridge, MA, USA
| | - Eric S. Lander
- Broad Institute of Harvard and MIT, Cambridge, MA, USA
- Department of Biology, MIT, Cambridge, MA, USA
- Department of Systems Biology, Harvard Medical School, Boston, MA, USA
| | - Pardis C. Sabeti
- Broad Institute of Harvard and MIT, Cambridge, MA, USA
- Department of Organismic and Evolutionary Biology, Harvard University, Cambridge, MA, USA
- Howard Hughes Medical Institute, Chevy Chase, MD, USA
| | - Hilary K. Finucane
- Analytic and Translational Genetics Unit, Massachusetts General Hospital, Boston, MA, USA
- Program in Medical and Population Genetics, Broad Institute of Harvard and MIT, Cambridge, MA, USA
- Stanley Center for Psychiatric Research, Broad Institute of Harvard and MIT, Cambridge, MA USA
| | - Steven K. Reilly
- Department of Genetics, Yale School of Medicine, New Haven, CT, USA
- Wu Tsai Institute, Yale University, New Haven, CT, USA
| | - Jacob C. Ulirsch
- Analytic and Translational Genetics Unit, Massachusetts General Hospital, Boston, MA, USA
- Program in Medical and Population Genetics, Broad Institute of Harvard and MIT, Cambridge, MA, USA
- Stanley Center for Psychiatric Research, Broad Institute of Harvard and MIT, Cambridge, MA USA
- Program in Biological and Biomedical Sciences, Harvard Medical School, Boston, MA, USA
- Illumina Artificial Intelligence Laboratory, Illumina, San Diego, CA, USA
| | - Ryan Tewhey
- The Jackson Laboratory, Bar Harbor, ME, USA
- Graduate School of Biomedical Sciences and Engineering, University of Maine, Orono, ME, USA
- Graduate School of Biomedical Sciences, Tufts University School of Medicine, Boston, MA, USA
| |
Collapse
|
48
|
Wang J, Agarwal V. How DNA encodes the start of transcription. Science 2024; 384:382-383. [PMID: 38662850 DOI: 10.1126/science.adp0869] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/03/2024]
Abstract
A deep-learning model reveals the rules that define transcription initiation.
Collapse
Affiliation(s)
- Jun Wang
- mRNA Center of Excellence, Sanofi Pasteur, Inc., Waltham, MA, USA
| | - Vikram Agarwal
- mRNA Center of Excellence, Sanofi Pasteur, Inc., Waltham, MA, USA
| |
Collapse
|
49
|
Dudnyk K, Cai D, Shi C, Xu J, Zhou J. Sequence basis of transcription initiation in the human genome. Science 2024; 384:eadj0116. [PMID: 38662817 PMCID: PMC11223672 DOI: 10.1126/science.adj0116] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/01/2023] [Accepted: 02/28/2024] [Indexed: 05/03/2024]
Abstract
Transcription initiation is a process that is essential to ensuring the proper function of any gene, yet we still lack a unified understanding of sequence patterns and rules that explain most transcription start sites in the human genome. By predicting transcription initiation at base-pair resolution from sequences with a deep learning-inspired explainable model called Puffin, we show that a small set of simple rules can explain transcription initiation at most human promoters. We identify key sequence patterns that contribute to human promoter activity, each activating transcription with distinct position-specific effects. Furthermore, we explain the sequence basis of bidirectional transcription at promoters, identify the links between promoter sequence and gene expression variation across cell types, and explore the conservation of sequence determinants of transcription initiation across mammalian species.
Collapse
Affiliation(s)
- Kseniia Dudnyk
- Lyda Hill Department of Bioinformatics, University of Texas Southwestern Medical Center; Dallas, Texas, United States of America
| | - Donghong Cai
- Lyda Hill Department of Bioinformatics, University of Texas Southwestern Medical Center; Dallas, Texas, United States of America
- Center of Excellence for Leukemia Studies (CELS), Department of Pathology, St. Jude Children’s Research Hospital, Memphis, Tennessee, United States of America
| | - Chenlai Shi
- Lyda Hill Department of Bioinformatics, University of Texas Southwestern Medical Center; Dallas, Texas, United States of America
| | - Jian Xu
- Center of Excellence for Leukemia Studies (CELS), Department of Pathology, St. Jude Children’s Research Hospital, Memphis, Tennessee, United States of America
| | - Jian Zhou
- Lyda Hill Department of Bioinformatics, University of Texas Southwestern Medical Center; Dallas, Texas, United States of America
| |
Collapse
|
50
|
Song W, Shi Y, Lin GN. Haplotype function score improves biological interpretation and cross-ancestry polygenic prediction of human complex traits. eLife 2024; 12:RP92574. [PMID: 38639992 PMCID: PMC11031082 DOI: 10.7554/elife.92574] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 04/20/2024] Open
Abstract
We propose a new framework for human genetic association studies: at each locus, a deep learning model (in this study, Sei) is used to calculate the functional genomic activity score for two haplotypes per individual. This score, defined as the Haplotype Function Score (HFS), replaces the original genotype in association studies. Applying the HFS framework to 14 complex traits in the UK Biobank, we identified 3619 independent HFS-trait associations with a significance of p < 5 × 10-8. Fine-mapping revealed 2699 causal associations, corresponding to a median increase of 63 causal findings per trait compared with single-nucleotide polymorphism (SNP)-based analysis. HFS-based enrichment analysis uncovered 727 pathway-trait associations and 153 tissue-trait associations with strong biological interpretability, including 'circadian pathway-chronotype' and 'arachidonic acid-intelligence'. Lastly, we applied least absolute shrinkage and selection operator (LASSO) regression to integrate HFS prediction score with SNP-based polygenic risk scores, which showed an improvement of 16.1-39.8% in cross-ancestry polygenic prediction. We concluded that HFS is a promising strategy for understanding the genetic basis of human complex traits.
Collapse
Affiliation(s)
- Weichen Song
- Shanghai Mental Health Center, Shanghai Jiao Tong University School of Medicine, School of Bioengineering, Shanghai Jiao Tong UniversityShanghaiChina
- Bio-X Institutes, Key Laboratory for the Genetics of Developmental and Neuropsychiatric Disorders (Ministry of Education), Collaborative Innovation Center for Brain Science, Shanghai Jiao Tong UniversityShanghaiChina
| | - Yongyong Shi
- Bio-X Institutes, Key Laboratory for the Genetics of Developmental and Neuropsychiatric Disorders (Ministry of Education), Collaborative Innovation Center for Brain Science, Shanghai Jiao Tong UniversityShanghaiChina
- Biomedical Sciences Institute of Qingdao University (Qingdao Branch of SJTU Bio-X12 Institutes), Qingdao UniversityQingdaoChina
| | - Guan Ning Lin
- Shanghai Mental Health Center, Shanghai Jiao Tong University School of Medicine, School of Bioengineering, Shanghai Jiao Tong UniversityShanghaiChina
| |
Collapse
|