1
|
An Z, Jiang A, Chen J. Toward understanding the role of genomic repeat elements in neurodegenerative diseases. Neural Regen Res 2025; 20:646-659. [PMID: 38886931 PMCID: PMC11433896 DOI: 10.4103/nrr.nrr-d-23-01568] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/18/2023] [Revised: 12/21/2023] [Accepted: 03/02/2024] [Indexed: 06/20/2024] Open
Abstract
Neurodegenerative diseases cause great medical and economic burdens for both patients and society; however, the complex molecular mechanisms thereof are not yet well understood. With the development of high-coverage sequencing technology, researchers have started to notice that genomic repeat regions, previously neglected in search of disease culprits, are active contributors to multiple neurodegenerative diseases. In this review, we describe the association between repeat element variants and multiple degenerative diseases through genome-wide association studies and targeted sequencing. We discuss the identification of disease-relevant repeat element variants, further powered by the advancement of long-read sequencing technologies and their related tools, and summarize recent findings in the molecular mechanisms of repeat element variants in brain degeneration, such as those causing transcriptional silencing or RNA-mediated gain of toxic function. Furthermore, we describe how in silico predictions using innovative computational models, such as deep learning language models, could enhance and accelerate our understanding of the functional impact of repeat element variants. Finally, we discuss future directions to advance current findings for a better understanding of neurodegenerative diseases and the clinical applications of genomic repeat elements.
Collapse
Affiliation(s)
- Zhengyu An
- Institute of Science and Technology for Brain-Inspired Intelligence, Fudan University, Shanghai, China
| | - Aidi Jiang
- Institute of Science and Technology for Brain-Inspired Intelligence, Fudan University, Shanghai, China
| | - Jingqi Chen
- Institute of Science and Technology for Brain-Inspired Intelligence, Fudan University, Shanghai, China
- MOE Frontiers Center for Brain Science, Fudan University, Shanghai, China
- MOE Key Laboratory of Computational Neuroscience and Brain-Inspired Intelligence, Fudan University, Shanghai, China
- Zhangjiang Fudan International Innovation Center, Shanghai, China
| |
Collapse
|
2
|
Tian Y, Wu X, Luo S, Xiong D, Liu R, Hu L, Yuan Y, Shi G, Yao J, Huang Z, Fu F, Yang X, Tang Z, Zhang J, Hu K. A multi-omic single-cell landscape of cellular diversification in the developing human cerebral cortex. Comput Struct Biotechnol J 2024; 23:2173-2189. [PMID: 38827229 PMCID: PMC11141146 DOI: 10.1016/j.csbj.2024.05.019] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/20/2024] [Revised: 05/09/2024] [Accepted: 05/13/2024] [Indexed: 06/04/2024] Open
Abstract
The vast neuronal diversity in the human neocortex is vital for high-order brain functions, necessitating elucidation of the regulatory mechanisms underlying such unparalleled diversity. However, recent studies have yet to comprehensively reveal the diversity of neurons and the molecular logic of neocortical origin in humans at single-cell resolution through profiling transcriptomic or epigenomic landscapes, owing to the application of unimodal data alone to depict exceedingly heterogeneous populations of neurons. In this study, we generated a comprehensive compendium of the developing human neocortex by simultaneously profiling gene expression and open chromatin from the same cell. We computationally reconstructed the differentiation trajectories of excitatory projection neurons of cortical origin and inferred the regulatory logic governing lineage bifurcation decisions for neuronal diversification. We demonstrated that neuronal diversity arises from progenitor cell lineage specificity and postmitotic differentiation at distinct stages. Our data paves the way for understanding the primarily coordinated regulatory logic for neuronal diversification in the neocortex.
Collapse
Affiliation(s)
- Yuhan Tian
- Zhongshan School of Medicine, Sun Yat-sen University, Guangzhou 510275, China
| | - Xia Wu
- Zhongshan School of Medicine, Sun Yat-sen University, Guangzhou 510275, China
| | - Songhao Luo
- School of Mathematics, Sun Yat-sen University, Guangzhou 510275, China
| | - Dan Xiong
- Zhongshan School of Medicine, Sun Yat-sen University, Guangzhou 510275, China
| | - Rong Liu
- Zhongshan School of Medicine, Sun Yat-sen University, Guangzhou 510275, China
| | - Lanqi Hu
- Zhongshan School of Medicine, Sun Yat-sen University, Guangzhou 510275, China
| | - Yuchen Yuan
- Zhongshan School of Medicine, Sun Yat-sen University, Guangzhou 510275, China
| | - Guowei Shi
- Zhongshan School of Medicine, Sun Yat-sen University, Guangzhou 510275, China
| | - Junjie Yao
- Zhongshan School of Medicine, Sun Yat-sen University, Guangzhou 510275, China
| | - Zhiwei Huang
- School of Mathematics, Sun Yat-sen University, Guangzhou 510275, China
| | - Fang Fu
- Guangzhou Women and Children’s Medical Center, Guangzhou Medical University, Guangzhou 511436, China
| | - Xin Yang
- Guangzhou Women and Children’s Medical Center, Guangzhou Medical University, Guangzhou 511436, China
| | - Zhonghui Tang
- Zhongshan School of Medicine, Sun Yat-sen University, Guangzhou 510275, China
| | - Jiajun Zhang
- School of Mathematics, Sun Yat-sen University, Guangzhou 510275, China
| | - Kunhua Hu
- Guangdong Provincial Key Laboratory of Brain Function and Disease, Zhongshan School of Medicine, Sun Yat-sen University, Guangzhou 510275, China
- Public Platform Laboratory, The Third Affiliated Hospital of Sun Yat-Sen University, Guangzhou 510630, China
| |
Collapse
|
3
|
Moghimianavval H, Gispert I, Castillo SR, Corning OBWH, Liu AP, Cuba Samaniego C. Engineering Sequestration-Based Biomolecular Classifiers with Shared Resources. ACS Synth Biol 2024; 13:3231-3245. [PMID: 39303290 PMCID: PMC11494701 DOI: 10.1021/acssynbio.4c00270] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/16/2024] [Revised: 09/08/2024] [Accepted: 09/09/2024] [Indexed: 09/22/2024]
Abstract
Constructing molecular classifiers that enable cells to recognize linear and nonlinear input patterns would expand the biocomputational capabilities of engineered cells, thereby unlocking their potential in diagnostics and therapeutic applications. While several biomolecular classifier schemes have been designed, the effects of biological constraints such as resource limitation and competitive binding on the function of those classifiers have been left unexplored. Here, we first demonstrate the design of a sigma factor-based perceptron as a molecular classifier working based on the principles of molecular sequestration between the sigma factor and its antisigma molecule. We then investigate how the output of the biomolecular perceptron, i.e., its response pattern or decision boundary, is affected by the competitive binding of sigma factors to a pool of shared and limited resources of core RNA polymerase. Finally, we reveal the influence of sharing limited resources on multilayer perceptron neural networks and outline design principles that enable the construction of nonlinear classifiers using sigma-based biomolecular neural networks in the presence of competitive resource-sharing effects.
Collapse
Affiliation(s)
- Hossein Moghimianavval
- CSHL Course
in Synthetic Biology 2022, Cold Spring Harbor
Laboratory, Cold Spring Harbor, New York 11724, United States
- Department
of Mechanical Engineering, University of
Michigan, Ann Arbor, Michigan 48109, United States
| | - Ignacio Gispert
- CSHL Course
in Synthetic Biology 2022, Cold Spring Harbor
Laboratory, Cold Spring Harbor, New York 11724, United States
- Chemical
Engineering Department, Imperial College
London, London SW7 2AZ, U.K.
| | - Santiago R. Castillo
- CSHL Course
in Synthetic Biology 2022, Cold Spring Harbor
Laboratory, Cold Spring Harbor, New York 11724, United States
- Department
of Biochemistry and Molecular Biology, Mayo
Clinic, Rochester, Minnesota 55905, United States
| | - Olaf B. W. H. Corning
- CSHL Course
in Synthetic Biology 2022, Cold Spring Harbor
Laboratory, Cold Spring Harbor, New York 11724, United States
- Department
of Bioengineering, University of Washington, Seattle, Washington 98125, United States
| | - Allen P. Liu
- Department
of Mechanical Engineering, University of
Michigan, Ann Arbor, Michigan 48109, United States
- Department
of Biomedical Engineering, University of
Michigan, Ann Arbor, Michigan 48109, United States
- Department
of Biophysics, University of Michigan, Ann Arbor, Michigan 48109, United States
- Cellular
and Molecular Biology Program, University
of Michigan, Ann Arbor, Michigan 48109, United States
| | - Christian Cuba Samaniego
- CSHL Course
in Synthetic Biology 2022, Cold Spring Harbor
Laboratory, Cold Spring Harbor, New York 11724, United States
- Computational
Biology Department, Carnegie Mellon University, Pittsburgh, Pennsylvania 15213, United States
| |
Collapse
|
4
|
La Fleur A, Shi Y, Seelig G. Decoding biology with massively parallel reporter assays and machine learning. Genes Dev 2024; 38:843-865. [PMID: 39362779 PMCID: PMC11535156 DOI: 10.1101/gad.351800.124] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/05/2024]
Abstract
Massively parallel reporter assays (MPRAs) are powerful tools for quantifying the impacts of sequence variation on gene expression. Reading out molecular phenotypes with sequencing enables interrogating the impact of sequence variation beyond genome scale. Machine learning models integrate and codify information learned from MPRAs and enable generalization by predicting sequences outside the training data set. Models can provide a quantitative understanding of cis-regulatory codes controlling gene expression, enable variant stratification, and guide the design of synthetic regulatory elements for applications from synthetic biology to mRNA and gene therapy. This review focuses on cis-regulatory MPRAs, particularly those that interrogate cotranscriptional and post-transcriptional processes: alternative splicing, cleavage and polyadenylation, translation, and mRNA decay.
Collapse
Affiliation(s)
- Alyssa La Fleur
- Paul G. Allen School of Computer Science and Engineering, University of Washington, Seattle, Washington 98195, USA
| | - Yongsheng Shi
- Department of Microbiology and Molecular Genetics, School of Medicine, University of California, Irvine, Irvine, California 92697, USA;
| | - Georg Seelig
- Paul G. Allen School of Computer Science and Engineering, University of Washington, Seattle, Washington 98195, USA;
- Department of Electrical & Computer Engineering, University of Washington, Seattle, Washington 98195, USA
| |
Collapse
|
5
|
Zhang Y. Path of career planning and employment strategy based on deep learning in the information age. PLoS One 2024; 19:e0308654. [PMID: 39405324 PMCID: PMC11478877 DOI: 10.1371/journal.pone.0308654] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/07/2024] [Accepted: 07/27/2024] [Indexed: 10/19/2024] Open
Abstract
With the improvement of education level and the expansion of higher education, more students can have the opportunities to obtain better education, and the pressure of employment competition is also increasing. How to improve students' employment competitiveness, comprehensive quality and the ability to explore paths for career planning and employment strategies has become a common concern in today's society. Under the background of today's informatization, the paths of career planning and employment strategies are becoming more and more informatized. The support of Internet is essential for obtaining more employment information. As a representative product of the information age, deep learning provides people with a better path. This paper conducts an in-depth study of the career planning and employment strategy paths based on deep learning in the information age. Research has shown that in the current information age, deep learning through career planning and employment strategy paths can help students solve the main problems they face in career planning education and better meet the needs of today's society. Career awareness increased by 35% and self-improvement by 15%. This indicated that in the information age, career planning and employment strategies based on deep learning are a way to conform to the trend of the times, which can better help college students improve their understanding, promote employment, and promote self-development.This study combines quantitative and qualitative methods, collects data through questionnaires, and uses deep learning model for analysis. Control group and experimental group were set up to evaluate the effect of career planning education. Descriptive statistics and correlation analysis were used to ensure the accuracy and reliability of the results.
Collapse
Affiliation(s)
- Yichi Zhang
- Enrollment and Employment Division, Southwest Petroleum University, Chengdu, Sichuan, China
| |
Collapse
|
6
|
Awdeh A, Turcotte M, Perkins TJ. Identifying transcription factors with cell-type specific DNA binding signatures. BMC Genomics 2024; 25:957. [PMID: 39402535 PMCID: PMC11472444 DOI: 10.1186/s12864-024-10859-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/10/2024] [Accepted: 10/02/2024] [Indexed: 10/19/2024] Open
Abstract
BACKGROUND Transcription factors (TFs) bind to different parts of the genome in different types of cells, but it is usually assumed that the inherent DNA-binding preferences of a TF are invariant to cell type. Yet, there are several known examples of TFs that switch their DNA-binding preferences in different cell types, and yet more examples of other mechanisms, such as steric hindrance or cooperative binding, that may result in a "DNA signature" of differential binding. RESULTS To survey this phenomenon systematically, we developed a deep learning method we call SigTFB (Signatures of TF Binding) to detect and quantify cell-type specificity in a TF's known genomic binding sites. We used ENCODE ChIP-seq data to conduct a wide scale investigation of 169 distinct TFs in up to 14 distinct cell types. SigTFB detected statistically significant DNA binding signatures in approximately two-thirds of TFs, far more than might have been expected from the relatively sparse evidence in prior literature. We found that the presence or absence of a cell-type specific DNA binding signature is distinct from, and indeed largely uncorrelated to, the degree of overlap between ChIP-seq peaks in different cell types, and tended to arise by two mechanisms: using established motifs in different frequencies, and by selective inclusion of motifs for distint TFs. CONCLUSIONS While recent results have highlighted cell state features such as chromatin accessibility and gene expression in predicting TF binding, our results emphasize that, for some TFs, the DNA sequences of the binding sites contain substantial cell-type specific motifs.
Collapse
Affiliation(s)
- Aseel Awdeh
- School of Electrical Engineering and Compute Science, University of Ottawa, 800 King Edward Ave., Ottawa, K1N 6N5, Ontario, Canada
- Regenerative Medicine Program, Ottawa Hospital Research Institute, 501 Smyth Rd., Ottawa, K1H 8L6, Ontario, Canada
| | - Marcel Turcotte
- School of Electrical Engineering and Compute Science, University of Ottawa, 800 King Edward Ave., Ottawa, K1N 6N5, Ontario, Canada
| | - Theodore J Perkins
- School of Electrical Engineering and Compute Science, University of Ottawa, 800 King Edward Ave., Ottawa, K1N 6N5, Ontario, Canada.
- Regenerative Medicine Program, Ottawa Hospital Research Institute, 501 Smyth Rd., Ottawa, K1H 8L6, Ontario, Canada.
- Ottawa Institute of Systems Biology, Department of Biochemistry, Microbiology and Immunology, University of Ottawa, 451 Smyth Rd., Ottawa, K1H 8M5, Ontario, Canada.
| |
Collapse
|
7
|
Ramprasad P, Pai N, Pan W. Enhancing personalized gene expression prediction from DNA sequences using genomic foundation models. HGG ADVANCES 2024; 5:100347. [PMID: 39205391 PMCID: PMC11416237 DOI: 10.1016/j.xhgg.2024.100347] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/07/2024] [Revised: 08/23/2024] [Accepted: 08/23/2024] [Indexed: 09/04/2024] Open
Abstract
Artificial intelligence (AI)/deep learning (DL) models that predict molecular phenotypes like gene expression directly from DNA sequences have recently emerged. While these models have proven effective at capturing the variation across genes, their ability to explain inter-individual differences has been limited. We hypothesize that the performance gap can be narrowed through the use of pre-trained embeddings from the Nucleotide Transformer, a large foundation model trained on 3,000+ genomes. We train a transformer model using the pre-trained embeddings and compare its predictive performance to Enformer, the current state-of-the-art model, using genotype and expression data from 290 individuals. Our model significantly outperforms Enformer in terms of correlation across individuals, and narrows the performance gap with an elastic net regression approach that uses just the genetic variants as predictors. Although simple regression models have their advantages in personalized prediction tasks, DL approaches based on foundation models pre-trained on diverse genomes have unique strengths in flexibility and interpretability. With further methodological and computational improvements with more training data, these models may eventually predict molecular phenotypes from DNA sequences with an accuracy surpassing that of regression-based approaches. Our work demonstrates the potential for large pre-trained AI/DL models to advance functional genomics.
Collapse
Affiliation(s)
- Pratik Ramprasad
- Division of Biostatistics and Health Data Science, University of Minnesota, Minneapolis, Minneapolis, MN, USA
| | - Nidhi Pai
- Division of Biostatistics and Health Data Science, University of Minnesota, Minneapolis, Minneapolis, MN, USA
| | - Wei Pan
- Division of Biostatistics and Health Data Science, University of Minnesota, Minneapolis, Minneapolis, MN, USA.
| |
Collapse
|
8
|
Harding-Larsen D, Funk J, Madsen NG, Gharabli H, Acevedo-Rocha CG, Mazurenko S, Welner DH. Protein representations: Encoding biological information for machine learning in biocatalysis. Biotechnol Adv 2024; 77:108459. [PMID: 39366493 DOI: 10.1016/j.biotechadv.2024.108459] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/18/2024] [Revised: 09/19/2024] [Accepted: 09/29/2024] [Indexed: 10/06/2024]
Abstract
Enzymes offer a more environmentally friendly and low-impact solution to conventional chemistry, but they often require additional engineering for their application in industrial settings, an endeavour that is challenging and laborious. To address this issue, the power of machine learning can be harnessed to produce predictive models that enable the in silico study and engineering of improved enzymatic properties. Such machine learning models, however, require the conversion of the complex biological information to a numerical input, also called protein representations. These inputs demand special attention to ensure the training of accurate and precise models, and, in this review, we therefore examine the critical step of encoding protein information to numeric representations for use in machine learning. We selected the most important approaches for encoding the three distinct biological protein representations - primary sequence, 3D structure, and dynamics - to explore their requirements for employment and inductive biases. Combined representations of proteins and substrates are also introduced as emergent tools in biocatalysis. We propose the division of fixed representations, a collection of rule-based encoding strategies, and learned representations extracted from the latent spaces of large neural networks. To select the most suitable protein representation, we propose two main factors to consider. The first one is the model setup, which is influenced by the size of the training dataset and the choice of architecture. The second factor is the model objectives such as consideration about the assayed property, the difference between wild-type models and mutant predictors, and requirements for explainability. This review is aimed at serving as a source of information and guidance for properly representing enzymes in future machine learning models for biocatalysis.
Collapse
Affiliation(s)
- David Harding-Larsen
- The Novo Nordisk Center for Biosustainability, Technical University of Denmark, Søltofts Plads, Bygning 220, 2800 Kgs. Lyngby, Denmark
| | - Jonathan Funk
- The Novo Nordisk Center for Biosustainability, Technical University of Denmark, Søltofts Plads, Bygning 220, 2800 Kgs. Lyngby, Denmark
| | - Niklas Gesmar Madsen
- The Novo Nordisk Center for Biosustainability, Technical University of Denmark, Søltofts Plads, Bygning 220, 2800 Kgs. Lyngby, Denmark
| | - Hani Gharabli
- The Novo Nordisk Center for Biosustainability, Technical University of Denmark, Søltofts Plads, Bygning 220, 2800 Kgs. Lyngby, Denmark
| | - Carlos G Acevedo-Rocha
- The Novo Nordisk Center for Biosustainability, Technical University of Denmark, Søltofts Plads, Bygning 220, 2800 Kgs. Lyngby, Denmark
| | - Stanislav Mazurenko
- Loschmidt Laboratories, Department of Experimental Biology and RECETOX, Faculty of Science, Masaryk University, Kamenice 5, 625 00 Brno, Czech Republic; International Clinical Research Center, St. Anne's University Hospital Brno, Pekarska 53, 656 91 Brno, Czech Republic
| | - Ditte Hededam Welner
- The Novo Nordisk Center for Biosustainability, Technical University of Denmark, Søltofts Plads, Bygning 220, 2800 Kgs. Lyngby, Denmark.
| |
Collapse
|
9
|
Capitanchik C, Wilkins OG, Wagner N, Gagneur J, Ule J. From computational models of the splicing code to regulatory mechanisms and therapeutic implications. Nat Rev Genet 2024:10.1038/s41576-024-00774-2. [PMID: 39358547 DOI: 10.1038/s41576-024-00774-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 08/27/2024] [Indexed: 10/04/2024]
Abstract
Since the discovery of RNA splicing and its role in gene expression, researchers have sought a set of rules, an algorithm or a computational model that could predict the splice isoforms, and their frequencies, produced from any transcribed gene in a specific cellular context. Over the past 30 years, these models have evolved from simple position weight matrices to deep-learning models capable of integrating sequence data across vast genomic distances. Most recently, new model architectures are moving the field closer to context-specific alternative splicing predictions, and advances in sequencing technologies are expanding the type of data that can be used to inform and interpret such models. Together, these developments are driving improved understanding of splicing regulatory mechanisms and emerging applications of the splicing code to the rational design of RNA- and splicing-based therapeutics.
Collapse
Affiliation(s)
- Charlotte Capitanchik
- The Francis Crick Institute, London, UK
- UK Dementia Research Institute at King's College London, London, UK
- Department of Basic and Clinical Neuroscience, Institute of Psychiatry Psychology & Neuroscience, King's College London, London, UK
| | - Oscar G Wilkins
- The Francis Crick Institute, London, UK
- UCL Queen Square Motor Neuron Disease Centre, Department of Neuromuscular Diseases, UCL Queen Square Institute of Neurology, UCL, London, UK
| | - Nils Wagner
- School of Computation, Information and Technology, Technical University of Munich, Garching, Germany
- Helmholtz Association - Munich School for Data Science (MUDS), Munich, Germany
| | - Julien Gagneur
- School of Computation, Information and Technology, Technical University of Munich, Garching, Germany.
- Institute of Human Genetics, School of Medicine, Technical University of Munich, Munich, Germany.
- Computational Health Center, Helmholtz Center Munich, Neuherberg, Germany.
| | - Jernej Ule
- The Francis Crick Institute, London, UK.
- UK Dementia Research Institute at King's College London, London, UK.
- Department of Basic and Clinical Neuroscience, Institute of Psychiatry Psychology & Neuroscience, King's College London, London, UK.
- National Institute of Chemistry, Ljubljana, Slovenia.
| |
Collapse
|
10
|
Gasparini K, Figueiredo YG, Araújo WL, Peres LE, Zsögön A. De novo domestication in the Solanaceae: advances and challenges. Curr Opin Biotechnol 2024; 89:103177. [PMID: 39106791 DOI: 10.1016/j.copbio.2024.103177] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/03/2024] [Revised: 06/21/2024] [Accepted: 07/19/2024] [Indexed: 08/09/2024]
Abstract
The advent of highly efficient genome editing (GE) tools, coupled with high-throughput genome sequencing, has paved the way for the accelerated domestication of crop wild relatives. New crops could thus be rapidly created that are well adapted to cope with drought, flooding, soil salinity, or insect damage. De novo domestication avoids the complexity of transferring polygenic stress resistance from wild species to crops. Instead, new crops can be created by manipulating major genes in stress-resistant wild species. However, the genetic basis of certain relevant domestication-related traits often involve epistasis and pleiotropy. Furthermore, pan-genome analyses show that structural variation driving gene expression changes has been selected during domestication. A growing body of work suggests that the Solanaceae family, which includes crop species such as tomatoes, potatoes, eggplants, peppers, and tobacco, is a suitable model group to dissect these phenomena and operate changes in wild relatives to improve agronomic traits rapidly with GE. We briefly discuss the prospects of this exciting novel field in the interface between fundamental and applied plant biology and its potential impact in the coming years.
Collapse
Affiliation(s)
- Karla Gasparini
- National Institute of Science and Technology on Plant Physiology Under Stress Conditions, Departamento de Biologia Vegetal, Universidade Federal de Viçosa, 36570-900 Viçosa, MG, Brazil
| | - Yuri G Figueiredo
- National Institute of Science and Technology on Plant Physiology Under Stress Conditions, Departamento de Biologia Vegetal, Universidade Federal de Viçosa, 36570-900 Viçosa, MG, Brazil
| | - Wagner L Araújo
- National Institute of Science and Technology on Plant Physiology Under Stress Conditions, Departamento de Biologia Vegetal, Universidade Federal de Viçosa, 36570-900 Viçosa, MG, Brazil
| | - Lázaro Ep Peres
- Laboratory of Hormonal Control of Plant Development. Departamento de Ciências Biológicas, Escola Superior de Agricultura "Luiz de Queiroz", Universidade de São Paulo, 13418-900 Piracicaba, SP, Brazil
| | - Agustin Zsögön
- National Institute of Science and Technology on Plant Physiology Under Stress Conditions, Departamento de Biologia Vegetal, Universidade Federal de Viçosa, 36570-900 Viçosa, MG, Brazil
| |
Collapse
|
11
|
Guo Z, Zhang K, Cai C, Li X, Zhang L, Yang Y, Wang X, Chen S, Zhang L, Cheng F. Deep learning can predict subgenome dominance in ancient but not in neo/synthetic polyploidized genomes. THE PLANT JOURNAL : FOR CELL AND MOLECULAR BIOLOGY 2024; 120:174-186. [PMID: 39133828 DOI: 10.1111/tpj.16979] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/12/2024] [Revised: 06/28/2024] [Accepted: 07/31/2024] [Indexed: 09/27/2024]
Abstract
Deep learning offers new approaches to investigate the mechanisms underlying complex biological phenomena, such as subgenome dominance. Subgenome dominance refers to the dominant expression and/or biased fractionation of genes in one subgenome of allopolyploids, which has shaped the evolution of a large group of plants. However, the underlying cause of subgenome dominance remains elusive. Here, we adopt deep learning to construct two convolutional neural network (CNN) models, binary expression model (BEM) and homoeolog contrast model (HCM), to investigate the mechanism underlying subgenome dominance using DNA sequence and methylation sites. We apply these CNN models to analyze three representative polyploidization systems, Brassica, Gossypium, and Cucurbitaceae, each with available ancient and neo/synthetic polyploidized genomes. The BEM shows that DNA sequence of the promoter region can accurately predict whether a gene is expressed or not. More importantly, the HCM shows that the DNA sequence of the promoter region predicts dominant expression status between homoeologous gene pairs retained from ancient polyploidizations, thus predicting subgenome dominance associated with these events. However, HCM fails to predict gene expression dominance between new homoeologous gene pairs arising from the neo/synthetic polyploidizations. These results are consistent across the three plant polyploidization systems, indicating broad applicability of our models. Furthermore, the two models based on methylation sites produce similar results. These results show that subgenome dominance is associated with long-term sequence differentiation between the promoters of homoeologs, suggesting that subgenome expression dominance precedes and is the driving force or even the determining factor for sequence divergence between subgenomes following polyploidization.
Collapse
Affiliation(s)
- Zhongwei Guo
- State Key Laboratory of Vegetable Biobreeding, Key Laboratory of Biology and Genetic Improvement of Horticultural Crops of the Ministry of Agriculture and Rural Affairs, Institute of Vegetables and Flowers, Chinese Academy of Agricultural Sciences, Beijing, China
| | - Kang Zhang
- State Key Laboratory of Vegetable Biobreeding, Key Laboratory of Biology and Genetic Improvement of Horticultural Crops of the Ministry of Agriculture and Rural Affairs, Institute of Vegetables and Flowers, Chinese Academy of Agricultural Sciences, Beijing, China
| | - Chengcheng Cai
- State Key Laboratory of Vegetable Biobreeding, Key Laboratory of Biology and Genetic Improvement of Horticultural Crops of the Ministry of Agriculture and Rural Affairs, Institute of Vegetables and Flowers, Chinese Academy of Agricultural Sciences, Beijing, China
| | - Xing Li
- State Key Laboratory of Vegetable Biobreeding, Key Laboratory of Biology and Genetic Improvement of Horticultural Crops of the Ministry of Agriculture and Rural Affairs, Institute of Vegetables and Flowers, Chinese Academy of Agricultural Sciences, Beijing, China
| | - Lingkui Zhang
- State Key Laboratory of Vegetable Biobreeding, Key Laboratory of Biology and Genetic Improvement of Horticultural Crops of the Ministry of Agriculture and Rural Affairs, Institute of Vegetables and Flowers, Chinese Academy of Agricultural Sciences, Beijing, China
| | - Yinqing Yang
- State Key Laboratory of Vegetable Biobreeding, Key Laboratory of Biology and Genetic Improvement of Horticultural Crops of the Ministry of Agriculture and Rural Affairs, Institute of Vegetables and Flowers, Chinese Academy of Agricultural Sciences, Beijing, China
| | - Xiang Wang
- State Key Laboratory of Vegetable Biobreeding, Key Laboratory of Biology and Genetic Improvement of Horticultural Crops of the Ministry of Agriculture and Rural Affairs, Institute of Vegetables and Flowers, Chinese Academy of Agricultural Sciences, Beijing, China
| | - Shumin Chen
- State Key Laboratory of Vegetable Biobreeding, Key Laboratory of Biology and Genetic Improvement of Horticultural Crops of the Ministry of Agriculture and Rural Affairs, Institute of Vegetables and Flowers, Chinese Academy of Agricultural Sciences, Beijing, China
| | - Lei Zhang
- State Key Laboratory of Vegetable Biobreeding, Key Laboratory of Biology and Genetic Improvement of Horticultural Crops of the Ministry of Agriculture and Rural Affairs, Institute of Vegetables and Flowers, Chinese Academy of Agricultural Sciences, Beijing, China
| | - Feng Cheng
- State Key Laboratory of Vegetable Biobreeding, Key Laboratory of Biology and Genetic Improvement of Horticultural Crops of the Ministry of Agriculture and Rural Affairs, Institute of Vegetables and Flowers, Chinese Academy of Agricultural Sciences, Beijing, China
| |
Collapse
|
12
|
Sigfstead S, Jiang R, Avram R, Davies B, Krahn AD, Cheung CC. Applying Artificial Intelligence for Phenotyping of Inherited Arrhythmia Syndromes. Can J Cardiol 2024; 40:1841-1851. [PMID: 38670456 DOI: 10.1016/j.cjca.2024.04.014] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/21/2024] [Revised: 04/08/2024] [Accepted: 04/21/2024] [Indexed: 04/28/2024] Open
Abstract
Inherited arrhythmia disorders account for a significant proportion of sudden cardiac death, particularly among young individuals. Recent advances in our understanding of these syndromes have improved patient diagnosis and care, yet certain clinical gaps remain, particularly within case ascertainment, access to genetic testing, and risk stratification. Artificial intelligence (AI), specifically machine learning and its subset deep learning, present promising solutions to these challenges. The capacity of AI to process vast amounts of patient data and identify disease patterns differentiates them from traditional methods, which are time- and resource-intensive. To date, AI models have shown immense potential in condition detection (including asymptomatic/concealed disease) and genotype and phenotype identification, exceeding expert cardiologists in these tasks. Additionally, they have exhibited applicability for general population screening, improving case ascertainment in a set of conditions that are often asymptomatic such as left ventricular dysfunction. Third, models have shown the ability to improve testing protocols; through model identification of disease and genotype, specific clinical testing (eg, drug challenges or further diagnostic imaging) can be avoided, reducing health care expenses, speeding diagnosis, and possibly allowing for more incremental or targeted genetic testing approaches. These significant benefits warrant continued investigation of AI, particularly regarding the development and implementation of clinically applicable screening tools. In this review we summarize key developments in AI, including studies in long QT syndrome, Brugada syndrome, hypertrophic cardiomyopathy, and arrhythmogenic cardiomyopathies, and provide direction for effective future AI implementation in clinical practice.
Collapse
Affiliation(s)
- Sophie Sigfstead
- Department of Mathematical and Statistical Sciences, University of Alberta, Edmonton, Alberta, Canada
| | - River Jiang
- Division of Cardiology, University of British Columbia, Vancouver, British Columbia, Canada
| | - Robert Avram
- Heartwise (heartwise.ai), Montreal Heart Institute, Montreal, Quebec, Canada; Department of Medicine, Montreal Heart Institute, Université de Montréal, Montreal, Quebec, Canada
| | - Brianna Davies
- Center for Cardiovascular Innovation, Division of Cardiology, Department of Medicine, University of British Columbia, Vancouver, British Columbia, Canada
| | - Andrew D Krahn
- Center for Cardiovascular Innovation, Division of Cardiology, Department of Medicine, University of British Columbia, Vancouver, British Columbia, Canada.
| | - Christopher C Cheung
- Division of Cardiology, Sunnybrook Health Sciences Centre, University of Toronto, Toronto, Ontario, Canada
| |
Collapse
|
13
|
Li Q, Hu Z, Wang Y, Li L, Fan Y, King I, Jia G, Wang S, Song L, Li Y. Progress and opportunities of foundation models in bioinformatics. Brief Bioinform 2024; 25:bbae548. [PMID: 39461902 PMCID: PMC11512649 DOI: 10.1093/bib/bbae548] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/30/2024] [Revised: 08/20/2024] [Accepted: 10/12/2024] [Indexed: 10/29/2024] Open
Abstract
Bioinformatics has undergone a paradigm shift in artificial intelligence (AI), particularly through foundation models (FMs), which address longstanding challenges in bioinformatics such as limited annotated data and data noise. These AI techniques have demonstrated remarkable efficacy across various downstream validation tasks, effectively representing diverse biological entities and heralding a new era in computational biology. The primary goal of this survey is to conduct a general investigation and summary of FMs in bioinformatics, tracing their evolutionary trajectory, current research landscape, and methodological frameworks. Our primary focus is on elucidating the application of FMs to specific biological problems, offering insights to guide the research community in choosing appropriate FMs for tasks like sequence analysis, structure prediction, and function annotation. Each section delves into the intricacies of the targeted challenges, contrasting the architectures and advancements of FMs with conventional methods and showcasing their utility across different biological domains. Further, this review scrutinizes the hurdles and constraints encountered by FMs in biology, including issues of data noise, model interpretability, and potential biases. This analysis provides a theoretical groundwork for understanding the circumstances under which certain FMs may exhibit suboptimal performance. Lastly, we outline prospective pathways and methodologies for the future development of FMs in biological research, facilitating ongoing innovation in the field. This comprehensive examination not only serves as an academic reference but also as a roadmap for forthcoming explorations and applications of FMs in biology.
Collapse
Affiliation(s)
- Qing Li
- Department of Computer Science and Engineering, The Chinese University of Hong Kong, Shatin, New Territories, Hong Kong SAR, 999077, China
| | - Zhihang Hu
- Department of Computer Science and Engineering, The Chinese University of Hong Kong, Shatin, New Territories, Hong Kong SAR, 999077, China
| | - Yixuan Wang
- Department of Computer Science and Engineering, The Chinese University of Hong Kong, Shatin, New Territories, Hong Kong SAR, 999077, China
| | - Lei Li
- Department of Computer Science and Engineering, The Chinese University of Hong Kong, Shatin, New Territories, Hong Kong SAR, 999077, China
| | - Yimin Fan
- Department of Computer Science and Engineering, The Chinese University of Hong Kong, Shatin, New Territories, Hong Kong SAR, 999077, China
| | - Irwin King
- Department of Computer Science and Engineering, The Chinese University of Hong Kong, Shatin, New Territories, Hong Kong SAR, 999077, China
| | - Gengjie Jia
- Shenzhen Branch, Guangdong Laboratory of Lingnan Modern Agriculture, Genome Analysis Laboratory of the Ministry of Agriculture and Rural Affairs, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen, Guangdong, 518120, China
| | - Sheng Wang
- Shanghai Zelixir Biotech Company Ltd., Shanghai, 200030, China
- Shenzhen Institute of Advanced Technology, Xueyuan Avenue, Shenzhen University Town, Nanshan District, Shenzhen, Guangdong, 518055, China
| | - Le Song
- BioMap, Zhongguancun Life Science Park, Haidian District, Beijing, 100085, China
| | - Yu Li
- Department of Computer Science and Engineering, The Chinese University of Hong Kong, Shatin, New Territories, Hong Kong SAR, 999077, China
| |
Collapse
|
14
|
Jo T, Bice P, Nho K, Saykin AJ. Linkage Disequilibrium-Informed Deep Learning Framework to Identify Genetic Loci for Alzheimer's Disease Using Whole Genome Sequencing Data. MEDRXIV : THE PREPRINT SERVER FOR HEALTH SCIENCES 2024:2024.09.19.24313993. [PMID: 39371140 PMCID: PMC11451815 DOI: 10.1101/2024.09.19.24313993] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 10/08/2024]
Abstract
The exponential growth of genomic datasets necessitates advanced analytical tools to effectively identify genetic loci from large-scale high throughput sequencing data. This study presents Deep-Block, a multi-stage deep learning framework that incorporates biological knowledge into its AI architecture to identify genetic regions as significantly associated with Alzheimer's disease (AD). The framework employs a three-stage approach: (1) genome segmentation based on linkage disequilibrium (LD) patterns, (2) selection of relevant LD blocks using sparse attention mechanisms, and (3) application of TabNet and Random Forest algorithms to quantify single nucleotide polymorphism (SNP) feature importance, thereby identifying genetic factors contributing to AD risk. The Deep-Block was applied to a large-scale whole genome sequencing (WGS) dataset from the Alzheimer's Disease Sequencing Project (ADSP), comprising 7,416 non-Hispanic white participants (3,150 cognitively normal older adults (CN), 4,266 AD). First, 30,218 LD blocks were identified and then ranked based on their relevance with Alzheimer's disease. Subsequently, the Deep-Block identified novel SNPs within the top 1,500 LD blocks and confirmed previously known variants, including APOE rs429358 and rs769449. The results were cross-validated against established AD-associated loci from the European Alzheimer's and Dementia Biobank (EADB) and the GWAS catalog. The Deep-Block framework effectively processes large-scale high throughput sequencing data while preserving interactions between SNPs in performing the dimensionality reduction, which can potentially introduce bias or lead to information loss. The Deep-Block approach identified both known and novel genetic variation, enhancing our understanding of the genetic architecture of and demonstrating the framework's potential for application in large-scale sequencing studies.
Collapse
Affiliation(s)
- Taeho Jo
- Indiana Alzheimer Disease Research Center and Center for Neuroimaging, Department of Radiology and Imaging Sciences, Indiana University School of Medicine, Indianapolis, IN 46202, USA
| | - Paula Bice
- Indiana Alzheimer Disease Research Center and Center for Neuroimaging, Department of Radiology and Imaging Sciences, Indiana University School of Medicine, Indianapolis, IN 46202, USA
| | - Kwangsik Nho
- Indiana Alzheimer Disease Research Center and Center for Neuroimaging, Department of Radiology and Imaging Sciences, Indiana University School of Medicine, Indianapolis, IN 46202, USA
- Center for Computational Biology and Bioinformatics, Indiana University School of Medicine, Indianapolis, IN 46202, USA
| | - Andrew J Saykin
- Indiana Alzheimer Disease Research Center and Center for Neuroimaging, Department of Radiology and Imaging Sciences, Indiana University School of Medicine, Indianapolis, IN 46202, USA
- Department of Medical and Molecular Genetics, Indiana University School of Medicine, Indianapolis, IN 46202, USA
| |
Collapse
|
15
|
Wabro A, Herrmann M, Winkler EC. When time is of the essence: ethical reconsideration of XAI in time-sensitive environments. JOURNAL OF MEDICAL ETHICS 2024:jme-2024-110046. [PMID: 39299730 DOI: 10.1136/jme-2024-110046] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/01/2024] [Accepted: 09/06/2024] [Indexed: 09/22/2024]
Abstract
The objective of explainable artificial intelligence systems designed for clinical decision support (XAI-CDSS) is to enhance physicians' diagnostic performance, confidence and trust through the implementation of interpretable methods, thus providing for a superior epistemic positioning, a robust foundation for critical reflection and trustworthiness in times of heightened technological dependence. However, recent studies have revealed shortcomings in achieving these goals, questioning the widespread endorsement of XAI by medical professionals, ethicists and policy-makers alike. Based on a surgical use case, this article challenges generalising calls for XAI-CDSS and emphasises the significance of time-sensitive clinical environments which frequently preclude adequate consideration of system explanations. Therefore, XAI-CDSS may not be able to meet expectations of augmenting clinical decision-making in specific circumstances where time is of the essence. This article, by employing a principled ethical balancing methodology, highlights several fallacies associated with XAI deployment in time-sensitive clinical situations and recommends XAI endorsement only where scientific evidence or stakeholder assessments do not contradict such deployment in specific target settings.
Collapse
Affiliation(s)
- Andreas Wabro
- National Center for Tumor Diseases (NCT) Heidelberg, NCT Heidelberg, a partnership between DKFZ and Heidelberg University Hospital, Germany, Heidelberg University, Medical Faculty Heidelberg, Heidelberg University Hospital, Department of Medical Oncology, Section Translational Medical Ethics, Heidelberg, Germany
| | - Markus Herrmann
- National Center for Tumor Diseases (NCT) Heidelberg, NCT Heidelberg, a partnership between DKFZ and Heidelberg University Hospital, Germany, Heidelberg University, Medical Faculty Heidelberg, Heidelberg University Hospital, Department of Medical Oncology, Section Translational Medical Ethics, Heidelberg, Germany
| | - Eva C Winkler
- National Center for Tumor Diseases (NCT) Heidelberg, NCT Heidelberg, a partnership between DKFZ and Heidelberg University Hospital, Germany, Heidelberg University, Medical Faculty Heidelberg, Heidelberg University Hospital, Department of Medical Oncology, Section Translational Medical Ethics, Heidelberg, Germany
| |
Collapse
|
16
|
Trapnell C. Revealing gene function with statistical inference at single-cell resolution. Nat Rev Genet 2024; 25:623-638. [PMID: 38951690 DOI: 10.1038/s41576-024-00750-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 05/21/2024] [Indexed: 07/03/2024]
Abstract
Single-cell and spatial molecular profiling assays have shown large gains in sensitivity, resolution and throughput. Applying these technologies to specimens from human and model organisms promises to comprehensively catalogue cell types, reveal their lineage origins in development and discern their contributions to disease pathogenesis. Moreover, rapidly dropping costs have made well-controlled perturbation experiments and cohort studies widely accessible, illuminating mechanisms that give rise to phenotypes at the scale of the cell, the tissue and the whole organism. Interpreting the coming flood of single-cell data, much of which will be spatially resolved, will place a tremendous burden on existing computational pipelines. However, statistical concepts, models, tools and algorithms can be repurposed to solve problems now arising in genetic and molecular biology studies of development and disease. Here, I review how the questions that recent technological innovations promise to answer can be addressed by the major classes of statistical tools.
Collapse
Affiliation(s)
- Cole Trapnell
- Department of Genome Sciences, University of Washington, Seattle, WA, USA.
- Brotman Baty Institute for Precision Medicine, University of Washington, Seattle, WA, USA.
- Allen Discovery Center for Cell Lineage Tracing, Seattle, WA, USA.
- Seattle Hub for Synthetic Biology, Seattle, WA, USA.
| |
Collapse
|
17
|
Yang B, Zhou X, Liu S. Tracing the genealogy origin of geographic populations based on genomic variation and deep learning. Mol Phylogenet Evol 2024; 198:108142. [PMID: 38964594 DOI: 10.1016/j.ympev.2024.108142] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/09/2023] [Revised: 05/30/2024] [Accepted: 07/01/2024] [Indexed: 07/06/2024]
Abstract
Assigning a query individual animal or plant to its derived population is a prime task in diverse applications related to organismal genealogy. Such endeavors have conventionally relied on short DNA sequences under a phylogenetic framework. These methods naturally show constraints when the inferred population sources are ambiguously phylogenetically structured, a scenario demanding substantially more informative genetic signals. Recent advances in cost-effective production of whole-genome sequences and artificial intelligence have created an unprecedented opportunity to trace the population origin for essentially any given individual, as long as the genome reference data are comprehensive and standardized. Here, we developed a convolutional neural network method to identify population origins using genomic SNPs. Three empirical datasets (an Asian honeybee, a red fire ant, and a chicken datasets) and two simulated populations are used for the proof of concepts. The performance tests indicate that our method can accurately identify the genealogy origin of query individuals, with success rates ranging from 93 % to 100 %. We further showed that the accuracy of the model can be significantly increased by refining the informative sites through FST filtering. Our method is robust to configurations related to batch sizes and epochs, whereas model learning benefits from the setting of a proper preset learning rate. Moreover, we explained the importance score of key sites for algorithm interpretability and credibility, which has been largely ignored. We anticipate that by coupling genomics and deep learning, our method will see broad potential in conservation and management applications that involve natural resources, invasive pests and weeds, and illegal trades of wildlife products.
Collapse
Affiliation(s)
- Bing Yang
- Department of Entomology, China Agricultural University, Beijing 100193, China
| | - Xin Zhou
- Department of Entomology, China Agricultural University, Beijing 100193, China.
| | - Shanlin Liu
- Department of Entomology, China Agricultural University, Beijing 100193, China; Key Laboratory of Zoological Systematics and Evolution, Institute of Zoology, Chinese Academy of Sciences, Beijing 100101, China.
| |
Collapse
|
18
|
Chen Z, Liang N, Li H, Zhang H, Li H, Yan L, Hu Z, Chen Y, Zhang Y, Wang Y, Ke D, Shi N. Exploring explainable AI features in the vocal biomarkers of lung disease. Comput Biol Med 2024; 179:108844. [PMID: 38981214 DOI: 10.1016/j.compbiomed.2024.108844] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/02/2024] [Revised: 05/15/2024] [Accepted: 06/04/2024] [Indexed: 07/11/2024]
Abstract
This review delves into the burgeoning field of explainable artificial intelligence (XAI) in the detection and analysis of lung diseases through vocal biomarkers. Lung diseases, often elusive in their early stages, pose a significant public health challenge. Recent advancements in AI have ushered in innovative methods for early detection, yet the black-box nature of many AI models limits their clinical applicability. XAI emerges as a pivotal tool, enhancing transparency and interpretability in AI-driven diagnostics. This review synthesizes current research on the application of XAI in analyzing vocal biomarkers for lung diseases, highlighting how these techniques elucidate the connections between specific vocal features and lung pathology. We critically examine the methodologies employed, the types of lung diseases studied, and the performance of various XAI models. The potential for XAI to aid in early detection, monitor disease progression, and personalize treatment strategies in pulmonary medicine is emphasized. Furthermore, this review identifies current challenges, including data heterogeneity and model generalizability, and proposes future directions for research. By offering a comprehensive analysis of explainable AI features in the context of lung disease detection, this review aims to bridge the gap between advanced computational approaches and clinical practice, paving the way for more transparent, reliable, and effective diagnostic tools.
Collapse
Affiliation(s)
- Zhao Chen
- Institute of Basic Research in Clinical Medicine, China Academy of Chinese Medical Sciences, Beijing, China
| | - Ning Liang
- Institute of Basic Research in Clinical Medicine, China Academy of Chinese Medical Sciences, Beijing, China
| | - Haoyuan Li
- Institute of Basic Research in Clinical Medicine, China Academy of Chinese Medical Sciences, Beijing, China
| | - Haili Zhang
- Institute of Basic Research in Clinical Medicine, China Academy of Chinese Medical Sciences, Beijing, China
| | - Huizhen Li
- Institute of Basic Research in Clinical Medicine, China Academy of Chinese Medical Sciences, Beijing, China
| | - Lijiao Yan
- Institute of Basic Research in Clinical Medicine, China Academy of Chinese Medical Sciences, Beijing, China
| | - Ziteng Hu
- Institute of Basic Research in Clinical Medicine, China Academy of Chinese Medical Sciences, Beijing, China
| | - Yaxin Chen
- Institute of Basic Research in Clinical Medicine, China Academy of Chinese Medical Sciences, Beijing, China
| | - Yujing Zhang
- Institute of Basic Research in Clinical Medicine, China Academy of Chinese Medical Sciences, Beijing, China
| | - Yanping Wang
- Institute of Basic Research in Clinical Medicine, China Academy of Chinese Medical Sciences, Beijing, China
| | - Dandan Ke
- Special Disease Clinic, Huaishuling Branch of Beijing Fengtai Hospital of Integrated Traditional Chinese and Western Medicine, Beijing, China.
| | - Nannan Shi
- State Key Laboratory of Cognitive Neuroscience and Learning, Beijing Normal University, Beijing, China; Institute of Basic Research in Clinical Medicine, China Academy of Chinese Medical Sciences, Beijing, China.
| |
Collapse
|
19
|
Rakowski A, Monti R, Huryn V, Lemanczyk M, Ohler U, Lippert C. Metadata-guided feature disentanglement for functional genomics. Bioinformatics 2024; 40:ii4-ii10. [PMID: 39230700 PMCID: PMC11373386 DOI: 10.1093/bioinformatics/btae403] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/05/2024] Open
Abstract
With the development of high-throughput technologies, genomics datasets rapidly grow in size, including functional genomics data. This has allowed the training of large Deep Learning (DL) models to predict epigenetic readouts, such as protein binding or histone modifications, from genome sequences. However, large dataset sizes come at a price of data consistency, often aggregating results from a large number of studies, conducted under varying experimental conditions. While data from large-scale consortia are useful as they allow studying the effects of different biological conditions, they can also contain unwanted biases from confounding experimental factors. Here, we introduce Metadata-guided Feature Disentanglement (MFD)-an approach that allows disentangling biologically relevant features from potential technical biases. MFD incorporates target metadata into model training, by conditioning weights of the model output layer on different experimental factors. It then separates the factors into disjoint groups and enforces independence of the corresponding feature subspaces with an adversarially learned penalty. We show that the metadata-driven disentanglement approach allows for better model introspection, by connecting latent features to experimental factors, without compromising, or even improving performance in downstream tasks, such as enhancer prediction, or genetic variant discovery. The code will be made available at https://github.com/HealthML/MFD.
Collapse
Affiliation(s)
- Alexander Rakowski
- Digital Health Machine Learning, Hasso Plattner Institute for Digital Engineering, Digital Engineering, University of Potsdam, Campus III Building G2, Rudolf-Breitscheid-Strasse 187, Potsdam, Brandenburg, 14482, Germany
| | - Remo Monti
- Digital Health Machine Learning, Hasso Plattner Institute for Digital Engineering, Digital Engineering, University of Potsdam, Campus III Building G2, Rudolf-Breitscheid-Strasse 187, Potsdam, Brandenburg, 14482, Germany
- Max-Delbrück-Center for Molecular Medicine in the Helmholtz Association, Berlin Institute for Medical Systems Biology, Department of Biology, Humboldt Universität Berlin, Hannoversche Strasse 28, Building 101, Room 1.05, Berlin, 10115, Germany
| | - Viktoriia Huryn
- Max-Delbrück-Center for Molecular Medicine in the Helmholtz Association, Berlin Institute for Medical Systems Biology, Department of Biology, Humboldt Universität Berlin, Hannoversche Strasse 28, Building 101, Room 1.05, Berlin, 10115, Germany
| | - Marta Lemanczyk
- Data Analytics and Computational Statistics, Hasso Plattner Institute for Digital Engineering, Digital Engineering, University of Potsdam, Potsdam, Brandenburg, 14482, Germany
| | - Uwe Ohler
- Max-Delbrück-Center for Molecular Medicine in the Helmholtz Association, Berlin Institute for Medical Systems Biology, Department of Biology, Humboldt Universität Berlin, Hannoversche Strasse 28, Building 101, Room 1.05, Berlin, 10115, Germany
| | - Christoph Lippert
- Digital Health Machine Learning, Hasso Plattner Institute for Digital Engineering, Digital Engineering, University of Potsdam, Campus III Building G2, Rudolf-Breitscheid-Strasse 187, Potsdam, Brandenburg, 14482, Germany
- Hasso Plattner Institute for Digital Health at Mount Sinai, Icahn School of Medicine at Mount Sinai, New York, NY, 10029, United States of America
| |
Collapse
|
20
|
Liu GY, Yan MD, Mai YY, Fu FJ, Pan L, Zhu JM, Ji WJ, Hu J, Li WP, Xie W. Frontiers and hotspots in anxiety disorders: A bibliometric analysis from 2004 to 2024. Heliyon 2024; 10:e35701. [PMID: 39220967 PMCID: PMC11365340 DOI: 10.1016/j.heliyon.2024.e35701] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/28/2024] [Revised: 06/05/2024] [Accepted: 08/01/2024] [Indexed: 09/04/2024] Open
Abstract
Objective This study aimed to analyze research on anxiety disorders using VOSviewer and CiteSpace to identify research hotspots and future directions. Methods We conduct ed a comprehensive search on the Web of Science Core Collection (WoSCC) for relevant studies about anxiety disorders published within the past two decades (from 2004 to 2024). VOSviewer and CiteSpace were mainly used to analyze the authors, institutions, countries, publishing journals, reference co-citation patterns, keyword co-occurrence, keyword clustering, and other aspects to construct a knowledge atlas. Results A total of 22,267 publications related to anxiety disorders were retrieved. The number of publications about anxiety disorders has generally increased over time, with some fluctuations. The United States emerged as the most productive country, with Harvard University identified as the most prolific institution and Brenda W. J. H. Penninx as the most prolific author in the field. Conclusion This research identified the most influential publications, authors, journals, institutions, and countries in the field of anxiety research. Future research directions are involved advanced treatments based on pharmacotherapy, psychotherapy and digital interventions, mechanism exploration to anxiety disorders based on neurobiological and genetic basis, influence of social and environmental factors on the onset of anxiety disorders.
Collapse
Affiliation(s)
- Gui-Yu Liu
- School of Traditional Chinese Medicine, Southern Medical University, Guangzhou, 510515, PR China, China
| | - Ming-De Yan
- School of Traditional Chinese Medicine, Southern Medical University, Guangzhou, 510515, PR China, China
| | - Yi-Yin Mai
- The Second School of Clinical Medicine, Southern Medical University, Guangzhou, China
| | - Fan-Jia Fu
- School of Traditional Chinese Medicine, Southern Medical University, Guangzhou, 510515, PR China, China
| | - Lei Pan
- School of Traditional Chinese Medicine, Southern Medical University, Guangzhou, 510515, PR China, China
| | - Jun-Ming Zhu
- School of Traditional Chinese Medicine, Southern Medical University, Guangzhou, 510515, PR China, China
| | - Wen-Juan Ji
- School of Traditional Chinese Medicine, Southern Medical University, Guangzhou, 510515, PR China, China
| | - Jiao Hu
- School of Traditional Chinese Medicine, Southern Medical University, Guangzhou, 510515, PR China, China
| | - Wei-Peng Li
- School of Traditional Chinese Medicine, Southern Medical University, Guangzhou, 510515, PR China, China
- Department of Neurology, Integrated Hospital of Traditional Chinese Medicine, Southern Medical University, Guangzhou, China
| | - Wei Xie
- School of Traditional Chinese Medicine, Southern Medical University, Guangzhou, 510515, PR China, China
- Department of Traditional Chinese Medicine, Nanfang Hospital, Southern Medical University, Guangzhou, China
| |
Collapse
|
21
|
Zhao Y, Jin J, Gao W, Qiao J, Wei L. Moss-m7G: A Motif-Based Interpretable Deep Learning Method for RNA N7-Methlguanosine Site Prediction. J Chem Inf Model 2024; 64:6230-6240. [PMID: 39011571 DOI: 10.1021/acs.jcim.4c00802] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/17/2024]
Abstract
N-7methylguanosine (m7G) modification plays a crucial role in various biological processes and is closely associated with the development and progression of many cancers. Accurate identification of m7G modification sites is essential for understanding their regulatory mechanisms and advancing cancer therapy. Previous studies often suffered from insufficient research data, underutilization of motif information, and lack of interpretability. In this work, we designed a novel motif-based interpretable method for m7G modification site prediction, called Moss-m7G. This approach enables the analysis of RNA sequences from a motif-centric perspective. Our proposed word-detection module and motif-embedding module within Moss-m7G extract motif information from sequences, transforming the raw sequences from base-level into motif-level and generating embeddings for these motif sequences. Compared with base sequences, motif sequences contain richer contextual information, which is further analyzed and integrated through the Transformer model. We constructed a comprehensive m7G data set to implement the training and testing process to address the data insufficiency noted in prior research. Our experimental results affirm the effectiveness and superiority of Moss-m7G in predicting m7G modification sites. Moreover, the introduction of the word-detection module enhances the interpretability of the model, providing insights into the predictive mechanisms.
Collapse
Affiliation(s)
- Yanxi Zhao
- School of Software, Shandong University, Jinan 250101, China
- Joint SDU-NTU Centre for Artificial Intelligence Research (C-FAIR), Shandong University, Jinan 250101, China
| | - Junru Jin
- School of Software, Shandong University, Jinan 250101, China
- Joint SDU-NTU Centre for Artificial Intelligence Research (C-FAIR), Shandong University, Jinan 250101, China
| | - Wenjia Gao
- School of Software, Shandong University, Jinan 250101, China
- Joint SDU-NTU Centre for Artificial Intelligence Research (C-FAIR), Shandong University, Jinan 250101, China
| | - Jianbo Qiao
- School of Software, Shandong University, Jinan 250101, China
- Joint SDU-NTU Centre for Artificial Intelligence Research (C-FAIR), Shandong University, Jinan 250101, China
| | - Leyi Wei
- School of Software, Shandong University, Jinan 250101, China
- Joint SDU-NTU Centre for Artificial Intelligence Research (C-FAIR), Shandong University, Jinan 250101, China
- School of Informatics, Xiamen University, Xiamen 361104, China
| |
Collapse
|
22
|
Chen V, Yang M, Cui W, Kim JS, Talwalkar A, Ma J. Applying interpretable machine learning in computational biology-pitfalls, recommendations and opportunities for new developments. Nat Methods 2024; 21:1454-1461. [PMID: 39122941 PMCID: PMC11348280 DOI: 10.1038/s41592-024-02359-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/27/2022] [Accepted: 06/24/2024] [Indexed: 08/12/2024]
Abstract
Recent advances in machine learning have enabled the development of next-generation predictive models for complex computational biology problems, thereby spurring the use of interpretable machine learning (IML) to unveil biological insights. However, guidelines for using IML in computational biology are generally underdeveloped. We provide an overview of IML methods and evaluation techniques and discuss common pitfalls encountered when applying IML methods to computational biology problems. We also highlight open questions, especially in the era of large language models, and call for collaboration between IML and computational biology researchers.
Collapse
Affiliation(s)
- Valerie Chen
- Machine Learning Department, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA
| | - Muyu Yang
- Ray and Stephanie Lane Computational Biology Department, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA
| | - Wenbo Cui
- Machine Learning Department, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA
| | - Joon Sik Kim
- Machine Learning Department, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA
| | - Ameet Talwalkar
- Machine Learning Department, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA.
| | - Jian Ma
- Ray and Stephanie Lane Computational Biology Department, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA.
| |
Collapse
|
23
|
van Hilten A, Katz S, Saccenti E, Niessen WJ, Roshchupkin GV. Designing interpretable deep learning applications for functional genomics: a quantitative analysis. Brief Bioinform 2024; 25:bbae449. [PMID: 39293804 PMCID: PMC11410376 DOI: 10.1093/bib/bbae449] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/05/2024] [Revised: 08/07/2024] [Accepted: 08/28/2024] [Indexed: 09/20/2024] Open
Abstract
Deep learning applications have had a profound impact on many scientific fields, including functional genomics. Deep learning models can learn complex interactions between and within omics data; however, interpreting and explaining these models can be challenging. Interpretability is essential not only to help progress our understanding of the biological mechanisms underlying traits and diseases but also for establishing trust in these model's efficacy for healthcare applications. Recognizing this importance, recent years have seen the development of numerous diverse interpretability strategies, making it increasingly difficult to navigate the field. In this review, we present a quantitative analysis of the challenges arising when designing interpretable deep learning solutions in functional genomics. We explore design choices related to the characteristics of genomics data, the neural network architectures applied, and strategies for interpretation. By quantifying the current state of the field with a predefined set of criteria, we find the most frequent solutions, highlight exceptional examples, and identify unexplored opportunities for developing interpretable deep learning models in genomics.
Collapse
Affiliation(s)
- Arno van Hilten
- Department of Radiology and Nuclear Medicine, Erasmus MC, 3015 GD Rotterdam, The Netherlands
| | - Sonja Katz
- Department of Radiology and Nuclear Medicine, Erasmus MC, 3015 GD Rotterdam, The Netherlands
- Laboratory of Systems and Synthetic Biology, Wageningen University & Research, 6700 HB Wageningen WE, The Netherlands
| | - Edoardo Saccenti
- Laboratory of Systems and Synthetic Biology, Wageningen University & Research, 6700 HB Wageningen WE, The Netherlands
| | - Wiro J Niessen
- Department of Imaging Physics, Delft University of Technology, 2628 CD Delft, The Netherlands
| | - Gennady V Roshchupkin
- Department of Radiology and Nuclear Medicine, Erasmus MC, 3015 GD Rotterdam, The Netherlands
- Department of Epidemiology, Erasmus MC, 3015 GD Rotterdam, The Netherlands
| |
Collapse
|
24
|
Gilliot PA, Gorochowski TE. Transfer learning for cross-context prediction of protein expression from 5'UTR sequence. Nucleic Acids Res 2024; 52:e58. [PMID: 38864396 PMCID: PMC11260469 DOI: 10.1093/nar/gkae491] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/10/2023] [Revised: 04/28/2024] [Accepted: 05/28/2024] [Indexed: 06/13/2024] Open
Abstract
Model-guided DNA sequence design can accelerate the reprogramming of living cells. It allows us to engineer more complex biological systems by removing the need to physically assemble and test each potential design. While mechanistic models of gene expression have seen some success in supporting this goal, data-centric, deep learning-based approaches often provide more accurate predictions. This accuracy, however, comes at a cost - a lack of generalization across genetic and experimental contexts that has limited their wider use outside the context in which they were trained. Here, we address this issue by demonstrating how a simple transfer learning procedure can effectively tune a pre-trained deep learning model to predict protein translation rate from 5' untranslated region (5'UTR) sequence for diverse contexts in Escherichia coli using a small number of new measurements. This allows for important model features learnt from expensive massively parallel reporter assays to be easily transferred to new settings. By releasing our trained deep learning model and complementary calibration procedure, this study acts as a starting point for continually refined model-based sequence design that builds on previous knowledge and future experimental efforts.
Collapse
Affiliation(s)
- Pierre-Aurélien Gilliot
- School of Biological Sciences, University of Bristol, 24 Tyndall Avenue, Bristol BS8 1TQ, UK
| | - Thomas E Gorochowski
- School of Biological Sciences, University of Bristol, 24 Tyndall Avenue, Bristol BS8 1TQ, UK
- BrisEngBio, School of Chemistry, University of Bristol, Cantock’s Close, Bristol BS8 1TS, UK
| |
Collapse
|
25
|
Chen Z, Chen C, Yang G, He X, Chi X, Zeng Z, Chen X. Research integrity in the era of artificial intelligence: Challenges and responses. Medicine (Baltimore) 2024; 103:e38811. [PMID: 38968491 PMCID: PMC11224801 DOI: 10.1097/md.0000000000038811] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 01/28/2024] [Accepted: 06/13/2024] [Indexed: 07/07/2024] Open
Abstract
The application of artificial intelligence (AI) technologies in scientific research has significantly enhanced efficiency and accuracy but also introduced new forms of academic misconduct, such as data fabrication and text plagiarism using AI algorithms. These practices jeopardize research integrity and can mislead scientific directions. This study addresses these challenges, underscoring the need for the academic community to strengthen ethical norms, enhance researcher qualifications, and establish rigorous review mechanisms. To ensure responsible and transparent research processes, we recommend the following specific key actions: Development and enforcement of comprehensive AI research integrity guidelines that include clear protocols for AI use in data analysis and publication, ensuring transparency and accountability in AI-assisted research. Implementation of mandatory AI ethics and integrity training for researchers, aimed at fostering an in-depth understanding of potential AI misuses and promoting ethical research practices. Establishment of international collaboration frameworks to facilitate the exchange of best practices and development of unified ethical standards for AI in research. Protecting research integrity is paramount for maintaining public trust in science, making these recommendations urgent for the scientific community consideration and action.
Collapse
Affiliation(s)
- Ziyu Chen
- The First Affiliated Hospital of Shenzhen University, Shenzhen University, Shenzhen, China
| | - Changye Chen
- The First Affiliated Hospital of Shenzhen University, Shenzhen University, Shenzhen, China
| | - Guozhao Yang
- The First Affiliated Hospital of Shenzhen University, Shenzhen University, Shenzhen, China
| | - Xiangpeng He
- The First Affiliated Hospital of Shenzhen University, Shenzhen University, Shenzhen, China
| | - Xiaoxia Chi
- The First Affiliated Hospital of Shenzhen University, Shenzhen University, Shenzhen, China
| | - Zhuoying Zeng
- The First Affiliated Hospital of Shenzhen University, Shenzhen University, Shenzhen, China
- Chemical Analysis & Physical Testing Institute, Shenzhen Center for Disease Control and Prevention, Shenzhen, China
| | - Xuhong Chen
- The First Affiliated Hospital of Shenzhen University, Shenzhen University, Shenzhen, China
| |
Collapse
|
26
|
Moeckel C, Mouratidis I, Chantzi N, Uzun Y, Georgakopoulos-Soares I. Advances in computational and experimental approaches for deciphering transcriptional regulatory networks: Understanding the roles of cis-regulatory elements is essential, and recent research utilizing MPRAs, STARR-seq, CRISPR-Cas9, and machine learning has yielded valuable insights. Bioessays 2024; 46:e2300210. [PMID: 38715516 PMCID: PMC11444527 DOI: 10.1002/bies.202300210] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2023] [Revised: 04/22/2024] [Accepted: 04/23/2024] [Indexed: 05/16/2024]
Abstract
Understanding the influence of cis-regulatory elements on gene regulation poses numerous challenges given complexities stemming from variations in transcription factor (TF) binding, chromatin accessibility, structural constraints, and cell-type differences. This review discusses the role of gene regulatory networks in enhancing understanding of transcriptional regulation and covers construction methods ranging from expression-based approaches to supervised machine learning. Additionally, key experimental methods, including MPRAs and CRISPR-Cas9-based screening, which have significantly contributed to understanding TF binding preferences and cis-regulatory element functions, are explored. Lastly, the potential of machine learning and artificial intelligence to unravel cis-regulatory logic is analyzed. These computational advances have far-reaching implications for precision medicine, therapeutic target discovery, and the study of genetic variations in health and disease.
Collapse
Affiliation(s)
- Camille Moeckel
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
| | - Ioannis Mouratidis
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
- Huck Institutes of the Life Sciences, The Pennsylvania State University, University Park, PA, USA
| | - Nikol Chantzi
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
| | - Yasin Uzun
- Huck Institutes of the Life Sciences, The Pennsylvania State University, University Park, PA, USA
- Department of Pediatrics, The Pennsylvania State University College of Medicine, Hershey, PA, USA
| | - Ilias Georgakopoulos-Soares
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
- Huck Institutes of the Life Sciences, The Pennsylvania State University, University Park, PA, USA
| |
Collapse
|
27
|
Posfai A, Zhou J, McCandlish DM, Kinney JB. Gauge fixing for sequence-function relationships. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.05.12.593772. [PMID: 38798671 PMCID: PMC11118547 DOI: 10.1101/2024.05.12.593772] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/29/2024]
Abstract
Quantitative models of sequence-function relationships are ubiquitous in computational biology, e.g., for modeling the DNA binding of transcription factors or the fitness landscapes of proteins. Interpreting these models, however, is complicated by the fact that the values of model parameters can often be changed without affecting model predictions. Before the values of model parameters can be meaningfully interpreted, one must remove these degrees of freedom (called "gauge freedoms" in physics) by imposing additional constraints (a process called "fixing the gauge"). However, strategies for fixing the gauge of sequence-function relationships have received little attention. Here we derive an analytically tractable family of gauges for a large class of sequence-function relationships. These gauges are derived in the context of models with all-order interactions, but an important subset of these gauges can be applied to diverse types of models, including additive models, pairwise-interaction models, and models with higher-order interactions. Many commonly used gauges are special cases of gauges within this family. We demonstrate the utility of this family of gauges by showing how different choices of gauge can be used both to explore complex activity landscapes and to reveal simplified models that are approximately correct within localized regions of sequence space. The results provide practical gauge-fixing strategies and demonstrate the utility of gauge-fixing for model exploration and interpretation.
Collapse
Affiliation(s)
- Anna Posfai
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY, 11724
| | - Juannan Zhou
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY, 11724
- Department of Biology, University of Florida, Gainesville, FL, 32611
| | - David M. McCandlish
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY, 11724
| | - Justin B. Kinney
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY, 11724
| |
Collapse
|
28
|
Zhan C, Dai Z, Yin S, Carroll KC, Soltanian MR. Conceptualizing future groundwater models through a ternary framework of multisource data, human expertise, and machine intelligence. WATER RESEARCH 2024; 257:121679. [PMID: 38696982 DOI: 10.1016/j.watres.2024.121679] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/23/2024] [Revised: 04/24/2024] [Accepted: 04/25/2024] [Indexed: 05/04/2024]
Abstract
Groundwater models are essential for understanding aquifer systems behavior and effective water resources spatio-temporal distributions, yet they are often hindered by challenges related to model assumptions, parametrization, uncertainty, and computational efficiency. Machine intelligence, especially deep learning, promises a paradigm shift in overcoming these challenges. A critical examination of existing machine-driven methods reveals the inherent limitations, particularly in terms of the interpretability and the ability to generalize findings. To overcome these challenges, we develop a ternary framework that synergizes the valuable insights from multisource data, human expertise, and machine intelligence. This framework capitalizes on the distinct strengths of each element: the value and relevance of multisource data, the innovative capacity of human expertise, and the analytical efficiency of machine intelligence. Our goal is to conceptualize sustainable water management practices and enhance our understanding and predictive capabilities of groundwater systems. Unlike approaches that rely solely on abundant data, our framework emphasizes the quality and strategic use of available data, combined with human intellect and advanced computing, to overcome current limitations and pave the way for more realistic groundwater simulations.
Collapse
Affiliation(s)
- Chuanjun Zhan
- School of Environmental and Municipal Engineering, Qingdao University of Technology, Qingdao 266520, China
| | - Zhenxue Dai
- School of Environmental and Municipal Engineering, Qingdao University of Technology, Qingdao 266520, China; College of Construction Engineering, Jilin University, Changchun 130026, China; Institute of Intelligent Simulation and Early Warning for Subsurface Environment, Jilin University, Changchun 130026, China.
| | - Shangxian Yin
- North China Institute of Science & Technology, Langfang 065201, China.
| | - Kenneth C Carroll
- Department of Plant & Environmental Science, New Mexico State University, Las Cruces, NM, USA
| | - Mohamad Reza Soltanian
- Departments of Geosciences and Environmental Engineering, University of Cincinnati, Cincinnati, OH, USA
| |
Collapse
|
29
|
Wagle MM, Long S, Chen C, Liu C, Yang P. Interpretable deep learning in single-cell omics. Bioinformatics 2024; 40:btae374. [PMID: 38889275 PMCID: PMC11211213 DOI: 10.1093/bioinformatics/btae374] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/25/2024] [Revised: 05/11/2024] [Accepted: 06/12/2024] [Indexed: 06/20/2024] Open
Abstract
MOTIVATION Single-cell omics technologies have enabled the quantification of molecular profiles in individual cells at an unparalleled resolution. Deep learning, a rapidly evolving sub-field of machine learning, has instilled a significant interest in single-cell omics research due to its remarkable success in analysing heterogeneous high-dimensional single-cell omics data. Nevertheless, the inherent multi-layer nonlinear architecture of deep learning models often makes them 'black boxes' as the reasoning behind predictions is often unknown and not transparent to the user. This has stimulated an increasing body of research for addressing the lack of interpretability in deep learning models, especially in single-cell omics data analyses, where the identification and understanding of molecular regulators are crucial for interpreting model predictions and directing downstream experimental validations. RESULTS In this work, we introduce the basics of single-cell omics technologies and the concept of interpretable deep learning. This is followed by a review of the recent interpretable deep learning models applied to various single-cell omics research. Lastly, we highlight the current limitations and discuss potential future directions.
Collapse
Affiliation(s)
- Manoj M Wagle
- Computational Systems Biology Unit, Children’s Medical Research Institute, Faculty of Medicine and Health, The University of Sydney, Westmead, NSW 2145, Australia
- School of Mathematics and Statistics, Faculty of Science, The University of Sydney, Camperdown, NSW 2006, Australia
- Sydney Precision Data Science Centre, The University of Sydney, Camperdown, NSW 2006, Australia
| | - Siqu Long
- Computational Systems Biology Unit, Children’s Medical Research Institute, Faculty of Medicine and Health, The University of Sydney, Westmead, NSW 2145, Australia
- School of Mathematics and Statistics, Faculty of Science, The University of Sydney, Camperdown, NSW 2006, Australia
- Sydney Precision Data Science Centre, The University of Sydney, Camperdown, NSW 2006, Australia
| | - Carissa Chen
- Computational Systems Biology Unit, Children’s Medical Research Institute, Faculty of Medicine and Health, The University of Sydney, Westmead, NSW 2145, Australia
- Sydney Precision Data Science Centre, The University of Sydney, Camperdown, NSW 2006, Australia
| | - Chunlei Liu
- Computational Systems Biology Unit, Children’s Medical Research Institute, Faculty of Medicine and Health, The University of Sydney, Westmead, NSW 2145, Australia
- Sydney Precision Data Science Centre, The University of Sydney, Camperdown, NSW 2006, Australia
| | - Pengyi Yang
- Computational Systems Biology Unit, Children’s Medical Research Institute, Faculty of Medicine and Health, The University of Sydney, Westmead, NSW 2145, Australia
- School of Mathematics and Statistics, Faculty of Science, The University of Sydney, Camperdown, NSW 2006, Australia
- Sydney Precision Data Science Centre, The University of Sydney, Camperdown, NSW 2006, Australia
- Charles Perkins Centre, The University of Sydney, Camperdown, NSW 2006, Australia
| |
Collapse
|
30
|
Hou H, Zhang R, Li J. Artificial intelligence in the clinical laboratory. Clin Chim Acta 2024; 559:119724. [PMID: 38734225 DOI: 10.1016/j.cca.2024.119724] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/17/2024] [Revised: 05/07/2024] [Accepted: 05/08/2024] [Indexed: 05/13/2024]
Abstract
Laboratory medicine has become a highly automated medical discipline. Nowadays, artificial intelligence (AI) applied to laboratory medicine is also gaining more and more attention, which can optimize the entire laboratory workflow and even revolutionize laboratory medicine in the future. However, only a few commercially available AI models are currently approved for use in clinical laboratories and have drawbacks such as high cost, lack of accuracy, and the need for manual review of model results. Furthermore, there are a limited number of literature reviews that comprehensively address the research status, challenges, and future opportunities of AI applications in laboratory medicine. Our article begins with a brief introduction to AI and some of its subsets, then reviews some AI models that are currently being used in clinical laboratories or that have been described in emerging studies, and explains the existing challenges associated with their application and possible solutions, finally provides insights into the future opportunities of the field. We highlight the current status of implementation and potential applications of AI models in different stages of the clinical testing process.
Collapse
Affiliation(s)
- Hanjing Hou
- National Center for Clinical Laboratories, Institute of Geriatric Medicine, Chinese Academy of Medical Sciences, Beijing Hospital/National Center of Gerontology, PR China; National Center for Clinical Laboratories, Chinese Academy of Medical Sciences & Peking Union Medical College, Beijing, PR China; Beijing Engineering Research Center of Laboratory Medicine, Beijing Hospital, Beijing, PR China
| | - Rui Zhang
- National Center for Clinical Laboratories, Institute of Geriatric Medicine, Chinese Academy of Medical Sciences, Beijing Hospital/National Center of Gerontology, PR China; National Center for Clinical Laboratories, Chinese Academy of Medical Sciences & Peking Union Medical College, Beijing, PR China; Beijing Engineering Research Center of Laboratory Medicine, Beijing Hospital, Beijing, PR China.
| | - Jinming Li
- National Center for Clinical Laboratories, Institute of Geriatric Medicine, Chinese Academy of Medical Sciences, Beijing Hospital/National Center of Gerontology, PR China; National Center for Clinical Laboratories, Chinese Academy of Medical Sciences & Peking Union Medical College, Beijing, PR China; Beijing Engineering Research Center of Laboratory Medicine, Beijing Hospital, Beijing, PR China.
| |
Collapse
|
31
|
Qi T, Zhou Y, Sheng Y, Li Z, Yang Y, Liu Q, Ge Q. Prediction of Transcription Factor Binding Sites on Cell-Free DNA Based on Deep Learning. J Chem Inf Model 2024; 64:4002-4008. [PMID: 38798191 DOI: 10.1021/acs.jcim.4c00047] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/29/2024]
Abstract
Transcription factors (TFs) are important regulatory elements for vital cellular activities, and the identification of transcription factor binding sites (TFBS) can help to explore gene regulatory mechanisms. Research studies have proved that cfDNA (cell-free DNA) shows relatively higher coverage at TFBS due to the protection by TF from degradation by nucleases and short fragments of cfDNA are enriched in TFBS. However, there are still great difficulties in the noninvasive identification of TFBSs from experimental techniques. In this study, we propose a deep learning-based approach that can noninvasively predict TFBSs of cfDNA by learning sequence information from known TFBSs through convolutional neural networks. Under the addition of long short-term memory, our model achieved an area under the curve of 84%. Based on this model to predict cfDNA, we found consistent motifs in cfDNA fragments and lower coverage occurred upstream and downstream of these cfDNA fragments, which is consistent with a previous study. We also found that the binding sites of the same TF differ in different cell lines. TF-specific target genes were detected from cfDNA and were enriched in cancer-related pathways. In summary, our method of locating TFBSs from plasma has the potential to reflect the intrinsic regulatory mechanism from a noninvasive perspective and provide technical guidance for dynamic monitoring of disease in clinical practice.
Collapse
Affiliation(s)
- Ting Qi
- State Key Laboratory of Digital Medical Engineering, School of Biological Science and Medical Engineering, Southeast University, Nanjing 210096, People's Republic of China
| | - Ying Zhou
- State Key Laboratory of Digital Medical Engineering, School of Biological Science and Medical Engineering, Southeast University, Nanjing 210096, People's Republic of China
| | - Yuqi Sheng
- State Key Laboratory of Digital Medical Engineering, School of Biological Science and Medical Engineering, Southeast University, Nanjing 210096, People's Republic of China
| | - Zhihui Li
- State Key Laboratory of Digital Medical Engineering, School of Biological Science and Medical Engineering, Southeast University, Nanjing 210096, People's Republic of China
| | - Yuwei Yang
- State Key Laboratory of Digital Medical Engineering, School of Biological Science and Medical Engineering, Southeast University, Nanjing 210096, People's Republic of China
| | - Quanjun Liu
- State Key Laboratory of Digital Medical Engineering, School of Biological Science and Medical Engineering, Southeast University, Nanjing 210096, People's Republic of China
| | - Qinyu Ge
- State Key Laboratory of Digital Medical Engineering, School of Biological Science and Medical Engineering, Southeast University, Nanjing 210096, People's Republic of China
| |
Collapse
|
32
|
Zhang D, Gao S, Liu ZP, Gao R. LogicGep: Boolean networks inference using symbolic regression from time-series transcriptomic profiling data. Brief Bioinform 2024; 25:bbae286. [PMID: 38886006 PMCID: PMC11182660 DOI: 10.1093/bib/bbae286] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/30/2024] [Revised: 05/09/2024] [Accepted: 06/06/2024] [Indexed: 06/20/2024] Open
Abstract
Reconstructing the topology of gene regulatory network from gene expression data has been extensively studied. With the abundance functional transcriptomic data available, it is now feasible to systematically decipher regulatory interaction dynamics in a logic form such as a Boolean network (BN) framework, which qualitatively indicates how multiple regulators aggregated to affect a common target gene. However, inferring both the network topology and gene interaction dynamics simultaneously is still a challenging problem since gene expression data are typically noisy and data discretization is prone to information loss. We propose a new method for BN inference from time-series transcriptional profiles, called LogicGep. LogicGep formulates the identification of Boolean functions as a symbolic regression problem that learns the Boolean function expression and solve it efficiently through multi-objective optimization using an improved gene expression programming algorithm. To avoid overly emphasizing dynamic characteristics at the expense of topology structure ones, as traditional methods often do, a set of promising Boolean formulas for each target gene is evolved firstly, and a feed-forward neural network trained with continuous expression data is subsequently employed to pick out the final solution. We validated the efficacy of LogicGep using multiple datasets including both synthetic and real-world experimental data. The results elucidate that LogicGep adeptly infers accurate BN models, outperforming other representative BN inference algorithms in both network topology reconstruction and the identification of Boolean functions. Moreover, the execution of LogicGep is hundreds of times faster than other methods, especially in the case of large network inference.
Collapse
Affiliation(s)
- Dezhen Zhang
- Center of Intelligent Medicine, School of Control Science and Engineering, Shandong University, Jinan, Shandong 250061, China
| | - Shuhua Gao
- Center of Intelligent Medicine, School of Control Science and Engineering, Shandong University, Jinan, Shandong 250061, China
| | - Zhi-Ping Liu
- Center of Intelligent Medicine, School of Control Science and Engineering, Shandong University, Jinan, Shandong 250061, China
| | - Rui Gao
- Center of Intelligent Medicine, School of Control Science and Engineering, Shandong University, Jinan, Shandong 250061, China
| |
Collapse
|
33
|
Assis de Souza A, Stubbs AP, Hesselink DA, Baan CC, Boer K. Cherry on Top or Real Need? A Review of Explainable Machine Learning in Kidney Transplantation. Transplantation 2024:00007890-990000000-00768. [PMID: 38773859 DOI: 10.1097/tp.0000000000005063] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/24/2024]
Abstract
Research on solid organ transplantation has taken advantage of the substantial acquisition of medical data and the use of artificial intelligence (AI) and machine learning (ML) to answer diagnostic, prognostic, and therapeutic questions for many years. Nevertheless, despite the question of whether AI models add value to traditional modeling approaches, such as regression models, their "black box" nature is one of the factors that have hindered the translation from research to clinical practice. Several techniques that make such models understandable to humans were developed with the promise of increasing transparency in the support of medical decision-making. These techniques should help AI to close the gap between theory and practice by yielding trust in the model by doctors and patients, allowing model auditing, and facilitating compliance with emergent AI regulations. But is this also happening in the field of kidney transplantation? This review reports the use and explanation of "black box" models to diagnose and predict kidney allograft rejection, delayed graft function, graft failure, and other related outcomes after kidney transplantation. In particular, we emphasize the discussion on the need (or not) to explain ML models for biological discovery and clinical implementation in kidney transplantation. We also discuss promising future research paths for these computational tools.
Collapse
Affiliation(s)
- Alvaro Assis de Souza
- Department of Internal Medicine, Erasmus MC Transplant Institute, University Medical Center Rotterdam, Rotterdam, the Netherlands
| | - Andrew P Stubbs
- Department of Pathology and Clinical Bioinformatics, Erasmus MC Stubbs Group, University Medical Center Rotterdam, Rotterdam, the Netherlands
| | - Dennis A Hesselink
- Department of Internal Medicine, Erasmus MC Transplant Institute, University Medical Center Rotterdam, Rotterdam, the Netherlands
| | - Carla C Baan
- Department of Internal Medicine, Erasmus MC Transplant Institute, University Medical Center Rotterdam, Rotterdam, the Netherlands
| | - Karin Boer
- Department of Internal Medicine, Erasmus MC Transplant Institute, University Medical Center Rotterdam, Rotterdam, the Netherlands
| |
Collapse
|
34
|
Gao Z, Liu X, Kang Y, Hu P, Zhang X, Yan W, Yan M, Yu P, Zhang Q, Xiao W, Zhang Z. Improving the Prognostic Evaluation Precision of Hospital Outcomes for Heart Failure Using Admission Notes and Clinical Tabular Data: Multimodal Deep Learning Model. J Med Internet Res 2024; 26:e54363. [PMID: 38696251 PMCID: PMC11099809 DOI: 10.2196/54363] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/07/2023] [Revised: 01/01/2024] [Accepted: 03/19/2024] [Indexed: 05/04/2024] Open
Abstract
BACKGROUND Clinical notes contain contextualized information beyond structured data related to patients' past and current health status. OBJECTIVE This study aimed to design a multimodal deep learning approach to improve the evaluation precision of hospital outcomes for heart failure (HF) using admission clinical notes and easily collected tabular data. METHODS Data for the development and validation of the multimodal model were retrospectively derived from 3 open-access US databases, including the Medical Information Mart for Intensive Care III v1.4 (MIMIC-III) and MIMIC-IV v1.0, collected from a teaching hospital from 2001 to 2019, and the eICU Collaborative Research Database v1.2, collected from 208 hospitals from 2014 to 2015. The study cohorts consisted of all patients with critical HF. The clinical notes, including chief complaint, history of present illness, physical examination, medical history, and admission medication, as well as clinical variables recorded in electronic health records, were analyzed. We developed a deep learning mortality prediction model for in-hospital patients, which underwent complete internal, prospective, and external evaluation. The Integrated Gradients and SHapley Additive exPlanations (SHAP) methods were used to analyze the importance of risk factors. RESULTS The study included 9989 (16.4%) patients in the development set, 2497 (14.1%) patients in the internal validation set, 1896 (18.3%) in the prospective validation set, and 7432 (15%) patients in the external validation set. The area under the receiver operating characteristic curve of the models was 0.838 (95% CI 0.827-0.851), 0.849 (95% CI 0.841-0.856), and 0.767 (95% CI 0.762-0.772), for the internal, prospective, and external validation sets, respectively. The area under the receiver operating characteristic curve of the multimodal model outperformed that of the unimodal models in all test sets, and tabular data contributed to higher discrimination. The medical history and physical examination were more useful than other factors in early assessments. CONCLUSIONS The multimodal deep learning model for combining admission notes and clinical tabular data showed promising efficacy as a potentially novel method in evaluating the risk of mortality in patients with HF, providing more accurate and timely decision support.
Collapse
Affiliation(s)
- Zhenyue Gao
- Beijing Engineering Research Center of Industrial Spectrum Imaging, School of Automation and Electrical Engineering, University of Science and Technology Beijing, Beijing, China
| | - Xiaoli Liu
- Center for Artificial Intelligence in Medicine, The General Hospital of People's Liberation Army, Beijing, China
| | - Yu Kang
- Department of Cardiology, West China Hospital, Sichuan University, Chengdu, China
| | - Pan Hu
- Center for Artificial Intelligence in Medicine, The General Hospital of People's Liberation Army, Beijing, China
| | - Xiu Zhang
- Department of Cardiology, West China Hospital, Sichuan University, Chengdu, China
| | - Wei Yan
- Center for Artificial Intelligence in Medicine, The General Hospital of People's Liberation Army, Beijing, China
| | - Muyang Yan
- Center for Artificial Intelligence in Medicine, The General Hospital of People's Liberation Army, Beijing, China
| | - Pengming Yu
- Department of Cardiology, West China Hospital, Sichuan University, Chengdu, China
| | - Qing Zhang
- Department of Cardiology, West China Hospital, Sichuan University, Chengdu, China
| | - Wendong Xiao
- Beijing Engineering Research Center of Industrial Spectrum Imaging, School of Automation and Electrical Engineering, University of Science and Technology Beijing, Beijing, China
| | - Zhengbo Zhang
- Center for Artificial Intelligence in Medicine, The General Hospital of People's Liberation Army, Beijing, China
| |
Collapse
|
35
|
Sandell FL, Holzweber T, Street NR, Dohm JC, Himmelbauer H. Genomic basis of seed colour in quinoa inferred from variant patterns using extreme gradient boosting. PLANT BIOTECHNOLOGY JOURNAL 2024; 22:1312-1324. [PMID: 38213076 PMCID: PMC11022794 DOI: 10.1111/pbi.14267] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/07/2023] [Revised: 11/03/2023] [Accepted: 11/28/2023] [Indexed: 01/13/2024]
Abstract
Quinoa is an agriculturally important crop species originally domesticated in the Andes of central South America. One of its most important phenotypic traits is seed colour. Seed colour variation is determined by contrasting abundance of betalains, a class of strong antioxidant and free radicals scavenging colour pigments only found in plants of the order Caryophyllales. However, the genetic basis for these pigments in seeds remains to be identified. Here we demonstrate the application of machine learning (extreme gradient boosting) to identify genetic variants predictive of seed colour. We show that extreme gradient boosting outperforms the classical genome-wide association approach. We provide re-sequencing and phenotypic data for 156 South American quinoa accessions and identify candidate genes potentially controlling betalain content in quinoa seeds. Genes identified include novel cytochrome P450 genes and known members of the betalain synthesis pathway, as well as genes annotated as being involved in seed development. Our work showcases the power of modern machine learning methods to extract biologically meaningful information from large sequencing data sets.
Collapse
Affiliation(s)
- Felix L. Sandell
- Department of Biotechnology, Institute of Computational BiologyUniversity of Natural Resources and Life Sciences (BOKU)ViennaAustria
| | - Thomas Holzweber
- Department of Biotechnology, Institute of Computational BiologyUniversity of Natural Resources and Life Sciences (BOKU)ViennaAustria
| | - Nathaniel R. Street
- Department of Plant Physiology, Umeå Plant Science CentreUmeå UniversityUmeåSweden
- SciLifeLabUmeå UniversityUmeåSweden
| | - Juliane C. Dohm
- Department of Biotechnology, Institute of Computational BiologyUniversity of Natural Resources and Life Sciences (BOKU)ViennaAustria
| | - Heinz Himmelbauer
- Department of Biotechnology, Institute of Computational BiologyUniversity of Natural Resources and Life Sciences (BOKU)ViennaAustria
| |
Collapse
|
36
|
Xu Y, Cao L, Chen Y, Zhang Z, Liu W, Li H, Ding C, Pu J, Qian K, Xu W. Integrating Machine Learning in Metabolomics: A Path to Enhanced Diagnostics and Data Interpretation. SMALL METHODS 2024:e2400305. [PMID: 38682615 DOI: 10.1002/smtd.202400305] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/02/2024] [Revised: 04/07/2024] [Indexed: 05/01/2024]
Abstract
Metabolomics, leveraging techniques like NMR and MS, is crucial for understanding biochemical processes in pathophysiological states. This field, however, faces challenges in metabolite sensitivity, data complexity, and omics data integration. Recent machine learning advancements have enhanced data analysis and disease classification in metabolomics. This study explores machine learning integration with metabolomics to improve metabolite identification, data efficiency, and diagnostic methods. Using deep learning and traditional machine learning, it presents advancements in metabolic data analysis, including novel algorithms for accurate peak identification, robust disease classification from metabolic profiles, and improved metabolite annotation. It also highlights multiomics integration, demonstrating machine learning's potential in elucidating biological phenomena and advancing disease diagnostics. This work contributes significantly to metabolomics by merging it with machine learning, offering innovative solutions to analytical challenges and setting new standards for omics data analysis.
Collapse
Affiliation(s)
- Yudian Xu
- Department of Traditional Chinese Medicine, Renji Hospital, School of Medicine, Shanghai Jiao Tong University, Shanghai, 200127, P. R. China
| | - Linlin Cao
- State Key Laboratory for Oncogenes and Related Genes, Division of Cardiology, Renji Hospital, School of Medicine, Shanghai Jiao Tong University, 160 Pujian Road, Shanghai, 200127, P. R. China
| | - Yifan Chen
- State Key Laboratory for Oncogenes and Related Genes, Division of Cardiology, Renji Hospital, School of Medicine, Shanghai Jiao Tong University, 160 Pujian Road, Shanghai, 200127, P. R. China
| | - Ziyue Zhang
- School of Biomedical Engineering, Institute of Medical Robotics and Med-X Research Institute, Shanghai Jiao Tong University, Shanghai, 200030, P. R. China
| | - Wanshan Liu
- School of Biomedical Engineering, Institute of Medical Robotics and Med-X Research Institute, Shanghai Jiao Tong University, Shanghai, 200030, P. R. China
| | - He Li
- Department of Traditional Chinese Medicine, Renji Hospital, School of Medicine, Shanghai Jiao Tong University, Shanghai, 200127, P. R. China
| | - Chenhuan Ding
- Department of Traditional Chinese Medicine, Renji Hospital, School of Medicine, Shanghai Jiao Tong University, Shanghai, 200127, P. R. China
| | - Jun Pu
- State Key Laboratory for Oncogenes and Related Genes, Division of Cardiology, Renji Hospital, School of Medicine, Shanghai Jiao Tong University, 160 Pujian Road, Shanghai, 200127, P. R. China
| | - Kun Qian
- State Key Laboratory for Oncogenes and Related Genes, Division of Cardiology, Renji Hospital, School of Medicine, Shanghai Jiao Tong University, 160 Pujian Road, Shanghai, 200127, P. R. China
- School of Biomedical Engineering, Institute of Medical Robotics and Med-X Research Institute, Shanghai Jiao Tong University, Shanghai, 200030, P. R. China
| | - Wei Xu
- State Key Laboratory for Oncogenes and Related Genes, Division of Cardiology, Renji Hospital, School of Medicine, Shanghai Jiao Tong University, 160 Pujian Road, Shanghai, 200127, P. R. China
| |
Collapse
|
37
|
Dudnyk K, Cai D, Shi C, Xu J, Zhou J. Sequence basis of transcription initiation in the human genome. Science 2024; 384:eadj0116. [PMID: 38662817 PMCID: PMC11223672 DOI: 10.1126/science.adj0116] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/01/2023] [Accepted: 02/28/2024] [Indexed: 05/03/2024]
Abstract
Transcription initiation is a process that is essential to ensuring the proper function of any gene, yet we still lack a unified understanding of sequence patterns and rules that explain most transcription start sites in the human genome. By predicting transcription initiation at base-pair resolution from sequences with a deep learning-inspired explainable model called Puffin, we show that a small set of simple rules can explain transcription initiation at most human promoters. We identify key sequence patterns that contribute to human promoter activity, each activating transcription with distinct position-specific effects. Furthermore, we explain the sequence basis of bidirectional transcription at promoters, identify the links between promoter sequence and gene expression variation across cell types, and explore the conservation of sequence determinants of transcription initiation across mammalian species.
Collapse
Affiliation(s)
- Kseniia Dudnyk
- Lyda Hill Department of Bioinformatics, University of Texas Southwestern Medical Center; Dallas, Texas, United States of America
| | - Donghong Cai
- Lyda Hill Department of Bioinformatics, University of Texas Southwestern Medical Center; Dallas, Texas, United States of America
- Center of Excellence for Leukemia Studies (CELS), Department of Pathology, St. Jude Children’s Research Hospital, Memphis, Tennessee, United States of America
| | - Chenlai Shi
- Lyda Hill Department of Bioinformatics, University of Texas Southwestern Medical Center; Dallas, Texas, United States of America
| | - Jian Xu
- Center of Excellence for Leukemia Studies (CELS), Department of Pathology, St. Jude Children’s Research Hospital, Memphis, Tennessee, United States of America
| | - Jian Zhou
- Lyda Hill Department of Bioinformatics, University of Texas Southwestern Medical Center; Dallas, Texas, United States of America
| |
Collapse
|
38
|
Zhou Z, Zhang J, Zheng X, Pan Z, Zhao F, Gao Y. CIRI-Deep Enables Single-Cell and Spatial Transcriptomic Analysis of Circular RNAs with Deep Learning. ADVANCED SCIENCE (WEINHEIM, BADEN-WURTTEMBERG, GERMANY) 2024; 11:e2308115. [PMID: 38308181 PMCID: PMC11005702 DOI: 10.1002/advs.202308115] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/25/2023] [Revised: 01/03/2024] [Indexed: 02/04/2024]
Abstract
Circular RNAs (circRNAs) are a crucial yet relatively unexplored class of transcripts known for their tissue- and cell-type-specific expression patterns. Despite the advances in single-cell and spatial transcriptomics, these technologies face difficulties in effectively profiling circRNAs due to inherent limitations in circRNA sequencing efficiency. To address this gap, a deep learning model, CIRI-deep, is presented for comprehensive prediction of circRNA regulation on diverse types of RNA-seq data. CIRI-deep is trained on an extensive dataset of 25 million high-confidence circRNA regulation events and achieved high performances on both test and leave-out data, ensuring its accuracy in inferring differential events from RNA-seq data. It is demonstrated that CIRI-deep and its adapted version enable various circRNA analyses, including cluster- or region-specific circRNA detection, BSJ ratio map visualization, and trans and cis feature importance evaluation. Collectively, CIRI-deep's adaptability extends to all major types of RNA-seq datasets including single-cell and spatial transcriptomic data, which will undoubtedly broaden the horizons of circRNA research.
Collapse
Affiliation(s)
- Zihan Zhou
- National Genomics Data Center & CAS Key Laboratory of Genome Sciences and Information Beijing Institute of GenomicsChinese Academy of Sciences and China National Center for BioinformationBeijing100101China
- University of Chinese Academy of SciencesBeijing100101China
| | - Jinyang Zhang
- Beijing Institutes of Life ScienceChinese Academy of SciencesBeijing100101China
- University of Chinese Academy of SciencesBeijing100101China
| | - Xin Zheng
- National Genomics Data Center & CAS Key Laboratory of Genome Sciences and Information Beijing Institute of GenomicsChinese Academy of Sciences and China National Center for BioinformationBeijing100101China
- University of Chinese Academy of SciencesBeijing100101China
| | - Zhicheng Pan
- Center for Computational Biology Flatiron InstituteNew York10010USA
| | - Fangqing Zhao
- Beijing Institutes of Life ScienceChinese Academy of SciencesBeijing100101China
- University of Chinese Academy of SciencesBeijing100101China
| | - Yuan Gao
- National Genomics Data Center & CAS Key Laboratory of Genome Sciences and Information Beijing Institute of GenomicsChinese Academy of Sciences and China National Center for BioinformationBeijing100101China
- University of Chinese Academy of SciencesBeijing100101China
| |
Collapse
|
39
|
Fang Y, Bansal K, Mostafavi S, Benoist C, Mathis D. AIRE relies on Z-DNA to flag gene targets for thymic T cell tolerization. Nature 2024; 628:400-407. [PMID: 38480882 PMCID: PMC11091860 DOI: 10.1038/s41586-024-07169-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/14/2023] [Accepted: 02/06/2024] [Indexed: 03/18/2024]
Abstract
AIRE is an unconventional transcription factor that enhances the expression of thousands of genes in medullary thymic epithelial cells and promotes clonal deletion or phenotypic diversion of self-reactive T cells1-4. The biological logic of AIRE's target specificity remains largely unclear as, in contrast to many transcription factors, it does not bind to a particular DNA sequence motif. Here we implemented two orthogonal approaches to investigate AIRE's cis-regulatory mechanisms: construction of a convolutional neural network and leveraging natural genetic variation through analysis of F1 hybrid mice5. Both approaches nominated Z-DNA and NFE2-MAF as putative positive influences on AIRE's target choices. Genome-wide mapping studies revealed that Z-DNA-forming and NFE2L2-binding motifs were positively associated with the inherent ability of a gene's promoter to generate DNA double-stranded breaks, and promoters showing strong double-stranded break generation were more likely to enter a poised state with accessible chromatin and already-assembled transcriptional machinery. Consequently, AIRE preferentially targets genes with poised promoters. We propose a model in which Z-DNA anchors the AIRE-mediated transcriptional program by enhancing double-stranded break generation and promoter poising. Beyond resolving a long-standing mechanistic conundrum, these findings suggest routes for manipulating T cell tolerance.
Collapse
Affiliation(s)
- Yuan Fang
- Department of Immunology, Harvard Medical School, Boston, MA, USA
- Department of Molecular and Cellular Biology, Harvard University, Cambridge, MA, USA
| | - Kushagra Bansal
- Molecular Biology and Genetics Unit, Jawaharlal Nehru Centre for Advanced Scientific Research, Bangalore, India
| | - Sara Mostafavi
- Paul G. Allen School of Computer Science and Engineering, University of Washington, Seattle, WA, USA
- Canadian Institute for Advanced Research, Toronto, Ontario, Canada
| | | | - Diane Mathis
- Department of Immunology, Harvard Medical School, Boston, MA, USA.
| |
Collapse
|
40
|
Xu Z, Liao H, Huang L, Chen Q, Lan W, Li S. IBPGNET: lung adenocarcinoma recurrence prediction based on neural network interpretability. Brief Bioinform 2024; 25:bbae080. [PMID: 38557672 PMCID: PMC10982951 DOI: 10.1093/bib/bbae080] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/15/2023] [Revised: 01/31/2024] [Accepted: 02/07/2024] [Indexed: 04/04/2024] Open
Abstract
Lung adenocarcinoma (LUAD) is the most common histologic subtype of lung cancer. Early-stage patients have a 30-50% probability of metastatic recurrence after surgical treatment. Here, we propose a new computational framework, Interpretable Biological Pathway Graph Neural Networks (IBPGNET), based on pathway hierarchy relationships to predict LUAD recurrence and explore the internal regulatory mechanisms of LUAD. IBPGNET can integrate different omics data efficiently and provide global interpretability. In addition, our experimental results show that IBPGNET outperforms other classification methods in 5-fold cross-validation. IBPGNET identified PSMC1 and PSMD11 as genes associated with LUAD recurrence, and their expression levels were significantly higher in LUAD cells than in normal cells. The knockdown of PSMC1 and PSMD11 in LUAD cells increased their sensitivity to afatinib and decreased cell migration, invasion and proliferation. In addition, the cells showed significantly lower EGFR expression, indicating that PSMC1 and PSMD11 may mediate therapeutic sensitivity through EGFR expression.
Collapse
Affiliation(s)
- Zhanyu Xu
- Department of Thoracic and Cardiovascular Surgery, The First Affiliated Hospital of Guangxi Medical University, Nanning, Guangxi Zhuang Autonomous Region 530021, China
| | - Haibo Liao
- School of computer, Electronic and Information, Guangxi University, Nanning, Guangxi Zhuang Autonomous Region 530021, China
| | - Liuliu Huang
- Department of Thoracic and Cardiovascular Surgery, The First Affiliated Hospital of Guangxi Medical University, Nanning, Guangxi Zhuang Autonomous Region 530021, China
| | - Qingfeng Chen
- School of computer, Electronic and Information, Guangxi University, Nanning, Guangxi Zhuang Autonomous Region 530021, China
| | - Wei Lan
- School of computer, Electronic and Information, Guangxi University, Nanning, Guangxi Zhuang Autonomous Region 530021, China
| | - Shikang Li
- Department of Thoracic and Cardiovascular Surgery, The First Affiliated Hospital of Guangxi Medical University, Nanning, Guangxi Zhuang Autonomous Region 530021, China
| |
Collapse
|
41
|
Liang P, Li H, Long C, Liu M, Zhou J, Zuo Y. Chromatin region binning of gene expression for improving embryo cell subtype identification. Comput Biol Med 2024; 170:108049. [PMID: 38290319 DOI: 10.1016/j.compbiomed.2024.108049] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/10/2023] [Revised: 01/01/2024] [Accepted: 01/26/2024] [Indexed: 02/01/2024]
Abstract
Mammalian embryonic development is a complex process, characterized by intricate spatiotemporal dynamics and distinct chromatin preferences. However, the quick diversification in early embryogenesis leads to significant cellular diversity and the sparsity of scRNA-seq data, posing challenges in accurately determining cell fate decisions. In this study, we introduce a chromatin region binning method using scChrBin, designed to identify chromatin regions that elucidate the dynamics of embryonic development and lineage differentiation. This method transforms scRNA-seq data into a chromatin-based matrix, leveraging genomic annotations. Our results showed that the scChrBin method achieves high accuracy, with 98.0% and 89.2% on two single-cell embryonic datasets, demonstrating its effectiveness in analyzing complex developmental processes. We also systematically and comprehensively analysis of these key chromatin binning regions and their associated genes, focusing on their roles in lineage and stage development. The perspective of chromatin region binning method enables a comprehensive analysis of transcriptome data at the chromatin level, allowing us to unveil the dynamic expression of chromatin regions across temporal and spatial development. The tool is available as an application at https://github.com/liameihao/scChrBin.
Collapse
Affiliation(s)
- Pengfei Liang
- The State Key Laboratory of Reproductive Regulation and Breeding of Grassland Livestock, Institutes of Biomedical Sciences, College of Life Sciences, Inner Mongolia University, Hohhot, 010070, China
| | - Hanshuang Li
- The State Key Laboratory of Reproductive Regulation and Breeding of Grassland Livestock, Institutes of Biomedical Sciences, College of Life Sciences, Inner Mongolia University, Hohhot, 010070, China
| | - Chunshen Long
- The State Key Laboratory of Reproductive Regulation and Breeding of Grassland Livestock, Institutes of Biomedical Sciences, College of Life Sciences, Inner Mongolia University, Hohhot, 010070, China
| | - Mingzhu Liu
- The State Key Laboratory of Reproductive Regulation and Breeding of Grassland Livestock, Institutes of Biomedical Sciences, College of Life Sciences, Inner Mongolia University, Hohhot, 010070, China
| | - Jian Zhou
- The State Key Laboratory of Reproductive Regulation and Breeding of Grassland Livestock, Institutes of Biomedical Sciences, College of Life Sciences, Inner Mongolia University, Hohhot, 010070, China
| | - Yongchun Zuo
- The State Key Laboratory of Reproductive Regulation and Breeding of Grassland Livestock, Institutes of Biomedical Sciences, College of Life Sciences, Inner Mongolia University, Hohhot, 010070, China.
| |
Collapse
|
42
|
Kim MJ, Kulkarni V, Goode MA, Hernandez J, Graham S, Sivesind TE, Manchadi ML. Utilizing systems genetics to enhance understanding into molecular targets of skin cancer. Exp Dermatol 2024; 33:e15043. [PMID: 38459629 PMCID: PMC11018140 DOI: 10.1111/exd.15043] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/31/2023] [Revised: 02/12/2024] [Accepted: 02/16/2024] [Indexed: 03/10/2024]
Abstract
Despite progress made with immune checkpoint inhibitors and targeted therapies, skin cancer remains a significant public health concern in the United States. The intricacies of the disease, encompassing genetics, immune responses, and external factors, call for a comprehensive approach. Techniques in systems genetics, including transcriptional correlation analysis, functional pathway enrichment analysis, and protein-protein interaction network analysis, prove valuable in deciphering intricate molecular mechanisms and identifying potential diagnostic and therapeutic targets for skin cancer. Recent studies demonstrate the efficacy of these techniques in uncovering molecular processes and pinpointing diagnostic markers for various skin cancer types, highlighting the potential of systems genetics in advancing innovative therapies. While certain limitations exist, such as generalizability and contextualization of external factors, the ongoing progress in AI technologies provides hope in overcoming these challenges. By providing protocols and a practical example involving Braf, we aim to inspire early-career experimental dermatologists to adopt these tools and seamlessly integrate these techniques into their skin cancer research, positioning them at the forefront of innovative approaches in combating this devastating disease.
Collapse
Affiliation(s)
- Minjae J Kim
- University of Tennessee Health Science Center School of Medicine, Memphis, Tennessee, USA
| | | | - Micah A Goode
- University of Tennessee Health Science Center School of Medicine, Memphis, Tennessee, USA
| | - Jacob Hernandez
- University of Tennessee Health Science Center School of Medicine, Memphis, Tennessee, USA
| | - Sean Graham
- University of Tennessee Health Science Center School of Medicine, Memphis, Tennessee, USA
| | - Torunn E Sivesind
- Department of Dermatology, University of Colorado Anschutz Medical Campus, Aurora, Colorado, USA
| | | |
Collapse
|
43
|
Knudsen JE, Rich JM, Ma R. Artificial Intelligence in Pathomics and Genomics of Renal Cell Carcinoma. Urol Clin North Am 2024; 51:47-62. [PMID: 37945102 DOI: 10.1016/j.ucl.2023.06.002] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2023]
Abstract
The integration of artificial intelligence (AI) with histopathology images and gene expression patterns has led to the emergence of the dynamic fields of pathomics and genomics. These fields have revolutionized renal cell carcinoma (RCC) diagnosis and subtyping and improved survival prediction models. Machine learning has identified unique gene patterns across RCC subtypes and grades, providing insights into RCC origins and potential treatments, as targeted therapies. The combination of pathomics and genomics using AI opens new avenues in RCC research, promising future breakthroughs and innovations that patients and physicians can anticipate.
Collapse
Affiliation(s)
- J Everett Knudsen
- Catherine & Joseph Aresty Department of Urology, USC Institute of Urology, Center for Robotic Simulation & Education, University of Southern California, Los Angeles, CA, USA
| | - Joseph M Rich
- Catherine & Joseph Aresty Department of Urology, USC Institute of Urology, Center for Robotic Simulation & Education, University of Southern California, Los Angeles, CA, USA
| | - Runzhuo Ma
- Catherine & Joseph Aresty Department of Urology, USC Institute of Urology, Center for Robotic Simulation & Education, University of Southern California, Los Angeles, CA, USA.
| |
Collapse
|
44
|
Tjärnberg A, Beheler-Amass M, Jackson CA, Christiaen LA, Gresham D, Bonneau R. Structure-primed embedding on the transcription factor manifold enables transparent model architectures for gene regulatory network and latent activity inference. Genome Biol 2024; 25:24. [PMID: 38238840 PMCID: PMC10797903 DOI: 10.1186/s13059-023-03134-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/28/2023] [Accepted: 11/30/2023] [Indexed: 01/22/2024] Open
Abstract
BACKGROUND Modeling of gene regulatory networks (GRNs) is limited due to a lack of direct measurements of genome-wide transcription factor activity (TFA) making it difficult to separate covariance and regulatory interactions. Inference of regulatory interactions and TFA requires aggregation of complementary evidence. Estimating TFA explicitly is problematic as it disconnects GRN inference and TFA estimation and is unable to account for, for example, contextual transcription factor-transcription factor interactions, and other higher order features. Deep-learning offers a potential solution, as it can model complex interactions and higher-order latent features, although does not provide interpretable models and latent features. RESULTS We propose a novel autoencoder-based framework, StrUcture Primed Inference of Regulation using latent Factor ACTivity (SupirFactor) for modeling, and a metric, explained relative variance (ERV), for interpretation of GRNs. We evaluate SupirFactor with ERV in a wide set of contexts. Compared to current state-of-the-art GRN inference methods, SupirFactor performs favorably. We evaluate latent feature activity as an estimate of TFA and biological function in S. cerevisiae as well as in peripheral blood mononuclear cells (PBMC). CONCLUSION Here we present a framework for structure-primed inference and interpretation of GRNs, SupirFactor, demonstrating interpretability using ERV in multiple biological and experimental settings. SupirFactor enables TFA estimation and pathway analysis using latent factor activity, demonstrated here on two large-scale single-cell datasets, modeling S. cerevisiae and PBMC. We find that the SupirFactor model facilitates biological analysis acquiring novel functional and regulatory insight.
Collapse
Affiliation(s)
- Andreas Tjärnberg
- Center for Developmental Genetics, New York University, New York, NY, 10003, USA.
- Center For Genomics and Systems Biology, NYU, New York, NY, 10008, USA.
- Department of Biology, NYU, New York, NY, 10008, USA.
- Mortimer B. Zuckerman Mind Brain Behavior Institute, Columbia University, New York, NY, 10010, USA.
- Department of Neuro-Science, University of Wisconsin-Madison - Waisman Center, Madison, USA.
| | - Maggie Beheler-Amass
- Center For Genomics and Systems Biology, NYU, New York, NY, 10008, USA
- Department of Biology, NYU, New York, NY, 10008, USA
| | - Christopher A Jackson
- Center For Genomics and Systems Biology, NYU, New York, NY, 10008, USA
- Department of Biology, NYU, New York, NY, 10008, USA
| | - Lionel A Christiaen
- Center for Developmental Genetics, New York University, New York, NY, 10003, USA
- Department of Biology, NYU, New York, NY, 10008, USA
- Sars International Centre for Marine Molecular Biology, University of Bergen, Bergen, Norway
- Department of Heart Disease, Haukeland University Hospital, Bergen, Norway
| | - David Gresham
- Center For Genomics and Systems Biology, NYU, New York, NY, 10008, USA
- Department of Biology, NYU, New York, NY, 10008, USA
| | - Richard Bonneau
- Center For Genomics and Systems Biology, NYU, New York, NY, 10008, USA.
- Department of Biology, NYU, New York, NY, 10008, USA.
- Flatiron Institute, Center for Computational Biology, Simons Foundation, New York, NY, 10010, USA.
- Courant Institute of Mathematical Sciences, Computer Science Department, New York University, New York, NY, 10003, USA.
- Center For Data Science, NYU, New York, NY, 10008, USA.
- Prescient Design, a Genentech accelerator, New York, NY, 10010, USA.
| |
Collapse
|
45
|
Sen SK, Green ED, Hutter CM, Craven M, Ideker T, Di Francesco V. Opportunities for basic, clinical, and bioethics research at the intersection of machine learning and genomics. CELL GENOMICS 2024; 4:100466. [PMID: 38190108 PMCID: PMC10794834 DOI: 10.1016/j.xgen.2023.100466] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/04/2022] [Revised: 07/14/2023] [Accepted: 11/20/2023] [Indexed: 01/09/2024]
Abstract
The data-intensive fields of genomics and machine learning (ML) are in an early stage of convergence. Genomics researchers increasingly seek to harness the power of ML methods to extract knowledge from their data; conversely, ML scientists recognize that genomics offers a wealth of large, complex, and well-annotated datasets that can be used as a substrate for developing biologically relevant algorithms and applications. The National Human Genome Research Institute (NHGRI) inquired with researchers working in these two fields to identify common challenges and receive recommendations to better support genomic research efforts using ML approaches. Those included increasing the amount and variety of training datasets by integrating genomic with multiomics, context-specific (e.g., by cell type), and social determinants of health datasets; reducing the inherent biases of training datasets; prioritizing transparency and interpretability of ML methods; and developing privacy-preserving technologies for research participants' data.
Collapse
Affiliation(s)
- Shurjo K Sen
- National Human Genome Research Institute, National Institutes of Health, Bethesda, MD 20892, USA.
| | - Eric D Green
- National Human Genome Research Institute, National Institutes of Health, Bethesda, MD 20892, USA
| | - Carolyn M Hutter
- National Human Genome Research Institute, National Institutes of Health, Bethesda, MD 20892, USA
| | - Mark Craven
- Department of Computer Sciences, University of Wisconsin-Madison, Madison, WI 53792, USA; Department of Biostatistics and Medical Informatics, University of Wisconsin-Madison, Madison, WI 53792, USA
| | - Trey Ideker
- Division of Genetics, Department of Medicine, University of California San Diego, La Jolla, CA 92093, USA
| | - Valentina Di Francesco
- National Human Genome Research Institute, National Institutes of Health, Bethesda, MD 20892, USA.
| |
Collapse
|
46
|
Rauluseviciute I, Riudavets-Puig R, Blanc-Mathieu R, Castro-Mondragon J, Ferenc K, Kumar V, Lemma RB, Lucas J, Chèneby J, Baranasic D, Khan A, Fornes O, Gundersen S, Johansen M, Hovig E, Lenhard B, Sandelin A, Wasserman W, Parcy F, Mathelier A. JASPAR 2024: 20th anniversary of the open-access database of transcription factor binding profiles. Nucleic Acids Res 2024; 52:D174-D182. [PMID: 37962376 PMCID: PMC10767809 DOI: 10.1093/nar/gkad1059] [Citation(s) in RCA: 88] [Impact Index Per Article: 88.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2023] [Revised: 10/20/2023] [Accepted: 10/31/2023] [Indexed: 11/15/2023] Open
Abstract
JASPAR (https://jaspar.elixir.no/) is a widely-used open-access database presenting manually curated high-quality and non-redundant DNA-binding profiles for transcription factors (TFs) across taxa. In this 10th release and 20th-anniversary update, the CORE collection has expanded with 329 new profiles. We updated three existing profiles and provided orthogonal support for 72 profiles from the previous release's UNVALIDATED collection. Altogether, the JASPAR 2024 update provides a 20% increase in CORE profiles from the previous release. A trimming algorithm enhanced profiles by removing low information content flanking base pairs, which were likely uninformative (within the capacity of the PFM models) for TFBS predictions and modelling TF-DNA interactions. This release includes enhanced metadata, featuring a refined classification for plant TFs' structural DNA-binding domains. The new JASPAR collections prompt updates to the genomic tracks of predicted TF binding sites (TFBSs) in 8 organisms, with human and mouse tracks available as native tracks in the UCSC Genome browser. All data are available through the JASPAR web interface and programmatically through its API and the updated Bioconductor and pyJASPAR packages. Finally, a new TFBS extraction tool enables users to retrieve predicted JASPAR TFBSs intersecting their genomic regions of interest.
Collapse
Affiliation(s)
- Ieva Rauluseviciute
- Centre for Molecular Medicine Norway (NCMM), Nordic EMBL Partnership, University of Oslo, 0318 Oslo, Norway
| | - Rafael Riudavets-Puig
- Centre for Molecular Medicine Norway (NCMM), Nordic EMBL Partnership, University of Oslo, 0318 Oslo, Norway
| | - Romain Blanc-Mathieu
- Laboratoire Physiologie Cellulaire et Végétale, Univ. Grenoble Alpes, CNRS, CEA, INRAE, IRIG-DBSCI-LPCV, 17 avenue des martyrs, F-38054, Grenoble, France
| | - Jaime A Castro-Mondragon
- Centre for Molecular Medicine Norway (NCMM), Nordic EMBL Partnership, University of Oslo, 0318 Oslo, Norway
| | - Katalin Ferenc
- Centre for Molecular Medicine Norway (NCMM), Nordic EMBL Partnership, University of Oslo, 0318 Oslo, Norway
| | - Vipin Kumar
- Centre for Molecular Medicine Norway (NCMM), Nordic EMBL Partnership, University of Oslo, 0318 Oslo, Norway
| | - Roza Berhanu Lemma
- Centre for Molecular Medicine Norway (NCMM), Nordic EMBL Partnership, University of Oslo, 0318 Oslo, Norway
| | - Jérémy Lucas
- Laboratoire Physiologie Cellulaire et Végétale, Univ. Grenoble Alpes, CNRS, CEA, INRAE, IRIG-DBSCI-LPCV, 17 avenue des martyrs, F-38054, Grenoble, France
| | - Jeanne Chèneby
- Center for Bioinformatics, Department of Informatics, University of Oslo, Oslo, Norway
| | - Damir Baranasic
- MRC London Institute of Medical Sciences, Du Cane Road, London W12 0NN, UK
- Institute of Clinical Sciences, Faculty of Medicine, Imperial College London, Hammersmith Hospital Campus, Du Cane Road, London W12 0NN, UK
- Division of Electronics, Ruđer Bošković Institute, Bijenička cesta, 10000 Zagreb, Croatia
| | - Aziz Khan
- Centre for Molecular Medicine Norway (NCMM), Nordic EMBL Partnership, University of Oslo, 0318 Oslo, Norway
- Stanford Cancer Institute, Stanford University School of Medicine, Stanford, CA 94305, USA
| | - Oriol Fornes
- Centre for Molecular Medicine and Therapeutics, Department of Medical Genetics, BC Children's Hospital Research Institute, University of British Columbia, 950 W 28th Ave, Vancouver, BC V5Z 4H4, Canada
| | - Sveinung Gundersen
- Center for Bioinformatics, Department of Informatics, University of Oslo, Oslo, Norway
| | - Morten Johansen
- Center for Bioinformatics, Department of Informatics, University of Oslo, Oslo, Norway
| | - Eivind Hovig
- Center for Bioinformatics, Department of Informatics, University of Oslo, Oslo, Norway
- Department of Tumor Biology, Institute for Cancer Research, Oslo University Hospital, 0424 Oslo, Norway
| | - Boris Lenhard
- MRC London Institute of Medical Sciences, Du Cane Road, London W12 0NN, UK
- Institute of Clinical Sciences, Faculty of Medicine, Imperial College London, Hammersmith Hospital Campus, Du Cane Road, London W12 0NN, UK
| | - Albin Sandelin
- Department of Biology and Biotech Research and Innovation Centre, University of Copenhagen, Ole Maaløes Vej 5, DK2200 Copenhagen N, Denmark
| | - Wyeth W Wasserman
- Centre for Molecular Medicine and Therapeutics, Department of Medical Genetics, BC Children's Hospital Research Institute, University of British Columbia, 950 W 28th Ave, Vancouver, BC V5Z 4H4, Canada
| | - François Parcy
- Laboratoire Physiologie Cellulaire et Végétale, Univ. Grenoble Alpes, CNRS, CEA, INRAE, IRIG-DBSCI-LPCV, 17 avenue des martyrs, F-38054, Grenoble, France
| | - Anthony Mathelier
- Centre for Molecular Medicine Norway (NCMM), Nordic EMBL Partnership, University of Oslo, 0318 Oslo, Norway
- Center for Bioinformatics, Department of Informatics, University of Oslo, Oslo, Norway
- Department of Medical Genetics, Institute of Clinical Medicine, University of Oslo and Oslo University Hospital, Oslo, Norway
| |
Collapse
|
47
|
Xu A, Tang LC, Jovanovic M, Regev O. Uncovering Distinct Peptide Charging Behaviors in Electrospray Ionization Mass Spectrometry Using a Large-Scale Dataset. JOURNAL OF THE AMERICAN SOCIETY FOR MASS SPECTROMETRY 2024; 35:90-99. [PMID: 38095561 PMCID: PMC10767741 DOI: 10.1021/jasms.3c00325] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/14/2023] [Revised: 11/20/2023] [Accepted: 11/22/2023] [Indexed: 12/26/2023]
Abstract
Electrospray ionization is a powerful and prevalent technique used to ionize analytes in mass spectrometry. The distribution of charges that an analyte receives (charge state distribution, CSD) is an important consideration for interpreting mass spectra. However, due to an incomplete understanding of the ionization mechanism, the analyte properties that influence CSDs are not fully understood. Here, we employ a machine learning-based approach and analyze CSDs of hundreds of thousands of peptides. Interestingly, half of the peptides exhibit charges that differ from what one would naively expect (the number of basic sites). We find that these peptides can be classified into two regimes (undercharging and overcharging) and that these two regimes display markedly different charging characteristics. Notably, peptides in the overcharging regime show minimal dependence on basic site count, and more generally, the two regimes exhibit distinct sequence determinants. These findings highlight the rich ionization behavior of peptides and the potential of CSDs for enhancing peptide identification.
Collapse
Affiliation(s)
- Allyn
M. Xu
- Department
of Mathematics, Courant Institute of Mathematical Sciences, New York University, New York, New York 10012, United States
| | - Lauren C. Tang
- Department
of Biological Sciences, Columbia University, New York, New York 10027, United States
| | - Marko Jovanovic
- Department
of Biological Sciences, Columbia University, New York, New York 10027, United States
| | - Oded Regev
- Computer
Science Department, Courant Institute of Mathematical Sciences, New York University, New York, New York 10012, United States
| |
Collapse
|
48
|
Huang X, Rymbekova A, Dolgova O, Lao O, Kuhlwilm M. Harnessing deep learning for population genetic inference. Nat Rev Genet 2024; 25:61-78. [PMID: 37666948 DOI: 10.1038/s41576-023-00636-3] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 07/11/2023] [Indexed: 09/06/2023]
Abstract
In population genetics, the emergence of large-scale genomic data for various species and populations has provided new opportunities to understand the evolutionary forces that drive genetic diversity using statistical inference. However, the era of population genomics presents new challenges in analysing the massive amounts of genomes and variants. Deep learning has demonstrated state-of-the-art performance for numerous applications involving large-scale data. Recently, deep learning approaches have gained popularity in population genetics; facilitated by the advent of massive genomic data sets, powerful computational hardware and complex deep learning architectures, they have been used to identify population structure, infer demographic history and investigate natural selection. Here, we introduce common deep learning architectures and provide comprehensive guidelines for implementing deep learning models for population genetic inference. We also discuss current challenges and future directions for applying deep learning in population genetics, focusing on efficiency, robustness and interpretability.
Collapse
Affiliation(s)
- Xin Huang
- Department of Evolutionary Anthropology, University of Vienna, Vienna, Austria.
- Human Evolution and Archaeological Sciences (HEAS), University of Vienna, Vienna, Austria.
| | - Aigerim Rymbekova
- Department of Evolutionary Anthropology, University of Vienna, Vienna, Austria
- Human Evolution and Archaeological Sciences (HEAS), University of Vienna, Vienna, Austria
| | - Olga Dolgova
- Integrative Genomics Laboratory, CIC bioGUNE - Centro de Investigación Cooperativa en Biociencias, Derio, Biscaya, Spain
| | - Oscar Lao
- Institute of Evolutionary Biology, CSIC-Universitat Pompeu Fabra, Barcelona, Spain.
| | - Martin Kuhlwilm
- Department of Evolutionary Anthropology, University of Vienna, Vienna, Austria.
- Human Evolution and Archaeological Sciences (HEAS), University of Vienna, Vienna, Austria.
| |
Collapse
|
49
|
Zheng W, Fong JHC, Wan YK, Chu AHY, Huang Y, Wong ASL, Ho JWK. Discovery of regulatory motifs in 5' untranslated regions using interpretable multi-task learning models. Cell Syst 2023; 14:1103-1112.e6. [PMID: 38016465 DOI: 10.1016/j.cels.2023.10.011] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/31/2023] [Revised: 09/18/2023] [Accepted: 10/31/2023] [Indexed: 11/30/2023]
Abstract
The sequence in the 5' untranslated regions (UTRs) is known to affect mRNA translation rates. However, the underlying regulatory grammar remains elusive. Here, we propose MTtrans, a multi-task translation rate predictor capable of learning common sequence patterns from datasets across various experimental techniques. The core premise is that common motifs are more likely to be genuinely involved in translation control. MTtrans outperforms existing methods in both accuracy and the ability to capture transferable motifs across species, highlighting its strength in identifying evolutionarily conserved sequence motifs. Our independent fluorescence-activated cell sorting coupled with deep sequencing (FACS-seq) experiment validates the impact of most motifs identified by MTtrans. Additionally, we introduce "GRU-rewiring," a technique to interpret the hidden states of the recurrent units. Gated recurrent unit (GRU)-rewiring allows us to identify regulatory element-enriched positions and examine the local effects of 5' UTR mutations. MTtrans is a powerful tool for deciphering the translation regulatory motifs.
Collapse
Affiliation(s)
- Weizhong Zheng
- School of Biomedical Sciences, Li Ka Shing Faculty of Medicine, The University of Hong Kong, Hong Kong SAR, China
| | - John H C Fong
- School of Biomedical Sciences, Li Ka Shing Faculty of Medicine, The University of Hong Kong, Hong Kong SAR, China
| | - Yuk Kei Wan
- School of Biomedical Sciences, Li Ka Shing Faculty of Medicine, The University of Hong Kong, Hong Kong SAR, China
| | - Athena H Y Chu
- School of Biomedical Sciences, Li Ka Shing Faculty of Medicine, The University of Hong Kong, Hong Kong SAR, China; Centre for Oncology and Immunology, Hong Kong Science Park, Hong Kong SAR, China
| | - Yuanhua Huang
- School of Biomedical Sciences, Li Ka Shing Faculty of Medicine, The University of Hong Kong, Hong Kong SAR, China; Department of Statistics and Actuarial Science, The University of Hong Kong, Hong Kong SAR, China; Center for Translational Stem Cell Biology, Hong Kong Science and Technology Park, Hong Kong SAR, China
| | - Alan S L Wong
- School of Biomedical Sciences, Li Ka Shing Faculty of Medicine, The University of Hong Kong, Hong Kong SAR, China; Centre for Oncology and Immunology, Hong Kong Science Park, Hong Kong SAR, China; Department of Electrical and Electronic Engineering, The University of Hong Kong, Hong Kong SAR, China
| | - Joshua W K Ho
- School of Biomedical Sciences, Li Ka Shing Faculty of Medicine, The University of Hong Kong, Hong Kong SAR, China; Laboratory of Data Discovery for Health (D24H) Limited, Hong Kong Science Park, Hong Kong SAR, China.
| |
Collapse
|
50
|
Sasse A, Ng B, Spiro AE, Tasaki S, Bennett DA, Gaiteri C, De Jager PL, Chikina M, Mostafavi S. Benchmarking of deep neural networks for predicting personal gene expression from DNA sequence highlights shortcomings. Nat Genet 2023; 55:2060-2064. [PMID: 38036778 DOI: 10.1038/s41588-023-01524-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/13/2023] [Accepted: 09/08/2023] [Indexed: 12/02/2023]
Abstract
Deep learning methods have recently become the state of the art in a variety of regulatory genomic tasks1-6, including the prediction of gene expression from genomic DNA. As such, these methods promise to serve as important tools in interpreting the full spectrum of genetic variation observed in personal genomes. Previous evaluation strategies have assessed their predictions of gene expression across genomic regions; however, systematic benchmarking is lacking to assess their predictions across individuals, which would directly evaluate their utility as personal DNA interpreters. We used paired whole genome sequencing and gene expression from 839 individuals in the ROSMAP study7 to evaluate the ability of current methods to predict gene expression variation across individuals at varied loci. Our approach identifies a limitation of current methods to correctly predict the direction of variant effects. We show that this limitation stems from insufficiently learned sequence motif grammar and suggest new model training strategies to improve performance.
Collapse
Affiliation(s)
- Alexander Sasse
- Paul G. Allen School of Computer Science and Engineering, University of Washington, Seattle, WA, USA
| | - Bernard Ng
- Rush Alzheimer's Disease Center, Rush University Medical Center, Chicago, IL, USA
| | - Anna E Spiro
- Paul G. Allen School of Computer Science and Engineering, University of Washington, Seattle, WA, USA
| | - Shinya Tasaki
- Rush Alzheimer's Disease Center, Rush University Medical Center, Chicago, IL, USA
| | - David A Bennett
- Rush Alzheimer's Disease Center, Rush University Medical Center, Chicago, IL, USA
| | - Christopher Gaiteri
- Rush Alzheimer's Disease Center, Rush University Medical Center, Chicago, IL, USA
- Department of Psychiatry, SUNY Upstate Medical University, Syracuse, NY, USA
| | - Philip L De Jager
- Center for Translational & Computational Neuroimmunology, Department of Neurology, and the Taub Institute for the Study of Alzheimer's Disease and the Aging Brain, Columbia University Irving Medical Center, New York, NY, USA
| | - Maria Chikina
- Department of Computational and Systems Biology, University of Pittsburgh, Pittsburgh, PA, USA.
| | - Sara Mostafavi
- Paul G. Allen School of Computer Science and Engineering, University of Washington, Seattle, WA, USA.
- Canadian Institute for Advanced Research, Toronto, Ontario, Canada.
| |
Collapse
|