1
|
Politano G, Benso A, Rehman HU, Re A. PRONTO-TK: a user-friendly PROtein Neural neTwOrk tool-kit for accessible protein function prediction. NAR Genom Bioinform 2024; 6:lqae112. [PMID: 39193069 PMCID: PMC11348006 DOI: 10.1093/nargab/lqae112] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/20/2024] [Revised: 08/01/2024] [Accepted: 08/15/2024] [Indexed: 08/29/2024] Open
Abstract
Associating one or more Gene Ontology (GO) terms to a protein means making a statement about a particular functional characteristic of the protein. This association provides scientists with a snapshot of the biological context of the protein activity. This paper introduces PRONTO-TK, a Python-based software toolkit designed to democratize access to Neural-Network based complex protein function prediction workflows. PRONTO-TK is a user-friendly graphical interface (GUI) for empowering researchers, even those with minimal programming experience, to leverage state-of-the-art Deep Learning architectures for protein function annotation using GO terms. We demonstrate PRONTO-TK's effectiveness on a running example, by showing how its intuitive configuration allows it to easily generate complex analyses while avoiding the complexities of building such a pipeline from scratch.
Collapse
Affiliation(s)
- Gianfranco Politano
- Department of Control and Computer Engineering, Politecnico di Torino, Torino, 10129, Italy
| | - Alfredo Benso
- Department of Control and Computer Engineering, Politecnico di Torino, Torino, 10129, Italy
| | - Hafeez Ur Rehman
- School of Computing and Data Sciences, Oryx Universal College with Liverpool John Moores University, Qatar
| | - Angela Re
- Department of Applied Science and Technology, Politecnico di Torino,Torino, 10129, Italy
| |
Collapse
|
2
|
Romero R, Menichelli C, Vroland C, Marin JM, Lèbre S, Lecellier CH, Bréhélin L. TFscope: systematic analysis of the sequence features involved in the binding preferences of transcription factors. Genome Biol 2024; 25:187. [PMID: 38987807 DOI: 10.1186/s13059-024-03321-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/09/2022] [Accepted: 06/24/2024] [Indexed: 07/12/2024] Open
Abstract
Characterizing the binding preferences of transcription factors (TFs) in different cell types and conditions is key to understand how they orchestrate gene expression. Here, we develop TFscope, a machine learning approach that identifies sequence features explaining the binding differences observed between two ChIP-seq experiments targeting either the same TF in two conditions or two TFs with similar motifs (paralogous TFs). TFscope systematically investigates differences in the core motif, nucleotide environment and co-factor motifs, and provides the contribution of each key feature in the two experiments. TFscope was applied to > 305 ChIP-seq pairs, and several examples are discussed.
Collapse
Affiliation(s)
- Raphaël Romero
- LIRMM, Univ Montpellier, CNRS, Montpellier, France
- IMAG, Univ Montpellier, CNRS, Montpellier, France
| | | | - Christophe Vroland
- LIRMM, Univ Montpellier, CNRS, Montpellier, France
- Institut de Génétique Moléculaire de Montpellier, University of Montpellier, CNRS, Montpellier, France
| | | | - Sophie Lèbre
- IMAG, Univ Montpellier, CNRS, Montpellier, France.
- AMIS, Université Paul-Valéry-Montpellier 3, Montpellier, France.
| | - Charles-Henri Lecellier
- LIRMM, Univ Montpellier, CNRS, Montpellier, France.
- Institut de Génétique Moléculaire de Montpellier, University of Montpellier, CNRS, Montpellier, France.
| | | |
Collapse
|
3
|
Li J, Zhang D, Yang F, Zhang Q, Pan S, Zhao X, Zhang Q, Han Y, Yang J, Wang K, Zhao C. TrG2P: A transfer-learning-based tool integrating multi-trait data for accurate prediction of crop yield. PLANT COMMUNICATIONS 2024; 5:100975. [PMID: 38751121 PMCID: PMC11287160 DOI: 10.1016/j.xplc.2024.100975] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/04/2023] [Revised: 04/14/2024] [Accepted: 05/11/2024] [Indexed: 06/24/2024]
Abstract
Yield prediction is the primary goal of genomic selection (GS)-assisted crop breeding. Because yield is a complex quantitative trait, making predictions from genotypic data is challenging. Transfer learning can produce an effective model for a target task by leveraging knowledge from a different, but related, source domain and is considered a great potential method for improving yield prediction by integrating multi-trait data. However, it has not previously been applied to genotype-to-phenotype prediction owing to the lack of an efficient implementation framework. We therefore developed TrG2P, a transfer-learning-based framework. TrG2P first employs convolutional neural networks (CNN) to train models using non-yield-trait phenotypic and genotypic data, thus obtaining pre-trained models. Subsequently, the convolutional layer parameters from these pre-trained models are transferred to the yield prediction task, and the fully connected layers are retrained, thus obtaining fine-tuned models. Finally, the convolutional layer and the first fully connected layer of the fine-tuned models are fused, and the last fully connected layer is trained to enhance prediction performance. We applied TrG2P to five sets of genotypic and phenotypic data from maize (Zea mays), rice (Oryza sativa), and wheat (Triticum aestivum) and compared its model precision to that of seven other popular GS tools: ridge regression best linear unbiased prediction (rrBLUP), random forest, support vector regression, light gradient boosting machine (LightGBM), CNN, DeepGS, and deep neural network for genomic prediction (DNNGP). TrG2P improved the accuracy of yield prediction by 39.9%, 6.8%, and 1.8% in rice, maize, and wheat, respectively, compared with predictions generated by the best-performing comparison model. Our work therefore demonstrates that transfer learning is an effective strategy for improving yield prediction by integrating information from non-yield-trait data. We attribute its enhanced prediction accuracy to the valuable information available from traits associated with yield and to training dataset augmentation. The Python implementation of TrG2P is available at https://github.com/lijinlong1991/TrG2P. The web-based tool is available at http://trg2p.ebreed.cn:81.
Collapse
Affiliation(s)
- Jinlong Li
- Information Technology Research Center, Beijing Academy of Agriculture and Forestry Sciences, Beijing 100097, China; National Engineering Research Center for Information Technology in Agriculture, Beijing 100097, China
| | - Dongfeng Zhang
- Information Technology Research Center, Beijing Academy of Agriculture and Forestry Sciences, Beijing 100097, China; National Engineering Research Center for Information Technology in Agriculture, Beijing 100097, China
| | - Feng Yang
- Information Technology Research Center, Beijing Academy of Agriculture and Forestry Sciences, Beijing 100097, China; National Engineering Research Center for Information Technology in Agriculture, Beijing 100097, China
| | - Qiusi Zhang
- Information Technology Research Center, Beijing Academy of Agriculture and Forestry Sciences, Beijing 100097, China; National Engineering Research Center for Information Technology in Agriculture, Beijing 100097, China
| | - Shouhui Pan
- Information Technology Research Center, Beijing Academy of Agriculture and Forestry Sciences, Beijing 100097, China; National Engineering Research Center for Information Technology in Agriculture, Beijing 100097, China
| | - Xiangyu Zhao
- Information Technology Research Center, Beijing Academy of Agriculture and Forestry Sciences, Beijing 100097, China; National Engineering Research Center for Information Technology in Agriculture, Beijing 100097, China
| | - Qi Zhang
- Information Technology Research Center, Beijing Academy of Agriculture and Forestry Sciences, Beijing 100097, China; National Engineering Research Center for Information Technology in Agriculture, Beijing 100097, China
| | - Yanyun Han
- Information Technology Research Center, Beijing Academy of Agriculture and Forestry Sciences, Beijing 100097, China; National Engineering Research Center for Information Technology in Agriculture, Beijing 100097, China
| | - Jinliang Yang
- Department of Agronomy and Horticulture, University of Nebraska-Lincoln, Lincoln, NE 68583, USA; Center for Plant Science Innovation, University of Nebraska-Lincoln, Lincoln, NE 68583, USA
| | - Kaiyi Wang
- Information Technology Research Center, Beijing Academy of Agriculture and Forestry Sciences, Beijing 100097, China; National Engineering Research Center for Information Technology in Agriculture, Beijing 100097, China.
| | - Chunjiang Zhao
- Information Technology Research Center, Beijing Academy of Agriculture and Forestry Sciences, Beijing 100097, China; National Engineering Research Center for Information Technology in Agriculture, Beijing 100097, China.
| |
Collapse
|
4
|
Naqvi S, Kim S, Tabatabaee S, Pampari A, Kundaje A, Pritchard JK, Wysocka J. Transfer learning reveals sequence determinants of the quantitative response to transcription factor dosage. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.05.28.596078. [PMID: 38853998 PMCID: PMC11160683 DOI: 10.1101/2024.05.28.596078] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2024]
Abstract
Deep learning approaches have made significant advances in predicting cell type-specific chromatin patterns from the identity and arrangement of transcription factor (TF) binding motifs. However, most models have been applied in unperturbed contexts, precluding a predictive understanding of how chromatin state responds to TF perturbation. Here, we used transfer learning to train and interpret deep learning models that use DNA sequence to predict, with accuracy approaching experimental reproducibility, how the concentration of two dosage-sensitive TFs (TWIST1, SOX9) affects regulatory element (RE) chromatin accessibility in facial progenitor cells. High-affinity motifs that allow for heterotypic TF co-binding and are concentrated at the center of REs buffer against quantitative changes in TF dosage and strongly predict unperturbed accessibility. In contrast, motifs with low-affinity or homotypic binding distributed throughout REs lead to sensitive responses with minimal contributions to unperturbed accessibility. Both buffering and sensitizing features show signatures of purifying selection. We validated these predictive sequence features using reporter assays and showed that a biophysical model of TF-nucleosome competition can explain the sensitizing effect of low-affinity motifs. Our approach of combining transfer learning and quantitative measurements of the chromatin response to TF dosage therefore represents a powerful method to reveal additional layers of the cis-regulatory code.
Collapse
Affiliation(s)
- Sahin Naqvi
- Departments of Chemical and Systems Biology and Developmental Biology, Stanford University School of Medicine, Stanford, CA, USA
- Department of Genetics, Stanford University, Stanford, California, USA
- Division of Gastroenterology, Hepatology, and Nutrition, Boston Children’s Hospital, Boston, MA, USA
- Department of Pediatrics, Harvard Medical School, Boston, MA, USA
- Lead contact
| | - Seungsoo Kim
- Departments of Chemical and Systems Biology and Developmental Biology, Stanford University School of Medicine, Stanford, CA, USA
- Howard Hughes Medical Institute, Stanford University School of Medicine, Stanford, CA, USA
- These authors contributed equally
| | - Saman Tabatabaee
- Departments of Chemical and Systems Biology and Developmental Biology, Stanford University School of Medicine, Stanford, CA, USA
- These authors contributed equally
| | - Anusri Pampari
- Department of Computer Science, Stanford University, Stanford, CA, USA
| | - Anshul Kundaje
- Department of Genetics, Stanford University, Stanford, California, USA
- Department of Computer Science, Stanford University, Stanford, CA, USA
| | - Jonathan K Pritchard
- Department of Genetics, Stanford University, Stanford, California, USA
- Department of Biology, Stanford University, Stanford, CA, USA
| | - Joanna Wysocka
- Departments of Chemical and Systems Biology and Developmental Biology, Stanford University School of Medicine, Stanford, CA, USA
- Howard Hughes Medical Institute, Stanford University School of Medicine, Stanford, CA, USA
| |
Collapse
|
5
|
Wu Y, Shao W, Yan M, Wang Y, Xu P, Huang G, Li X, Gregory BD, Yang J, Wang H, Yu X. Transfer learning enables identification of multiple types of RNA modifications using nanopore direct RNA sequencing. Nat Commun 2024; 15:4049. [PMID: 38744925 PMCID: PMC11094168 DOI: 10.1038/s41467-024-48437-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/08/2023] [Accepted: 04/26/2024] [Indexed: 05/16/2024] Open
Abstract
Nanopore direct RNA sequencing (DRS) has emerged as a powerful tool for RNA modification identification. However, concurrently detecting multiple types of modifications in a single DRS sample remains a challenge. Here, we develop TandemMod, a transferable deep learning framework capable of detecting multiple types of RNA modifications in single DRS data. To train high-performance TandemMod models, we generate in vitro epitranscriptome datasets from cDNA libraries, containing thousands of transcripts labeled with various types of RNA modifications. We validate the performance of TandemMod on both in vitro transcripts and in vivo human cell lines, confirming its high accuracy for profiling m6A and m5C modification sites. Furthermore, we perform transfer learning for identifying other modifications such as m7G, Ψ, and inosine, significantly reducing training data size and running time without compromising performance. Finally, we apply TandemMod to identify 3 types of RNA modifications in rice grown in different environments, demonstrating its applicability across species and conditions. In summary, we provide a resource with ground-truth labels that can serve as benchmark datasets for nanopore-based modification identification methods, and TandemMod for identifying diverse RNA modifications using a single DRS sample.
Collapse
Affiliation(s)
- You Wu
- Joint International Research Laboratory of Metabolic & Developmental Sciences, School of Life Sciences and Biotechnology, Shanghai Jiao Tong University, Shanghai, 200240, China
| | - Wenna Shao
- Joint International Research Laboratory of Metabolic & Developmental Sciences, School of Life Sciences and Biotechnology, Shanghai Jiao Tong University, Shanghai, 200240, China
| | - Mengxiao Yan
- Shanghai Key Laboratory of Plant Functional Genomics and Resources, Shanghai Chenshan Botanical Garden, Shanghai, 201602, China
| | - Yuqin Wang
- Shanghai Key Laboratory of Plant Functional Genomics and Resources, Shanghai Chenshan Botanical Garden, Shanghai, 201602, China
| | - Pengfei Xu
- Joint International Research Laboratory of Metabolic & Developmental Sciences, School of Life Sciences and Biotechnology, Shanghai Jiao Tong University, Shanghai, 200240, China
| | - Guoqiang Huang
- Joint International Research Laboratory of Metabolic & Developmental Sciences, School of Life Sciences and Biotechnology, Shanghai Jiao Tong University, Shanghai, 200240, China
| | - Xiaofei Li
- Joint International Research Laboratory of Metabolic & Developmental Sciences, School of Life Sciences and Biotechnology, Shanghai Jiao Tong University, Shanghai, 200240, China
| | - Brian D Gregory
- Department of Biology, University of Pennsylvania, Philadelphia, PA, 19104, USA
| | - Jun Yang
- Shanghai Key Laboratory of Plant Functional Genomics and Resources, Shanghai Chenshan Botanical Garden, Shanghai, 201602, China.
- Chenshan Scientific Research Center of CAS Center for Excellence in Molecular Plant Sciences, Shanghai, 201602, China.
| | - Hongxia Wang
- Shanghai Key Laboratory of Plant Functional Genomics and Resources, Shanghai Chenshan Botanical Garden, Shanghai, 201602, China.
- Chenshan Scientific Research Center of CAS Center for Excellence in Molecular Plant Sciences, Shanghai, 201602, China.
| | - Xiang Yu
- Joint International Research Laboratory of Metabolic & Developmental Sciences, School of Life Sciences and Biotechnology, Shanghai Jiao Tong University, Shanghai, 200240, China.
| |
Collapse
|
6
|
de Almeida BP, Schaub C, Pagani M, Secchia S, Furlong EEM, Stark A. Targeted design of synthetic enhancers for selected tissues in the Drosophila embryo. Nature 2024; 626:207-211. [PMID: 38086418 PMCID: PMC10830412 DOI: 10.1038/s41586-023-06905-9] [Citation(s) in RCA: 6] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/19/2023] [Accepted: 11/28/2023] [Indexed: 01/19/2024]
Abstract
Enhancers control gene expression and have crucial roles in development and homeostasis1-3. However, the targeted de novo design of enhancers with tissue-specific activities has remained challenging. Here we combine deep learning and transfer learning to design tissue-specific enhancers for five tissues in the Drosophila melanogaster embryo: the central nervous system, epidermis, gut, muscle and brain. We first train convolutional neural networks using genome-wide single-cell assay for transposase-accessible chromatin with sequencing (ATAC-seq) datasets and then fine-tune the convolutional neural networks with smaller-scale data from in vivo enhancer activity assays, yielding models with 13% to 76% positive predictive value according to cross-validation. We designed and experimentally assessed 40 synthetic enhancers (8 per tissue) in vivo, of which 31 (78%) were active and 27 (68%) functioned in the target tissue (100% for central nervous system and muscle). The strategy of combining genome-wide and small-scale functional datasets by transfer learning is generally applicable and should enable the design of tissue-, cell type- and cell state-specific enhancers in any system.
Collapse
Affiliation(s)
- Bernardo P de Almeida
- Research Institute of Molecular Pathology (IMP), Vienna BioCenter (VBC), Vienna, Austria
- Vienna BioCenter PhD Program, Doctoral School of the University of Vienna and Medical University of Vienna, Vienna, Austria
- InstaDeep, Paris, France
| | - Christoph Schaub
- European Molecular Biology Laboratory (EMBL), Genome Biology Unit, Heidelberg, Germany
| | - Michaela Pagani
- Research Institute of Molecular Pathology (IMP), Vienna BioCenter (VBC), Vienna, Austria
| | - Stefano Secchia
- European Molecular Biology Laboratory (EMBL), Genome Biology Unit, Heidelberg, Germany
| | - Eileen E M Furlong
- European Molecular Biology Laboratory (EMBL), Genome Biology Unit, Heidelberg, Germany
| | - Alexander Stark
- Research Institute of Molecular Pathology (IMP), Vienna BioCenter (VBC), Vienna, Austria.
- Medical University of Vienna, Vienna BioCenter (VBC), Vienna, Austria.
| |
Collapse
|
7
|
Zheng W, Fong JHC, Wan YK, Chu AHY, Huang Y, Wong ASL, Ho JWK. Discovery of regulatory motifs in 5' untranslated regions using interpretable multi-task learning models. Cell Syst 2023; 14:1103-1112.e6. [PMID: 38016465 DOI: 10.1016/j.cels.2023.10.011] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/31/2023] [Revised: 09/18/2023] [Accepted: 10/31/2023] [Indexed: 11/30/2023]
Abstract
The sequence in the 5' untranslated regions (UTRs) is known to affect mRNA translation rates. However, the underlying regulatory grammar remains elusive. Here, we propose MTtrans, a multi-task translation rate predictor capable of learning common sequence patterns from datasets across various experimental techniques. The core premise is that common motifs are more likely to be genuinely involved in translation control. MTtrans outperforms existing methods in both accuracy and the ability to capture transferable motifs across species, highlighting its strength in identifying evolutionarily conserved sequence motifs. Our independent fluorescence-activated cell sorting coupled with deep sequencing (FACS-seq) experiment validates the impact of most motifs identified by MTtrans. Additionally, we introduce "GRU-rewiring," a technique to interpret the hidden states of the recurrent units. Gated recurrent unit (GRU)-rewiring allows us to identify regulatory element-enriched positions and examine the local effects of 5' UTR mutations. MTtrans is a powerful tool for deciphering the translation regulatory motifs.
Collapse
Affiliation(s)
- Weizhong Zheng
- School of Biomedical Sciences, Li Ka Shing Faculty of Medicine, The University of Hong Kong, Hong Kong SAR, China
| | - John H C Fong
- School of Biomedical Sciences, Li Ka Shing Faculty of Medicine, The University of Hong Kong, Hong Kong SAR, China
| | - Yuk Kei Wan
- School of Biomedical Sciences, Li Ka Shing Faculty of Medicine, The University of Hong Kong, Hong Kong SAR, China
| | - Athena H Y Chu
- School of Biomedical Sciences, Li Ka Shing Faculty of Medicine, The University of Hong Kong, Hong Kong SAR, China; Centre for Oncology and Immunology, Hong Kong Science Park, Hong Kong SAR, China
| | - Yuanhua Huang
- School of Biomedical Sciences, Li Ka Shing Faculty of Medicine, The University of Hong Kong, Hong Kong SAR, China; Department of Statistics and Actuarial Science, The University of Hong Kong, Hong Kong SAR, China; Center for Translational Stem Cell Biology, Hong Kong Science and Technology Park, Hong Kong SAR, China
| | - Alan S L Wong
- School of Biomedical Sciences, Li Ka Shing Faculty of Medicine, The University of Hong Kong, Hong Kong SAR, China; Centre for Oncology and Immunology, Hong Kong Science Park, Hong Kong SAR, China; Department of Electrical and Electronic Engineering, The University of Hong Kong, Hong Kong SAR, China
| | - Joshua W K Ho
- School of Biomedical Sciences, Li Ka Shing Faculty of Medicine, The University of Hong Kong, Hong Kong SAR, China; Laboratory of Data Discovery for Health (D24H) Limited, Hong Kong Science Park, Hong Kong SAR, China.
| |
Collapse
|
8
|
Tahara S, Tsuchiya T, Matsumoto H, Ozaki H. Transcription factor-binding k-mer analysis clarifies the cell type dependency of binding specificities and cis-regulatory SNPs in humans. BMC Genomics 2023; 24:597. [PMID: 37805453 PMCID: PMC10560430 DOI: 10.1186/s12864-023-09692-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/21/2023] [Accepted: 09/21/2023] [Indexed: 10/09/2023] Open
Abstract
BACKGROUND Transcription factors (TFs) exhibit heterogeneous DNA-binding specificities in individual cells and whole organisms under natural conditions, and de novo motif discovery usually provides multiple motifs, even from a single chromatin immunoprecipitation-sequencing (ChIP-seq) sample. Despite the accumulation of ChIP-seq data and ChIP-seq-derived motifs, the diversity of DNA-binding specificities across different TFs and cell types remains largely unexplored. RESULTS Here, we applied MOCCS2, our k-mer-based motif discovery method, to a collection of human TF ChIP-seq samples across diverse TFs and cell types, and systematically computed profiles of TF-binding specificity scores for all k-mers. After quality control, we compiled a set of TF-binding specificity score profiles for 2,976 high-quality ChIP-seq samples, comprising 473 TFs and 398 cell types. Using these high-quality samples, we confirmed that the k-mer-based TF-binding specificity profiles reflected TF- or TF-family dependent DNA-binding specificities. We then compared the binding specificity scores of ChIP-seq samples with the same TFs but with different cell type classes and found that half of the analyzed TFs exhibited differences in DNA-binding specificities across cell type classes. Additionally, we devised a method to detect differentially bound k-mers between two ChIP-seq samples and detected k-mers exhibiting statistically significant differences in binding specificity scores. Moreover, we demonstrated that differences in the binding specificity scores between k-mers on the reference and alternative alleles could be used to predict the effect of variants on TF binding, as validated by in vitro and in vivo assay datasets. Finally, we demonstrated that binding specificity score differences can be used to interpret disease-associated non-coding single-nucleotide polymorphisms (SNPs) as TF-affecting SNPs and provide candidates responsible for TFs and cell types. CONCLUSIONS Our study provides a basis for investigating the regulation of gene expression in a TF-, TF family-, or cell-type-dependent manner. Furthermore, our differential analysis of binding-specificity scores highlights noncoding disease-associated variants in humans.
Collapse
Affiliation(s)
- Saeko Tahara
- Bioinformatics Laboratory, Institute of Medicine, University of Tsukuba, 1-1-1 Tennodai, Tsukuba, Ibaraki, 305-8577, Japan
- School of Medicine, University of Tsukuba, 1-1-1 Tennodai, Tsukuba, Ibaraki, 305-8577, Japan
| | - Takaho Tsuchiya
- Bioinformatics Laboratory, Institute of Medicine, University of Tsukuba, 1-1-1 Tennodai, Tsukuba, Ibaraki, 305-8577, Japan
- Center for Artificial Intelligence Research, University of Tsukuba, 1-1-1 Tennodai, Tsukuba, Ibaraki, 305-8577, Japan
| | - Hirotaka Matsumoto
- School of Information and Data Sciences, Nagasaki University, 1-14, Bunkyo-Machi, Nagasaki City, Nagasaki, 852-8521, Japan
- Laboratory for Bioinformatics Research, RIKEN Center for Biosystems Dynamics, Wako, Saitama, 351-0198, Japan
| | - Haruka Ozaki
- Bioinformatics Laboratory, Institute of Medicine, University of Tsukuba, 1-1-1 Tennodai, Tsukuba, Ibaraki, 305-8577, Japan.
- Center for Artificial Intelligence Research, University of Tsukuba, 1-1-1 Tennodai, Tsukuba, Ibaraki, 305-8577, Japan.
- Laboratory for Bioinformatics Research, RIKEN Center for Biosystems Dynamics, Wako, Saitama, 351-0198, Japan.
| |
Collapse
|
9
|
Wu KE, Zou JY, Chang H. Machine learning modeling of RNA structures: methods, challenges and future perspectives. Brief Bioinform 2023; 24:bbad210. [PMID: 37280185 DOI: 10.1093/bib/bbad210] [Citation(s) in RCA: 4] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/13/2023] [Revised: 05/12/2023] [Accepted: 05/17/2023] [Indexed: 06/08/2023] Open
Abstract
The three-dimensional structure of RNA molecules plays a critical role in a wide range of cellular processes encompassing functions from riboswitches to epigenetic regulation. These RNA structures are incredibly dynamic and can indeed be described aptly as an ensemble of structures that shifts in distribution depending on different cellular conditions. Thus, the computational prediction of RNA structure poses a unique challenge, even as computational protein folding has seen great advances. In this review, we focus on a variety of machine learning-based methods that have been developed to predict RNA molecules' secondary structure, as well as more complex tertiary structures. We survey commonly used modeling strategies, and how many are inspired by or incorporate thermodynamic principles. We discuss the shortcomings that various design decisions entail and propose future directions that could build off these methods to yield more robust, accurate RNA structure predictions.
Collapse
Affiliation(s)
- Kevin E Wu
- Department of Computer Science, Stanford University, Stanford, CA 94305, USA
- Center for Personal Dynamic Regulomes, Stanford University, Stanford, CA 94305, USA
- Department of Biomedical Data Science, Stanford University School of Medicine, Stanford, CA 94305, USA
| | - James Y Zou
- Department of Computer Science, Stanford University, Stanford, CA 94305, USA
- Department of Biomedical Data Science, Stanford University School of Medicine, Stanford, CA 94305, USA
| | - Howard Chang
- Howard Hughes Medical Institute, Stanford University, Stanford, CA 94305, USA
- Department of Biomedical Data Science, Stanford University School of Medicine, Stanford, CA 94305, USA
| |
Collapse
|
10
|
Toseef M, Olayemi Petinrin O, Wang F, Rahaman S, Liu Z, Li X, Wong KC. Deep transfer learning for clinical decision-making based on high-throughput data: comprehensive survey with benchmark results. Brief Bioinform 2023:bbad254. [PMID: 37455245 DOI: 10.1093/bib/bbad254] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/16/2023] [Revised: 06/04/2023] [Accepted: 06/20/2023] [Indexed: 07/18/2023] Open
Abstract
The rapid growth of omics-based data has revolutionized biomedical research and precision medicine, allowing machine learning models to be developed for cutting-edge performance. However, despite the wealth of high-throughput data available, the performance of these models is hindered by the lack of sufficient training data, particularly in clinical research (in vivo experiments). As a result, translating this knowledge into clinical practice, such as predicting drug responses, remains a challenging task. Transfer learning is a promising tool that bridges the gap between data domains by transferring knowledge from the source to the target domain. Researchers have proposed transfer learning to predict clinical outcomes by leveraging pre-clinical data (mouse, zebrafish), highlighting its vast potential. In this work, we present a comprehensive literature review of deep transfer learning methods for health informatics and clinical decision-making, focusing on high-throughput molecular data. Previous reviews mostly covered image-based transfer learning works, while we present a more detailed analysis of transfer learning papers. Furthermore, we evaluated original studies based on different evaluation settings across cross-validations, data splits and model architectures. The result shows that those transfer learning methods have great potential; high-throughput sequencing data and state-of-the-art deep learning models lead to significant insights and conclusions. Additionally, we explored various datasets in transfer learning papers with statistics and visualization.
Collapse
Affiliation(s)
- Muhammad Toseef
- Department of Computer Science, City University of Hong Kong, Hong Kong SAR
| | | | - Fuzhou Wang
- Department of Computer Science, City University of Hong Kong, Hong Kong SAR
| | - Saifur Rahaman
- Department of Computer Science, City University of Hong Kong, Hong Kong SAR
| | - Zhe Liu
- Department of Computer Science, City University of Hong Kong, Hong Kong SAR
| | - Xiangtao Li
- School of Artificial Intelligence, Jilin University, Jilin, China
| | - Ka-Chun Wong
- Department of Computer Science, City University of Hong Kong, Hong Kong SAR
- Hong Kong Institute for Data Science, City University of Hong Kong, Hong Kong SAR
| |
Collapse
|
11
|
Novakovsky G, Fornes O, Saraswat M, Mostafavi S, Wasserman WW. ExplaiNN: interpretable and transparent neural networks for genomics. Genome Biol 2023; 24:154. [PMID: 37370113 DOI: 10.1186/s13059-023-02985-y] [Citation(s) in RCA: 6] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/20/2022] [Accepted: 06/12/2023] [Indexed: 06/29/2023] Open
Abstract
Deep learning models such as convolutional neural networks (CNNs) excel in genomic tasks but lack interpretability. We introduce ExplaiNN, which combines the expressiveness of CNNs with the interpretability of linear models. ExplaiNN can predict TF binding, chromatin accessibility, and de novo motifs, achieving performance comparable to state-of-the-art methods. Its predictions are transparent, providing global (cell state level) as well as local (individual sequence level) biological insights into the data. ExplaiNN can serve as a plug-and-play platform for pretrained models and annotated position weight matrices. ExplaiNN aims to accelerate the adoption of deep learning in genomic sequence analysis by domain experts.
Collapse
Affiliation(s)
- Gherman Novakovsky
- Department of Medical Genetics, Centre for Molecular Medicine and Therapeutics, BC Children's Hospital Research Institute, University of British Columbia, Vancouver, BC, Canada
| | - Oriol Fornes
- Department of Medical Genetics, Centre for Molecular Medicine and Therapeutics, BC Children's Hospital Research Institute, University of British Columbia, Vancouver, BC, Canada
| | - Manu Saraswat
- Department of Medical Genetics, Centre for Molecular Medicine and Therapeutics, BC Children's Hospital Research Institute, University of British Columbia, Vancouver, BC, Canada
- Division of Computational Genomics and Systems Genetics, German Cancer Research Center (DKFZ), Heidelberg, Germany
- European Molecular Biology Laboratory (EMBL), Genome Biology Unit, Heidelberg, Germany
| | - Sara Mostafavi
- Paul G. Allen School of Computer Science and Engineering, University of Washington (UW), Seattle, USA
| | - Wyeth W Wasserman
- Department of Medical Genetics, Centre for Molecular Medicine and Therapeutics, BC Children's Hospital Research Institute, University of British Columbia, Vancouver, BC, Canada.
| |
Collapse
|
12
|
Salvatore M, Horlacher M, Marsico A, Winther O, Andersson R. Transfer learning identifies sequence determinants of cell-type specific regulatory element accessibility. NAR Genom Bioinform 2023; 5:lqad026. [PMID: 37007588 PMCID: PMC10052367 DOI: 10.1093/nargab/lqad026] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/22/2022] [Revised: 03/01/2023] [Accepted: 03/07/2023] [Indexed: 04/03/2023] Open
Abstract
Dysfunction of regulatory elements through genetic variants is a central mechanism in the pathogenesis of disease. To better understand disease etiology, there is consequently a need to understand how DNA encodes regulatory activity. Deep learning methods show great promise for modeling of biomolecular data from DNA sequence but are limited to large input data for training. Here, we develop ChromTransfer, a transfer learning method that uses a pre-trained, cell-type agnostic model of open chromatin regions as a basis for fine-tuning on regulatory sequences. We demonstrate superior performances with ChromTransfer for learning cell-type specific chromatin accessibility from sequence compared to models not informed by a pre-trained model. Importantly, ChromTransfer enables fine-tuning on small input data with minimal decrease in accuracy. We show that ChromTransfer uses sequence features matching binding site sequences of key transcription factors for prediction. Together, these results demonstrate ChromTransfer as a promising tool for learning the regulatory code.
Collapse
Affiliation(s)
- Marco Salvatore
- Section for Computational and RNA Biology, Department of Biology, University of Copenhagen, 2200, Copenhagen, Denmark
- Abzu ApS, 2150, Copenhagen, Denmark
| | - Marc Horlacher
- Section for Computational and RNA Biology, Department of Biology, University of Copenhagen, 2200, Copenhagen, Denmark
- Department of Computer Science, Technical University Munich, Munich, Germany
- Computational Health Center, Helmholtz Center Munich, Munich, Germany
| | - Annalisa Marsico
- Computational Health Center, Helmholtz Center Munich, Munich, Germany
| | - Ole Winther
- Section for Computational and RNA Biology, Department of Biology, University of Copenhagen, 2200, Copenhagen, Denmark
- Section for Cognitive Systems, DTU Compute, Technical University of Denmark, 2800 Kongens Lyngby, Denmark
- Department of Genomic medicine, Rigshospitalet, 2100 Copenhagen, Denmark
| | - Robin Andersson
- Section for Computational and RNA Biology, Department of Biology, University of Copenhagen, 2200, Copenhagen, Denmark
- The Novo Nordisk Foundation Center for Genomic Mechanisms of Disease, Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA
| |
Collapse
|
13
|
Smith GD, Ching WH, Cornejo-Páramo P, Wong ES. Decoding enhancer complexity with machine learning and high-throughput discovery. Genome Biol 2023; 24:116. [PMID: 37173718 PMCID: PMC10176946 DOI: 10.1186/s13059-023-02955-4] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/18/2022] [Accepted: 04/28/2023] [Indexed: 05/15/2023] Open
Abstract
Enhancers are genomic DNA elements controlling spatiotemporal gene expression. Their flexible organization and functional redundancies make deciphering their sequence-function relationships challenging. This article provides an overview of the current understanding of enhancer organization and evolution, with an emphasis on factors that influence these relationships. Technological advancements, particularly in machine learning and synthetic biology, are discussed in light of how they provide new ways to understand this complexity. Exciting opportunities lie ahead as we continue to unravel the intricacies of enhancer function.
Collapse
Affiliation(s)
- Gabrielle D Smith
- Victor Chang Cardiac Research Institute, 405 Liverpool Street, Darlinghurst, NSW, Australia
- School of Biotechnology and Biomolecular Sciences, UNSW Sydney, Kensington, NSW, Australia
| | - Wan Hern Ching
- Victor Chang Cardiac Research Institute, 405 Liverpool Street, Darlinghurst, NSW, Australia
| | - Paola Cornejo-Páramo
- Victor Chang Cardiac Research Institute, 405 Liverpool Street, Darlinghurst, NSW, Australia
- School of Biotechnology and Biomolecular Sciences, UNSW Sydney, Kensington, NSW, Australia
| | - Emily S Wong
- Victor Chang Cardiac Research Institute, 405 Liverpool Street, Darlinghurst, NSW, Australia.
- School of Biotechnology and Biomolecular Sciences, UNSW Sydney, Kensington, NSW, Australia.
| |
Collapse
|
14
|
Steyaert S, Pizurica M, Nagaraj D, Khandelwal P, Hernandez-Boussard T, Gentles AJ, Gevaert O. Multimodal data fusion for cancer biomarker discovery with deep learning. NAT MACH INTELL 2023; 5:351-362. [PMID: 37693852 PMCID: PMC10484010 DOI: 10.1038/s42256-023-00633-5] [Citation(s) in RCA: 27] [Impact Index Per Article: 27.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/28/2022] [Accepted: 02/17/2023] [Indexed: 09/12/2023]
Abstract
Technological advances now make it possible to study a patient from multiple angles with high-dimensional, high-throughput multi-scale biomedical data. In oncology, massive amounts of data are being generated ranging from molecular, histopathology, radiology to clinical records. The introduction of deep learning has significantly advanced the analysis of biomedical data. However, most approaches focus on single data modalities leading to slow progress in methods to integrate complementary data types. Development of effective multimodal fusion approaches is becoming increasingly important as a single modality might not be consistent and sufficient to capture the heterogeneity of complex diseases to tailor medical care and improve personalised medicine. Many initiatives now focus on integrating these disparate modalities to unravel the biological processes involved in multifactorial diseases such as cancer. However, many obstacles remain, including lack of usable data as well as methods for clinical validation and interpretation. Here, we cover these current challenges and reflect on opportunities through deep learning to tackle data sparsity and scarcity, multimodal interpretability, and standardisation of datasets.
Collapse
Affiliation(s)
- Sandra Steyaert
- Stanford Center for Biomedical Informatics Research (BMIR), Department of Medicine, Stanford University
| | - Marija Pizurica
- Stanford Center for Biomedical Informatics Research (BMIR), Department of Medicine, Stanford University
| | | | | | - Tina Hernandez-Boussard
- Stanford Center for Biomedical Informatics Research (BMIR), Department of Medicine, Stanford University
- Department of Biomedical Data Science, Stanford University
| | - Andrew J Gentles
- Stanford Center for Biomedical Informatics Research (BMIR), Department of Medicine, Stanford University
- Department of Biomedical Data Science, Stanford University
| | - Olivier Gevaert
- Stanford Center for Biomedical Informatics Research (BMIR), Department of Medicine, Stanford University
- Department of Biomedical Data Science, Stanford University
| |
Collapse
|
15
|
Zheng A, Shen Z, Glass CK, Gymrek M. Deep learning predicts the impact of regulatory variants on cell-type-specific enhancers in the brain. BIOINFORMATICS ADVANCES 2023; 3:vbad002. [PMID: 36726730 PMCID: PMC9887460 DOI: 10.1093/bioadv/vbad002] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 08/01/2022] [Revised: 11/11/2022] [Accepted: 01/11/2023] [Indexed: 01/13/2023]
Abstract
Motivation Previous studies have shown that the heritability of multiple brain-related traits and disorders is highly enriched in transcriptional enhancer regions. However, these regions often contain many individual variants, while only a subset of them are likely to causally contribute to a trait. Statistical fine-mapping techniques can identify putative causal variants, but their resolution is often limited, especially in regions with multiple variants in high linkage disequilibrium. In these cases, alternative computational methods to estimate the impact of individual variants can aid in variant prioritization. Results Here, we develop a deep learning pipeline to predict cell-type-specific enhancer activity directly from genomic sequences and quantify the impact of individual genetic variants in these regions. We show that the variants highlighted by our deep learning models are targeted by purifying selection in the human population, likely indicating a functional role. We integrate our deep learning predictions with statistical fine-mapping results for 8 brain-related traits, identifying 63 distinct candidate causal variants predicted to contribute to these traits by modulating enhancer activity, representing 6% of all genome-wide association study signals analyzed. Overall, our study provides a valuable computational method that can prioritize individual variants based on their estimated regulatory impact, but also highlights the limitations of existing methods for variant prioritization and fine-mapping. Availability and implementation The data underlying this article, nucleotide-level importance scores, and code for running the deep learning pipeline are available at https://github.com/Pandaman-Ryan/AgentBind-brain. Contact mgymrek@ucsd.edu. Supplementary information Supplementary data are available at Bioinformatics Advances online.
Collapse
Affiliation(s)
- An Zheng
- Department of Computer Science and Engineering, University of California San Diego, La Jolla, CA 92093, USA
| | - Zeyang Shen
- Department of Cellular and Molecular Medicine, University of California San Diego, La Jolla, CA 92093, USA
- Department of Bioengineering, University of California San Diego, La Jolla, CA 92093, USA
| | - Christopher K Glass
- Department of Cellular and Molecular Medicine, University of California San Diego, La Jolla, CA 92093, USA
- Department of Medicine, University of California San Diego, La Jolla, CA 92093, USA
| | - Melissa Gymrek
- Department of Computer Science and Engineering, University of California San Diego, La Jolla, CA 92093, USA
- Department of Medicine, University of California San Diego, La Jolla, CA 92093, USA
| |
Collapse
|
16
|
Nora LC, Cassiano MHA, Santana ÍP, Guazzaroni ME, Silva-Rocha R, da Silva RR. Mining novel cis-regulatory elements from the emergent host Rhodosporidium toruloides using transcriptomic data. Front Microbiol 2023; 13:1069443. [PMID: 36687612 PMCID: PMC9853887 DOI: 10.3389/fmicb.2022.1069443] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/13/2022] [Accepted: 12/14/2022] [Indexed: 01/07/2023] Open
Abstract
The demand for robust microbial cell factories that produce valuable biomaterials while resisting stresses imposed by current bioprocesses is rapidly growing. Rhodosporidium toruloides is an emerging host that presents desirable features for bioproduction, since it can grow in a wide range of substrates and tolerate a variety of toxic compounds. To explore R. toruloides suitability for application as a cell factory in biorefineries, we sought to understand the transcriptional responses of this yeast when growing under experimental settings that simulated those used in biofuels-related industries. Thus, we performed RNA sequencing of the oleaginous, carotenogenic yeast in different contexts. The first ones were stress-related: two conditions of high temperature (37 and 42°C) and two ethanol concentrations (2 and 4%), while the other used the inexpensive and abundant sugarcane juice as substrate. Differential expression and functional analysis were implemented using transcriptomic data to select differentially expressed genes and enriched pathways from each set-up. A reproducible bioinformatics workflow was developed for mining new regulatory elements. We then predicted, for the first time in this yeast, binding motifs for several transcription factors, including HAC1, ARG80, RPN4, ADR1, and DAL81. Most putative transcription factors uncovered here were involved in stress responses and found in the yeast genome. Our method for motif discovery provides a new realm of possibilities in studying gene regulatory networks, not only for the emerging host R. toruloides, but for other organisms of biotechnological importance.
Collapse
Affiliation(s)
- Luísa Czamanski Nora
- Cell and Molecular Biology Department, Ribeirão Preto Medical School, University of São Paulo, Ribeirão Preto, SP, Brazil,*Correspondence: Luísa Czamanski Nora,
| | | | - Ítalo Paulino Santana
- Faculty of Philosophy, Sciences and Letters of Ribeirão Preto, University of São Paulo, Ribeirão Preto, SP, Brazil
| | - María-Eugenia Guazzaroni
- Faculty of Philosophy, Sciences and Letters of Ribeirão Preto, University of São Paulo, Ribeirão Preto, SP, Brazil
| | - Rafael Silva-Rocha
- Cell and Molecular Biology Department, Ribeirão Preto Medical School, University of São Paulo, Ribeirão Preto, SP, Brazil
| | - Ricardo Roberto da Silva
- Faculty of Pharmaceutical Sciences of Ribeirão Preto, University of São Paulo, Ribeirão Preto, SP, Brazil,Ricardo Roberto da Silva,
| |
Collapse
|
17
|
Toneyan S, Tang Z, Koo PK. Evaluating deep learning for predicting epigenomic profiles. NAT MACH INTELL 2022. [DOI: 10.1038/s42256-022-00570-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/12/2022]
|
18
|
Towards a better understanding of TF-DNA binding prediction from genomic features. Comput Biol Med 2022; 149:105993. [DOI: 10.1016/j.compbiomed.2022.105993] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/15/2022] [Revised: 07/12/2022] [Accepted: 08/14/2022] [Indexed: 11/17/2022]
|
19
|
Yi R, Cho K, Bonneau R. NetTIME: a Multitask and Base-pair Resolution Framework for Improved Transcription Factor Binding Site Prediction. Bioinformatics 2022; 38:4762-4770. [PMID: 35997560 PMCID: PMC9563695 DOI: 10.1093/bioinformatics/btac569] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/06/2021] [Revised: 08/16/2022] [Accepted: 08/20/2022] [Indexed: 12/05/2022] Open
Abstract
Motivation Machine learning models for predicting cell-type-specific transcription factor (TF) binding sites have become increasingly more accurate thanks to the increased availability of next-generation sequencing data and more standardized model evaluation criteria. However, knowledge transfer from data-rich to data-limited TFs and cell types remains crucial for improving TF binding prediction models because available binding labels are highly skewed towards a small collection of TFs and cell types. Transfer prediction of TF binding sites can potentially benefit from a multitask learning approach; however, existing methods typically use shallow single-task models to generate low-resolution predictions. Here, we propose NetTIME, a multitask learning framework for predicting cell-type-specific TF binding sites with base-pair resolution. Results We show that the multitask learning strategy for TF binding prediction is more efficient than the single-task approach due to the increased data availability. NetTIME trains high-dimensional embedding vectors to distinguish TF and cell-type identities. We show that this approach is critical for the success of the multitask learning strategy and allows our model to make accurate transfer predictions within and beyond the training panels of TFs and cell types. We additionally train a linear-chain conditional random field (CRF) to classify binding predictions and show that this CRF eliminates the need for setting a probability threshold and reduces classification noise. We compare our method’s predictive performance with two state-of-the-art methods, Catchitt and Leopard, and show that our method outperforms previous methods under both supervised and transfer learning settings. Availability and implementation NetTIME is freely available at https://github.com/ryi06/NetTIME and the code is also archived at https://doi.org/10.5281/zenodo.6994897. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Ren Yi
- Department of Computer Science, New York University, New York, NY, 10011, USA
| | - Kyunghyun Cho
- Department of Computer Science, New York University, New York, NY, 10011, USA.,Center for Data Science, New York University, New York, NY, 10011, USA.,Prescient Design, a Genentech accelerator, New York, NY, 10010, USA
| | - Richard Bonneau
- Department of Computer Science, New York University, New York, NY, 10011, USA.,Center for Data Science, New York University, New York, NY, 10011, USA.,Department of Biology, New York University, New York, NY, 10003, USA.,Prescient Design, a Genentech accelerator, New York, NY, 10010, USA
| |
Collapse
|
20
|
Zhang X, Yang Y, Shen YW, Zhang KR, Jiang ZK, Ma LT, Ding C, Wang BY, Meng Y, Liu H. Diagnostic accuracy and potential covariates of artificial intelligence for diagnosing orthopedic fractures: a systematic literature review and meta-analysis. Eur Radiol 2022; 32:7196-7216. [PMID: 35754091 DOI: 10.1007/s00330-022-08956-4] [Citation(s) in RCA: 8] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/11/2022] [Revised: 05/07/2022] [Accepted: 06/08/2022] [Indexed: 02/05/2023]
Abstract
OBJECTIVES To systematically quantify the diagnostic accuracy and identify potential covariates affecting the performance of artificial intelligence (AI) in diagnosing orthopedic fractures. METHODS PubMed, Embase, Web of Science, and Cochrane Library were systematically searched for studies on AI applications in diagnosing orthopedic fractures from inception to September 29, 2021. Pooled sensitivity and specificity and the area under the receiver operating characteristic curves (AUC) were obtained. This study was registered in the PROSPERO database prior to initiation (CRD 42021254618). RESULTS Thirty-nine were eligible for quantitative analysis. The overall pooled AUC, sensitivity, and specificity were 0.96 (95% CI 0.94-0.98), 90% (95% CI 87-92%), and 92% (95% CI 90-94%), respectively. In subgroup analyses, multicenter designed studies yielded higher sensitivity (92% vs. 88%) and specificity (94% vs. 91%) than single-center studies. AI demonstrated higher sensitivity with transfer learning (with vs. without: 92% vs. 87%) or data augmentation (with vs. without: 92% vs. 87%), compared to those without. Utilizing plain X-rays as input images for AI achieved results comparable to CT (AUC 0.96 vs. 0.96). Moreover, AI achieved comparable results to humans (AUC 0.97 vs. 0.97) and better results than non-expert human readers (AUC 0.98 vs. 0.96; sensitivity 95% vs. 88%). CONCLUSIONS AI demonstrated high accuracy in diagnosing orthopedic fractures from medical images. Larger-scale studies with higher design quality are needed to validate our findings. KEY POINTS • Multicenter study design, application of transfer learning, and data augmentation are closely related to improving the performance of artificial intelligence models in diagnosing orthopedic fractures. • Utilizing plain X-rays as input images for AI to diagnose fractures achieved results comparable to CT (AUC 0.96 vs. 0.96). • AI achieved comparable results to humans (AUC 0.97 vs. 0.97) but was superior to non-expert human readers (AUC 0.98 vs. 0.96, sensitivity 95% vs. 88%) in diagnosing fractures.
Collapse
Affiliation(s)
- Xiang Zhang
- Department of Orthopedics, Orthopedic Research Institute, West China Hospital, Sichuan University, No. 37 Guo Xue Rd, Chengdu, 610041, China
| | - Yi Yang
- Department of Orthopedics, Orthopedic Research Institute, West China Hospital, Sichuan University, No. 37 Guo Xue Rd, Chengdu, 610041, China
| | - Yi-Wei Shen
- Department of Orthopedics, Orthopedic Research Institute, West China Hospital, Sichuan University, No. 37 Guo Xue Rd, Chengdu, 610041, China
| | - Ke-Rui Zhang
- Department of Orthopedics, Orthopedic Research Institute, West China Hospital, Sichuan University, No. 37 Guo Xue Rd, Chengdu, 610041, China
| | - Ze-Kun Jiang
- West China Biomedical Big Data Center, West China Hospital, Sichuan University, Chengdu, 610000, China
| | - Li-Tai Ma
- Department of Orthopedics, Orthopedic Research Institute, West China Hospital, Sichuan University, No. 37 Guo Xue Rd, Chengdu, 610041, China
| | - Chen Ding
- Department of Orthopedics, Orthopedic Research Institute, West China Hospital, Sichuan University, No. 37 Guo Xue Rd, Chengdu, 610041, China
| | - Bei-Yu Wang
- Department of Orthopedics, Orthopedic Research Institute, West China Hospital, Sichuan University, No. 37 Guo Xue Rd, Chengdu, 610041, China
| | - Yang Meng
- Department of Orthopedics, Orthopedic Research Institute, West China Hospital, Sichuan University, No. 37 Guo Xue Rd, Chengdu, 610041, China
| | - Hao Liu
- Department of Orthopedics, Orthopedic Research Institute, West China Hospital, Sichuan University, No. 37 Guo Xue Rd, Chengdu, 610041, China.
| |
Collapse
|
21
|
Peng X, Wang X, Guo Y, Ge Z, Li F, Gao X, Song J. RBP-TSTL is a two-stage transfer learning framework for genome-scale prediction of RNA-binding proteins. Brief Bioinform 2022; 23:6596984. [PMID: 35649392 PMCID: PMC9294422 DOI: 10.1093/bib/bbac215] [Citation(s) in RCA: 10] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/17/2022] [Revised: 04/25/2022] [Accepted: 05/06/2022] [Indexed: 11/27/2022] Open
Abstract
RNA binding proteins (RBPs) are critical for the post-transcriptional control of RNAs and play vital roles in a myriad of biological processes, such as RNA localization and gene regulation. Therefore, computational methods that are capable of accurately identifying RBPs are highly desirable and have important implications for biomedical and biotechnological applications. Here, we propose a two-stage deep transfer learning-based framework, termed RBP-TSTL, for accurate prediction of RBPs. In the first stage, the knowledge from the self-supervised pre-trained model was extracted as feature embeddings and used to represent the protein sequences, while in the second stage, a customized deep learning model was initialized based on an annotated pre-training RBPs dataset before being fine-tuned on each corresponding target species dataset. This two-stage transfer learning framework can enable the RBP-TSTL model to be effectively trained to learn and improve the prediction performance. Extensive performance benchmarking of the RBP-TSTL models trained using the features generated by the self-supervised pre-trained model and other models trained using hand-crafting encoding features demonstrated the effectiveness of the proposed two-stage knowledge transfer strategy based on the self-supervised pre-trained models. Using the best-performing RBP-TSTL models, we further conducted genome-scale RBP predictions for Homo sapiens, Arabidopsis thaliana, Escherichia coli, and Salmonella and established a computational compendium containing all the predicted putative RBPs candidates. We anticipate that the proposed RBP-TSTL approach will be explored as a useful tool for the characterization of RNA-binding proteins and exploration of their sequence–structure–function relationships.
Collapse
Affiliation(s)
- Xinxin Peng
- Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, Victoria 3800, Australia.,Monash Data Futures Institute, Monash University, Melbourne, Victoria 3800, Australia
| | - Xiaoyu Wang
- Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, Victoria 3800, Australia.,Monash Data Futures Institute, Monash University, Melbourne, Victoria 3800, Australia
| | - Yuming Guo
- Department of Epidemiology and Preventive Medicine, School of Public Health and Preventive Medicine, Monash University, Melbourne, Victoria 3004, Australia
| | - Zongyuan Ge
- Monash e-Research Centre and Faculty of Engineering, Monash University, Melbourne, VIC 3800, Australia
| | - Fuyi Li
- Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, Victoria 3800, Australia.,Department of Microbiology and Immunology, The Peter Doherty Institute for Infection and Immunity, The University of Melbourne, 792 Elizabeth Street, Melbourne, Victoria 3000, Australia.,College of Information Engineering, Northwest A&F University, Yangling, 712100, China
| | - Xin Gao
- Computer Science Program, Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Kingdom of Saudi Arabia.,KAUST Computational Bioscience Research Center, King Abdullah University of Science and Technology
| | - Jiangning Song
- Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, Victoria 3800, Australia.,Monash Data Futures Institute, Monash University, Melbourne, Victoria 3800, Australia
| |
Collapse
|