1
|
Xu S, Wei J, Sun S, Zhang J, Chan TF, Li Y. SSBlazer: a genome-wide nucleotide-resolution model for predicting single-strand break sites. Genome Biol 2024; 25:46. [PMID: 38347618 PMCID: PMC10863285 DOI: 10.1186/s13059-024-03179-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/31/2022] [Accepted: 01/24/2024] [Indexed: 02/15/2024] Open
Abstract
Single-strand breaks are the major DNA damage in the genome and serve a crucial role in various biological processes. To reveal the significance of single-strand breaks, multiple sequencing-based single-strand break detection methods have been developed, which are costly and unfeasible for large-scale analysis. Hence, we propose SSBlazer, an explainable and scalable deep learning framework for single-strand break site prediction at the nucleotide level. SSBlazer is a lightweight model with robust generalization capabilities across various species and is capable of numerous unexplored SSB-related applications.
Collapse
Affiliation(s)
- Sheng Xu
- Department of Computer Science and Engineering, The Chinese University of Hong Kong, 100871, Hong Kong SAR, China
- Research Institute of Intelligent Complex Systems, Fudan University, 220 Handan Rd, Shanghai, 200437, China
- Shanghai AI Lab, 422 Jingan Rd, 200041, Shanghai, China
| | - Junkang Wei
- Department of Computer Science and Engineering, The Chinese University of Hong Kong, 100871, Hong Kong SAR, China.
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, USA.
| | - Siqi Sun
- Research Institute of Intelligent Complex Systems, Fudan University, 220 Handan Rd, Shanghai, 200437, China
- Shanghai AI Lab, 422 Jingan Rd, 200041, Shanghai, China
| | - Jizhou Zhang
- School of Life Sciences, The Chinese University of Hong Kong, 100871, Hong Kong SAR, China
- State Key Laboratory of Agrobiotechnology, The Chinese University of Hong Kong, 100871, Hong Kong SAR, China
| | - Ting-Fung Chan
- School of Life Sciences, The Chinese University of Hong Kong, 100871, Hong Kong SAR, China
- State Key Laboratory of Agrobiotechnology, The Chinese University of Hong Kong, 100871, Hong Kong SAR, China
| | - Yu Li
- Department of Computer Science and Engineering, The Chinese University of Hong Kong, 100871, Hong Kong SAR, China.
- The CUHK Shenzhen Research Institute, Hi-Tech Park, Nanshan, 518057, Shenzhen, China.
| |
Collapse
|
2
|
Zhang T, Li L, Sun H, Xu D, Wang G. DeepICSH: a complex deep learning framework for identifying cell-specific silencers and their strength from the human genome. Brief Bioinform 2023; 24:bbad316. [PMID: 37643374 DOI: 10.1093/bib/bbad316] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/01/2023] [Revised: 07/25/2023] [Accepted: 08/11/2023] [Indexed: 08/31/2023] Open
Abstract
Silencers are noncoding DNA sequence fragments located on the genome that suppress gene expression. The variation of silencers in specific cells is closely related to gene expression and cancer development. Computational approaches that exclusively rely on DNA sequence information for silencer identification fail to account for the cell specificity of silencers, resulting in diminished accuracy. Despite the discovery of several transcription factors and epigenetic modifications associated with silencers on the genome, there is still no definitive biological signal or combination thereof to fully characterize silencers, posing challenges in selecting suitable biological signals for their identification. Therefore, we propose a sophisticated deep learning framework called DeepICSH, which is based on multiple biological data sources. Specifically, DeepICSH leverages a deep convolutional neural network to automatically capture biologically relevant signal combinations strongly associated with silencers, originating from a diverse array of biological signals. Furthermore, the utilization of attention mechanisms facilitates the scoring and visualization of these signal combinations, whereas the employment of skip connections facilitates the fusion of multilevel sequence features and signal combinations, thereby empowering the accurate identification of silencers within specific cells. Extensive experiments on HepG2 and K562 cell line data sets demonstrate that DeepICSH outperforms state-of-the-art methods in silencer identification. Notably, we introduce for the first time a deep learning framework based on multi-omics data for classifying strong and weak silencers, achieving favorable performance. In conclusion, DeepICSH shows great promise for advancing the study and analysis of silencers in complex diseases. The source code is available at https://github.com/lyli1013/DeepICSH.
Collapse
Affiliation(s)
- Tianjiao Zhang
- College of Computer and Control Engineering, Northeast Forestry University, Harbin 150040, China
| | - Liangyu Li
- College of Computer and Control Engineering, Northeast Forestry University, Harbin 150040, China
| | - Hailong Sun
- College of Computer and Control Engineering, Northeast Forestry University, Harbin 150040, China
| | - Dali Xu
- College of Computer and Control Engineering, Northeast Forestry University, Harbin 150040, China
| | - Guohua Wang
- College of Computer and Control Engineering, Northeast Forestry University, Harbin 150040, China
| |
Collapse
|
3
|
Xu H, Jia J, Jeong HH, Zhao Z. Deep learning for detecting and elucidating human T-cell leukemia virus type 1 integration in the human genome. PATTERNS (NEW YORK, N.Y.) 2023; 4:100674. [PMID: 36873907 PMCID: PMC9982299 DOI: 10.1016/j.patter.2022.100674] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 09/01/2022] [Revised: 11/02/2022] [Accepted: 12/13/2022] [Indexed: 02/12/2023]
Abstract
Human T-cell leukemia virus type 1 (HTLV-1), a retrovirus, is the causative agent for adult T cell leukemia/lymphoma and many other human diseases. Accurate and high throughput detection of HTLV-1 virus integration sites (VISs) across the host genomes plays a crucial role in the prevention and treatment of HTLV-1-associated diseases. Here, we developed DeepHTLV, the first deep learning framework for VIS prediction de novo from genome sequence, motif discovery, and cis-regulatory factor identification. We demonstrated the high accuracy of DeepHTLV with more efficient and interpretive feature representations. Decoding the informative features captured by DeepHTLV resulted in eight representative clusters with consensus motifs for potential HTLV-1 integration. Furthermore, DeepHTLV revealed interesting cis-regulatory elements in regulation of VISs that have significant association with the detected motifs. Literature evidence demonstrated nearly half (34) of the predicted transcription factors enriched with VISs were involved in HTLV-1-associated diseases. DeepHTLV is freely available at https://github.com/bsml320/DeepHTLV.
Collapse
Affiliation(s)
- Haodong Xu
- Center for Precision Health, School of Biomedical Informatics, UTHealth Science Center at Houston, Houston, TX 77030, USA
| | - Johnathan Jia
- Center for Precision Health, School of Biomedical Informatics, UTHealth Science Center at Houston, Houston, TX 77030, USA.,MD Anderson UTHealth Graduate School of Biomedical Sciences, Houston, TX 77030, USA
| | - Hyun-Hwan Jeong
- Center for Precision Health, School of Biomedical Informatics, UTHealth Science Center at Houston, Houston, TX 77030, USA
| | - Zhongming Zhao
- Center for Precision Health, School of Biomedical Informatics, UTHealth Science Center at Houston, Houston, TX 77030, USA.,MD Anderson UTHealth Graduate School of Biomedical Sciences, Houston, TX 77030, USA.,Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN 37203, USA
| |
Collapse
|
4
|
Ding W, Abdel-Basset M, Hawash H, Ali AM. Explainability of artificial intelligence methods, applications and challenges: A comprehensive survey. Inf Sci (N Y) 2022. [DOI: 10.1016/j.ins.2022.10.013] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
|
5
|
An attention-based hybrid deep neural networks for accurate identification of transcription factor binding sites. Neural Comput Appl 2022. [DOI: 10.1007/s00521-022-07502-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022]
|
6
|
Abstract
The tremendous amount of biological sequence data available, combined with the recent methodological breakthrough in deep learning in domains such as computer vision or natural language processing, is leading today to the transformation of bioinformatics through the emergence of deep genomics, the application of deep learning to genomic sequences. We review here the new applications that the use of deep learning enables in the field, focusing on three aspects: the functional annotation of genomes, the sequence determinants of the genome functions and the possibility to write synthetic genomic sequences.
Collapse
|
7
|
Predicting CRISPR/Cas9 Repair Outcomes by Attention-Based Deep Learning Framework. Cells 2022; 11:cells11111847. [PMID: 35681543 PMCID: PMC9180579 DOI: 10.3390/cells11111847] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/06/2022] [Revised: 05/24/2022] [Accepted: 06/02/2022] [Indexed: 02/01/2023] Open
Abstract
As a simple and programmable nuclease-based genome editing tool, the CRISPR/Cas9 system has been widely used in target-gene repair and gene-expression regulation. The DNA mutation generated by CRISPR/Cas9-mediated double-strand breaks determines its biological and phenotypic effects. Experiments have demonstrated that CRISPR/Cas9-generated cellular-repair outcomes depend on local sequence features. Therefore, the repair outcomes after DNA break can be predicted by sequences near the cleavage sites. However, existing prediction methods rely on manually constructed features or insufficiently detailed prediction labels. They cannot satisfy clinical-level-prediction accuracy, which limit the performance of these models to existing knowledge about CRISPR/Cas9 editing. We predict 557 repair labels of DNA, covering the vast majority of Cas9-generated mutational outcomes, and build a deep learning model called Apindel, to predict CRISPR/Cas9 editing outcomes. Apindel, automatically, trains the sequence features of DNA with the GloVe model, introduces location information through Positional Encoding (PE), and embeds the trained-word vector matrixes into a deep learning model, containing BiLSTM and the Attention mechanism. Apindel has better performance and more detailed prediction categories than the most advanced DNA-mutation-predicting models. It, also, reveals that nucleotides at different positions relative to the cleavage sites have different influences on CRISPR/Cas9 editing outcomes.
Collapse
|
8
|
Yu L, Zhang Y, Xue L, Liu F, Chen Q, Luo J, Jing R. Systematic Analysis and Accurate Identification of DNA N4-Methylcytosine Sites by Deep Learning. Front Microbiol 2022; 13:843425. [PMID: 35401453 PMCID: PMC8989013 DOI: 10.3389/fmicb.2022.843425] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/25/2021] [Accepted: 02/21/2022] [Indexed: 11/13/2022] Open
Abstract
DNA N4-methylcytosine (4mC) is a pivotal epigenetic modification that plays an essential role in DNA replication, repair, expression and differentiation. To gain insight into the biological functions of 4mC, it is critical to identify their modification sites in the genomics. Recently, deep learning has become increasingly popular in recent years and frequently employed for the 4mC site identification. However, a systematic analysis of how to build predictive models using deep learning techniques is still lacking. In this work, we first summarized all existing deep learning-based predictors and systematically analyzed their models, features and datasets, etc. Then, using a typical standard dataset with three species (A. thaliana, C. elegans, and D. melanogaster), we assessed the contribution of different model architectures, encoding methods and the attention mechanism in establishing a deep learning-based model for the 4mC site prediction. After a series of optimizations, convolutional-recurrent neural network architecture using the one-hot encoding and attention mechanism achieved the best overall prediction performance. Extensive comparison experiments were conducted based on the same dataset. This work will be helpful for researchers who would like to build the 4mC prediction models using deep learning in the future.
Collapse
Affiliation(s)
- Lezheng Yu
- School of Chemistry and Materials Science, Guizhou Education University, Guiyang, China
| | - Yonglin Zhang
- Department of Pharmacology, School of Pharmacy, Southwest Medical University, Luzhou, China
| | - Li Xue
- School of Public Health, Southwest Medical University, Luzhou, China
| | - Fengjuan Liu
- School of Geography and Resources, Guizhou Education University, Guiyang, China
| | - Qi Chen
- Department of Endocrinology and Metabolism, The Affiliated Hospital of Southwest Medical University, Luzhou, China
| | - Jiesi Luo
- Department of Pharmacology, School of Pharmacy, Southwest Medical University, Luzhou, China.,Department of Pharmacy, The Affiliated Hospital of Southwest Medical University, Luzhou, China
| | - Runyu Jing
- School of Cyber Science and Engineering, Sichuan University, Chengdu, China
| |
Collapse
|
9
|
de Santana Correia A, Colombini EL. Attention, please! A survey of neural attention models in deep learning. Artif Intell Rev 2022. [DOI: 10.1007/s10462-022-10148-x] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/11/2023]
|
10
|
Ekpenyong ME, Adegoke AA, Edoho ME, Inyang UG, Udo IJ, Ekaidem IS, Osang F, Uto NP, Geoffery JI. Collaborative Mining of Whole Genome Sequences for Intelligent HIV-1 Sub-Strain(s) Discovery. Curr HIV Res 2022; 20:163-183. [PMID: 35142269 DOI: 10.2174/1570162x20666220210142209] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/24/2021] [Revised: 11/30/2021] [Accepted: 12/20/2021] [Indexed: 11/22/2022]
Abstract
BACKGROUND Effective global antiretroviral vaccines and therapeutic strategies depend on the diversity, evolution, and epidemiology of their various strains as well as their transmission and pathogenesis. Most viral disease-causing particles are clustered into a taxonomy of subtypes to suggest pointers toward nucleotide-specific vaccines or therapeutic applications of clinical significance sufficient for sequence-specific diagnosis and homologous viral studies. These are very useful to formulate predictors to induce cross-resistance to some retroviral control drugs being used across study areas. OBJECTIVE This research proposed a collaborative framework of hybridized (Machine Learning and Natural Language Processing) techniques to discover hidden genome patterns and feature predictors, for HIV-1 genome sequences mining. METHOD 630 human HIV-1 genome sequences above 8500 bps were excavated from the National Center for Biotechnology Information (NCBI) database (https://www.ncbi.nlm.nih.gov) for 21 countries across different continents, Antarctica exempt. These sequences were transformed and learned using a self-organizing map (SOM). To discriminate emerging/new sub-strain(s), the HIV-1 reference genome was included as part of the input isolates/samples during the training. After training the SOM, component planes defining pattern clusters of the input datasets were generated, for cognitive knowledge mining and subsequent labelling of the datasets. Additional genome features including dinucleotide transmission recurrences, codon recurrences, and mutation recurrences, were finally extracted from the raw genomes to construct output classification targets for supervised learning. RESULTS SOM training explains the inherent pattern diversity of HIV-1 genomes as well as inter- and intra-country transmissions in which mobility might play an active role, as corroborated by literature. Nine sub-strains were discovered after disassembling the SOM correlation hunting matrix space attributed to disparate clusters. Cognitive knowledge mining separated similar pattern clusters bounded by a certain degree of correlation range, discovered by the SOM. A Kruskal-Wallis rank-sum test and Wilcoxon rank-sum test showed statistically significant variations in dinucleotide, codon, and mutation patterns. CONCLUSION Results of the discovered sub-strains and response clusters visualizations corroborate existing literature, with significant haplotype variations. The proposed framework would assist in the development of decision support systems for easy contact tracing, infectious disease surveillance, and studying the progressive evolution of the reference HIV-1 genome.
Collapse
Affiliation(s)
- Moses E Ekpenyong
- Department of Computer Science, Faculty of Science, University of Uyo, Uyo, Nigeria
- Centre for Research and Development, University of Uyo, Uyo, Nigeria
| | - Anthony A Adegoke
- Department of Microbiology, Faculty of Science, University of Uyo, Uyo, Nigeria
| | - Mercy E Edoho
- Department of Computer Science, Faculty of Science, University of Uyo, Uyo, Nigeria
| | - Udoinyang G Inyang
- Department of Computer Science, Faculty of Science, University of Uyo, Uyo, Nigeria
| | - Ifiok J Udo
- Department of Computer Science, Faculty of Science, University of Uyo, Uyo, Nigeria
| | - Itemobong S Ekaidem
- Department of Chemical Pathology, College of Health Sciences, University of Uyo, Uyo, Nigeria
| | - Francis Osang
- Department of Computer Science, Faculty of Science, National Open University, Abuja, Nigeria
| | - Nseobong P Uto
- School of Mathematics and Statistics, University of St Andrews, Scotland, United Kingdom
| | - Joseph I Geoffery
- Department of Computer Science, Faculty of Science, University of Uyo, Uyo, Nigeria
| |
Collapse
|
11
|
Yang G, Ye Q, Xia J. Unbox the black-box for the medical explainable AI via multi-modal and multi-centre data fusion: A mini-review, two showcases and beyond. AN INTERNATIONAL JOURNAL ON INFORMATION FUSION 2022; 77:29-52. [PMID: 34980946 PMCID: PMC8459787 DOI: 10.1016/j.inffus.2021.07.016] [Citation(s) in RCA: 140] [Impact Index Per Article: 70.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/27/2021] [Revised: 05/25/2021] [Accepted: 07/25/2021] [Indexed: 05/04/2023]
Abstract
Explainable Artificial Intelligence (XAI) is an emerging research topic of machine learning aimed at unboxing how AI systems' black-box choices are made. This research field inspects the measures and models involved in decision-making and seeks solutions to explain them explicitly. Many of the machine learning algorithms cannot manifest how and why a decision has been cast. This is particularly true of the most popular deep neural network approaches currently in use. Consequently, our confidence in AI systems can be hindered by the lack of explainability in these black-box models. The XAI becomes more and more crucial for deep learning powered applications, especially for medical and healthcare studies, although in general these deep neural networks can return an arresting dividend in performance. The insufficient explainability and transparency in most existing AI systems can be one of the major reasons that successful implementation and integration of AI tools into routine clinical practice are uncommon. In this study, we first surveyed the current progress of XAI and in particular its advances in healthcare applications. We then introduced our solutions for XAI leveraging multi-modal and multi-centre data fusion, and subsequently validated in two showcases following real clinical scenarios. Comprehensive quantitative and qualitative analyses can prove the efficacy of our proposed XAI solutions, from which we can envisage successful applications in a broader range of clinical questions.
Collapse
Affiliation(s)
- Guang Yang
- National Heart and Lung Institute, Imperial College London, London, UK
- Royal Brompton Hospital, London, UK
- Imperial Institute of Advanced Technology, Hangzhou, China
| | - Qinghao Ye
- Hangzhou Ocean’s Smart Boya Co., Ltd, China
- University of California, San Diego, La Jolla, CA, USA
| | - Jun Xia
- Radiology Department, Shenzhen Second People’s Hospital, Shenzhen, China
| |
Collapse
|
12
|
AIM in Medical Informatics. Artif Intell Med 2022. [DOI: 10.1007/978-3-030-64573-1_32] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022]
|
13
|
Huminiecki Ł. Virtual Gene Concept and a Corresponding Pragmatic Research Program in Genetical Data Science. ENTROPY (BASEL, SWITZERLAND) 2021; 24:17. [PMID: 35052043 PMCID: PMC8774939 DOI: 10.3390/e24010017] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/30/2021] [Revised: 12/02/2021] [Accepted: 12/14/2021] [Indexed: 06/14/2023]
Abstract
Mendel proposed an experimentally verifiable paradigm of particle-based heredity that has been influential for over 150 years. The historical arguments have been reflected in the near past as Mendel's concept has been diversified by new types of omics data. As an effect of the accumulation of omics data, a virtual gene concept forms, giving rise to genetical data science. The concept integrates genetical, functional, and molecular features of the Mendelian paradigm. I argue that the virtual gene concept should be deployed pragmatically. Indeed, the concept has already inspired a practical research program related to systems genetics. The program includes questions about functionality of structural and categorical gene variants, about regulation of gene expression, and about roles of epigenetic modifications. The methodology of the program includes bioinformatics, machine learning, and deep learning. Education, funding, careers, standards, benchmarks, and tools to monitor research progress should be provided to support the research program.
Collapse
Affiliation(s)
- Łukasz Huminiecki
- Evolutionary, Computational, and Statistical Genetics, Department of Molecula Biology, Institute of Genetics and Animal Biotechnology, Polish Academy of Sciences, Postępu 36A, Jastrzębiec, 05-552 Warsaw, Poland
| |
Collapse
|
14
|
Xu Z, Luo M, Lin W, Xue G, Wang P, Jin X, Xu C, Zhou W, Cai Y, Yang W, Nie H, Jiang Q. DLpTCR: an ensemble deep learning framework for predicting immunogenic peptide recognized by T cell receptor. Brief Bioinform 2021; 22:6355415. [PMID: 34415016 DOI: 10.1093/bib/bbab335] [Citation(s) in RCA: 36] [Impact Index Per Article: 12.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/20/2021] [Revised: 07/25/2021] [Accepted: 07/28/2021] [Indexed: 12/30/2022] Open
Abstract
Accurate prediction of immunogenic peptide recognized by T cell receptor (TCR) can greatly benefit vaccine development and cancer immunotherapy. However, identifying immunogenic peptides accurately is still a huge challenge. Most of the antigen peptides predicted in silico fail to elicit immune responses in vivo without considering TCR as a key factor. This inevitably causes costly and time-consuming experimental validation test for predicted antigens. Therefore, it is necessary to develop novel computational methods for precisely and effectively predicting immunogenic peptide recognized by TCR. Here, we described DLpTCR, a multimodal ensemble deep learning framework for predicting the likelihood of interaction between single/paired chain(s) of TCR and peptide presented by major histocompatibility complex molecules. To investigate the generality and robustness of the proposed model, COVID-19 data and IEDB data were constructed for independent evaluation. The DLpTCR model exhibited high predictive power with area under the curve up to 0.91 on COVID-19 data while predicting the interaction between peptide and single TCR chain. Additionally, the DLpTCR model achieved the overall accuracy of 81.03% on IEDB data while predicting the interaction between peptide and paired TCR chains. The results demonstrate that DLpTCR has the ability to learn general interaction rules and generalize to antigen peptide recognition by TCR. A user-friendly webserver is available at http://jianglab.org.cn/DLpTCR/. Additionally, a stand-alone software package that can be downloaded from https://github.com/jiangBiolab/DLpTCR.
Collapse
Affiliation(s)
- Zhaochun Xu
- Center for Bioinformatics, School of Life Science and Technology, Harbin Institute of Technology, Harbin 150000, China
| | - Meng Luo
- Center for Bioinformatics, School of Life Science and Technology, Harbin Institute of Technology, Harbin 150000, China
| | - Weizhong Lin
- Center for Bioinformatics, Computer Department, Jingdezhen Ceramic Institute, Jingdezhen 333403, China
| | - Guangfu Xue
- Center for Bioinformatics, School of Life Science and Technology, Harbin Institute of Technology, Harbin 150000, China
| | - Pingping Wang
- Center for Bioinformatics, School of Life Science and Technology, Harbin Institute of Technology, Harbin 150000, China
| | - Xiyun Jin
- Center for Bioinformatics, School of Life Science and Technology, Harbin Institute of Technology, Harbin 150000, China
| | - Chang Xu
- Center for Bioinformatics, School of Life Science and Technology, Harbin Institute of Technology, Harbin 150000, China
| | - Wenyang Zhou
- Center for Bioinformatics, School of Life Science and Technology, Harbin Institute of Technology, Harbin 150000, China
| | - Yideng Cai
- Center for Bioinformatics, School of Life Science and Technology, Harbin Institute of Technology, Harbin 150000, China
| | - Wenyi Yang
- Center for Bioinformatics, School of Life Science and Technology, Harbin Institute of Technology, Harbin 150000, China
| | - Huan Nie
- Center for Bioinformatics, School of Life Science and Technology, Harbin Institute of Technology, Harbin 150000, China
| | - Qinghua Jiang
- Center for Bioinformatics, School of Life Science and Technology, Harbin Institute of Technology, Harbin 150000, China.,Key Laboratory of Biological Data (Harbin Institute of Technology), Ministry of Education, China
| |
Collapse
|
15
|
Luo X, Gandhi P, Zhang Z, Shao W, Han Z, Chandrasekaran V, Turzhitsky V, Bali V, Roberts AR, Metzger M, Baker J, La Rosa C, Weaver J, Dexter P, Huang K. Applying interpretable deep learning models to identify chronic cough patients using EHR data. COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE 2021; 210:106395. [PMID: 34525412 DOI: 10.1016/j.cmpb.2021.106395] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/05/2021] [Accepted: 08/30/2021] [Indexed: 06/13/2023]
Abstract
BACKGROUND AND OBJECTIVE Chronic cough (CC) affects approximately 10% of adults. Many disease states are associated with chronic cough, such as asthma, upper airway cough syndrome, bronchitis, and gastroesophageal reflux disease. The lack of an ICD code specific for chronic cough makes it challenging to identify such patients from electronic health records (EHRs). For clinical and research purposes, computational methods using EHR data are urgently needed to identify chronic cough cases. This research aims to investigate the data representations and deep learning algorithms for chronic cough prediction. METHODS Utilizing real-world EHR data from a large academic healthcare system from October 2005 to September 2015, we investigated Natural Language Representation of the EHR data and systematically evaluated deep learning and traditional machine learning models to predict chronic cough patients. We built these machine learning models using structured data (medication and diagnosis) and unstructured data (clinical notes). RESULTS The sensitivity and specificity of a transformer-based deep learning algorithm, specifically BERT with attention model, was 0.856 and 0.866, respectively, using structured data (medication and diagnosis). Sensitivity and specificity improved to 0.952 and 0.930 when we combined structured data with symptoms extracted from clinical notes. We further found that the attention mechanism of deep learning models can be used to extract important features that drive the prediction decisions. Compared with our previously published rule-based algorithm, the deep learning algorithm can identify more chronic cough patients with structured data. CONCLUSIONS By applying deep learning models, chronic cough patients can be reliably identified for prospective or retrospective research through medication and diagnosis data, widely available in EHR and electronic claims data, thus improving the generalizability of the patient identification algorithm. Deep learning models can identify chronic cough patients with even higher sensitivity and specificity when structured and unstructured EHR data are utilized. We anticipate language-based data representation and deep learning models developed in this research could also be productively used for other disease prediction and case identification.
Collapse
Affiliation(s)
- Xiao Luo
- Purdue School of Engineering and Technology, IUPUI, 799W Michigan St, Indianapolis, IN 46202, United States.
| | - Priyanka Gandhi
- Purdue School of Engineering and Technology, IUPUI, 799W Michigan St, Indianapolis, IN 46202, United States.
| | - Zuoyi Zhang
- Indiana University School of Medicine, 340W 10th St #6200, Indianapolis, IN 46202, United States.
| | - Wei Shao
- Indiana University School of Medicine, 340W 10th St #6200, Indianapolis, IN 46202, United States.
| | - Zhi Han
- Indiana University School of Medicine, 340W 10th St #6200, Indianapolis, IN 46202, United States; Regenstrief Institute, 1101W 10th Street, Indianapolis, IN, 46202, United States.
| | - Vasu Chandrasekaran
- Center for Observational and Real-World Evidence, Merck Co., Inc, 2000 Galloping Hill Rd, Kenilworth, NJ, 07033 United States.
| | - Vladimir Turzhitsky
- Center for Observational and Real-World Evidence, Merck Co., Inc, 2000 Galloping Hill Rd, Kenilworth, NJ, 07033 United States.
| | - Vishal Bali
- Center for Observational and Real-World Evidence, Merck Co., Inc, 2000 Galloping Hill Rd, Kenilworth, NJ, 07033 United States.
| | - Anna R Roberts
- Regenstrief Institute, 1101W 10th Street, Indianapolis, IN, 46202, United States.
| | - Megan Metzger
- Regenstrief Institute, 1101W 10th Street, Indianapolis, IN, 46202, United States.
| | - Jarod Baker
- Regenstrief Institute, 1101W 10th Street, Indianapolis, IN, 46202, United States.
| | - Carmen La Rosa
- Center for Observational and Real-World Evidence, Merck Co., Inc, 2000 Galloping Hill Rd, Kenilworth, NJ, 07033 United States.
| | - Jessica Weaver
- Center for Observational and Real-World Evidence, Merck Co., Inc, 2000 Galloping Hill Rd, Kenilworth, NJ, 07033 United States.
| | - Paul Dexter
- Indiana University School of Medicine, 340W 10th St #6200, Indianapolis, IN 46202, United States; Regenstrief Institute, 1101W 10th Street, Indianapolis, IN, 46202, United States; Eskenazi Health, 720 Eskenazi Ave, Indianapolis, IN 46202, United States.
| | - Kun Huang
- Indiana University School of Medicine, 340W 10th St #6200, Indianapolis, IN 46202, United States; Regenstrief Institute, 1101W 10th Street, Indianapolis, IN, 46202, United States.
| |
Collapse
|
16
|
Wu C, Guo X, Li M, Shen J, Fu X, Xie Q, Hou Z, Zhai M, Qiu X, Cui Z, Xie H, Qin P, Weng X, Hu Z, Liang J. DeepHBV: a deep learning model to predict hepatitis B virus (HBV) integration sites. BMC Ecol Evol 2021; 21:138. [PMID: 34233610 PMCID: PMC8261932 DOI: 10.1186/s12862-021-01869-8] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/15/2021] [Accepted: 06/29/2021] [Indexed: 01/05/2023] Open
Abstract
Background The hepatitis B virus (HBV) is one of the main causes of viral hepatitis and liver cancer. HBV integration is one of the key steps in the virus-promoted malignant transformation. Results An attention-based deep learning model, DeepHBV, was developed to predict HBV integration sites. By learning local genomic features automatically, DeepHBV was trained and tested using HBV integration site data from the dsVIS database. Initially, DeepHBV showed an AUROC of 0.6363 and an AUPR of 0.5471 for the dataset. The integration of genomic features of repeat peaks and TCGA Pan-Cancer peaks significantly improved model performance, with AUROCs of 0.8378 and 0.9430 and AUPRs of 0.7535 and 0.9310, respectively. The transcription factor binding sites (TFBS) were significantly enriched near the genomic positions that were considered. The binding sites of the AR-halfsite, Arnt, Atf1, bHLHE40, bHLHE41, BMAL1, CLOCK, c-Myc, COUP-TFII, E2A, EBF1, Erra, and Foxo3 were highlighted by DeepHBV in both the dsVIS and VISDB datasets, revealing a novel integration preference for HBV. Conclusions DeepHBV is a useful tool for predicting HBV integration sites, revealing novel insights into HBV integration-related carcinogenesis. Supplementary Information The online version contains supplementary material available at 10.1186/s12862-021-01869-8.
Collapse
Affiliation(s)
- Canbiao Wu
- Institute for Brain Research and Rehabilitation, South China Normal University, Guangzhou, 510631, Guangdong, China
| | - Xiaofang Guo
- Department of Medical Oncology of the Eastern Hospital, the First Affiliated Hospital, Sun Yat-Sen University, Guangdong, 510700, Guangzhou, China
| | - Mengyuan Li
- Department of Gynecological Oncology, the First Affiliated Hospital, Sun Yat-Sen University, Guangdong, 510080, Guangzhou, China
| | - Jingxian Shen
- Institute for Brain Research and Rehabilitation, South China Normal University, Guangzhou, 510631, Guangdong, China
| | - Xiayu Fu
- Department of Thoracic Surgery, the First Affiliated Hospital, Sun Yat-Sen University, Guangdong, 510080, Guangzhou, China
| | - Qingyu Xie
- Institute for Brain Research and Rehabilitation, South China Normal University, Guangzhou, 510631, Guangdong, China.,School of Computer Science, South China Normal University, Guangzhou, 510631, China
| | - Zeliang Hou
- Institute for Brain Research and Rehabilitation, South China Normal University, Guangzhou, 510631, Guangdong, China
| | - Manman Zhai
- Institute for Brain Research and Rehabilitation, South China Normal University, Guangzhou, 510631, Guangdong, China.,School of Psychology, South China Normal University, Guangzhou, 510080, Guangdong, China
| | - Xiaofan Qiu
- Institute for Brain Research and Rehabilitation, South China Normal University, Guangzhou, 510631, Guangdong, China
| | - Zifeng Cui
- Department of Gynecological Oncology, the First Affiliated Hospital, Sun Yat-Sen University, Guangdong, 510080, Guangzhou, China
| | - Hongxian Xie
- Generulor Company Bio-X Lab, Guangzhou, 510006, Guangdong, China
| | - Pengmin Qin
- School of Psychology, South China Normal University, Guangzhou, 510080, Guangdong, China
| | - Xuchu Weng
- Institute for Brain Research and Rehabilitation, South China Normal University, Guangzhou, 510631, Guangdong, China.,Key Laboratory of Brain, Cognition and Education Sciences (South China Normal University), Ministry of Education, Guangzhou, 510080, Guangdong, China
| | - Zheng Hu
- Department of Gynecological Oncology, the First Affiliated Hospital, Sun Yat-Sen University, Guangdong, 510080, Guangzhou, China. .,Department of Obstetrics and Gynecology, Tongji Hospital, Tongji Medical College, Huazhong University of Science and Technology, Wuhan, 430030, Hubei, China.
| | - Jiuxing Liang
- Institute for Brain Research and Rehabilitation, South China Normal University, Guangzhou, 510631, Guangdong, China. .,Key Laboratory of Brain, Cognition and Education Sciences (South China Normal University), Ministry of Education, Guangzhou, 510080, Guangdong, China.
| |
Collapse
|
17
|
Liang J, Cui Z, Wu C, Yu Y, Tian R, Xie H, Jin Z, Fan W, Xie W, Huang Z, Xu W, Zhu J, You Z, Guo X, Qiu X, Ye J, Lang B, Li M, Tan S, Hu Z. DeepEBV: A deep learning model to predict Epstein-Barr virus (EBV) integration sites. Bioinformatics 2021; 37:3405-3411. [PMID: 34009299 DOI: 10.1093/bioinformatics/btab388] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/27/2020] [Revised: 03/26/2021] [Accepted: 05/17/2021] [Indexed: 11/14/2022] Open
Abstract
MOTIVATION Epstein-Barr virus (EBV) is one of the most prevalent DNA oncogenic viruses. The integration of EBV into the host genome has been reported to play an important role in cancer development. The preference of EBV integration showed strong dependence on the local genomic environment, which enables the prediction of EBV integration sites. RESULTS An attention-based deep learning model, DeepEBV, was developed to predict EBV integration sites by learning local genomic features automatically. First, DeepEBV was trained and tested using the data from the dsVIS database. The results showed that DeepEBV with EBV integration sequences plus Repeat peaks and 2 fold data augmentation performed the best on the training dataset. Furthermore, the performance of the model was validated in an independent dataset. In addition, the motifs of DNA-binding proteins could influence the selection preference of viral insertional mutagenesis. Furthermore, the results showed that DeepEBV can predict EBV integration hotspot genes accurately. In summary, DeepEBV is a robust, accurate and explainable deep learning model, providing novel insights into EBV integration preferences and mechanisms. AVAILABILITY DeepEBV is available as open-source software and can be downloaded from https://github.com/JiuxingLiang/DeepEBV.gitSupplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Jiuxing Liang
- Key Laboratory of Brain, Cognition and Education Sciences, Ministry of Education, China; Institute for Brain Research and Rehabilitation, South China Normal University, Guangzhou 510631, China
| | - Zifeng Cui
- Department of Gynaecological oncology, The First Affiliated Hospital, Sun Yat-sen University, Guangzhou 510080, Guangdong, China
| | - Canbiao Wu
- Key Laboratory of Brain, Cognition and Education Sciences, Ministry of Education, China; Institute for Brain Research and Rehabilitation, South China Normal University, Guangzhou 510631, China
| | - Yao Yu
- Department of Urology, The First Medical Center of Chinese PLA General Hospital, Beijing 100853 China.,School of Medicine, Nankai University, Tianjin 300071, China
| | - Rui Tian
- Center for Translational Medicine, The First Affiliated Hospital, Sun Yat-sen University, Guangzhou 510080, Guangdong, China
| | - Hongxian Xie
- STech Company Bio-X Lab, Zhuhai 519000, Guangdong, China
| | - Zhuang Jin
- Department of Gynaecological oncology, The First Affiliated Hospital, Sun Yat-sen University, Guangzhou 510080, Guangdong, China
| | - Weiwen Fan
- Department of Gynaecological oncology, The First Affiliated Hospital, Sun Yat-sen University, Guangzhou 510080, Guangdong, China
| | - Weiling Xie
- Department of Gynaecological oncology, The First Affiliated Hospital, Sun Yat-sen University, Guangzhou 510080, Guangdong, China
| | - Zhaoyue Huang
- Department of Gynaecological oncology, The First Affiliated Hospital, Sun Yat-sen University, Guangzhou 510080, Guangdong, China
| | - Wei Xu
- Department of Gynaecological oncology, The First Affiliated Hospital, Sun Yat-sen University, Guangzhou 510080, Guangdong, China
| | - Jingjing Zhu
- Department of Gynaecological oncology, The First Affiliated Hospital, Sun Yat-sen University, Guangzhou 510080, Guangdong, China
| | - Zeshan You
- Department of Gynaecological oncology, The First Affiliated Hospital, Sun Yat-sen University, Guangzhou 510080, Guangdong, China
| | - Xiaofang Guo
- Department of Medical Oncology of the Eastern Hospital, the First Affiliated Hospital, Sun Yat-sen University, Guangzhou, 510700, China
| | - Xiaofan Qiu
- Key Laboratory of Brain, Cognition and Education Sciences, Ministry of Education, China; Institute for Brain Research and Rehabilitation, South China Normal University, Guangzhou 510631, China
| | - Jiahao Ye
- Key Laboratory of Brain, Cognition and Education Sciences, Ministry of Education, China; Institute for Brain Research and Rehabilitation, South China Normal University, Guangzhou 510631, China.,School of Computer Science, South China Normal University, Guangzhou 510631, China
| | - Bin Lang
- School of Health Sciences and Sports, Macao Polytechnic Institute, China
| | - Mengyuan Li
- Department of Gynaecological oncology, The First Affiliated Hospital, Sun Yat-sen University, Guangzhou 510080, Guangdong, China
| | - Songwei Tan
- School of Pharmacy, Tongji Medical College, Huazhong University of Science and Technology, Wuhan 430030, China
| | - Zheng Hu
- Department of Gynaecological oncology, The First Affiliated Hospital, Sun Yat-sen University, Guangzhou 510080, Guangdong, China.,Department of Obstetrics and Gynaecology, Tongji Hospital, Tongji Medical College, Huazhong University of Science and Technology, Wuhan 430030, Hubei, China
| |
Collapse
|
18
|
Xu H, Jia P, Zhao Z. DeepVISP: Deep Learning for Virus Site Integration Prediction and Motif Discovery. ADVANCED SCIENCE (WEINHEIM, BADEN-WURTTEMBERG, GERMANY) 2021; 8:2004958. [PMID: 33977077 PMCID: PMC8097320 DOI: 10.1002/advs.202004958] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/23/2020] [Indexed: 05/08/2023]
Abstract
Approximately 15% of human cancers are estimated to be attributed to viruses. Virus sequences can be integrated into the host genome, leading to genomic instability and carcinogenesis. Here, a new deep convolutional neural network (CNN) model is developed with attention architecture, namely DeepVISP, for accurately predicting oncogenic virus integration sites (VISs) in the human genome. Using the curated benchmark integration data of three viruses, hepatitis B virus (HBV), human herpesvirus (HPV), and Epstein-Barr virus (EBV), DeepVISP achieves high accuracy and robust performance for all three viruses through automatically learning informative features and essential genomic positions only from the DNA sequences. In comparison, DeepVISP outperforms conventional machine learning methods by 8.43-34.33% measured by area under curve (AUC) value enhancement in three viruses. Moreover, DeepVISP can decode cis-regulatory factors that are potentially involved in virus integration and tumorigenesis, such as HOXB7, IKZF1, and LHX6. These findings are supported by multiple lines of evidence in literature. The clustering analysis of the informative motifs reveales that the representative k-mers in clusters could help guide virus recognition of the host genes. A user-friendly web server is developed for predicting putative oncogenic VISs in the human genome using DeepVISP.
Collapse
Affiliation(s)
- Haodong Xu
- Center for Precision HealthSchool of Biomedical InformaticsThe University of Texas Health Science Center at Houston (UTHealth)HoustonTX77030USA
| | - Peilin Jia
- Center for Precision HealthSchool of Biomedical InformaticsThe University of Texas Health Science Center at Houston (UTHealth)HoustonTX77030USA
| | - Zhongming Zhao
- Center for Precision HealthSchool of Biomedical InformaticsThe University of Texas Health Science Center at Houston (UTHealth)HoustonTX77030USA
- MD Anderson Cancer Center UTHealth Graduate School of Biomedical SciencesHoustonTX77030USA
- Department of Biomedical InformaticsVanderbilt University Medical CenterNashvilleTN37203USA
| |
Collapse
|
19
|
GS-9822, a preclinical LEDGIN candidate, displays a block-and-lock phenotype in cell culture. Antimicrob Agents Chemother 2021; 65:AAC.02328-20. [PMID: 33619061 PMCID: PMC8092873 DOI: 10.1128/aac.02328-20] [Citation(s) in RCA: 16] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/15/2022] Open
Abstract
The ability of HIV to integrate into the host genome and establish latent reservoirs is the main hurdle preventing an HIV cure. LEDGINs are small-molecule integrase inhibitors that target the binding pocket of LEDGF/p75, a cellular cofactor that substantially contributes to HIV integration site selection. They are potent antivirals that inhibit HIV integration and maturation. In addition, they retarget residual integrants away from transcription units and towards a more repressive chromatin environment. As a result, treatment with the LEDGIN CX14442 yielded residual provirus that proved more latent and more refractory to reactivation, supporting the use of LEDGINs as research tools to study HIV latency and a functional cure strategy. In this study we compared GS-9822, a potent, pre-clinical lead compound, with CX14442 with respect to antiviral potency, integration site selection, latency and reactivation. GS-9822 was more potent than CX14442 in most assays. For the first time, the combined effects on viral replication, integrase-LEDGF/p75 interaction, integration sites, epigenetic landscape, immediate latency and latency reversal was demonstrated at nanomolar concentrations achievable in the clinic. GS-9822 profiles as a preclinical candidate for future functional cure research.
Collapse
|
20
|
Xiong Y, He X, Zhao D, Tian T, Hong L, Jiang T, Zeng J. Modeling multi-species RNA modification through multi-task curriculum learning. Nucleic Acids Res 2021; 49:3719-3734. [PMID: 33744973 PMCID: PMC8053129 DOI: 10.1093/nar/gkab124] [Citation(s) in RCA: 17] [Impact Index Per Article: 5.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/03/2021] [Accepted: 02/12/2021] [Indexed: 01/01/2023] Open
Abstract
N6-methyladenosine (m6A) is the most pervasive modification in eukaryotic mRNAs. Numerous biological processes are regulated by this critical post-transcriptional mark, such as gene expression, RNA stability, RNA structure and translation. Recently, various experimental techniques and computational methods have been developed to characterize the transcriptome-wide landscapes of m6A modification for understanding its underlying mechanisms and functions in mRNA regulation. However, the experimental techniques are generally costly and time-consuming, while the existing computational models are usually designed only for m6A site prediction in a single-species and have significant limitations in accuracy, interpretability and generalizability. Here, we propose a highly interpretable computational framework, called MASS, based on a multi-task curriculum learning strategy to capture m6A features across multiple species simultaneously. Extensive computational experiments demonstrate the superior performances of MASS when compared to the state-of-the-art prediction methods. Furthermore, the contextual sequence features of m6A captured by MASS can be explained by the known critical binding motifs of the related RNA-binding proteins, which also help elucidate the similarity and difference among m6A features across species. In addition, based on the predicted m6A profiles, we further delineate the relationships between m6A and various properties of gene regulation, including gene expression, RNA stability, translation, RNA structure and histone modification. In summary, MASS may serve as a useful tool for characterizing m6A modification and studying its regulatory code. The source code of MASS can be downloaded from https://github.com/mlcb-thu/MASS.
Collapse
Affiliation(s)
- Yuanpeng Xiong
- Bioinformatics Division, BNRIST/Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China
| | - Xuan He
- Institute for Interdisciplinary Information Sciences, Tsinghua University, Beijing 100084, China
| | - Dan Zhao
- Institute for Interdisciplinary Information Sciences, Tsinghua University, Beijing 100084, China
| | - Tingzhong Tian
- Institute for Interdisciplinary Information Sciences, Tsinghua University, Beijing 100084, China
| | - Lixiang Hong
- Institute for Interdisciplinary Information Sciences, Tsinghua University, Beijing 100084, China
| | - Tao Jiang
- Department of Computer Science and Engineering, University of California, Riverside, CA 92521, USA
- Bioinformatics Division, BNRIST/Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China
| | - Jianyang Zeng
- Institute for Interdisciplinary Information Sciences, Tsinghua University, Beijing 100084, China
| |
Collapse
|
21
|
A machine learning-based framework for modeling transcription elongation. Proc Natl Acad Sci U S A 2021; 118:2007450118. [PMID: 33526657 DOI: 10.1073/pnas.2007450118] [Citation(s) in RCA: 9] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/15/2022] Open
Abstract
RNA polymerase II (Pol II) generally pauses at certain positions along gene bodies, thereby interrupting the transcription elongation process, which is often coupled with various important biological functions, such as precursor mRNA splicing and gene expression regulation. Characterizing the transcriptional elongation dynamics can thus help us understand many essential biological processes in eukaryotic cells. However, experimentally measuring Pol II elongation rates is generally time and resource consuming. We developed PEPMAN (polymerase II elongation pausing modeling through attention-based deep neural network), a deep learning-based model that accurately predicts Pol II pausing sites based on the native elongating transcript sequencing (NET-seq) data. Through fully taking advantage of the attention mechanism, PEPMAN is able to decipher important sequence features underlying Pol II pausing. More importantly, we demonstrated that the analyses of the PEPMAN-predicted results around various types of alternative splicing sites can provide useful clues into understanding the cotranscriptional splicing events. In addition, associating the PEPMAN prediction results with different epigenetic features can help reveal important factors related to the transcription elongation process. All these results demonstrated that PEPMAN can provide a useful and effective tool for modeling transcription elongation and understanding the related biological factors from available high-throughput sequencing data.
Collapse
|
22
|
Hu H, Liu X, Xiao A, Li Y, Zhang C, Jiang T, Zhao D, Song S, Zeng J. Riboexp: an interpretable reinforcement learning framework for ribosome density modeling. Brief Bioinform 2021; 22:6105941. [PMID: 33479731 DOI: 10.1093/bib/bbaa412] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/17/2020] [Revised: 12/11/2020] [Indexed: 11/13/2022] Open
Abstract
Translation elongation is a crucial phase during protein biosynthesis. In this study, we develop a novel deep reinforcement learning-based framework, named Riboexp, to model the determinants of the uneven distribution of ribosomes on mRNA transcripts during translation elongation. In particular, our model employs a policy network to perform a context-dependent feature selection in the setting of ribosome density prediction. Our extensive tests demonstrated that Riboexp can significantly outperform the state-of-the-art methods in predicting ribosome density by up to 5.9% in terms of per-gene Pearson correlation coefficient on the datasets from three species. In addition, Riboexp can indicate more informative sequence features for the prediction task than other commonly used attribution methods in deep learning. In-depth analyses also revealed the meaningful biological insights generated by the Riboexp framework. Moreover, the application of Riboexp in codon optimization resulted in an increase of protein production by around 31% over the previous state-of-the-art method that models ribosome density. These results have established Riboexp as a powerful and useful computational tool in the studies of translation dynamics and protein synthesis. Availability: The data and code of this study are available on GitHub: https://github.com/Liuxg16/Riboexp. Contact: zengjy321@tsinghua.edu.cn; songsen@tsinghua.edu.cn.
Collapse
Affiliation(s)
- Hailin Hu
- School of Medicine, Tsinghua University, Beijing, 100084, China
| | - Xianggen Liu
- Laboratory for Brain and Intelligence and Department of Biomedical Engineering, Tsinghua University, Beijing, 100084, China.,Beijing Innovation Center for Future Chip, Tsinghua University, Beijing, 100084, China
| | - An Xiao
- Institute for Interdisciplinary Information Sciences, Tsinghua University, Beijing, 100084, China
| | - YangYang Li
- Comprehensive AIDS Research Center, Collaborative Innovation Center for Diagnosis and Treatment of Infectious Diseases, School of Life Sciences, and School of Medicine, Tsinghua University, Beijing, 100084, China
| | | | - Tao Jiang
- Department of Computer Science and Engineering, University of California, Riverside, CA 92521, USA.,Bioinformatics Division, BNRIST/Department of Computer Science and Technology, Tsinghua University, Beijing, 100084, China.,Institute of Integrative Genome Biology, University of California, Riverside, CA 92521, USA
| | - Dan Zhao
- Institute for Interdisciplinary Information Sciences, Tsinghua University, Beijing, 100084, China
| | - Sen Song
- Laboratory for Brain and Intelligence and Department of Biomedical Engineering, Tsinghua University, Beijing, 100084, China.,Beijing Innovation Center for Future Chip, Tsinghua University, Beijing, 100084, China
| | - Jianyang Zeng
- School of Medicine, Tsinghua University, Beijing, 100084, China
| |
Collapse
|
23
|
Bruno P, Calimeri F, Greco G. AIM in Medical Informatics. Artif Intell Med 2021. [DOI: 10.1007/978-3-030-58080-3_32-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
|
24
|
Tian R, Zhou P, Li M, Tan J, Cui Z, Xu W, Wei J, Zhu J, Jin Z, Cao C, Fan W, Xie W, Huang Z, Xie H, You Z, Niu G, Wu C, Guo X, Weng X, Tian X, Yu F, Yu Z, Liang J, Hu Z. DeepHPV: a deep learning model to predict human papillomavirus integration sites. Brief Bioinform 2020; 22:5924410. [PMID: 33059369 DOI: 10.1093/bib/bbaa242] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/27/2020] [Revised: 08/26/2020] [Accepted: 08/28/2020] [Indexed: 01/09/2023] Open
Abstract
Human papillomavirus (HPV) integrating into human genome is the main cause of cervical carcinogenesis. HPV integration selection preference shows strong dependence on local genomic environment. Due to this theory, it is possible to predict HPV integration sites. However, a published bioinformatic tool is not available to date. Thus, we developed an attention-based deep learning model DeepHPV to predict HPV integration sites by learning environment features automatically. In total, 3608 known HPV integration sites were applied to train the model, and 584 reviewed HPV integration sites were used as the testing dataset. DeepHPV showed an area under the receiver-operating characteristic (AUROC) of 0.6336 and an area under the precision recall (AUPR) of 0.5670. Adding RepeatMasker and TCGA Pan Cancer peaks improved the model performance to 0.8464 and 0.8501 in AUROC and 0.7985 and 0.8106 in AUPR, respectively. Next, we tested these trained models on independent database VISDB and found the model adding TCGA Pan Cancer performed better (AUROC: 0.7175, AUPR: 0.6284) than the model adding RepeatMasker peaks (AUROC: 0.6102, AUPR: 0.5577). Moreover, we introduced attention mechanism in DeepHPV and enriched the transcription factor binding sites including BHLHA15, CHR, COUP-TFII, DMRTA2, E2A, HIC1, INR, NPAS, Nr5a2, RARa, SCL, Snail1, Sox10, Sox3, Sox4, Sox6, STAT6, Tbet, Tbx5, TEAD, Tgif2, ZNF189, ZNF416 near attention intensive sites. Together, DeepHPV is a robust and explainable deep learning model, providing new insights into HPV integration preference and mechanism. Availability: DeepHPV is available as an open-source software and can be downloaded from https://github.com/JiuxingLiang/DeepHPV.git, Contact: huzheng1998@163.com, liangjiuxing@m.scnu.edu.cn, lizheyzy@163.com.
Collapse
Affiliation(s)
- Rui Tian
- Translational Medicine of the First Affiliated Hospital, Sun Yat-sen University
| | - Ping Zhou
- Dongguan Maternal and Child Health Care Hospital
| | - Mengyuan Li
- Department of Obstetrics and Gynecology at the First Affiliated Hospital, Sun Yat-sen University
| | - Jinfeng Tan
- First Affiliated Hospital, Sun Yat-sen University
| | - Zifeng Cui
- First Affiliated Hospital, Sun Yat-sen University
| | - Wei Xu
- Department of Obstetrics and Gynecology at the First Affiliated Hospital, Sun Yat-sen University
| | - Jingyue Wei
- Department of Obstetrics and Gynecology at the First Affiliated Hospital, Sun Yat-sen University
| | - Jingjing Zhu
- Department of Obstetrics and Gynecology of the First Affiliated Hospital, Sun Yat-sen University
| | - Zhuang Jin
- First Affiliated Hospital, Sun Yat-sen University
| | - Chen Cao
- Central Hospital of Wuhan, China
| | - Weiwen Fan
- College of Medicine at the Sun Yat-sen University
| | - Weiling Xie
- First Affiliated Hospital, Sun Yat-sen University
| | | | | | - Zeshan You
- First Affiliated Hospital, Sun Yat-sen University
| | - Gang Niu
- Department of Obstetrics and Gynecology of the First Affiliated Hospital, Sun Yat-sen University
| | - Canbiao Wu
- Institute for Brain Research and Rehabilitation at the South China Normal University
| | - Xiaofang Guo
- Department of Medical Oncology of the Eastern Hospital at the First Affiliated Hospital, Sun Yat-sen University
| | - Xuchu Weng
- Institute for Brain Research and Rehabilitation at the South China Normal University
| | | | - Fubing Yu
- Dongguan Maternal and Child Health Care Hospital
| | - Zhiying Yu
- Department of Gynecology, Shenzhen Second People's Hospital/the First Affiliated Hospital of Shenzhen University Health Science Center
| | - Jiuxing Liang
- Institute for Brain Research and Rehabilitation at the South China Normal University
| | - Zheng Hu
- Gynecological Oncology of the First Affiliated Hospital, Precision Medicine Institute, Sun Yat-sen University
| |
Collapse
|
25
|
Fu H, Cao Z, Li M, Wang S. ACEP: improving antimicrobial peptides recognition through automatic feature fusion and amino acid embedding. BMC Genomics 2020; 21:597. [PMID: 32859150 PMCID: PMC7455913 DOI: 10.1186/s12864-020-06978-0] [Citation(s) in RCA: 23] [Impact Index Per Article: 5.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/31/2020] [Accepted: 08/11/2020] [Indexed: 12/14/2022] Open
Abstract
BACKGROUND Antimicrobial resistance is one of our most serious health threats. Antimicrobial peptides (AMPs), effecter molecules of innate immune system, can defend host organisms against microbes and most have shown a lowered likelihood for bacteria to form resistance compared to many conventional drugs. Thus, AMPs are gaining popularity as better substitute to antibiotics. To aid researchers in novel AMPs discovery, we design computational approaches to screen promising candidates. RESULTS In this work, we design a deep learning model that can learn amino acid embedding patterns, automatically extract sequence features, and fuse heterogeneous information. Results show that the proposed model outperforms state-of-the-art methods on recognition of AMPs. By visualizing data in some layers of the model, we overcome the black-box nature of deep learning, explain the working mechanism of the model, and find some import motifs in sequences. CONCLUSIONS ACEP model can capture similarity between amino acids, calculate attention scores for different parts of a peptide sequence in order to spot important parts that significantly contribute to final predictions, and automatically fuse a variety of heterogeneous information or features. For high-throughput AMPs recognition, open source software and datasets are made freely available at https://github.com/Fuhaoyi/ACEP .
Collapse
Affiliation(s)
- Haoyi Fu
- School of Information Science and Engineering, Yunnan University, Kunming, 650500, China
| | - Zicheng Cao
- School of Public Health (Shenzhen), Sun Yat-sen University, Guangzhou, 510006, China
| | - Mingyuan Li
- School of Information Science and Engineering, Yunnan University, Kunming, 650500, China
| | - Shunfang Wang
- School of Information Science and Engineering, Yunnan University, Kunming, 650500, China.
| |
Collapse
|
26
|
|
27
|
Payrovnaziri SN, Chen Z, Rengifo-Moreno P, Miller T, Bian J, Chen JH, Liu X, He Z. Explainable artificial intelligence models using real-world electronic health record data: a systematic scoping review. J Am Med Inform Assoc 2020; 27:1173-1185. [PMID: 32417928 PMCID: PMC7647281 DOI: 10.1093/jamia/ocaa053] [Citation(s) in RCA: 87] [Impact Index Per Article: 21.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/22/2020] [Revised: 04/01/2020] [Accepted: 04/07/2020] [Indexed: 01/08/2023] Open
Abstract
OBJECTIVE To conduct a systematic scoping review of explainable artificial intelligence (XAI) models that use real-world electronic health record data, categorize these techniques according to different biomedical applications, identify gaps of current studies, and suggest future research directions. MATERIALS AND METHODS We searched MEDLINE, IEEE Xplore, and the Association for Computing Machinery (ACM) Digital Library to identify relevant papers published between January 1, 2009 and May 1, 2019. We summarized these studies based on the year of publication, prediction tasks, machine learning algorithm, dataset(s) used to build the models, the scope, category, and evaluation of the XAI methods. We further assessed the reproducibility of the studies in terms of the availability of data and code and discussed open issues and challenges. RESULTS Forty-two articles were included in this review. We reported the research trend and most-studied diseases. We grouped XAI methods into 5 categories: knowledge distillation and rule extraction (N = 13), intrinsically interpretable models (N = 9), data dimensionality reduction (N = 8), attention mechanism (N = 7), and feature interaction and importance (N = 5). DISCUSSION XAI evaluation is an open issue that requires a deeper focus in the case of medical applications. We also discuss the importance of reproducibility of research work in this field, as well as the challenges and opportunities of XAI from 2 medical professionals' point of view. CONCLUSION Based on our review, we found that XAI evaluation in medicine has not been adequately and formally practiced. Reproducibility remains a critical concern. Ample opportunities exist to advance XAI research in medicine.
Collapse
Affiliation(s)
| | - Zhaoyi Chen
- Department of Health Outcomes and Biomedical Informatics, University of Florida, Gainesville, Florida, USA
| | - Pablo Rengifo-Moreno
- College of Medicine, Florida State University, Tallahassee, Florida, USA
- Tallahassee Memorial Hospital, Tallahassee, Florida, USA
| | - Tim Miller
- School of Computing and Information Systems, The University of Melbourne, Melbourne, Victoria, Australia
| | - Jiang Bian
- Department of Health Outcomes and Biomedical Informatics, University of Florida, Gainesville, Florida, USA
| | - Jonathan H Chen
- Center for Biomedical Informatics Research, Department of Medicine, Stanford University, Stanford, California, USA
- Division of Hospital Medicine, Department of Medicine, Stanford University, Stanford, California, USA
| | - Xiuwen Liu
- Department of Computer Science, Florida State University, Tallahassee, Florida, USA
| | - Zhe He
- School of Information, Florida State University, Tallahassee, Florida, USA
| |
Collapse
|
28
|
Xu H, Jia P, Zhao Z. Deep4mC: systematic assessment and computational prediction for DNA N4-methylcytosine sites by deep learning. Brief Bioinform 2020; 22:5856341. [PMID: 32578842 DOI: 10.1093/bib/bbaa099] [Citation(s) in RCA: 43] [Impact Index Per Article: 10.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/24/2020] [Revised: 04/16/2020] [Accepted: 05/02/2020] [Indexed: 12/11/2022] Open
Abstract
DNA N4-methylcytosine (4mC) modification represents a novel epigenetic regulation. It involves in various cellular processes, including DNA replication, cell cycle and gene expression, among others. In addition to experimental identification of 4mC sites, in silico prediction of 4mC sites in the genome has emerged as an alternative and promising approach. In this study, we first reviewed the current progress in the computational prediction of 4mC sites and systematically evaluated the predictive capacity of eight conventional machine learning algorithms as well as 12 feature types commonly used in previous studies in six species. Using a representative benchmark dataset, we investigated the contribution of feature selection and stacking approach to the model construction, and found that feature optimization and proper reinforcement learning could improve the performance. We next recollected newly added 4mC sites in the six species' genomes and developed a novel deep learning-based 4mC site predictor, namely Deep4mC. Deep4mC applies convolutional neural networks with four representative features. For species with small numbers of samples, we extended our deep learning framework with a bootstrapping method. Our evaluation indicated that Deep4mC could obtain high accuracy and robust performance with the average area under curve (AUC) values greater than 0.9 in all species (range: 0.9005-0.9722). In comparison, Deep4mC achieved an AUC value improvement from 10.14 to 46.21% when compared to previous tools in these six species. A user-friendly web server (https://bioinfo.uth.edu/Deep4mC) was built for predicting putative 4mC sites in a genome.
Collapse
Affiliation(s)
- Haodong Xu
- Center for Precision Health, School of Biomedical Informatics
| | - Peilin Jia
- Center for Precision Health, School of Biomedical Informatics
| | - Zhongming Zhao
- Center for Precision Health, School of Biomedical Informatics
| |
Collapse
|
29
|
Wan F, Zhu Y, Hu H, Dai A, Cai X, Chen L, Gong H, Xia T, Yang D, Wang MW, Zeng J. DeepCPI: A Deep Learning-based Framework for Large-scale in silico Drug Screening. GENOMICS PROTEOMICS & BIOINFORMATICS 2020; 17:478-495. [PMID: 32035227 PMCID: PMC7056933 DOI: 10.1016/j.gpb.2019.04.003] [Citation(s) in RCA: 36] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 03/19/2019] [Accepted: 04/29/2019] [Indexed: 12/13/2022]
Abstract
Accurate identification of compound–protein interactions (CPIs) in silico may deepen our understanding of the underlying mechanisms of drug action and thus remarkably facilitate drug discovery and development. Conventional similarity- or docking-based computational methods for predicting CPIs rarely exploit latent features from currently available large-scale unlabeled compound and protein data and often limit their usage to relatively small-scale datasets. In the present study, we propose DeepCPI, a novel general and scalable computational framework that combines effective feature embedding (a technique of representation learning) with powerful deep learning methods to accurately predict CPIs at a large scale. DeepCPI automatically learns the implicit yet expressive low-dimensional features of compounds and proteins from a massive amount of unlabeled data. Evaluations of the measured CPIs in large-scale databases, such as ChEMBL and BindingDB, as well as of the known drug–target interactions from DrugBank, demonstrated the superior predictive performance of DeepCPI. Furthermore, several interactions among small-molecule compounds and three G protein-coupled receptor targets (glucagon-like peptide-1 receptor, glucagon receptor, and vasoactive intestinal peptide receptor) predicted using DeepCPI were experimentally validated. The present study suggests that DeepCPI is a useful and powerful tool for drug discovery and repositioning. The source code of DeepCPI can be downloaded from https://github.com/FangpingWan/DeepCPI.
Collapse
Affiliation(s)
- Fangping Wan
- Institute for Interdisciplinary Information Sciences, Tsinghua University, Beijing 100084, China
| | - Yue Zhu
- The National Center for Drug Screening and the CAS Key Laboratory of Receptor Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, Shanghai 201203, China
| | - Hailin Hu
- School of Medicine, Tsinghua University, Beijing 100084, China
| | - Antao Dai
- The National Center for Drug Screening and the CAS Key Laboratory of Receptor Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, Shanghai 201203, China
| | - Xiaoqing Cai
- The National Center for Drug Screening and the CAS Key Laboratory of Receptor Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, Shanghai 201203, China
| | - Ligong Chen
- School of Pharmaceutical Sciences, Tsinghua University, Beijing 100084, China
| | - Haipeng Gong
- School of Life Science, Tsinghua University, Beijing 100084, China
| | - Tian Xia
- Department of Electronics and Information Engineering, Huazhong University of Science and Technology, Wuhan 430074, China
| | - Dehua Yang
- The National Center for Drug Screening and the CAS Key Laboratory of Receptor Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, Shanghai 201203, China.
| | - Ming-Wei Wang
- The National Center for Drug Screening and the CAS Key Laboratory of Receptor Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, Shanghai 201203, China; School of Life Science and Technology, ShanghaiTech University, Shanghai 201210, China; Shanghai Medical College, Fudan University, Shanghai 200032, China.
| | - Jianyang Zeng
- Institute for Interdisciplinary Information Sciences, Tsinghua University, Beijing 100084, China; MOE Key Laboratory of Bioinformatics, Tsinghua University, Beijing 100084, China.
| |
Collapse
|
30
|
Trivedi J, Mahajan D, Jaffe RJ, Acharya A, Mitra D, Byrareddy SN. Recent Advances in the Development of Integrase Inhibitors for HIV Treatment. Curr HIV/AIDS Rep 2020; 17:63-75. [PMID: 31965427 PMCID: PMC7004278 DOI: 10.1007/s11904-019-00480-3] [Citation(s) in RCA: 9] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/10/2023]
Abstract
PURPOSE OF THE REVIEW The complex multistep life cycle of HIV allows it to proliferate within the host and integrate its genome in to the host chromosomal DNA. This provirus can remain dormant for an indefinite period. The process of integration, governed by integrase (IN), is highly conserved across the Retroviridae family. Hence, targeting integration is not only expected to block HIV replication but may also reveal new therapeutic strategies to treat HIV as well as other retrovirus infections. RECENT FINDINGS HIV integrase (IN) has gained attention as the most promising therapeutic target as there are no equivalent homologues of IN that has been discovered in humans. Although current nano-formulated long-acting IN inhibitors have demonstrated the phenomenal ability to block HIV integration and replication with extraordinary half-life, they also have certain limitations. In this review, we have summarized the current literature on clinically established IN inhibitors, their mechanism of action, the advantages and disadvantages associated with their therapeutic application, and finally current HIV cure strategies using these inhibitors.
Collapse
Affiliation(s)
- Jay Trivedi
- National Centre for Cell Science, Pune University Campus, Pune, Maharashtra, India
- Department of Pharmacology and Experimental Neuroscience, University of Nebraska Medical Center, Omaha, NE, USA
| | - Dinesh Mahajan
- Drug Discovery Research Centre, Translational Health Science and Technology Institute, NCR Biotech Science Cluster, 3rd Milestone, Faridabad, Haryana, India
| | - Russell J Jaffe
- Department of Pharmacology and Experimental Neuroscience, University of Nebraska Medical Center, Omaha, NE, USA
| | - Arpan Acharya
- Department of Pharmacology and Experimental Neuroscience, University of Nebraska Medical Center, Omaha, NE, USA
| | - Debashis Mitra
- National Centre for Cell Science, Pune University Campus, Pune, Maharashtra, India.
- Centre for DNA Fingerprinting and Diagnostics, Uppal Telangana state, Hyderabad, India.
| | - Siddappa N Byrareddy
- Department of Pharmacology and Experimental Neuroscience, University of Nebraska Medical Center, Omaha, NE, USA.
- Department of Genetics, Cell Biology and Anatomy, University of Nebraska Medical Center, Omaha, NE, USA.
- Department of Biochemistry and Molecular Biology, University of Nebraska Medical Center, Omaha, NE, USA.
| |
Collapse
|