1
|
Xu H, Jia J, Jeong HH, Zhao Z. Deep learning for detecting and elucidating human T-cell leukemia virus type 1 integration in the human genome. PATTERNS (NEW YORK, N.Y.) 2023; 4:100674. [PMID: 36873907 PMCID: PMC9982299 DOI: 10.1016/j.patter.2022.100674] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 09/01/2022] [Revised: 11/02/2022] [Accepted: 12/13/2022] [Indexed: 02/12/2023]
Abstract
Human T-cell leukemia virus type 1 (HTLV-1), a retrovirus, is the causative agent for adult T cell leukemia/lymphoma and many other human diseases. Accurate and high throughput detection of HTLV-1 virus integration sites (VISs) across the host genomes plays a crucial role in the prevention and treatment of HTLV-1-associated diseases. Here, we developed DeepHTLV, the first deep learning framework for VIS prediction de novo from genome sequence, motif discovery, and cis-regulatory factor identification. We demonstrated the high accuracy of DeepHTLV with more efficient and interpretive feature representations. Decoding the informative features captured by DeepHTLV resulted in eight representative clusters with consensus motifs for potential HTLV-1 integration. Furthermore, DeepHTLV revealed interesting cis-regulatory elements in regulation of VISs that have significant association with the detected motifs. Literature evidence demonstrated nearly half (34) of the predicted transcription factors enriched with VISs were involved in HTLV-1-associated diseases. DeepHTLV is freely available at https://github.com/bsml320/DeepHTLV.
Collapse
Affiliation(s)
- Haodong Xu
- Center for Precision Health, School of Biomedical Informatics, UTHealth Science Center at Houston, Houston, TX 77030, USA
| | - Johnathan Jia
- Center for Precision Health, School of Biomedical Informatics, UTHealth Science Center at Houston, Houston, TX 77030, USA.,MD Anderson UTHealth Graduate School of Biomedical Sciences, Houston, TX 77030, USA
| | - Hyun-Hwan Jeong
- Center for Precision Health, School of Biomedical Informatics, UTHealth Science Center at Houston, Houston, TX 77030, USA
| | - Zhongming Zhao
- Center for Precision Health, School of Biomedical Informatics, UTHealth Science Center at Houston, Houston, TX 77030, USA.,MD Anderson UTHealth Graduate School of Biomedical Sciences, Houston, TX 77030, USA.,Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN 37203, USA
| |
Collapse
|
2
|
Xu H, Zhao Z. NetBCE: An Interpretable Deep Neural Network for Accurate Prediction of Linear B-cell Epitopes. GENOMICS, PROTEOMICS & BIOINFORMATICS 2022; 20:1002-1012. [PMID: 36526218 PMCID: PMC10025766 DOI: 10.1016/j.gpb.2022.11.009] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/05/2022] [Revised: 10/27/2022] [Accepted: 11/11/2022] [Indexed: 12/15/2022]
Abstract
Identification of B-cell epitopes (BCEs) plays an essential role in the development of peptide vaccines and immuno-diagnostic reagents, as well as antibody design and production. In this work, we generated a large benchmark dataset comprising 124,879 experimentally supported linear epitope-containing regions in 3567 protein clusters from over 1.3 million B cell assays. Analysis of this curated dataset showed large pathogen diversity covering 176 different families. The accuracy in linear BCE prediction was found to strongly vary with different features, while all sequence-derived and structural features were informative. To search more efficient and interpretive feature representations, a ten-layer deep learning framework for linear BCE prediction, namely NetBCE, was developed. NetBCE achieved high accuracy and robust performance with the average area under the curve (AUC) value of 0.8455 in five-fold cross-validation through automatically learning the informative classification features. NetBCE substantially outperformed the conventional machine learning algorithms and other tools, with more than 22.06% improvement of AUC value compared to other tools using an independent dataset. Through investigating the output of important network modules in NetBCE, epitopes and non-epitopes tended to be presented in distinct regions with efficient feature representation along the network layer hierarchy. The NetBCE is freely available at https://github.com/bsml320/NetBCE.
Collapse
Affiliation(s)
- Haodong Xu
- Center for Precision Health, School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX 77030, USA
| | - Zhongming Zhao
- Center for Precision Health, School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX 77030, USA; Human Genetics Center, School of Public Health, The University of Texas Health Science Center at Houston, Houston, TX 77030, USA; The University of Texas MD Anderson Cancer Center UTHealth Houston Graduate School of Biomedical Sciences, Houston, TX 77030, USA; Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN 37203, USA.
| |
Collapse
|
3
|
Designing optimal convolutional neural network architecture using differential evolution algorithm. PATTERNS 2022; 3:100567. [PMID: 36124301 PMCID: PMC9481963 DOI: 10.1016/j.patter.2022.100567] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 04/10/2022] [Revised: 06/04/2022] [Accepted: 07/13/2022] [Indexed: 01/08/2023]
Abstract
Convolutional neural networks (CNNs) are deep learning models used widely for solving various tasks like computer vision and speech recognition. CNNs are developed manually based on problem-specific domain knowledge and tricky settings, which are laborious, time consuming, and challenging. To solve these, our study develops an improved differential evolution of convolutional neural network (IDECNN) algorithm to design CNN layer architectures for image classification. Variable-length encoding is utilized to represent the flexible layer architecture of a CNN model in IDECNN. An efficient heuristic mechanism is proposed in IDECNN to evolve CNN architecture through mutation and crossover to prevent premature convergence during the evolutionary process. Eight well-known imaging datasets were utilized. The results showed that IDECNN could design suitable architecture compared with 20 existing CNN models. Finally, CNN architectures are applied to pneumonia and coronavirus disease 2019 (COVID-19) X-ray biomedical image data. The results demonstrated the usefulness of the proposed approach to generate a suitable CNN model. Introduce DE algorithm to automatically design CNN architectures Variable-length encoding strategy is proposed to encode each CNN model For the DE framework, two CNN architectures undergo a refinement difference approach Design a heuristic mechanism for mutation operation to evolve CNN architectures
Convolutional neural networks (CNNs) are a class of deep learning (DL) methods that have demonstrated improved performance in various computer vision tasks. With the growing popularity of CNNs, several CNN architectures have been introduced with a large number of design options that are problem dependent. In most situations, the constructed CNN model performs well on the dataset used to train it. There is no guarantee that the designed CNN model can achieve sufficient classification accuracy for other datasets. Designing an appropriate CNN model architecture for a particular problem requires human interaction and trial-and-error procedures, which are laborious and time consuming. This study uses an improved differential evolution of convolutional neural network (IDECNN) technique to automatically construct effective CNN architectures for several image classification problems, which mitigates the issues found with manually designed CNN models.
Collapse
|
4
|
Tang D, Li Y, Tan D, Fu J, Tang Y, Lin J, Zhao R, Du H, Zhao Z. KCOSS: an ultra-fast k-mer counter for assembled genome analysis. Bioinformatics 2022; 38:933-940. [PMID: 34849595 DOI: 10.1093/bioinformatics/btab797] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/13/2021] [Revised: 10/13/2021] [Accepted: 11/19/2021] [Indexed: 02/03/2023] Open
Abstract
MOTIVATION The k-mer frequency in whole genome sequences provides researchers with an insightful perspective on genomic complexity, comparative genomics, metagenomics and phylogeny. The current k-mer counting tools are typically slow, and they require large memory and hard disk for assembled genome analysis. RESULTS We propose a novel and ultra-fast k-mer counting algorithm, KCOSS, to fulfill k-mer counting mainly for assembled genomes with segmented Bloom filter, lock-free queue, lock-free thread pool and cuckoo hash table. We optimize running time and memory consumption by recycling memory blocks, merging multiple consecutive first-occurrence k-mers into C-read, and writing a set of C-reads to disk asynchronously. KCOSS was comparatively tested with Jellyfish2, CHTKC and KMC3 on seven assembled genomes and three sequencing datasets in running time, memory consumption, and hard disk occupation. The experimental results show that KCOSS counts k-mer with less memory and disk while having a shorter running time on assembled genomes. KCOSS can be used to calculate the k-mer frequency not only for assembled genomes but also for sequencing data. AVAILABILITYAND IMPLEMENTATION The KCOSS software is implemented in C++. It is freely available on GitHub: https://github.com/kcoss-2021/KCOSS. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Deyou Tang
- School of Software Engineering, South China University of Technology, Guangzhou, Guangdong 510006, China.,Center for Precision Health, School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX 77030, USA
| | - Yucheng Li
- School of Software Engineering, South China University of Technology, Guangzhou, Guangdong 510006, China
| | - Daqiang Tan
- School of Software Engineering, South China University of Technology, Guangzhou, Guangdong 510006, China
| | - Juan Fu
- School of Medicine, South China University of Technology, Guangzhou, Guangdong 510006, China
| | - Yelei Tang
- School of Software Engineering, South China University of Technology, Guangzhou, Guangdong 510006, China
| | - Jiabin Lin
- School of Software Engineering, South China University of Technology, Guangzhou, Guangdong 510006, China
| | - Rong Zhao
- School of Software Engineering, South China University of Technology, Guangzhou, Guangdong 510006, China
| | - Hongli Du
- School of Biology and Biological Engineering, South China University of Technology, Guangzhou, Guangdong 510006, China
| | - Zhongming Zhao
- Center for Precision Health, School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX 77030, USA.,Human Genetics Center, School of Public Health, The University of Texas Health Science Center at Houston, Houston, TX 77030, USA.,MD Anderson Cancer Center UTHealth Graduate School of Biomedical Sciences, Houston, TX 77030, USA
| |
Collapse
|