1
|
Alipour F, Hill KA, Kari L. CGRclust: Chaos Game Representation for twin contrastive clustering of unlabelled DNA sequences. BMC Genomics 2024; 25:1214. [PMID: 39695938 DOI: 10.1186/s12864-024-11135-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/01/2024] [Accepted: 12/06/2024] [Indexed: 12/20/2024] Open
Abstract
BACKGROUND Traditional supervised learning methods applied to DNA sequence taxonomic classification rely on the labor-intensive and time-consuming step of labelling the primary DNA sequences. Additionally, standard DNA classification/clustering methods involve time-intensive multiple sequence alignments, which impacts their applicability to large genomic datasets or distantly related organisms. These limitations indicate a need for robust, efficient, and scalable unsupervised DNA sequence clustering methods that do not depend on sequence labels or alignment. RESULTS This study proposes CGRclust, a novel combination of unsupervised twin contrastive clustering of Chaos Game Representations (CGR) of DNA sequences, with convolutional neural networks (CNNs). To the best of our knowledge, CGRclust is the first method to use unsupervised learning for image classification (herein applied to two-dimensional CGR images) for clustering datasets of DNA sequences. CGRclust overcomes the limitations of traditional sequence classification methods by leveraging unsupervised twin contrastive learning to detect distinctive sequence patterns, without requiring DNA sequence alignment or biological/taxonomic labels. CGRclust accurately clustered twenty-five diverse datasets, with sequence lengths ranging from 664 bp to 100 kbp, including mitochondrial genomes of fish, fungi, and protists, as well as viral whole genome assemblies and synthetic DNA sequences. Compared with three recent clustering methods for DNA sequences (DeLUCS, iDeLUCS, and MeShClust v3.0.), CGRclust is the only method that surpasses 81.70% accuracy across all four taxonomic levels tested for mitochondrial DNA genomes of fish. Moreover, CGRclust also consistently demonstrates superior performance across all the viral genomic datasets. The high clustering accuracy of CGRclust on these twenty-five datasets, which vary significantly in terms of sequence length, number of genomes, number of clusters, and level of taxonomy, demonstrates its robustness, scalability, and versatility. CONCLUSION CGRclust is a novel, scalable, alignment-free DNA sequence clustering method that uses CGR images of DNA sequences and CNNs for twin contrastive clustering of unlabelled primary DNA sequences, achieving superior or comparable accuracy and performance over current approaches. CGRclust demonstrated enhanced reliability, by consistently achieving over 80% accuracy in more than 90% of the datasets analyzed. In particular, CGRclust performed especially well in clustering viral DNA datasets, where it consistently outperformed all competing methods.
Collapse
Affiliation(s)
- Fatemeh Alipour
- School of Computer Science, University of Waterloo, Waterloo, Canada.
| | - Kathleen A Hill
- Department of Biology, University of Western Ontario, London, Canada
| | - Lila Kari
- School of Computer Science, University of Waterloo, Waterloo, Canada
| |
Collapse
|
2
|
Zhang YZ, Imoto S. Genome analysis through image processing with deep learning models. J Hum Genet 2024; 69:519-525. [PMID: 39085457 PMCID: PMC11422167 DOI: 10.1038/s10038-024-01275-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/08/2024] [Revised: 07/08/2024] [Accepted: 07/08/2024] [Indexed: 08/02/2024]
Abstract
Genomic sequences are traditionally represented as strings of characters: A (adenine), C (cytosine), G (guanine), and T (thymine). However, an alternative approach involves depicting sequence-related information through image representations, such as Chaos Game Representation (CGR) and read pileup images. With rapid advancements in deep learning (DL) methods within computer vision and natural language processing, there is growing interest in applying image-based DL methods to genomic sequence analysis. These methods involve encoding genomic information as images or integrating spatial information from images into the analytical process. In this review, we summarize three typical applications that use image processing with DL models for genome analysis. We examine the utilization and advantages of these image-based approaches.
Collapse
Affiliation(s)
- Yao-Zhong Zhang
- Division of Health Medical Intelligence, Human Genome Center, the Institute of Medical Science, the University of Tokyo, 4-6-1 Shirokanedai, Minato-ku, Tokyo, 108-8639, Japan.
| | - Seiya Imoto
- Division of Health Medical Intelligence, Human Genome Center, the Institute of Medical Science, the University of Tokyo, 4-6-1 Shirokanedai, Minato-ku, Tokyo, 108-8639, Japan.
| |
Collapse
|
3
|
Fahmy AM, Hammad MS, Mabrouk MS, Al-Atabany WI. On leveraging self-supervised learning for accurate HCV genotyping. Sci Rep 2024; 14:15463. [PMID: 38965254 PMCID: PMC11224313 DOI: 10.1038/s41598-024-64209-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/11/2024] [Accepted: 06/06/2024] [Indexed: 07/06/2024] Open
Abstract
Hepatitis C virus (HCV) is a major global health concern, affecting millions of individuals worldwide. While existing literature predominantly focuses on disease classification using clinical data, there exists a critical research gap concerning HCV genotyping based on genomic sequences. Accurate HCV genotyping is essential for patient management and treatment decisions. While the neural models excel at capturing complex patterns, they still face challenges, such as data scarcity, that exist a lot in computational genomics. To overcome this challenges, this paper introduces an advanced deep learning approach for HCV genotyping based on the graphical representation of nucleotide sequences that outperforms classical approaches. Notably, it is effective for both partial and complete HCV genomes and addresses challenges associated with imbalanced datasets. In this work, ten HCV genotypes: 1a, 1b, 2a, 2b, 2c, 3a, 3b, 4, 5, and 6 were used in the analysis. This study utilizes Chaos Game Representation for 2D mapping of genomic sequences, employing self-supervised learning using convolutional autoencoder for deep feature extraction, resulting in an outstanding performance for HCV genotyping compared to various machine learning and deep learning models. This baseline provides a benchmark against which the performance of the proposed approach and other models can be evaluated. The experimental results showcase a remarkable classification accuracy of over 99%, outperforming traditional deep learning models. This performance demonstrates the capability of the proposed model to accurately identify HCV genotypes in both partial and complete sequences and in dealing with data scarcity for certain genotypes. The results of the proposed model are compared to NCBI genotyping tool.
Collapse
Affiliation(s)
- Ahmed M Fahmy
- Computer Science program, School of Information Technology and Computer Science (ITCS), Nile University, Sheikh Zayed City, Egypt.
| | - Muhammed S Hammad
- Biomedical Engineering Department, Faculty of Engineering, Helwan University, Cairo, Egypt
| | - Mai S Mabrouk
- Biomedical informatics program, School of Information Technology and Computer Science (ITCS), Nile University, Sheikh Zayed City, Egypt
| | - Walid I Al-Atabany
- Biomedical informatics program, School of Information Technology and Computer Science (ITCS), Nile University, Sheikh Zayed City, Egypt
| |
Collapse
|
4
|
Antar S, Abd El-Sattar HKH, Abdel-Rahman MH, F M Ghaleb F. COVID-19 infection segmentation using hybrid deep learning and image processing techniques. Sci Rep 2023; 13:22737. [PMID: 38123587 PMCID: PMC10733411 DOI: 10.1038/s41598-023-49337-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/20/2023] [Accepted: 12/07/2023] [Indexed: 12/23/2023] Open
Abstract
The coronavirus disease 2019 (COVID-19) epidemic has become a worldwide problem that continues to affect people's lives daily, and the early diagnosis of COVID-19 has a critical importance on the treatment of infected patients for medical and healthcare organizations. To detect COVID-19 infections, medical imaging techniques, including computed tomography (CT) scan images and X-ray images, are considered some of the helpful medical tests that healthcare providers carry out. However, in addition to the difficulty of segmenting contaminated areas from CT scan images, these approaches also offer limited accuracy for identifying the virus. Accordingly, this paper addresses the effectiveness of using deep learning (DL) and image processing techniques, which serve to expand the dataset without the need for any augmentation strategies, and it also presents a novel approach for detecting COVID-19 virus infections in lung images, particularly the infection prediction issue. In our proposed method, to reveal the infection, the input images are first preprocessed using a threshold then resized to 128 × 128. After that, a density heat map tool is used for coloring the resized lung images. The three channels (red, green, and blue) are then separated from the colored image and are further preprocessed through image inverse and histogram equalization, and are subsequently fed, in independent directions, into three separate U-Nets with the same architecture for segmentation. Finally, the segmentation results are combined and run through a convolution layer one by one to get the detection. Several evaluation metrics using the CT scan dataset were used to measure the performance of the proposed approach in comparison with other state-of-the-art techniques in terms of accuracy, sensitivity, precision, and the dice coefficient. The experimental results of the proposed approach reached 99.71%, 0.83, 0.87, and 0.85, respectively. These results show that coloring the CT scan images dataset and then dividing each image into its RGB image channels can enhance the COVID-19 detection, and it also increases the U-Net power in the segmentation when merging the channel segmentation results. In comparison to other existing segmentation techniques employing bigger 512 × 512 images, this study is one of the few that can rapidly and correctly detect the COVID-19 virus with high accuracy on smaller 128 × 128 images using the metrics of accuracy, sensitivity, precision, and dice coefficient.
Collapse
Affiliation(s)
- Samar Antar
- Computer Science Division, Department of Mathematics, Faculty of Science, Ain Shams University, Abbassia, Cairo, 11566, Egypt
| | | | - Mohammad H Abdel-Rahman
- Computer Science Division, Department of Mathematics, Faculty of Science, Ain Shams University, Abbassia, Cairo, 11566, Egypt
| | - Fayed F M Ghaleb
- Computer Science Division, Department of Mathematics, Faculty of Science, Ain Shams University, Abbassia, Cairo, 11566, Egypt
| |
Collapse
|
5
|
Yue T, Wang Y, Zhang L, Gu C, Xue H, Wang W, Lyu Q, Dun Y. Deep Learning for Genomics: From Early Neural Nets to Modern Large Language Models. Int J Mol Sci 2023; 24:15858. [PMID: 37958843 PMCID: PMC10649223 DOI: 10.3390/ijms242115858] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/30/2023] [Revised: 10/24/2023] [Accepted: 10/30/2023] [Indexed: 11/15/2023] Open
Abstract
The data explosion driven by advancements in genomic research, such as high-throughput sequencing techniques, is constantly challenging conventional methods used in genomics. In parallel with the urgent demand for robust algorithms, deep learning has succeeded in various fields such as vision, speech, and text processing. Yet genomics entails unique challenges to deep learning, since we expect a superhuman intelligence that explores beyond our knowledge to interpret the genome from deep learning. A powerful deep learning model should rely on the insightful utilization of task-specific knowledge. In this paper, we briefly discuss the strengths of different deep learning models from a genomic perspective so as to fit each particular task with proper deep learning-based architecture, and we remark on practical considerations of developing deep learning architectures for genomics. We also provide a concise review of deep learning applications in various aspects of genomic research and point out current challenges and potential research directions for future genomics applications. We believe the collaborative use of ever-growing diverse data and the fast iteration of deep learning models will continue to contribute to the future of genomics.
Collapse
Affiliation(s)
- Tianwei Yue
- School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213, USA; (Y.W.); (L.Z.); (W.W.)
| | - Yuanxin Wang
- School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213, USA; (Y.W.); (L.Z.); (W.W.)
| | - Longxiang Zhang
- School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213, USA; (Y.W.); (L.Z.); (W.W.)
| | - Chunming Gu
- Department of Biomedical Engineering, School of Medicine, Johns Hopkins University, Baltimore, MD 21218, USA;
| | - Haoru Xue
- The Robotics Institute, Carnegie Mellon University, Pittsburgh, PA 15213, USA;
| | - Wenping Wang
- School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213, USA; (Y.W.); (L.Z.); (W.W.)
| | - Qi Lyu
- Department of Computational Mathematics, Science, and Engineering, Michigan State University, East Lansing, MI 48824, USA;
| | - Yujie Dun
- School of Information and Communications Engineering, Xi’an Jiaotong University, Xi’an 710049, China;
| |
Collapse
|
6
|
Marwan M, Han M, Khan R. Generation of multi-scrolls in corona virus disease 2019 (COVID-19) chaotic system and its impact on the zero-covid policy. Sci Rep 2023; 13:13954. [PMID: 37626140 PMCID: PMC10457353 DOI: 10.1038/s41598-023-40651-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/26/2023] [Accepted: 08/16/2023] [Indexed: 08/27/2023] Open
Abstract
In this paper, we discussed the impossibility of achieving zero-covid cases per day for all time with the help of fuzzy theory, while how a single case can trigger chaotic situation in the nearby city is elaborated using multi-scrolls. To accomplish this goal, we consider the number of new cases per day; [Formula: see text] to be the preferred state variable by restricting its value to the interval (0, 1). One can need to think of [Formula: see text] as a member of a fuzzy set and provide that set with appropriate membership functions. Moreover, how a single incident in one city can spread chaos to other cities is also addressed at length, using multi-scroll attractors and the signal excitation function. In addition, a bifurcation diagram of daily new instances vs the parameter [Formula: see text] is shown, elaborating that daily new cases may show a decrease under strict rules and regulations, but can again lead to chaos. Apart from biologist, this paper can play vital role for engineers as well in a sense that, a signal function can be embedded in non-symmetric systems for the creation of multi-scroll attractors in all directions using a generalized algorithm that has been designed in the current work. Finally, it is our future target to show that the covid is leading towards influenza and will be no more dangerous as was in the past.
Collapse
Affiliation(s)
- Muhammad Marwan
- School of Mathematical Sciences, Zhejiang Normal University, Jinhua, 321004, China.
| | - Maoan Han
- School of Mathematical Sciences, Zhejiang Normal University, Jinhua, 321004, China
| | - Rizwan Khan
- School of Computer Science, Zhejiang Normal University, Jinhua, 321004, China
| |
Collapse
|