1
|
Chong LC, Khan AM. A Systematic Bioinformatics Approach for Mapping the Minimal Set of a Viral Peptidome. Curr Protoc 2024; 4:e1056. [PMID: 38856995 DOI: 10.1002/cpz1.1056] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/11/2024]
Abstract
Sequence changes in viral genomes generate protein sequence diversity that enables viruses to evade the host immune system, hindering the development of effective preventive and therapeutic interventions. The massive proliferation of sequence data provides unprecedented opportunities to study viral adaptation and evolution. An alignment-free approach removes various restrictions posed by an alignment-dependent approach for studying sequence diversity. The publicly available tool, UNIQmin, offers an alignment-free approach for studying viral sequence diversity at any given rank of taxonomy lineage and is big data ready. The tool performs an exhaustive search to determine the minimal set of sequences required to capture the peptidome diversity within a given dataset. This compression is possible through the removal of identical sequences and unique sequences that do not contribute effectively to the peptidome diversity pool. Herein, we describe a detailed four-part protocol utilizing UNIQmin to generate the minimal set for the purpose of viral diversity analyses, alignment-free at any rank of the taxonomy lineage, using the recent global public health threat Monkeypox virus (MPX) sequence data as a case study. The protocol enables a systematic bioinformatics approach to study sequence diversity across taxonomic lineages, which is crucial for our future preparedness against viral epidemics. This is particularly important when data are abundant, freely available, and alignment is not an option. © 2024 Wiley Periodicals LLC. Basic Protocol 1: Tool installation and input file preparation Basic Protocol 2: Generation of a minimal set of sequences for a given dataset Basic Protocol 3: Comparative minimal set analysis across taxonomic lineage ranks Basic Protocol 4: Factors affecting the minimal set of sequences.
Collapse
Affiliation(s)
- Li Chuin Chong
- Centre for Bioinformatics, School of Data Sciences, Perdana University, Kuala Lumpur, Malaysia
- Beykoz Institute of Life Sciences and Biotechnology, Bezmialem Vakif University, Beykoz, Turkey
- Current affiliation: Institute for Experimental Virology, TWINCORE Centre for Experimental and Clinical Infection Research, a Medical School Hannover (MHH) and Helmholtz Centre for Infection Research (HZI) joint venture, Hannover, Germany
| | - Asif M Khan
- Centre for Bioinformatics, School of Data Sciences, Perdana University, Kuala Lumpur, Malaysia
- Beykoz Institute of Life Sciences and Biotechnology, Bezmialem Vakif University, Beykoz, Turkey
- Current affiliation: College of Computing and Information Technology, University of Doha for Science and Technology, Doha, Qatar
| |
Collapse
|
2
|
Wang T, Yu ZG, Li J. CGRWDL: alignment-free phylogeny reconstruction method for viruses based on chaos game representation weighted by dynamical language model. Front Microbiol 2024; 15:1339156. [PMID: 38572227 PMCID: PMC10987876 DOI: 10.3389/fmicb.2024.1339156] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/15/2023] [Accepted: 02/23/2024] [Indexed: 04/05/2024] Open
Abstract
Traditional alignment-based methods meet serious challenges in genome sequence comparison and phylogeny reconstruction due to their high computational complexity. Here, we propose a new alignment-free method to analyze the phylogenetic relationships (classification) among species. In our method, the dynamical language (DL) model and the chaos game representation (CGR) method are used to characterize the frequency information and the context information of k-mers in a sequence, respectively. Then for each DNA sequence or protein sequence in a dataset, our method converts the sequence into a feature vector that represents the sequence information based on CGR weighted by the DL model to infer phylogenetic relationships. We name our method CGRWDL. Its performance was tested on both DNA and protein sequences of 8 datasets of viruses to construct the phylogenetic trees. We compared the Robinson-Foulds (RF) distance between the phylogenetic tree constructed by CGRWDL and the reference tree by other advanced methods for each dataset. The results show that the phylogenetic trees constructed by CGRWDL can accurately classify the viruses, and the RF scores between the trees and the reference trees are smaller than that with other methods.
Collapse
Affiliation(s)
- Ting Wang
- National Center for Applied Mathematics in Hunan, Xiangtan University, Xiangtan, Hunan, China
- Key Laboratory of Intelligent Computing and Information Processing of Ministry of Education, Xiangtan University, Xiangtan, Hunan, China
| | - Zu-Guo Yu
- National Center for Applied Mathematics in Hunan, Xiangtan University, Xiangtan, Hunan, China
- Key Laboratory of Intelligent Computing and Information Processing of Ministry of Education, Xiangtan University, Xiangtan, Hunan, China
| | - Jinyan Li
- School of Computer Science and Control Engineering, Shenzhen Institute of Advanced Technology, Shenzhen, Guangdong, China
| |
Collapse
|
3
|
Qiu X, Liu Y, Sha A. SARS-CoV-2 and natural infection in animals. J Med Virol 2023; 95:e28147. [PMID: 36121159 PMCID: PMC9538246 DOI: 10.1002/jmv.28147] [Citation(s) in RCA: 16] [Impact Index Per Article: 16.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/20/2022] [Revised: 09/02/2022] [Accepted: 09/12/2022] [Indexed: 01/11/2023]
Abstract
Severe acute respiratory syndrome coronavirus-2 (SARS-CoV-2) is the causative agent of the novel coronavirus disease (COVID-19) pandemic, which has caused serious challenges for public health systems worldwide. Due to the close relationship between animals and humans, confirmed transmission from humans to numerous animal species has been reported. Understanding the cross-species transmission of SARS-CoV-2 and the infection and transmission dynamics of SARS-CoV-2 in different animals is crucial to control COVID-19 and protect animal health. In this review, the possible animal origins of SARS-CoV-2 and animal species naturally susceptible to SARS-CoV-2 infection are discussed. Furthermore, this review categorizes the SARS-CoV-2 susceptible animals by families, so as to better understand the relationship between SARS-CoV-2 and animals.
Collapse
Affiliation(s)
- Xinyu Qiu
- School of Biology and Food EngineeringChongqing Three Gorges UniversityChongqingChina
| | - Yi Liu
- School of Biology and Food EngineeringChongqing Three Gorges UniversityChongqingChina
| | - Ailong Sha
- School of Teacher EducationChongqing Three Gorges UniversityChongqingChina
| |
Collapse
|
4
|
Jain M, Patil N, Gor D, Sharma MK, Goel N, Kaushik P. Proteomic Approach for Comparative Analysis of the Spike Protein of SARS-CoV-2 Omicron (B.1.1.529) Variant and Other Pango Lineages. Proteomes 2022; 10:34. [PMID: 36278694 PMCID: PMC9624331 DOI: 10.3390/proteomes10040034] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/25/2022] [Revised: 10/11/2022] [Accepted: 10/13/2022] [Indexed: 11/24/2022] Open
Abstract
The novel SARS-CoV-2 variant, Omicron (B.1.1.529), is being testified, and the WHO has characterized Omicron as a variant of concern due to its higher transmissibility and very contagious behavior, immunization breakthrough cases. Here, the comparative proteomic study has been conducted on spike-protein, hACE2 of five lineages (α, β, δ, γ and Omicron. The docking was performed on spike protein- hACE-2 protein using HADDOCK, and PRODIGY was used to analyze the binding energy affinity using a reduced Haddock score. Followed by superimposition in different variant-based protein structures and calculated the esteem root mean square deviation (RMSD). This study reveals that Omicron was seen generating a monophyletic clade. Further, as α variant is the principal advanced strain after Wuhan SARS-CoV-2, and that is the reason it was showing the least likeness rate with the Omicron and connoting Omicron has developed of late with the extreme number of mutations. α variant has shown the highest binding affinity with hACE2, followed by β strain, and followed with γ. Omicron showed a penultimate binding relationship, while the δ variant was seen as having the least binding affinity. This proteomic basis in silico analysis of variable spike proteins of variants will impart light on the development of vaccines and the identification of mutations occurring in the upcoming variants.
Collapse
Affiliation(s)
- Mukul Jain
- Parul Institute of Applied Sciences, Parul University, Vadodara 391760, Gujarat, India; (N.P.); (D.G.)
- Lab 209—Cell & Developmental Biology, Centre of Research for Development, Parul University, Vadodara 391760, Gujarat, India
| | - Nil Patil
- Parul Institute of Applied Sciences, Parul University, Vadodara 391760, Gujarat, India; (N.P.); (D.G.)
- Lab 209—Cell & Developmental Biology, Centre of Research for Development, Parul University, Vadodara 391760, Gujarat, India
| | - Darshil Gor
- Parul Institute of Applied Sciences, Parul University, Vadodara 391760, Gujarat, India; (N.P.); (D.G.)
| | - Mohit Kumar Sharma
- School of Molecular Medicine, Medical University of Warsaw, ul. Żwirki i Wigury 61, 02-091 Warsaw, Poland;
- Malopolska Center of Biotechnology, 30-387 Krakow, Poland
| | - Neha Goel
- Institute of Biomedicine, University of Turku, 20500 Turku, Finland;
| | | |
Collapse
|
5
|
Silva JM, Pratas D, Caetano T, Matos S. The complexity landscape of viral genomes. Gigascience 2022; 11:6661051. [PMID: 35950839 PMCID: PMC9366995 DOI: 10.1093/gigascience/giac079] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/03/2022] [Revised: 05/25/2022] [Accepted: 07/26/2022] [Indexed: 12/11/2022] Open
Abstract
BACKGROUND Viruses are among the shortest yet highly abundant species that harbor minimal instructions to infect cells, adapt, multiply, and exist. However, with the current substantial availability of viral genome sequences, the scientific repertory lacks a complexity landscape that automatically enlights viral genomes' organization, relation, and fundamental characteristics. RESULTS This work provides a comprehensive landscape of the viral genome's complexity (or quantity of information), identifying the most redundant and complex groups regarding their genome sequence while providing their distribution and characteristics at a large and local scale. Moreover, we identify and quantify inverted repeats abundance in viral genomes. For this purpose, we measure the sequence complexity of each available viral genome using data compression, demonstrating that adequate data compressors can efficiently quantify the complexity of viral genome sequences, including subsequences better represented by algorithmic sources (e.g., inverted repeats). Using a state-of-the-art genomic compressor on an extensive viral genomes database, we show that double-stranded DNA viruses are, on average, the most redundant viruses while single-stranded DNA viruses are the least. Contrarily, double-stranded RNA viruses show a lower redundancy relative to single-stranded RNA. Furthermore, we extend the ability of data compressors to quantify local complexity (or information content) in viral genomes using complexity profiles, unprecedently providing a direct complexity analysis of human herpesviruses. We also conceive a features-based classification methodology that can accurately distinguish viral genomes at different taxonomic levels without direct comparisons between sequences. This methodology combines data compression with simple measures such as GC-content percentage and sequence length, followed by machine learning classifiers. CONCLUSIONS This article presents methodologies and findings that are highly relevant for understanding the patterns of similarity and singularity between viral groups, opening new frontiers for studying viral genomes' organization while depicting the complexity trends and classification components of these genomes at different taxonomic levels. The whole study is supported by an extensive website (https://asilab.github.io/canvas/) for comprehending the viral genome characterization using dynamic and interactive approaches.
Collapse
Affiliation(s)
- Jorge Miguel Silva
- Institute of Electronics and Informatics Engineering of Aveiro, University of Aveiro, Campus Universitário de Santiago, 3810-193 Aveiro, Portugal
| | - Diogo Pratas
- Institute of Electronics and Informatics Engineering of Aveiro, University of Aveiro, Campus Universitário de Santiago, 3810-193 Aveiro, Portugal.,Department of Electronics Telecommunications and Informatics, University of Aveiro, Campus Universitario de Santiago, 3810-193 Aveiro, Portugal.,Department of Virology, University of Helsinki, Haartmaninkatu 3, 00014 Helsinki, Finland
| | - Tânia Caetano
- Department of Biology, University of Aveiro, Campus Universitario de Santiago, 3810-193 Aveiro, Portugal
| | - Sérgio Matos
- Institute of Electronics and Informatics Engineering of Aveiro, University of Aveiro, Campus Universitário de Santiago, 3810-193 Aveiro, Portugal.,Department of Electronics Telecommunications and Informatics, University of Aveiro, Campus Universitario de Santiago, 3810-193 Aveiro, Portugal
| |
Collapse
|
6
|
Ahmad SU, Hafeez Kiani B, Abrar M, Jan Z, Zafar I, Ali Y, Alanazi AM, Malik A, Rather MA, Ahmad A, Khan AA. A comprehensive genomic study, mutation screening, phylogenetic and statistical analysis of SARS-CoV-2 and its variant omicron among different countries. J Infect Public Health 2022; 15:878-891. [PMID: 35839568 PMCID: PMC9262654 DOI: 10.1016/j.jiph.2022.07.002] [Citation(s) in RCA: 18] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/21/2022] [Revised: 06/16/2022] [Accepted: 07/03/2022] [Indexed: 01/09/2023] Open
Abstract
BACKGROUND With the rapid development of the genomic sequence data for the Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) and its variants Delta (B.1.617.2) and Omicron (B.1.1.529), it is vital to successfully identify mutations within the genome. OBJECTIVE The main objective of the study is to investigate the full-length genome mutation analysis of 157 SARS-CoV-2 and its variant Delta and Omicron isolates. This study also provides possible effects at the structural level to understand the role of mutations and new insights into the evolution of COVID-19 and evaluates the differential level analysis in viral genome sequence among different nations. We have also tried to offer a mutation snapshot for these differences that could help in vaccine formulation. This study utilizes a unique and efficient method of targeting the stable genes for the drug discovery approach. METHODS Complete genome sequence information of SARS-CoV-2, Delta, and Omicron from online resources were used to predict structure domain identification, data mining, and screening; employing different bioinformatics tools. BioEdit software was used to perform their genomic alignments across countries and a phylogenetic tree as per the confidence of 500 bootstrapping values was constructed. Heterozygosity ratios were determined in-silico. A minimum spanning network (MSN) of selected populations was determined by Bruvo's distance role-based framework. RESULTS Out of all 157 different strains of SARS-CoV-2 and its variants, and their complete genome sequences from different countries, Corona nucleoca and DUF5515 were observed to be the most conserved domains. All genomes obtained changes in comparison to the Wuhan-Hu-1 strain, mainly in the TRS region (CUAAAC or ACGAAC). We discovered 596 mutations in all genes, with the highest number (321) found in ORF1ab (QHD43415.1), or TRS site mutations found only in ORF7a (1) and ORF10 (2). The Omicron variant has 30 mutations in the Spike protein and has a higher alpha-helix shape (23.46%) than the Delta version (22.03%). T478 was also discovered to be a prevalent polymorphism in Delta and Omicron variations, as well as genomic gaps ranging from 45 to 65aa. All 157 sequences contained variations and conformed to Nei's Genetic distance. We discovered heterozygosity (Hs) 0.01, mean anticipated Hs 0.32, the genetic diversity index (GDI) 0.01943989, and GD within population 0.01266951. The Hedrick value was 0.52324978, the GD coefficient was 0.52324978, the average Hs was 0.01371452, and the GD coefficient was 0.52324978. Among other countries, Brazil has the highest standard error (SE) rate (1.398), whereas Japan has the highest ratio of Nei's gene diversity (0.01). CONCLUSIONS The study's findings will assist in comprehending the shape and kind of complete genome, their streaming genomic sequences, and mutations in various additions of SARS-CoV-2, as well as its different variant strains like Omicron. These results will provide a scientific basis to design the vaccines and understand the genomic study of these viruses.
Collapse
Affiliation(s)
- Syed Umair Ahmad
- Department of Bioinformatics, Hazara University, Mansehra, Pakistan
| | - Bushra Hafeez Kiani
- Department of Biological Sciences, Faculty of Basic and Applied Sciences, International Islamic University Islamabad, 44000, Pakistan
| | - Muhammad Abrar
- Department of Anesthesia, DHQ Teaching Hospital, Sahiwal Medical College, Sahiwal, Pakistan
| | - Zainab Jan
- Department of Bioinformatics, Hazara University, Mansehra, Pakistan
| | - Imran Zafar
- Department of Bioinformatics and Computational Biology, Virtual University, Pakistan
| | - Yasir Ali
- National Centre for Bioinformatics, Quaid-i-Azam University, Islamabad, Pakistan
| | - Amer M. Alanazi
- Pharmaceutical Biotechnology Laboratory, Department of Pharmaceutical Chemistry, College of Pharmacy, King Saud University, P.O. Box 2457, Riyadh 11451, Saudi Arabia
| | - Abdul Malik
- Department of Pharmaceutics, College of Pharmacy, King Saud University, P.O. Box 2457, Riyadh 11451, Saudi Arabia
| | - Mohd Ashraf Rather
- Division of Fish Genetics and Biotechnology, Faculty of Fisheries Ganderbal, Sher-e, Kashmir University of Agricultural Science and Technology, Kashmir, India
| | - Asrar Ahmad
- Center for Sickle Cell Disease, College of Medicine, Howard University, Washington DC, USA
| | - Azmat Ali Khan
- Pharmaceutical Biotechnology Laboratory, Department of Pharmaceutical Chemistry, College of Pharmacy, King Saud University, P.O. Box 2457, Riyadh 11451, Saudi Arabia,Corresponding author
| |
Collapse
|