1
|
Fahmy AM, Hammad MS, Mabrouk MS, Al-Atabany WI. On leveraging self-supervised learning for accurate HCV genotyping. Sci Rep 2024; 14:15463. [PMID: 38965254 PMCID: PMC11224313 DOI: 10.1038/s41598-024-64209-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/11/2024] [Accepted: 06/06/2024] [Indexed: 07/06/2024] Open
Abstract
Hepatitis C virus (HCV) is a major global health concern, affecting millions of individuals worldwide. While existing literature predominantly focuses on disease classification using clinical data, there exists a critical research gap concerning HCV genotyping based on genomic sequences. Accurate HCV genotyping is essential for patient management and treatment decisions. While the neural models excel at capturing complex patterns, they still face challenges, such as data scarcity, that exist a lot in computational genomics. To overcome this challenges, this paper introduces an advanced deep learning approach for HCV genotyping based on the graphical representation of nucleotide sequences that outperforms classical approaches. Notably, it is effective for both partial and complete HCV genomes and addresses challenges associated with imbalanced datasets. In this work, ten HCV genotypes: 1a, 1b, 2a, 2b, 2c, 3a, 3b, 4, 5, and 6 were used in the analysis. This study utilizes Chaos Game Representation for 2D mapping of genomic sequences, employing self-supervised learning using convolutional autoencoder for deep feature extraction, resulting in an outstanding performance for HCV genotyping compared to various machine learning and deep learning models. This baseline provides a benchmark against which the performance of the proposed approach and other models can be evaluated. The experimental results showcase a remarkable classification accuracy of over 99%, outperforming traditional deep learning models. This performance demonstrates the capability of the proposed model to accurately identify HCV genotypes in both partial and complete sequences and in dealing with data scarcity for certain genotypes. The results of the proposed model are compared to NCBI genotyping tool.
Collapse
Affiliation(s)
- Ahmed M Fahmy
- Computer Science program, School of Information Technology and Computer Science (ITCS), Nile University, Sheikh Zayed City, Egypt.
| | - Muhammed S Hammad
- Biomedical Engineering Department, Faculty of Engineering, Helwan University, Cairo, Egypt
| | - Mai S Mabrouk
- Biomedical informatics program, School of Information Technology and Computer Science (ITCS), Nile University, Sheikh Zayed City, Egypt
| | - Walid I Al-Atabany
- Biomedical informatics program, School of Information Technology and Computer Science (ITCS), Nile University, Sheikh Zayed City, Egypt
| |
Collapse
|
2
|
PWM2Vec: An Efficient Embedding Approach for Viral Host Specification from Coronavirus Spike Sequences. BIOLOGY 2022; 11:biology11030418. [PMID: 35336792 PMCID: PMC8945605 DOI: 10.3390/biology11030418] [Citation(s) in RCA: 7] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 01/04/2022] [Revised: 02/24/2022] [Accepted: 03/07/2022] [Indexed: 01/14/2023]
Abstract
Simple Summary The family of coronaviruses comprises a diverse set of strains and variants which cause diseases from the common cold to COVID-19. Moreover, they infect a wide array of hosts from bats, camels, birds, to humans. Studying coronaviruses through the lens of host specificity provides a unique perspective to understanding the evolution, diversity and dynamics of this family. In particular, this can reveal groups of different hosts infected by similar strains, giving clues on strains which were more likely to have evolved to jump from one host to another. In this work, we frame host specificity as a classification task, in designing a very compact numerical representation of the spike sequences of different coronaviruses. Based on this numerical representation, classification methods are able to detect the target host with high accuracy. Such an approach can used to efficiently scale to large volumes of sequences, in order to unveil trends in the host specificity of different coronavirus strains. Abstract The study of host specificity has important connections to the question about the origin of SARS-CoV-2 in humans which led to the COVID-19 pandemic—an important open question. There are speculations that bats are a possible origin. Likewise, there are many closely related (corona)viruses, such as SARS, which was found to be transmitted through civets. The study of the different hosts which can be potential carriers and transmitters of deadly viruses to humans is crucial to understanding, mitigating, and preventing current and future pandemics. In coronaviruses, the surface (S) protein, or spike protein, is important in determining host specificity, since it is the point of contact between the virus and the host cell membrane. In this paper, we classify the hosts of over five thousand coronaviruses from their spike protein sequences, segregating them into clusters of distinct hosts among birds, bats, camels, swine, humans, and weasels, to name a few. We propose a feature embedding based on the well-known position weight matrix (PWM), which we call PWM2Vec, and we use it to generate feature vectors from the spike protein sequences of these coronaviruses. While our embedding is inspired by the success of PWMs in biological applications, such as determining protein function and identifying transcription factor binding sites, we are the first (to the best of our knowledge) to use PWMs from viral sequences to generate fixed-length feature vector representations, and use them in the context of host classification. The results on real world data show that when using PWM2Vec, machine learning classifiers are able to perform comparably to the baseline models in terms of predictive performance and runtime—in some cases, the performance is better. We also measure the importance of different amino acids using information gain to show the amino acids which are important for predicting the host of a given coronavirus. Finally, we perform some statistical analyses on these results to show that our embedding is more compact than the embeddings of the baseline models.
Collapse
|
3
|
Singh A, Mankotia DS, Irshad M. A Single-step Multiplex Quantitative Real Time Polymerase Chain Reaction Assay for Hepatitis C Virus Genotypes. J Transl Int Med 2017; 5:34-42. [PMID: 28680837 DOI: 10.1515/jtim-2017-0010] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/15/2023] Open
Abstract
BACKGROUND AND OBJECTIVES The variable response of hepatitis C virus (HCV) genotypes towards anti-viral treatment requires prior information on the genotype status before planning a therapeutic strategy. Although assays for typing or subtyping of HCV are available, however, a fast and reliable assay system is still needed. The present study was planned to develop a single-step multiplex quantitative real time polymerase chain reaction (qPCR) assay to determine HCV genotypes in patients' sera. METHODS The conserved sequences from 5' UTR, core and NS5b regions of HCV genome were used to design primers and hydrolysis probes labeled with fluorophores. Starting with the standardization of singleplex (qPCR) for each individual HCV-genotype, the experimental conditions were finally optimized for the development of multiplex assay. The sensitivity and specificity were assessed both for singleplex and multiplex assays. Using the template concentration of 102 copies per microliter, the value of quantification cycle (Cq) and the limit of detection (LOD) were also compared for both singleplex and multiplex assays. Similarly, the merit of multiplex assay was also compared with sequence analysis and restriction fragment length polymorphism (RFLP) techniques used for HCV genotyping. In order to find the application of multiplex qPCR assay, it was used for genotyping in a panel of 98 sera positive for HCV RNA after screening a total number of 239 patients with various liver diseases. RESULTS The results demonstrated the presence of genotype 1 in 26 of 98 (26.53%) sera, genotype 3 in 65 (66.32%) and genotype 4 in 2 (2.04%) sera samples, respectively. One sample showed mixed infection of genotype 1 and 3. Five samples could not show the presence of any genotype. Genotypes 2, 5 and 6 could not be detected in these sera samples. The analysis of sera by singleplex and RFLP indicated the results of multiplex to be comparable with singleplex and with clear merit of multiplex over RFLP. In addition, the results of multiplex assay were also found to be comparable with those from sequence analysis. The sensitivity, specificity, Cq values and LOD values were compared and found to be closely associated both for singleplex and multiplex assays. CONCLUSION The multiplex qPCR assay was found to be a fast, specific and sensitive method that can be used as a technique of choice for HCV genotyping in all routine laboratories.
Collapse
Affiliation(s)
- Akanksha Singh
- Clinical Biochemistry Division, Department of Laboratory Medicine, All India Institute of Medical Sciences, New Delhi-110029, India
| | - Dhananjay Singh Mankotia
- Clinical Biochemistry Division, Department of Laboratory Medicine, All India Institute of Medical Sciences, New Delhi-110029, India
| | - Mohammad Irshad
- Clinical Biochemistry Division, Department of Laboratory Medicine, All India Institute of Medical Sciences, New Delhi-110029, India
| |
Collapse
|
4
|
Qiu P, Stevens R, Wei B, Lahser F, Howe AYM, Klappenbach JA, Marton MJ. HCV genotyping from NGS short reads and its application in genotype detection from HCV mixed infected plasma. PLoS One 2015; 10:e0122082. [PMID: 25830316 PMCID: PMC4382110 DOI: 10.1371/journal.pone.0122082] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/15/2014] [Accepted: 02/10/2015] [Indexed: 12/12/2022] Open
Abstract
Genotyping of hepatitis C virus (HCV) plays an important role in the treatment of HCV. As new genotype-specific treatment options become available, it has become increasingly important to have accurate HCV genotype and subtype information to ensure that the most appropriate treatment regimen is selected. Most current genotyping methods are unable to detect mixed genotypes from two or more HCV infections. Next generation sequencing (NGS) allows for rapid and low cost mass sequencing of viral genomes and provides an opportunity to probe the viral population from a single host. In this paper, the possibility of using short NGS reads for direct HCV genotyping without genome assembly was evaluated. We surveyed the publicly-available genetic content of three HCV drug target regions (NS3, NS5A, NS5B) in terms of whether these genes contained genotype-specific regions that could predict genotype. Six genotypes and 38 subtypes were included in this study. An automated phylogenetic analysis based HCV genotyping method was implemented and used to assess different HCV target gene regions. Candidate regions of 250-bp each were found for all three genes that have enough genetic information to predict HCV genotypes/subtypes. Validation using public datasets shows 100% genotyping accuracy. To test whether these 250-bp regions were sufficient to identify mixed genotypes, we developed a random primer-based method to sequence HCV plasma samples containing mixtures of two HCV genotypes in different ratios. We were able to determine the genotypes without ambiguity and to quantify the ratio of the abundances of the mixed genotypes in the samples. These data provide a proof-of-concept that this random primed, NGS-based short-read genotyping approach does not need prior information about the viral population and is capable of detecting mixed viral infection.
Collapse
Affiliation(s)
- Ping Qiu
- Molecular Biomarker and Diagnostics, Merck Research Laboratories, Rahway, New Jersey, United States of America
- * E-mail:
| | - Richard Stevens
- Target & Pathway Biology, Merck Research Laboratories, Boston, Massachusetts, United States of America
| | - Bo Wei
- Molecular Biomarker and Diagnostics, Merck Research Laboratories, Rahway, New Jersey, United States of America
| | - Fred Lahser
- Infectious Diseases and Clinical Virology, Merck Research Laboratories, Kenilworth, New Jersey, United States of America
| | - Anita Y. M. Howe
- Infectious Diseases and Clinical Virology, Merck Research Laboratories, Kenilworth, New Jersey, United States of America
| | - Joel A. Klappenbach
- Target & Pathway Biology, Merck Research Laboratories, Boston, Massachusetts, United States of America
| | - Matthew J. Marton
- Molecular Biomarker and Diagnostics, Merck Research Laboratories, Rahway, New Jersey, United States of America
| |
Collapse
|
5
|
Identification of novel small molecules as inhibitors of hepatitis C virus by structure-based virtual screening. Int J Mol Sci 2013; 14:22845-56. [PMID: 24264035 PMCID: PMC3856094 DOI: 10.3390/ijms141122845] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/12/2013] [Revised: 11/06/2013] [Accepted: 11/07/2013] [Indexed: 12/30/2022] Open
Abstract
Hepatitis C virus (HCV) NS3/NS4A serine protease is essential for viral replication, which is regarded as a promising drug target for developing direct-acting anti-HCV agents. In this study, sixteen novel compounds with cell-based HCV replicon activity ranging from 3.0 to 28.2 μM (IC50) were successfully identified by means of structure-based virtual screening. Compound 5 and compound 11, with an IC50 of 3.0 μM and 5.1 μM, respectively, are the two most potent molecules with low cytotoxicity.
Collapse
|
6
|
Abstract
INTRODUCTION Boceprevir was the first direct acting agent developed for the treatment of hepatitis C virus infection. Boceprevir functions by targeting NS3 protease, a viral enzyme essential for replication. This peptidomimetic molecule was optimized from a peptide lead to provide a potent, selective and orally bioavailable drug that can be combined with ribavirin and peg interferon to achieve sustained viral response (undetectable HCV RNA levels for 24 weeks after completion of therapy) in patients infected with Genotype 1 of the virus. AREAS COVERED This article provides a review of the pre-clinical and clinical discovery of boceprevir. This review includes the role and function of its molecular target, NS3 protease, as well as the assays used to measure in vitro efficacy, compound optimization and clinical studies to demonstrate safety and efficacy. EXPERT OPINION As the first direct acting anti-HCV agent, boceprevir represents an important advance in therapy of this widespread chronic disease. Yet, while this therapy is a valuable approach, it does have limitations. Studies have suggested that 30% of patients do not achieve sustained viral response and 11% of patients have developed anemia and/or neutropenia. Current drug discovery and development efforts are underway to develop novel therapeutic options that address these issues.
Collapse
Affiliation(s)
- David P Rotella
- Montclair State University, Department of Chemistry and Biochemistry , 1 Normal Avenue Montclair NJ 07043 , USA +1 973 655 7204 ; +1 973 655 7772 ;
| |
Collapse
|
7
|
Den Boer JW, Euser SM, Nagelkerke NJ, Schuren F, Jarraud S, Etienne J. Prediction of the origin of French Legionella pneumophila strains using a mixed-genome microarray. BMC Genomics 2013; 14:435. [PMID: 23815549 PMCID: PMC3701591 DOI: 10.1186/1471-2164-14-435] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/22/2013] [Accepted: 06/19/2013] [Indexed: 11/17/2022] Open
Abstract
Background Legionella is a water and soil bacterium that can infect humans, causing a pneumonia known as Legionnaires’ disease. The pneumonia is almost exclusively caused by the species L. pneumophila, of which serogroup 1 is responsible for 90% of patients. Within serogroup 1, large differences in prevalence in clinical isolates have been described. A recent study, using a Dutch Legionella strain collection, identified five virulence associated markers. In our study, we verify whether these five Dutch markers can predict the patient or environmental origin of a French Legionella strain collection. In addition, we identify new potential virulence markers and verify whether these can predict better. A total of 219 French patient isolates and environmental strains were compared using a mixed-genome micro-array. The micro-array data were analysed to identify predictive markers, using a Random Forest algorithm combined with a logistic regression model. The sequences of the identified markers were compared with eleven known Legionella genomes, using BlastN and BlastX; the functionality for each of the predictive markers was checked in the literature. Results The five Dutch markers insufficiently predicted the patient or environmental origin of the French Legionella strains. Subsequent analyses identified four predictive markers for the French collection that were used for the logistic regression model. This model showed a negative predictive value of 91%. Three of the French markers differed from the Dutch markers, one showed considerable overlap and was found in one of the Legionella genomes (Lorraine strain). This marker encodes for a structural toxin protein RtxA, described for L. pneumophila as a factor involved in virulence and entry in both human cells and amoebae. Conclusions The combination of a mixed-genome micro-array and statistical analysis using a Random Forest algorithm has identified virulence markers in a consistent way. The Lorraine strain and related Dutch and French Legionella strains contain a marker that encodes a RtxA protein which probably is involved in the increased prevalence in clinical isolates. The current set of predictive markers is insufficient to justify its use as a reliable test in the public health field in France. Our results suggest that genetic differences in Legionella strains exist between geographically distinct entities. It may be necessary to develop region-specific mixed-genome microarrays that are constantly adapted and updated.
Collapse
Affiliation(s)
- Jeroen W Den Boer
- Regional Public Health Laboratory Kennemerland, Boerhaavelaan 26, 2035 RC, Haarlem, the Netherlands.
| | | | | | | | | | | |
Collapse
|
8
|
Prevalence of hepatitis C virus genotypes in mashhad, northeast iran. IRANIAN JOURNAL OF PUBLIC HEALTH 2012; 41. [PMID: 23193507 PMCID: PMC3494216] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/04/2022]
Abstract
BACKGROUND Hepatitis C is a disease with significant global impact. The distribution of hepatitis C virus (HCV) genotypes in Mashhad (the Northeast and the biggest city after the capital of Iran) is unknown. The purpose of this study was to determine the prevalence of HCV genotypes among HCV seropositive patients, and to study the relationship between types, virologic and demographic features of patients in Mashhad. METHODS Three hundred and eighty-two clinical specimens obtained from HCV-infected patients referred to Ghaem Hospital in Mashhad during a period of 2009 to 2010 were selected. HCV genotype was determined by Nested PCR amplification of HCV core gene using genotype specific primers. RESULTS Totally, 299 patients were male (79.9%). The most common HCV genotype was genotype 3a, with 150 (40%) of subjects. Genotype 1a was the other frequent genotype, with 147(39.2%) subjects. Frequency of genotypes for 1b, 5 and 2 was 41(10.9%), 13(3.4%) and 9(2.4%), respectively. Mix genotype including 1a+1b in 4 (1.04%), 1a+3a in 3 (0.8%) was found in 7 patients. Four percent out of these samples had an undetermined genotype. Among the hemophilia patient, there were 13(48.1%) genotypes as 1a, 3(11.1%) 1b and 10(37%) 3a, respectively. CONCLUSION The dominant HCV genotype among patients living in Mashhad was 3a. This study gives added evidence of the predominant HCV genotypes in Iran.
Collapse
|
9
|
Sarwar MT, Kausar H, Ijaz B, Ahmad W, Ansar M, Sumrin A, Ashfaq UA, Asad S, Gull S, Shahid I, Hassan S. NS4A protein as a marker of HCV history suggests that different HCV genotypes originally evolved from genotype 1b. Virol J 2011; 8:317. [PMID: 21696641 PMCID: PMC3145594 DOI: 10.1186/1743-422x-8-317] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/18/2011] [Accepted: 06/23/2011] [Indexed: 01/06/2023] Open
Abstract
BACKGROUND The 9.6 kb long RNA genome of Hepatitis C virus (HCV) is under the control of RNA dependent RNA polymerase, an error-prone enzyme, for its transcription and replication. A high rate of mutation has been found to be associated with RNA viruses like HCV. Based on genetic variability, HCV has been classified into 6 different major genotypes and 11 different subtypes. However this classification system does not provide significant information about the origin of the virus, primarily due to high mutation rate at nucleotide level. HCV genome codes for a single polyprotein of about 3011 amino acids which is processed into structural and non-structural proteins inside host cell by viral and cellular proteases. RESULTS We have identified a conserved NS4A protein sequence for HCV genotype 3a reported from four different continents of the world i.e. Europe, America, Australia and Asia. We investigated 346 sequences and compared amino acid composition of NS4A protein of different HCV genotypes through Multiple Sequence Alignment and observed amino acid substitutions C22, V29, V30, V38, Q46 and Q47 in NS4A protein of genotype 1b. Furthermore, we observed C22 and V30 as more consistent members of NS4A protein of genotype 1a. Similarly Q46 and Q47 in genotype 5, V29, V30, Q46 and Q47 in genotype 4, C22, Q46 and Q47 in genotype 6, C22, V38, Q46 and Q47 in genotype 3 and C22 in genotype 2 as more consistent members of NS4A protein of these genotypes. So the different amino acids that were introduced as substitutions in NS4A protein of genotype 1 subtype 1b have been retained as consistent members of the NS4A protein of other known genotypes. CONCLUSION These observations indicate that NS4A protein of different HCV genotypes originally evolved from NS4A protein of genotype 1 subtype 1b, which in turn indicate that HCV genotype 1 subtype 1b established itself earlier in human population and all other known genotypes evolved later as a result of mutations in HCV genotype 1b. These results were further confirmed through phylogenetic analysis by constructing phylogenetic tree using NS4A protein as a phylogenetic marker.
Collapse
Affiliation(s)
- Muhammad T Sarwar
- Centre of Excellence in Molecular Biology, University of the Punjab, Lahore, Pakistan
| | | | | | | | | | | | | | | | | | | | | |
Collapse
|
10
|
Chevaliez S, Bouvier-Alias M, Brillet R, Pawlotsky JM. Hepatitis C virus (HCV) genotype 1 subtype identification in new HCV drug development and future clinical practice. PLoS One 2009; 4:e8209. [PMID: 19997618 PMCID: PMC2785465 DOI: 10.1371/journal.pone.0008209] [Citation(s) in RCA: 98] [Impact Index Per Article: 6.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/09/2009] [Accepted: 11/09/2009] [Indexed: 12/12/2022] Open
Abstract
BACKGROUND With the development of new specific inhibitors of hepatitis C virus (HCV) enzymes and functions that may yield different antiviral responses and resistance profiles according to the HCV subtype, correct HCV genotype 1 subtype identification is mandatory in clinical trials for stratification and interpretation purposes and will likely become necessary in future clinical practice. The goal of this study was to identify the appropriate molecular tool(s) for accurate HCV genotype 1 subtype determination. METHODOLOGY/PRINCIPAL FINDINGS A large cohort of 500 treatment-naïve patients eligible for HCV drug trials and infected with either subtype 1a or 1b was studied. Methods based on the sole analysis of the 5' non-coding region (5'NCR) by sequence analysis or reverse hybridization failed to correctly identify HCV subtype 1a in 22.8%-29.5% of cases, and HCV subtype 1b in 9.5%-8.7% of cases. Natural polymorphisms at positions 107, 204 and/or 243 were responsible for mis-subtyping with these methods. A real-time PCR method using genotype- and subtype-specific primers and probes located in both the 5'NCR and the NS5B-coding region failed to correctly identify HCV genotype 1 subtype in approximately 10% of cases. The second-generation line probe assay, a reverse hybridization assay that uses probes targeting both the 5'NCR and core-coding region, correctly identified HCV subtypes 1a and 1b in more than 99% of cases. CONCLUSIONS/SIGNIFICANCE In the context of new HCV drug development, HCV genotyping methods based on the exclusive analysis of the 5'NCR should be avoided. The second-generation line probe assay is currently the best commercial assay for determination of HCV genotype 1 subtypes 1a and 1b in clinical trials and practice.
Collapse
Affiliation(s)
- Stéphane Chevaliez
- French National Reference Center for Viral Hepatitis B, C and delta, Department of Virology, Hôpital Henri Mondor, Université Paris 12, Créteil, France
- INSERM U955, Créteil, France
| | - Magali Bouvier-Alias
- French National Reference Center for Viral Hepatitis B, C and delta, Department of Virology, Hôpital Henri Mondor, Université Paris 12, Créteil, France
- INSERM U955, Créteil, France
| | | | - Jean-Michel Pawlotsky
- French National Reference Center for Viral Hepatitis B, C and delta, Department of Virology, Hôpital Henri Mondor, Université Paris 12, Créteil, France
- INSERM U955, Créteil, France
- * E-mail:
| |
Collapse
|