1
|
Liu G, Chen X, Luan Y, Li D. VirusPredictor: XGBoost-based software to predict virus-related sequences in human data. Bioinformatics 2024; 40:btae192. [PMID: 38597887 PMCID: PMC11052659 DOI: 10.1093/bioinformatics/btae192] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/12/2023] [Revised: 02/29/2024] [Accepted: 04/08/2024] [Indexed: 04/11/2024] Open
Abstract
MOTIVATION Discovering disease causative pathogens, particularly viruses without reference genomes, poses a technical challenge as they are often unidentifiable through sequence alignment. Machine learning prediction of patient high-throughput sequences unmappable to human and pathogen genomes may reveal sequences originating from uncharacterized viruses. Currently, there is a lack of software specifically designed for accurately predicting such viral sequences in human data. RESULTS We developed a fast XGBoost method and software VirusPredictor leveraging an in-house viral genome database. Our two-step XGBoost models first classify each query sequence into one of three groups: infectious virus, endogenous retrovirus (ERV) or non-ERV human. The prediction accuracies increased as the sequences became longer, i.e. 0.76, 0.93, and 0.98 for 150-350 (Illumina short reads), 850-950 (Sanger sequencing data), and 2000-5000 bp sequences, respectively. Then, sequences predicted to be from infectious viruses are further classified into one of six virus taxonomic subgroups, and the accuracies increased from 0.92 to >0.98 when query sequences increased from 150-350 to >850 bp. The results suggest that Illumina short reads should be de novo assembled into contigs (e.g. ∼1000 bp or longer) before prediction whenever possible. We applied VirusPredictor to multiple real genomic and metagenomic datasets and obtained high accuracies. VirusPredictor, a user-friendly open-source Python software, is useful for predicting the origins of patients' unmappable sequences. This study is the first to classify ERVs in infectious viral sequence prediction. This is also the first study combining virus sub-group predictions. AVAILABILITY AND IMPLEMENTATION www.dllab.org/software/VirusPredictor.html.
Collapse
Affiliation(s)
- Guangchen Liu
- Department of Microbiology and Molecular Genetics, University of Vermont, Burlington, Vermont 05405, United States
- School of Mathematics, Shandong University, Jinan, Shandong 250100, China
- School of Mathematics and Statistics, Ludong University, Yantai, Shandong 264025, China
| | - Xun Chen
- Department of Microbiology and Molecular Genetics, University of Vermont, Burlington, Vermont 05405, United States
| | - Yihui Luan
- School of Mathematics, Shandong University, Jinan, Shandong 250100, China
| | - Dawei Li
- Department of Microbiology and Molecular Genetics, University of Vermont, Burlington, Vermont 05405, United States
- Department of Immunology and Molecular Microbiology, Texas Tech University Health Sciences Center, Lubbock, Texas 79430, United States
- ICanCME Research Network, Sainte-Justine University Hospital Research Center, Montreal, Quebec H3T 1C5, Canada
| |
Collapse
|
2
|
Chitcharoen S, Phokaew C, Mauleekoonphairoj J, Khongphatthanayothin A, Sutjaporn B, Wandee P, Poovorawan Y, Nademanee K, Payungporn S. Metagenomic analysis of viral genes integrated in whole genome sequencing data of Thai patients with Brugada syndrome. Genomics Inform 2022; 20:e44. [PMID: 36617651 PMCID: PMC9847385 DOI: 10.5808/gi.22047] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/26/2022] [Accepted: 09/25/2022] [Indexed: 12/31/2022] Open
Abstract
Brugada syndrome (BS) is an autosomal dominant inheritance cardiac arrhythmia disorder associated with sudden death in young adults. Thailand has the highest prevalence of BS worldwide, and over 60% of patients with BS still have unclear disease etiology. Here, we performeda new viral metagenome analysis pipeline called VIRIN and validated it with whole genome sequencing (WGS) data of HeLa cell lines and hepatocellular carcinoma. Then the VIRIN pipelinewas applied to identify viral integration positions from unmapped WGS data of Thai males, including 100 BS patients (case) and 100 controls. Even though the sample preparation had noviral enrichment step, we can identify several virus genes from our analysis pipeline. The predominance of human endogenous retrovirus K (HERV-K) viruses was found in both cases andcontrols by blastn and blastx analysis. This study is the first report on the full-length HERV-Kassembled genomes in the Thai population. Furthermore, the HERV-K integration breakpointpositions were validated and compared between the case and control datasets. Interestingly,Brugada cases contained HERV-K integration breakpoints at promoters five times more oftenthan controls. Overall, the highlight of this study is the BS-specific HERV-K breakpoint positionsthat were found at the gene coding region "NBPF11" (n = 9), "NBPF12" (n = 8) and longnon-coding RNA (lncRNA) "PCAT14" (n = 4) region. The genes and the lncRNA have been reported to be associated with congenital heart and arterial diseases. These findings provide another aspect of the BS etiology associated with viral genome integrations within the humangenome.
Collapse
Affiliation(s)
- Suwalak Chitcharoen
- Program in Bioinformatics and Computational Biology, Graduate School, Chulalongkorn University, Bangkok 10330, Thailand,Research Unit of Systems Microbiology, Department of Biochemistry, Faculty of Medicine, Chulalongkorn University, Bangkok 10330, Thailand
| | - Chureerat Phokaew
- Center of Excellence for Medical Genomics, Medical Genomics Cluster, Faculty of Medicine, Chulalongkorn University, Bangkok 10330, Thailand,Excellence Center for Genomics and Precision Medicine, King Chulalongkorn Memorial Hospital, The Thai Red Cross Society, Bangkok 10330, Thailand,Research Affairs, Faculty of Medicine, Chulalongkorn University, Bangkok 10330, Thailand,Corresponding author: E-mail:
| | - John Mauleekoonphairoj
- Department of Medicine, Faculty of Medicine, Center of Excellence in Arrhythmia Research Chulalongkorn University, Chulalongkorn University, Bangkok 10330, Thailand,Interdisciplinary Program of Biomedical Sciences, Graduate School, Chulalongkorn University, Bangkok 10330, Thailand
| | - Apichai Khongphatthanayothin
- Department of Medicine, Faculty of Medicine, Center of Excellence in Arrhythmia Research Chulalongkorn University, Chulalongkorn University, Bangkok 10330, Thailand,Division of Cardiology, Department of Pediatrics, Faculty of Medicine, Chulalongkorn University, Bangkok 10330, Thailand,Bangkok General Hospital, Bangkok 10330, Thailand
| | - Boosamas Sutjaporn
- Excellence Center for Genomics and Precision Medicine, King Chulalongkorn Memorial Hospital, The Thai Red Cross Society, Bangkok 10330, Thailand,Department of Medicine, Faculty of Medicine, Center of Excellence in Arrhythmia Research Chulalongkorn University, Chulalongkorn University, Bangkok 10330, Thailand
| | - Pharawee Wandee
- Department of Medicine, Faculty of Medicine, Center of Excellence in Arrhythmia Research Chulalongkorn University, Chulalongkorn University, Bangkok 10330, Thailand
| | - Yong Poovorawan
- Department of Pediatrics, Faculty of Medicine, Chulalongkorn University, Bangkok 10330, Thailand
| | - Koonlawee Nademanee
- Department of Medicine, Faculty of Medicine, Center of Excellence in Arrhythmia Research Chulalongkorn University, Chulalongkorn University, Bangkok 10330, Thailand,Department of Medicine, Faculty of Medicine, Chulalongkorn University, Bangkok 10330, Thailand,Pacific Rim Electrophysiology Research Institute, Bumrungrad Hospital, Bangkok 10110, Thailand
| | - Sunchai Payungporn
- Research Unit of Systems Microbiology, Department of Biochemistry, Faculty of Medicine, Chulalongkorn University, Bangkok 10330, Thailand,Corresponding author: E-mail:
| |
Collapse
|
4
|
Mathkar PP, Chen X, Sulovari A, Li D. Characterization of Hepatitis B Virus Integrations Identified in Hepatocellular Carcinoma Genomes. Viruses 2021; 13:v13020245. [PMID: 33557409 PMCID: PMC7915589 DOI: 10.3390/v13020245] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/31/2020] [Revised: 01/31/2021] [Accepted: 02/02/2021] [Indexed: 12/19/2022] Open
Abstract
Hepatocellular carcinoma (HCC) is a leading cause of cancer-related mortality. Almost half of HCC cases are associated with hepatitis B virus (HBV) infections, which often lead to HBV sequence integrations in the human genome. Accurate identification of HBV integration sites at a single nucleotide resolution is critical for developing a better understanding of the cancer genome landscape and of the disease itself. Here, we performed further analyses and characterization of HBV integrations identified by our recently reported VIcaller platform in recurrent or known HCC genes (such as TERT, MLL4, and CCNE1) as well as non-recurrent cancer-related genes (such as CSMD2, NKD2, and RHOU). Our pathway enrichment analysis revealed multiple pathways involving the alcohol dehydrogenase 4 gene, such as the metabolism pathways of retinol, tyrosine, and fatty acid. Further analysis of the HBV integration sites revealed distinct patterns involving the integration upper breakpoints, integrated genome lengths, and integration allele fractions between tumor and normal tissues. Our analysis also implies that the VIcaller method has diagnostic potential through discovering novel clonal integrations in cancer-related genes. In conclusion, although VIcaller is a hypothesis free virome-wide approach, it can still be applied to accurately identify genome-wide integration events of a specific candidate virus and their integration allele fractions.
Collapse
Affiliation(s)
- Pranav P. Mathkar
- Department of Microbiology and Molecular Genetics, University of Vermont, Burlington, VT 05405, USA; (P.P.M.); (A.S.)
| | - Xun Chen
- Department of Microbiology and Molecular Genetics, University of Vermont, Burlington, VT 05405, USA; (P.P.M.); (A.S.)
- Institute for the Advanced Study of Human Biology, Kyoto University, Kyoto 606-8501, Japan
- Correspondence: (X.C.); (D.L.)
| | - Arvis Sulovari
- Department of Microbiology and Molecular Genetics, University of Vermont, Burlington, VT 05405, USA; (P.P.M.); (A.S.)
- Cajal Neuroscience Inc., Seattle, WA 98102, USA
| | - Dawei Li
- Department of Microbiology and Molecular Genetics, University of Vermont, Burlington, VT 05405, USA; (P.P.M.); (A.S.)
- Department of Biomedical Science, Charles E. Schmidt College of Medicine, Florida Atlantic University, Boca Raton, FL 33431, USA
- Correspondence: (X.C.); (D.L.)
| |
Collapse
|