1
|
Oligonucleotide usage in coronavirus genomes mimics that in exon regions in host genomes. Virol J 2023; 20:39. [PMID: 36859385 PMCID: PMC9976658 DOI: 10.1186/s12985-023-01995-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/28/2022] [Accepted: 02/19/2023] [Indexed: 03/03/2023] Open
Abstract
BACKGROUND Viruses use various host factors for their growth, and efficient growth requires efficient use of these factors. Our previous study revealed that the occurrence frequency of oligonucleotides in the influenza virus genome is distinctly different among derived hosts, and the frequency tends to adapt to the host cells in which they grow. We aimed to study the adaptation mechanisms of a zoonotic virus to host cells. METHODS Herein, we compared the frequency of oligonucleotides in the genome of alpha- and betacoronavirus with those in the genomes of humans and bats, which are typical hosts of the viruses. RESULTS By comparing the oligonucleotide frequency in coronaviruses and their host genomes, we found a statistically tested positive correlation between the frequency of coronaviruses and that of the exon regions of the host from which the virus is derived. To examine the characteristics of early-stage changes in the viral genome, which are assumed to accompany the host change from non-humans to humans, we compared the oligonucleotide frequency between severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) at the beginning of the pandemic and the prevalent variants thereafter, and found changes towards the frequency of the host exon regions. CONCLUSIONS In alpha- and betacoronaviruses, the genome oligonucleotide frequency is thought to change in response to the cellular environment in which the virus is replicating, and actually the frequency has approached the frequency in exon regions in the host.
Collapse
|
2
|
Huang Q, Qiu H, Bible PW, Huang Y, Zheng F, Gu J, Sun J, Hao Y, Liu Y. Early detection of SARS-CoV-2 variants through dynamic co-mutation network surveillance. Front Public Health 2023; 11:1015969. [PMID: 36755900 PMCID: PMC9901361 DOI: 10.3389/fpubh.2023.1015969] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/10/2022] [Accepted: 01/02/2023] [Indexed: 01/25/2023] Open
Abstract
Background Precise public health and clinical interventions for the COVID-19 pandemic has spurred a global rush on SARS-CoV-2 variant tracking, but current approaches to variant tracking are challenged by the flood of viral genome sequences leading to a loss of timeliness, accuracy, and reliability. Here, we devised a new co-mutation network framework, aiming to tackle these difficulties in variant surveillance. Methods To avoid simultaneous input and modeling of the whole large-scale data, we dynamically investigate the nucleotide covarying pattern of weekly sequences. The community detection algorithm is applied to a co-occurring genomic alteration network constructed from mutation corpora of weekly collected data. Co-mutation communities are identified, extracted, and characterized as variant markers. They contribute to the creation and weekly updates of a community-based variant dictionary tree representing SARS-CoV-2 evolution, where highly similar ones between weeks have been merged to represent the same variants. Emerging communities imply the presence of novel viral variants or new branches of existing variants. This process was benchmarked with worldwide GISAID data and validated using national level data from six COVID-19 hotspot countries. Results A total of 235 co-mutation communities were identified after a 120 weeks' investigation of worldwide sequence data, from March 2020 to mid-June 2022. The dictionary tree progressively developed from these communities perfectly recorded the time course of SARS-CoV-2 branching, coinciding with GISAID clades. The time-varying prevalence of these communities in the viral population showed a good match with the emergence and circulation of the variants they represented. All these benchmark results not only exhibited the methodology features but also demonstrated high efficiency in detection of the pandemic variants. When it was applied to regional variant surveillance, our method displayed significantly earlier identification of feature communities of major WHO-named SARS-CoV-2 variants in contrast with Pangolin's monitoring. Conclusion An efficient genomic surveillance framework built from weekly co-mutation networks and a dynamic community-based variant dictionary tree enables early detection and continuous investigation of SARS-CoV-2 variants overcoming genomic data flood, aiding in the response to the COVID-19 pandemic.
Collapse
Affiliation(s)
- Qiang Huang
- Department of Medical Statistics, School of Public Health, Sun Yat-sen University, Guangzhou, China
| | - Huining Qiu
- Guangdong Artificial Intelligence Machine Vision Engineering Technology Research Center, Guangzhou, China
| | - Paul W. Bible
- College of Arts and Sciences, Marian University, Indianapolis, IN, United States
| | - Yong Huang
- Institute of Public Health, Guangzhou Medical University & Guangzhou Center for Disease Control and Prevention, Guangzhou, China
| | - Fangfang Zheng
- School of Traditional Chinese Medicine Healthcare, Guangdong Food and Drug Vocational College, Guangzhou, China
| | - Jing Gu
- Department of Medical Statistics, School of Public Health, Sun Yat-sen University, Guangzhou, China
| | - Jian Sun
- Department of Clinical Research, The Third Affiliated Hospital of Sun Yat-sen University, Guangzhou, China,*Correspondence: Jian Sun ✉
| | - Yuantao Hao
- Peking University Center for Public Health and Epidemic Preparedness & Response, Beijing, China,Yuantao Hao ✉
| | - Yu Liu
- Department of Medical Statistics, School of Public Health, Sun Yat-sen University, Guangzhou, China,Yu Liu ✉
| |
Collapse
|
3
|
Al Khalaf R, Bernasconi A, Pinoli P, Ceri S. Analysis of co-occurring and mutually exclusive amino acid changes and detection of convergent and divergent evolution events in SARS-CoV-2. Comput Struct Biotechnol J 2022; 20:4238-4250. [PMID: 35945925 PMCID: PMC9352683 DOI: 10.1016/j.csbj.2022.07.051] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/18/2022] [Revised: 07/29/2022] [Accepted: 07/29/2022] [Indexed: 11/28/2022] Open
Affiliation(s)
- Ruba Al Khalaf
- Dipartimento di Elettronica, Informazione e Bioingegneria, Politecnico di Milano, Milan, Italy
| | - Anna Bernasconi
- Dipartimento di Elettronica, Informazione e Bioingegneria, Politecnico di Milano, Milan, Italy
- Corresponding author.
| | - Pietro Pinoli
- Dipartimento di Elettronica, Informazione e Bioingegneria, Politecnico di Milano, Milan, Italy
| | - Stefano Ceri
- Dipartimento di Elettronica, Informazione e Bioingegneria, Politecnico di Milano, Milan, Italy
| |
Collapse
|
4
|
Iwasaki Y, Ikemura T, Wada K, Wada Y, Abe T. Comparative genomic analysis of the human genome and six bat genomes using unsupervised machine learning: Mb-level CpG and TFBS islands. BMC Genomics 2022; 23:497. [PMID: 35804296 PMCID: PMC9264310 DOI: 10.1186/s12864-022-08664-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/03/2022] [Accepted: 05/31/2022] [Indexed: 11/25/2022] Open
Abstract
Background Emerging infectious disease-causing RNA viruses, such as the SARS-CoV-2 and Ebola viruses, are thought to rely on bats as natural reservoir hosts. Since these zoonotic viruses pose a great threat to humans, it is important to characterize the bat genome from multiple perspectives. Unsupervised machine learning methods for extracting novel information from big sequence data without prior knowledge or particular models are highly desirable for obtaining unexpected insights. We previously established a batch-learning self-organizing map (BLSOM) of the oligonucleotide composition that reveals novel genome characteristics from big sequence data. Results In this study, using the oligonucleotide BLSOM, we conducted a comparative genomic study of humans and six bat species. BLSOM is an explainable-type machine learning algorithm that reveals the diagnostic oligonucleotides contributing to sequence clustering (self-organization). When unsupervised machine learning reveals unexpected and/or characteristic features, these features can be studied in more detail via the much simpler and more direct standard distribution map method. Based on this combined strategy, we identified the Mb-level enrichment of CG dinucleotide (Mb-level CpG islands) around the termini of bat long-scaffold sequences. In addition, a class of CG-containing oligonucleotides were enriched in the centromeric and pericentromeric regions of human chromosomes. Oligonucleotides longer than tetranucleotides often represent binding motifs for a wide variety of proteins (e.g., transcription factor binding sequences (TFBSs)). By analyzing the penta- and hexanucleotide composition, we observed the evident enrichment of a wide range of hexanucleotide TFBSs in centromeric and pericentromeric heterochromatin regions on all human chromosomes. Conclusion Function of transcription factors (TFs) beyond their known regulation of gene expression (e.g., TF-mediated looping interactions between two different genomic regions) has received wide attention. The Mb-level TFBS and CpG islands are thought to be involved in the large-scale nuclear organization, such as centromere and telomere clustering. TFBSs, which are enriched in centromeric and pericentromeric heterochromatin regions, are thought to play an important role in the formation of nuclear 3D structures. Our machine learning-based analysis will help us to understand the differential features of nuclear 3D structures in the human and bat genomes. Supplementary Information The online version contains supplementary material available at 10.1186/s12864-022-08664-9.
Collapse
Affiliation(s)
- Yuki Iwasaki
- Department of Bioscience, Nagahama Institute of Bio-Science and Technology, Tamura-cho 1266, Nagahama-shi, Shiga-ken, 526-0829, Japan
| | - Toshimichi Ikemura
- Department of Bioscience, Nagahama Institute of Bio-Science and Technology, Tamura-cho 1266, Nagahama-shi, Shiga-ken, 526-0829, Japan.
| | - Kennosuke Wada
- Department of Bioscience, Nagahama Institute of Bio-Science and Technology, Tamura-cho 1266, Nagahama-shi, Shiga-ken, 526-0829, Japan
| | - Yoshiko Wada
- Department of Bioscience, Nagahama Institute of Bio-Science and Technology, Tamura-cho 1266, Nagahama-shi, Shiga-ken, 526-0829, Japan
| | - Takashi Abe
- Smart Information Systems, Faculty of Engineering, Niigata University, Niigata-ken, 950-2181, Japan.
| |
Collapse
|
5
|
Huang Q, Zhang Q, Bible PW, Liang Q, Zheng F, Wang Y, Hao Y, Liu Y. A New Way to Trace SARS-CoV-2 Variants Through Weighted Network Analysis of Frequency Trajectories of Mutations. Front Microbiol 2022; 13:859241. [PMID: 35369526 PMCID: PMC8966897 DOI: 10.3389/fmicb.2022.859241] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/21/2022] [Accepted: 02/18/2022] [Indexed: 11/13/2022] Open
Abstract
Early detection of SARS-CoV-2 variants enables timely tracking of clinically important strains in order to inform the public health response. Current subtype-based variant surveillance depending on prior subtype assignment according to lag features and their continuous risk assessment may delay this process. We proposed a weighted network framework to model the frequency trajectories of mutations (FTMs) for SARS-CoV-2 variant tracing, without requiring prior subtype assignment. This framework modularizes the FTMs and conglomerates synchronous FTMs together to represent the variants. It also generates module clusters to unveil the epidemic stages and their contemporaneous variants. Eventually, the module-based variants are assessed by phylogenetic tree through sub-sampling to facilitate communication and control of the epidemic. This process was benchmarked using worldwide GISAID data, which not only demonstrated all the methodology features but also showed the module-based variant identification had highly specific and sensitive mapping with the global phylogenetic tree. When applying this process to regional data like India and South Africa for SARS-CoV-2 variant surveillance, the approach clearly elucidated the national dispersal history of the viral variants and their co-circulation pattern, and provided much earlier warning of Beta (B.1.351), Delta (B.1.617.2), and Omicron (B.1.1.529). In summary, our work showed that the weighted network modeling of FTMs enables us to rapidly and easily track down SARS-CoV-2 variants overcoming prior viral subtyping with lag features, accelerating the understanding and surveillance of COVID-19.
Collapse
Affiliation(s)
- Qiang Huang
- Department of Medical Statistics and Epidemiology, School of Public Health, Sun Yat-sen University, Guangzhou, China
| | - Qiang Zhang
- College of Computer, Chengdu University, Chengdu, China
| | - Paul W Bible
- College of Arts and Sciences, Marian University, Indianapolis, IN, United States
| | - Qiaoxing Liang
- State Key Laboratory of Ophthalmology, Zhongshan Ophthalmic Center, Sun Yat-sen University, Guangzhou, China
| | - Fangfang Zheng
- School of Traditional Chinese Medicine Healthcare, Guangdong Food and Drug Vocational College, Guangzhou, China
| | - Ying Wang
- Department of Medical Statistics and Epidemiology, School of Public Health, Sun Yat-sen University, Guangzhou, China
| | - Yuantao Hao
- Department of Medical Statistics and Epidemiology, School of Public Health, Sun Yat-sen University, Guangzhou, China
| | - Yu Liu
- Department of Medical Statistics and Epidemiology, School of Public Health, Sun Yat-sen University, Guangzhou, China
| |
Collapse
|
6
|
Bernasconi A, Mari L, Casagrandi R, Ceri S. Data-driven analysis of amino acid change dynamics timely reveals SARS-CoV-2 variant emergence. Sci Rep 2021; 11:21068. [PMID: 34702903 PMCID: PMC8548498 DOI: 10.1038/s41598-021-00496-z] [Citation(s) in RCA: 12] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/04/2021] [Accepted: 10/12/2021] [Indexed: 02/07/2023] Open
Abstract
Since its emergence in late 2019, the diffusion of SARS-CoV-2 is associated with the evolution of its viral genome. The co-occurrence of specific amino acid changes, collectively named ‘virus variant’, requires scrutiny (as variants may hugely impact the agent’s transmission, pathogenesis, or antigenicity); variant evolution is studied using phylogenetics. Yet, never has this problem been tackled by digging into data with ad hoc analysis techniques. Here we show that the emergence of variants can in fact be traced through data-driven methods, further capitalizing on the value of large collections of SARS-CoV-2 sequences. For all countries with sufficient data, we compute weekly counts of amino acid changes, unveil time-varying clusters of changes with similar—rapidly growing—dynamics, and then follow their evolution. Our method succeeds in timely associating clusters to variants of interest/concern, provided their change composition is well characterized. This allows us to detect variants’ emergence, rise, peak, and eventual decline under competitive pressure of another variant. Our early warning system, exclusively relying on deposited sequences, shows the power of big data in this context, and concurs to calling for the wide spreading of public SARS-CoV-2 genome sequencing for improved surveillance and control of the COVID-19 pandemic.
Collapse
Affiliation(s)
- Anna Bernasconi
- Departement of Electronics, Information, and Bioengineering, Politecnico di Milano, 20133, Milan, Italy.
| | - Lorenzo Mari
- Departement of Electronics, Information, and Bioengineering, Politecnico di Milano, 20133, Milan, Italy
| | - Renato Casagrandi
- Departement of Electronics, Information, and Bioengineering, Politecnico di Milano, 20133, Milan, Italy
| | - Stefano Ceri
- Departement of Electronics, Information, and Bioengineering, Politecnico di Milano, 20133, Milan, Italy
| |
Collapse
|
7
|
Ikemura T, Iwasaki Y, Wada K, Wada Y, Abe T. AI for the collective analysis of a massive number of genome sequences: various examples from the small genome of pandemic SARS-CoV-2 to the human genome. Genes Genet Syst 2021; 96:165-176. [PMID: 34565757 DOI: 10.1266/ggs.21-00025] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/23/2022] Open
Abstract
In genetics and related fields, huge amounts of data, such as genome sequences, are accumulating, and the use of artificial intelligence (AI) suitable for big data analysis has become increasingly important. Unsupervised AI that can reveal novel knowledge from big data without prior knowledge or particular models is highly desirable for analyses of genome sequences, particularly for obtaining unexpected insights. We have developed a batch-learning self-organizing map (BLSOM) for oligonucleotide compositions that can reveal various novel genome characteristics. Here, we explain the data mining by the BLSOM: an unsupervised AI. As a specific target, we first selected SARS-CoV-2 (severe acute respiratory syndrome coronavirus 2) because a large number of viral genome sequences have been accumulated via worldwide efforts. We analyzed more than 0.6 million sequences collected primarily in the first year of the pandemic. BLSOMs for short oligonucleotides (e.g., 4-6-mers) allowed separation into known clades, but longer oligonucleotides further increased the separation ability and revealed subgrouping within known clades. In the case of 15-mers, there is mostly one copy in the genome; thus, 15-mers that appeared after the epidemic started could be connected to mutations, and the BLSOM for 15-mers revealed the mutations that contributed to separation into known clades and their subgroups. After introducing the detailed methodological strategies, we explain BLSOMs for various topics, such as the tetranucleotide BLSOM for over 5 million 5-kb fragment sequences derived from almost all microorganisms currently available and its use in metagenome studies. We also explain BLSOMs for various eukaryotes, including fishes, frogs and Drosophila species, and found a high separation ability among closely related species. When analyzing the human genome, we found enrichments in transcription factor-binding sequences in centromeric and pericentromeric heterochromatin regions. The tDNAs (tRNA genes) could be separated according to their corresponding amino acid.
Collapse
Affiliation(s)
| | - Yuki Iwasaki
- Faculty of Bioscience, Nagahama Institute of Bio-Science and Technology
| | - Kennosuke Wada
- Faculty of Bioscience, Nagahama Institute of Bio-Science and Technology
| | - Yoshiko Wada
- Faculty of Bioscience, Nagahama Institute of Bio-Science and Technology
| | - Takashi Abe
- Department of Information Engineering, Faculty of Engineering, Niigata University
| |
Collapse
|
8
|
Iwasaki Y, Abe T, Ikemura T. Human cell-dependent, directional, time-dependent changes in the mono- and oligonucleotide compositions of SARS-CoV-2 genomes. BMC Microbiol 2021; 21:89. [PMID: 33757449 PMCID: PMC7987243 DOI: 10.1186/s12866-021-02158-6] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/26/2020] [Accepted: 03/15/2021] [Indexed: 12/24/2022] Open
Abstract
Background When a virus that has grown in a nonhuman host starts an epidemic in the human population, human cells may not provide growth conditions ideal for the virus. Therefore, the invasion of severe acute respiratory syndrome coronavirus-2 (SARS-CoV-2), which is usually prevalent in the bat population, into the human population is thought to have necessitated changes in the viral genome for efficient growth in the new environment. In the present study, to understand host-dependent changes in coronavirus genomes, we focused on the mono- and oligonucleotide compositions of SARS-CoV-2 genomes and investigated how these compositions changed time-dependently in the human cellular environment. We also compared the oligonucleotide compositions of SARS-CoV-2 and other coronaviruses prevalent in humans or bats to investigate the causes of changes in the host environment. Results Time-series analyses of changes in the nucleotide compositions of SARS-CoV-2 genomes revealed a group of mono- and oligonucleotides whose compositions changed in a common direction for all clades, even though viruses belonging to different clades should evolve independently. Interestingly, the compositions of these oligonucleotides changed towards those of coronaviruses that have been prevalent in humans for a long period and away from those of bat coronaviruses. Conclusions Clade-independent, time-dependent changes are thought to have biological significance and should relate to viral adaptation to a new host environment, providing important clues for understanding viral host adaptation mechanisms. Supplementary Information The online version contains supplementary material available at 10.1186/s12866-021-02158-6.
Collapse
Affiliation(s)
- Yuki Iwasaki
- Department of Bioscience, Nagahama Institute of Bio-Science and Technology, Shiga, Japan
| | - Takashi Abe
- Graduate School of Science and Technology, Niigata University, Niigata, Japan
| | - Toshimichi Ikemura
- Department of Bioscience, Nagahama Institute of Bio-Science and Technology, Shiga, Japan.
| |
Collapse
|