1
|
Ikemura T, Iwasaki Y, Wada K, Wada Y, Abe T. AI for the collective analysis of a massive number of genome sequences: various examples from the small genome of pandemic SARS-CoV-2 to the human genome. Genes Genet Syst 2021; 96:165-176. [PMID: 34565757 DOI: 10.1266/ggs.21-00025] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/23/2022] Open
Abstract
In genetics and related fields, huge amounts of data, such as genome sequences, are accumulating, and the use of artificial intelligence (AI) suitable for big data analysis has become increasingly important. Unsupervised AI that can reveal novel knowledge from big data without prior knowledge or particular models is highly desirable for analyses of genome sequences, particularly for obtaining unexpected insights. We have developed a batch-learning self-organizing map (BLSOM) for oligonucleotide compositions that can reveal various novel genome characteristics. Here, we explain the data mining by the BLSOM: an unsupervised AI. As a specific target, we first selected SARS-CoV-2 (severe acute respiratory syndrome coronavirus 2) because a large number of viral genome sequences have been accumulated via worldwide efforts. We analyzed more than 0.6 million sequences collected primarily in the first year of the pandemic. BLSOMs for short oligonucleotides (e.g., 4-6-mers) allowed separation into known clades, but longer oligonucleotides further increased the separation ability and revealed subgrouping within known clades. In the case of 15-mers, there is mostly one copy in the genome; thus, 15-mers that appeared after the epidemic started could be connected to mutations, and the BLSOM for 15-mers revealed the mutations that contributed to separation into known clades and their subgroups. After introducing the detailed methodological strategies, we explain BLSOMs for various topics, such as the tetranucleotide BLSOM for over 5 million 5-kb fragment sequences derived from almost all microorganisms currently available and its use in metagenome studies. We also explain BLSOMs for various eukaryotes, including fishes, frogs and Drosophila species, and found a high separation ability among closely related species. When analyzing the human genome, we found enrichments in transcription factor-binding sequences in centromeric and pericentromeric heterochromatin regions. The tDNAs (tRNA genes) could be separated according to their corresponding amino acid.
Collapse
Affiliation(s)
| | - Yuki Iwasaki
- Faculty of Bioscience, Nagahama Institute of Bio-Science and Technology
| | - Kennosuke Wada
- Faculty of Bioscience, Nagahama Institute of Bio-Science and Technology
| | - Yoshiko Wada
- Faculty of Bioscience, Nagahama Institute of Bio-Science and Technology
| | - Takashi Abe
- Department of Information Engineering, Faculty of Engineering, Niigata University
| |
Collapse
|
2
|
Qiu Y, Abe T, Nakao R, Satoh K, Sugimoto C. Viral population analysis of the taiga tick, Ixodes persulcatus, by using Batch Learning Self-Organizing Maps and BLAST search. J Vet Med Sci 2019; 81:401-410. [PMID: 30674747 PMCID: PMC6451905 DOI: 10.1292/jvms.18-0483] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022] Open
Abstract
Ticks transmit a wide range of viral, bacterial, and protozoal pathogens, which are often zoonotic. Several novel tick-borne viral pathogens have been reported during the past few years.
The aim of this study was to investigate a diversity of tick viral populations, which may contain as-yet unidentified viruses, using a combination of high throughput pyrosequencing and Batch
Learning Self-Organizing Map (BLSOM) program, which enables phylogenetic estimation based on the similarity of oligonucleotide frequencies. DNA/cDNA prepared from virus-enriched fractions
obtained from Ixodes persulcatus ticks was pyrosequenced. After de novo assembly, contigs were cataloged by the BLSOM program. In total 41 different viral
families and order including those previously associated with human and animal diseases such as Bunyavirales, Flaviviridae, and Reoviridae,
were detected. Therefore, our strategy is applicable for viral population analysis of other arthropods of medical and veterinary importance, such as mosquitos and lice. The results lead to
the contribution to the prediction of emerging tick-borne viral diseases. A sufficient understanding of tick viral populations will also empower to analyze and understand tick biology
including vector competency and interactions with other pathogens.
Collapse
Affiliation(s)
- Yongjin Qiu
- Division of Collaboration and Education, Hokkaido University Research Center for Zoonosis Control, Kita 20 Nishi 10, Kita-ku, Sapporo, Hokkaido 001-0020 Japan.,Hokudai Center for Zoonosis Control in Zambia, the University of Zambia, Lusaka, 10101 Zambia
| | - Takashi Abe
- Graduate School of Science and Technology, Niigata University, Ikarashi 2 no-cho 8050, Nishi-ku, Niigata, Niigata 950-2181 Japan
| | - Ryo Nakao
- Unit of Risk Analysis and Management, Hokkaido University Research Center for Zoonosis Control, Kita 20 Nishi 10, Kita-ku, Sapporo, Hokkaido 001-0020 Japan.,Laboratory of Parasitology, Faculty of Veterinary Medicine, Graduate School of Infectious Diseases, Hokkaido University, Kita 18 Nishi 9, Kita-ku, Sapporo, Hokkaido 060-0818 Japan
| | - Kenro Satoh
- Graduate School of Science and Technology, Niigata University, Ikarashi 2 no-cho 8050, Nishi-ku, Niigata, Niigata 950-2181 Japan
| | - Chihiro Sugimoto
- Division of Collaboration and Education, Hokkaido University Research Center for Zoonosis Control, Kita 20 Nishi 10, Kita-ku, Sapporo, Hokkaido 001-0020 Japan.,Global Institution for Collaborative Research and Education (GI-CoRE), Hokkaido University, Kita 20 Nishi 10, Kita-ku, Sapporo, Hokkaido 001-0020 Japan
| |
Collapse
|
3
|
Wada Y, Iwasaki Y, Abe T, Wada K, Tooyama I, Ikemura T. CG-containing oligonucleotides and transcription factor-binding motifs are enriched in human pericentric regions. Genes Genet Syst 2016; 90:43-53. [PMID: 26119665 DOI: 10.1266/ggs.90.43] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/23/2022] Open
Abstract
Unsupervised data mining capable of extracting a wide range of information from big sequence data without prior knowledge or particular models is highly desirable in an era of big data accumulation for research on genes, genomes and genetic systems. By handling oligonucleotide compositions in genomic sequences as high-dimensional data, we have previously modified the conventional SOM (self-organizing map) for genome informatics and established BLSOM for oligonucleotide composition, which can analyze more than ten million sequences simultaneously and is thus suitable for big data analyses. Oligonucleotides often represent motif sequences responsible for sequence-specific binding of proteins such as transcription factors. The distribution of such functionally important oligonucleotides is probably biased in genomic sequences, and may differ among genomic regions. When constructing BLSOMs to analyze pentanucleotide composition in 50-kb sequences derived from the human genome in this study, we found that BLSOMs did not classify human sequences according to chromosome but revealed several specific zones, which are enriched for a class of CG-containing pentanucleotides; these zones are composed primarily of sequences derived from pericentric regions. The biological significance of enrichment of these pentanucleotides in pericentric regions is discussed in connection with cell type- and stage-dependent formation of the condensed heterochromatin in the chromocenter, which is formed through association of pericentric regions of multiple chromosomes.
Collapse
Affiliation(s)
- Yoshiko Wada
- Department of Bioscience, Nagahama Institute of Bio-Science and Technology
| | | | | | | | | | | |
Collapse
|
4
|
A novel bioinformatics method for efficient knowledge discovery by BLSOM from big genomic sequence data. BIOMED RESEARCH INTERNATIONAL 2014; 2014:765648. [PMID: 24804244 PMCID: PMC3996302 DOI: 10.1155/2014/765648] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 10/24/2013] [Accepted: 02/14/2014] [Indexed: 11/17/2022]
Abstract
With remarkable increase of genomic sequence data of a wide range of species, novel tools are needed for comprehensive analyses of the big sequence data. Self-Organizing Map (SOM) is an effective tool for clustering and visualizing high-dimensional data such as oligonucleotide composition on one map. By modifying the conventional SOM, we have previously developed Batch-Learning SOM (BLSOM), which allows classification of sequence fragments according to species, solely depending on the oligonucleotide composition. In the present study, we introduce the oligonucleotide BLSOM used for characterization of vertebrate genome sequences. We first analyzed pentanucleotide compositions in 100 kb sequences derived from a wide range of vertebrate genomes and then the compositions in the human and mouse genomes in order to investigate an efficient method for detecting differences between the closely related genomes. BLSOM can recognize the species-specific key combination of oligonucleotide frequencies in each genome, which is called a "genome signature," and the specific regions specifically enriched in transcription-factor-binding sequences. Because the classification and visualization power is very high, BLSOM is an efficient powerful tool for extracting a wide range of information from massive amounts of genomic sequences (i.e., big sequence data).
Collapse
|
5
|
Iwasaki Y, Abe T, Wada K, Wada Y, Ikemura T. A Novel Bioinformatics Strategy to Analyze Microbial Big Sequence Data for Efficient Knowledge Discovery: Batch-Learning Self-Organizing Map (BLSOM). Microorganisms 2013; 1:137-157. [PMID: 27694768 PMCID: PMC5029494 DOI: 10.3390/microorganisms1010137] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/26/2013] [Revised: 11/05/2013] [Accepted: 11/08/2013] [Indexed: 11/24/2022] Open
Abstract
With the remarkable increase of genomic sequence data of microorganisms, novel tools are needed for comprehensive analyses of the big sequence data available. The self-organizing map (SOM) is an effective tool for clustering and visualizing high-dimensional data, such as oligonucleotide composition on one map. By modifying the conventional SOM, we developed batch-learning SOM (BLSOM), which allowed classification of sequence fragments (e.g., 1 kb) according to phylotypes, solely depending on oligonucleotide composition. Metagenomics studies of uncultivable microorganisms in clinical and environmental samples should allow extensive surveys of genes important in life sciences. BLSOM is most suitable for phylogenetic assignment of metagenomic sequences, because fragmental sequences can be clustered according to phylotypes, solely depending on oligonucleotide composition. We first constructed oligonucleotide BLSOMs for all available sequences from genomes of known species, and by mapping metagenomic sequences on these large-scale BLSOMs, we can predict phylotypes of individual metagenomic sequences, revealing a microbial community structure of uncultured microorganisms, including viruses. BLSOM has shown that influenza viruses isolated from humans and birds clearly differ in oligonucleotide composition. Based on this host-dependent oligonucleotide composition, we have proposed strategies for predicting directional changes of virus sequences and for surveilling potentially hazardous strains when introduced into humans from non-human sources.
Collapse
Affiliation(s)
- Yuki Iwasaki
- Department of Bioscience, Nagahama Institute of Bio-Science and Technology, Nagahama-shi, Shiga-ken 526-0829, Japan.
- Japan Society for the Promotion of Science, Chiyoda-ku, Tokyo 102-0083, Japan.
| | - Takashi Abe
- Department of Bioscience, Nagahama Institute of Bio-Science and Technology, Nagahama-shi, Shiga-ken 526-0829, Japan.
- Department of Information Engineering, Faculty of Engineering, Niigata University, Niigata-ken 950-2181, Japan.
| | - Kennosuke Wada
- Department of Bioscience, Nagahama Institute of Bio-Science and Technology, Nagahama-shi, Shiga-ken 526-0829, Japan.
| | - Yoshiko Wada
- Department of Bioscience, Nagahama Institute of Bio-Science and Technology, Nagahama-shi, Shiga-ken 526-0829, Japan.
- Faculty of Medicine, Shiga University of Medical Science, Shiga-ken 520-2121, Japan.
| | - Toshimichi Ikemura
- Department of Bioscience, Nagahama Institute of Bio-Science and Technology, Nagahama-shi, Shiga-ken 526-0829, Japan.
| |
Collapse
|
6
|
Ikeda S, Abe T, Nakamura Y, Kibinge N, Hirai Morita A, Nakatani A, Ono N, Ikemura T, Nakamura K, Altaf-Ul-Amin M, Kanaya S. Systematization of the protein sequence diversity in enzymes related to secondary metabolic pathways in plants, in the context of big data biology inspired by the KNApSAcK motorcycle database. PLANT & CELL PHYSIOLOGY 2013; 54:711-727. [PMID: 23509110 DOI: 10.1093/pcp/pct041] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/01/2023]
Abstract
Biology is increasingly becoming a data-intensive science with the recent progress of the omics fields, e.g. genomics, transcriptomics, proteomics and metabolomics. The species-metabolite relationship database, KNApSAcK Core, has been widely utilized and cited in metabolomics research, and chronological analysis of that research work has helped to reveal recent trends in metabolomics research. To meet the needs of these trends, the KNApSAcK database has been extended by incorporating a secondary metabolic pathway database called Motorcycle DB. We examined the enzyme sequence diversity related to secondary metabolism by means of batch-learning self-organizing maps (BL-SOMs). Initially, we constructed a map by using a big data matrix consisting of the frequencies of all possible dipeptides in the protein sequence segments of plants and bacteria. The enzyme sequence diversity of the secondary metabolic pathways was examined by identifying clusters of segments associated with certain enzyme groups in the resulting map. The extent of diversity of 15 secondary metabolic enzyme groups is discussed. Data-intensive approaches such as BL-SOM applied to big data matrices are needed for systematizing protein sequences. Handling big data has become an inevitable part of biology.
Collapse
Affiliation(s)
- Shun Ikeda
- Graduate School of Information Science, Nara Institute of Science and Technology, 8916-5 Takayama-cho, Ikoma-shi, Nara, 630-0192 Japan
| | | | | | | | | | | | | | | | | | | | | |
Collapse
|
7
|
A novel approach, based on BLSOMs (Batch Learning Self-Organizing Maps), to the microbiome analysis of ticks. ISME JOURNAL 2013; 7:1003-15. [PMID: 23303373 DOI: 10.1038/ismej.2012.171] [Citation(s) in RCA: 81] [Impact Index Per Article: 7.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/30/2023]
Abstract
Ticks transmit a variety of viral, bacterial and protozoal pathogens, which are often zoonotic. The aim of this study was to identify diverse tick microbiomes, which may contain as-yet unidentified pathogens, using a metagenomic approach. DNA prepared from bacteria/archaea-enriched fractions obtained from seven tick species, namely Amblyomma testudinarium, Amblyomma variegatum, Haemaphysalis formosensis, Haemaphysalis longicornis, Ixodes ovatus, Ixodes persulcatus and Ixodes ricinus, was subjected to pyrosequencing after whole-genome amplification. The resulting sequence reads were phylotyped using a Batch Learning Self-Organizing Map (BLSOM) program, which allowed phylogenetic estimation based on similarity of oligonucleotide frequencies, and functional annotation by BLASTX similarity searches. In addition to bacteria previously associated with human/animal diseases, such as Anaplasma, Bartonella, Borrelia, Ehrlichia, Francisella and Rickettsia, BLSOM analysis detected microorganisms belonging to the phylum Chlamydiae in some tick species. This was confirmed by pan-Chlamydia PCR and sequencing analysis. Gene sequences associated with bacterial pathogenesis were also identified, some of which were suspected to originate from horizontal gene transfer. These efforts to construct a database of tick microbes may lead to the ability to predict emerging tick-borne diseases. Furthermore, a comprehensive understanding of tick microbiomes will be useful for understanding tick biology, including vector competency and interactions with pathogens and symbionts.
Collapse
|