Reference Citation Analysis: Find an Article, Find a Category, Find a Journal, Find a Scholar

For: Wang JT, Rozen S, Shapiro BA, Shasha D, Wang Z, Yin M. New techniques for DNA sequence classification. J Comput Biol 1999;6:209-18. [PMID: 10421523 DOI: 10.1089/cmb.1999.6.209] [Citation(s) in RCA: 22] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open

For:	Wang JT, Rozen S, Shapiro BA, Shasha D, Wang Z, Yin M. New techniques for DNA sequence classification. J Comput Biol 1999;6:209-18. [PMID: 10421523 DOI: 10.1089/cmb.1999.6.209] [Citation(s) in RCA: 22] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open

Number

Cited by Other Article(s)

İhtiyar MN, Özgür A. Generative language models on nucleotide sequences of human genes. Sci Rep 2024;14:22204. [PMID: 39333252 PMCID: PMC11437190 DOI: 10.1038/s41598-024-72512-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/17/2023] [Accepted: 09/09/2024] [Indexed: 09/29/2024] Open

Abstract

Language models, especially transformer-based ones, have achieved colossal success in natural language processing. To be precise, studies like BERT for natural language understanding and works like GPT-3 for natural language generation are very important. If we consider DNA sequences as a text written with an alphabet of four letters representing the nucleotides, they are similar in structure to natural languages. This similarity has led to the development of discriminative language models such as DNABERT in the field of DNA-related bioinformatics. To our knowledge, however, the generative side of the coin is still largely unexplored. Therefore, we have focused on the development of an autoregressive generative language model such as GPT-3 for DNA sequences. Since working with whole DNA sequences is challenging without extensive computational resources, we decided to conduct our study on a smaller scale and focus on nucleotide sequences of human genes, i.e. unique parts of DNA with specific functions, rather than the whole DNA. This decision has not significantly changed the structure of the problem, as both DNA and genes can be considered as 1D sequences consisting of four different nucleotides without losing much information and without oversimplification. First of all, we systematically studied an almost entirely unexplored problem and observed that recurrent neural networks (RNNs) perform best, while simple techniques such as N-grams are also promising. Another beneficial point was learning how to work with generative models on languages we do not understand, unlike natural languages. The importance of using real-world tasks beyond classical metrics such as perplexity was noted. In addition, we examined whether the data-hungry nature of these models can be altered by selecting a language with minimal vocabulary size, four due to four different types of nucleotides. The reason for reviewing this was that choosing such a language might make the problem easier. However, in this study, we found that this did not change the amount of data required very much.

Collapse

Bhandari N, Khare S, Walambe R, Kotecha K. Comparison of machine learning and deep learning techniques in promoter prediction across diverse species. PeerJ Comput Sci 2021;7:e365. [PMID: 33817015 PMCID: PMC7959599 DOI: 10.7717/peerj-cs.365] [Citation(s) in RCA: 11] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/22/2020] [Accepted: 12/30/2020] [Indexed: 06/12/2023]

Signature Recognition Methods for Identifying Influenza Sequences. Artif Intell Med 2005. [DOI: 10.1007/11527770_67] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/18/2022]

GeneScout: a data mining system for predicting vertebrate genes in genomic DNA sequences. Inf Sci (N Y) 2004. [DOI: 10.1016/j.ins.2003.03.016] [Citation(s) in RCA: 13] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]

Effective hidden Markov models for detecting splicing junction sites in DNA sequences. Inf Sci (N Y) 2001. [DOI: 10.1016/s0020-0255(01)00160-8] [Citation(s) in RCA: 27] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]