1
|
Das B. An implementation of a hybrid method based on machine learning to identify biomarkers in the Covid-19 diagnosis using DNA sequences. CHEMOMETRICS AND INTELLIGENT LABORATORY SYSTEMS : AN INTERNATIONAL JOURNAL SPONSORED BY THE CHEMOMETRICS SOCIETY 2022; 230:104680. [PMID: 36213553 PMCID: PMC9528020 DOI: 10.1016/j.chemolab.2022.104680] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 09/08/2022] [Revised: 09/20/2022] [Accepted: 09/27/2022] [Indexed: 06/16/2023]
Abstract
Although some people do not have any chronic disease or are not in the risky age group for Covid-19, they are more vulnerable to the coronavirus. As the reason for this situation, some experts focus on the immune system of the person, while others think that the genetic history of patients may play a role. It is critical to detect corona from DNA signals as early as possible to determine the relationship between Covid-19 and genes. Thus, the effect on the severe course of the disease of variations in the genes associated with the corona disease will be revealed. In this study, a novel intelligent computer approach is proposed to identify coronavirus from nucleotide signals for the first time. The proposed method presents a multilayered feature extraction structure to extract the most effective features using an Entropy-based mapping technique, Discrete Wavelet Transform (DWT), statistical feature extractor, and Singular Value Decomposition (SVD), together. Then 94 distinctive features are selected by the ReliefF technique. Support vector machine (SVM) and k nearest neighborhood (k-NN) are chosen as classifiers. The method achieved the highest classification accuracy rate of 98.84% with an SVM classifier to detect Covid-19 from DNA signals. The proposed method is ready to be tested with a different database in the diagnosis of Covid-19 using RNA or other signals.
Collapse
Affiliation(s)
- Bihter Das
- Department of Software Engineering, Technology Faculty, Firat University, 23119, Elazig, Turkey
| |
Collapse
|
2
|
Das B. A deep learning model for identification of diabetes type 2 based on nucleotide signals. Neural Comput Appl 2022. [DOI: 10.1007/s00521-022-07121-8] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/27/2022]
|
3
|
Das B, Toraman S. Deep transfer learning for automated liver cancer gene recognition using spectrogram images of digitized DNA sequences. Biomed Signal Process Control 2022. [DOI: 10.1016/j.bspc.2021.103317] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
|
4
|
ALAKUŞ TB, TÜRKOĞLU İ. Kanser Teşhisinde Protein Haritalama Tekniklerinin Başarımlarının Derin Öğrenme Kullanılarak Karşılaştırılması. FIRAT ÜNIVERSITESI MÜHENDISLIK BILIMLERI DERGISI 2021; 33:547-565. [DOI: 10.35234/fumbd.881228] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 09/01/2023]
Abstract
Kanser, dünya çapında çoğu insanın ölmesine neden olan ve birçok farklı alt tiplerden oluşan heterojen bir hastalıktır. Bir kanser türünün erken teşhisi ve prognozu, hastaların sonraki klinik takibini kolaylaştırabildiği için kanser araştırmalarında bir gereklilik haline gelmiştir. Bunun için en çok kullanılan yöntemlerden birisi histolojik incelemedir. Ancak bu yöntemde çok sayıda gözlemciler arası değişkenlik bulunmakta, bu ise inceleme sürecinin uzun olmasına ve zaman almasına neden olmaktadır. Bu dezavantajın önüne geçmek için araştırmacılar hesaplama-tabanlı yaklaşımlara yönelmişler ve kanserli proteinlerin belirlenmesi için protein-protein etkileşimleri, protein etkileşim ağları ve moleküler parmak izleri yöntemlerinden yararlanmaktadırlar. Bu yöntemler arasında, çeşitli çalışmalar genomik bilgilerden de kanserli hücrelerin tespit edilebildiğini göstermiştir. Kansere ait genlerin dizilimlerine göre belirli kanser türlerinin belirlenebildiği ve bu süreçte yapay öğrenme tabanlı yaklaşımların etkili olduğu görülmüştür. Bu çalışmada, derin öğrenme algoritmalarından birisi olan tekrarlayıcı sinir ağı mimarisi kullanılmış ve insana ait mesane, kolon ve prostat kanserlerinin, protein dizilimlerine göre sınıflandırılması yapılmıştır. Çalışma, verilerin elde edilmesi, protein dizilimlerinin sayısallaştırılması, derin öğrenme model uygulamasının geliştirilmesi ve protein haritalama tekniklerinin başarımının karşılaştırılması olmak üzere dört aşamadan meydana gelmektedir. Protein dizilimlerini sayısallaştırmak için AESNN1, hidrofobiklik, tam sayı, Miyazawa enerjileri ve rastgele kodlama yöntemleri ele alınmıştır. Çalışmanın sonunda, mesane kanseri için en yüksek doğruluk değeri %87.15 ile AESNN1 haritalama yöntemiyle, kolon kanseri ve prostat kanseri için ise en yüksek doğruluk değeri sırasıyla %94.40 ve %75.45 olarak Miyazawa enerjileri ve rastgele kodlama protein haritalama yöntemi ile elde edilmiştir. Bu çalışma ile yapay öğrenme ve protein haritalama tekniklerinin, kanserli protein dizilimlerinin belirlenmesinde etkili olduğu gözlemlenmiştir.
Collapse
|
5
|
Das L, Das JK, Mohapatra S, Nanda S. DNA numerical encoding schemes for exon prediction: a recent history. NUCLEOSIDES NUCLEOTIDES & NUCLEIC ACIDS 2021; 40:985-1017. [PMID: 34455915 DOI: 10.1080/15257770.2021.1966797] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/20/2022]
Abstract
Bioinformatics in the present day has been firmly established as a regulator in genomics. In recent times, applications of Signal processing in exon prediction have gained a lot of attention. The exons carry protein information. Proteins are composed of connected constituents known as amino acids that characterize the specific function. Conversion of the nucleotide character string into a numerical sequence is the gateway before analyzing it through signal processing methods. This numeric encoding is the mathematical descriptor of nucleotides and is based on some statistical properties of the structure of nucleic acids. Since the type of encoding extremely affects the exon detection accuracy, this paper is devised for the review of existing encoding (mapping) schemes. The comparative analysis is formulated to emphasize the importance of the genetic code setting of amino acids considered for application related to computational elucidation for exon detection. This work covers much helpful information for future applications.
Collapse
Affiliation(s)
- Lopamudra Das
- School of Electronics Engineering, KIIT, Bhubaneswar, India
| | - J K Das
- School of Electronics Engineering, KIIT, Bhubaneswar, India
| | - S Mohapatra
- School of Electronics Engineering, KIIT, Bhubaneswar, India
| | - Sarita Nanda
- School of Electronics Engineering, KIIT, Bhubaneswar, India
| |
Collapse
|
6
|
M RK, Vaegae NK. Walsh code based numerical mapping method for the identification of protein coding regions in eukaryotes. Biomed Signal Process Control 2020. [DOI: 10.1016/j.bspc.2020.101859] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/25/2022]
|
7
|
Raman Kumar M, Vaegae NK. A new numerical approach for DNA representation using modified Gabor wavelet transform for the identification of protein coding regions. Biocybern Biomed Eng 2020. [DOI: 10.1016/j.bbe.2020.03.007] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
|
8
|
Alakus TB, Das B, Turkoglu I. DNA encoding with entropy based numerical mapping technique for phylogenetic analysis. 2019 INTERNATIONAL ARTIFICIAL INTELLIGENCE AND DATA PROCESSING SYMPOSIUM (IDAP) 2019. [DOI: 10.1109/idap.2019.8875937] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 09/01/2023]
|
9
|
Li J, Zhang L, Li H, Ping Y, Xu Q, Wang R, Tan R, Wang Z, Liu B, Wang Y. Integrated entropy-based approach for analyzing exons and introns in DNA sequences. BMC Bioinformatics 2019; 20:283. [PMID: 31182012 PMCID: PMC6557737 DOI: 10.1186/s12859-019-2772-y] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/29/2022] Open
Abstract
BACKGROUND Numerous essential algorithms and methods, including entropy-based quantitative methods, have been developed to analyze complex DNA sequences since the last decade. Exons and introns are the most notable components of DNA and their identification and prediction are always the focus of state-of-the-art research. RESULTS In this study, we designed an integrated entropy-based analysis approach, which involves modified topological entropy calculation, genomic signal processing (GSP) method and singular value decomposition (SVD), to investigate exons and introns in DNA sequences. We optimized and implemented the topological entropy and the generalized topological entropy to calculate the complexity of DNA sequences, highlighting the characteristics of repetition sequences. By comparing digitalizing entropy values of exons and introns, we observed that they are significantly different. After we converted DNA data to numerical topological entropy value, we applied SVD method to effectively investigate exon and intron regions on a single gene sequence. Additionally, several genes across five species are used for exon predictions. CONCLUSIONS Our approach not only helps to explore the complexity of DNA sequence and its functional elements, but also provides an entropy-based GSP method to analyze exon and intron regions. Our work is feasible across different species and extendable to analyze other components in both coding and noncoding region of DNA sequences.
Collapse
Affiliation(s)
- Junyi Li
- School of Computer Science and Technology, Harbin Institute of Technology (Shenzhen), Shenzhen, Guangdong, 518055 China
| | - Li Zhang
- School of Computer Science and Technology, Harbin Institute of Technology (Shenzhen), Shenzhen, Guangdong, 518055 China
| | - Huinian Li
- School of Computer Science and Technology, Harbin Institute of Technology (Shenzhen), Shenzhen, Guangdong, 518055 China
| | - Yuan Ping
- School of Computer Science and Technology, Harbin Institute of Technology (Shenzhen), Shenzhen, Guangdong, 518055 China
| | - Qingzhe Xu
- School of Computer Science and Technology, Harbin Institute of Technology (Shenzhen), Shenzhen, Guangdong, 518055 China
| | - Rongjie Wang
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, Heilongjiang, 150001 China
| | - Renjie Tan
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, Heilongjiang, 150001 China
| | - Zhen Wang
- CAS Key Laboratory of Computational Biology, CAS-MPG Partner Institute for Computational Biology, Shanghai Institute of Nutrition and Health, Shanghai Institutes for Biological Sciences, University of Chinese Academy of Sciences, Chinese Academy of Sciences, Shanghai, 200031 China
| | - Bo Liu
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, Heilongjiang, 150001 China
| | - Yadong Wang
- School of Computer Science and Technology, Harbin Institute of Technology (Shenzhen), Shenzhen, Guangdong, 518055 China
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, Heilongjiang, 150001 China
| |
Collapse
|
10
|
Skutkova H, Maderankova D, Sedlar K, Jugas R, Vitek M. A degeneration-reducing criterion for optimal digital mapping of genetic codes. Comput Struct Biotechnol J 2019; 17:406-414. [PMID: 30984363 PMCID: PMC6444178 DOI: 10.1016/j.csbj.2019.03.007] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/17/2018] [Revised: 02/07/2019] [Accepted: 03/15/2019] [Indexed: 01/08/2023] Open
Abstract
Bioinformatics may seem to be a scientific field processing primarily large string datasets, as nucleotides and amino acids are represented with dedicated characters. On the other hand, many computational tasks that bioinformatics challenges are mathematical problems understandable as operations with digits. In fact, many computational tasks are solved this way in the background. One of the most widely used digital representations is mapping of nucleotides and amino acids with integers 0–3 and 0–20, respectively. The limitation of this mapping occurs when the digital signal of nucleotides has to be translated into a digital signal of amino acids as the genetic code is degenerated. This causes non-monotonies in a mapping function. Although map for reducing this undesirable effect has already been proposed, it is defined theoretically and for standard genetic codes only. In this study, we derived a novel optimal criterion for reducing the influence of degeneration by utilizing a large dataset of real sequences with various genetic codes. As a result, we proposed a new robust global optimal map suitable for any genetic code as well as specialized optimal maps for particular genetic codes. Optimization of 1D numerical representation for DNA to protein translation. Reducing genetic code degeneracy in numerical representation of DNA sequences. More robust numerical conversion used for genomic-proteomic analysis.
Collapse
Affiliation(s)
- Helena Skutkova
- Department of Biomedical Engineering, Brno University of Technology, Technicka 12, 616 00 Brno, Czech republic
| | - Denisa Maderankova
- Department of Biomedical Engineering, Brno University of Technology, Technicka 12, 616 00 Brno, Czech republic
| | - Karel Sedlar
- Department of Biomedical Engineering, Brno University of Technology, Technicka 12, 616 00 Brno, Czech republic
| | - Robin Jugas
- Department of Biomedical Engineering, Brno University of Technology, Technicka 12, 616 00 Brno, Czech republic
| | - Martin Vitek
- Department of Biomedical Engineering, Brno University of Technology, Technicka 12, 616 00 Brno, Czech republic
| |
Collapse
|
11
|
Spänig S, Heider D. Encodings and models for antimicrobial peptide classification for multi-resistant pathogens. BioData Min 2019; 12:7. [PMID: 30867681 PMCID: PMC6399931 DOI: 10.1186/s13040-019-0196-x] [Citation(s) in RCA: 51] [Impact Index Per Article: 8.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/21/2018] [Accepted: 02/24/2019] [Indexed: 01/10/2023] Open
Abstract
Antimicrobial peptides (AMPs) are part of the inherent immune system. In fact, they occur in almost all organisms including, e.g., plants, animals, and humans. Remarkably, they show effectivity also against multi-resistant pathogens with a high selectivity. This is especially crucial in times, where society is faced with the major threat of an ever-increasing amount of antibiotic resistant microbes. In addition, AMPs can also exhibit antitumor and antiviral effects, thus a variety of scientific studies dealt with the prediction of active peptides in recent years. Due to their potential, even the pharmaceutical industry is keen on discovering and developing novel AMPs. However, AMPs are difficult to verify in vitro, hence researchers conduct sequence similarity experiments against known, active peptides. Unfortunately, this approach is very time-consuming and limits potential candidates to sequences with a high similarity to known AMPs. Machine learning methods offer the opportunity to explore the huge space of sequence variations in a timely manner. These algorithms have, in principal, paved the way for an automated discovery of AMPs. However, machine learning models require a numerical input, thus an informative encoding is very important. Unfortunately, developing an appropriate encoding is a major challenge, which has not been entirely solved so far. For this reason, the development of novel amino acid encodings is established as a stand-alone research branch. The present review introduces state-of-the-art encodings of amino acids as well as their properties in sequence and structure based aggregation. Moreover, albeit a well-chosen encoding is essential, performant classifiers are required, which is reflected by a tendency towards specifically designed models in the literature. Furthermore, we introduce these models with a particular focus on encodings derived from support vector machines and deep learning approaches. Albeit a strong focus has been set on AMP predictions, not all of the mentioned encodings have been elaborated as part of antimicrobial research studies, but rather as general protein or peptide representations.
Collapse
Affiliation(s)
- Sebastian Spänig
- Department of Bioinformatics, Faculty of Mathematics and Computer Science, Philipps-University of Marburg, Marburg, Germany
| | - Dominik Heider
- Department of Bioinformatics, Faculty of Mathematics and Computer Science, Philipps-University of Marburg, Marburg, Germany
| |
Collapse
|
12
|
Mendizabal-Ruiz G, Román-Godínez I, Torres-Ramos S, Salido-Ruiz RA, Vélez-Pérez H, Morales JA. Genomic signal processing for DNA sequence clustering. PeerJ 2018; 6:e4264. [PMID: 29379686 PMCID: PMC5786891 DOI: 10.7717/peerj.4264] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/17/2017] [Accepted: 12/24/2017] [Indexed: 11/20/2022] Open
Abstract
Genomic signal processing (GSP) methods which convert DNA data to numerical values have recently been proposed, which would offer the opportunity of employing existing digital signal processing methods for genomic data. One of the most used methods for exploring data is cluster analysis which refers to the unsupervised classification of patterns in data. In this paper, we propose a novel approach for performing cluster analysis of DNA sequences that is based on the use of GSP methods and the K-means algorithm. We also propose a visualization method that facilitates the easy inspection and analysis of the results and possible hidden behaviors. Our results support the feasibility of employing the proposed method to find and easily visualize interesting features of sets of DNA data.
Collapse
Affiliation(s)
| | - Israel Román-Godínez
- Departamento de Ciencias Computacionales, Universidad de Guadalajara, Guadalajara, Mexico
| | - Sulema Torres-Ramos
- Departamento de Ciencias Computacionales, Universidad de Guadalajara, Guadalajara, Mexico
| | - Ricardo A Salido-Ruiz
- Departamento de Ciencias Computacionales, Universidad de Guadalajara, Guadalajara, Mexico
| | - Hugo Vélez-Pérez
- Departamento de Ciencias Computacionales, Universidad de Guadalajara, Guadalajara, Mexico
| | - J Alejandro Morales
- Departamento de Ciencias Computacionales, Universidad de Guadalajara, Guadalajara, Mexico
| |
Collapse
|