1
|
Zhong G, Liu H, Deng L. Ensemble Machine Learning and Predicted Properties Promote Antimicrobial Peptide Identification. Interdiscip Sci 2024; 16:951-965. [PMID: 38972032 DOI: 10.1007/s12539-024-00640-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/22/2024] [Revised: 06/04/2024] [Accepted: 06/07/2024] [Indexed: 07/08/2024]
Abstract
The emergence of antibiotic-resistant microbes raises a pressing demand for novel alternative treatments. One promising alternative is the antimicrobial peptides (AMPs), a class of innate immunity mediators within the therapeutic peptide realm. AMPs offer salient advantages such as high specificity, cost-effective synthesis, and reduced toxicity. Although some computational methodologies have been proposed to identify potential AMPs with the rapid development of artificial intelligence techniques, there is still ample room to improve their performance. This study proposes a predictive framework which ensembles deep learning and statistical learning methods to screen peptides with antimicrobial activity. We integrate multiple LightGBM classifiers and convolution neural networks which leverages various predicted sequential, structural and physicochemical properties from their residue sequences extracted by diverse machine learning paradigms. Comparative experiments exhibit that our method outperforms other state-of-the-art approaches on an independent test dataset, in terms of representative capability measures. Besides, we analyse the discrimination quality under different varieties of attribute information and it reveals that combination of multiple features could improve prediction. In addition, a case study is carried out to illustrate the exemplary favorable identification effect. We establish a web application at http://amp.denglab.org to provide convenient usage of our proposal and make the predictive framework, source code, and datasets publicly accessible at https://github.com/researchprotein/amp .
Collapse
Affiliation(s)
- Guolun Zhong
- School of Computer Science and Engineering, Central South University, Changsha, 410083, China
| | - Hui Liu
- College of Computer and Information Engineering, Nanjing Tech University, Nanjing, 211816, China.
| | - Lei Deng
- School of Computer Science and Engineering, Central South University, Changsha, 410083, China.
| |
Collapse
|
2
|
Chen F, Ye S, Xu L, Xie R. FTDZOA: An Efficient and Robust FS Method with Multi-Strategy Assistance. Biomimetics (Basel) 2024; 9:632. [PMID: 39451838 PMCID: PMC11505684 DOI: 10.3390/biomimetics9100632] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/10/2024] [Revised: 10/11/2024] [Accepted: 10/15/2024] [Indexed: 10/26/2024] Open
Abstract
Feature selection (FS) is a pivotal technique in big data analytics, aimed at mitigating redundant information within datasets and optimizing computational resource utilization. This study introduces an enhanced zebra optimization algorithm (ZOA), termed FTDZOA, for superior feature dimensionality reduction. To address the challenges of ZOA, such as susceptibility to local optimal feature subsets, limited global search capabilities, and sluggish convergence when tackling FS problems, three strategies are integrated into the original ZOA to bolster its FS performance. Firstly, a fractional order search strategy is incorporated to preserve information from the preceding generations, thereby enhancing ZOA's exploitation capabilities. Secondly, a triple mean point guidance strategy is introduced, amalgamating information from the global optimal point, a random point, and the current point to effectively augment ZOA's exploration prowess. Lastly, the exploration capacity of ZOA is further elevated through the introduction of a differential strategy, which integrates information disparities among different individuals. Subsequently, the FTDZOA-based FS method was applied to solve 23 FS problems spanning low, medium, and high dimensions. A comparative analysis with nine advanced FS methods revealed that FTDZOA achieved higher classification accuracy on over 90% of the datasets and secured a winning rate exceeding 83% in terms of execution time. These findings confirm that FTDZOA is a reliable, high-performance, practical, and robust FS method.
Collapse
Affiliation(s)
- Fuqiang Chen
- Department of Artificial Intelligence, Guangzhou Huashang College, Guangzhou 511300, China; (F.C.); (S.Y.); (L.X.)
| | - Shitong Ye
- Department of Artificial Intelligence, Guangzhou Huashang College, Guangzhou 511300, China; (F.C.); (S.Y.); (L.X.)
| | - Lijuan Xu
- Department of Artificial Intelligence, Guangzhou Huashang College, Guangzhou 511300, China; (F.C.); (S.Y.); (L.X.)
| | - Rongxiang Xie
- State Key Laboratory of Public Big Data, Guizhou University, Guiyang 550025, China
| |
Collapse
|
3
|
Duong TKC, Tran VL, Nguyen TB, Nguyen TT, Ho NTK, Nguyen TQ. Ensemble learning-based approach for automatic classification of termite mushrooms. Front Genet 2023; 14:1208695. [PMID: 37886685 PMCID: PMC10598762 DOI: 10.3389/fgene.2023.1208695] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/19/2023] [Accepted: 09/13/2023] [Indexed: 10/28/2023] Open
Abstract
Termite mushrooms are edible fungi that provide significant economic, nutritional, and medicinal value. However, identifying these mushroom species based on morphology and traditional knowledge is ineffective due to their short development time and seasonal nature. This study proposes a novel method for classifying termite mushroom species. The method utilizes Gradient Boosting machine learning techniques and sequence encoding on the Internal Transcribed Spacer (ITS) gene dataset to construct a machine learning model for identifying termite mushroom species. The model is trained using ITS sequences obtained from the National Center for Biotechnology Information (NCBI) and the Barcode of Life Data Systems (BOLD). Ensemble learning techniques are applied to classify termite mushroom species. The proposed model achieves good results on the test dataset, with an accuracy of 0.91 and an average AUCROC value of 0.99. To validate the model, eight ITS sequences collected from termite mushroom samples in An Linh commune, Phu Giao district, Binh Duong province, Vietnam were used as the test data. The results show consistent species identification with predictions from the NCBI BLAST software. The results of species identification were consistent with the NCBI BLAST prediction software. This machine-learning model shows promise as an automatic solution for classifying termite mushroom species. It can help researchers better understand the local growth of these termite mushrooms and develop conservation plans for this rare and valuable plant resource.
Collapse
Affiliation(s)
- Thi Kim Chi Duong
- Department of Information Technology, Lac Hong University, Dong Nai Province, Vietnam
- Faculty of Engineering and Technology, Thu Dau Mot University, Binh Duong Province, Vietnam
| | - Van Lang Tran
- HUFLIT Journal of Science, Ho Chi Minh City University of Foreign Languages and Information Technology, Ho Chi Minh City, Vietnam
| | - The Bao Nguyen
- Faculty of Engineering and Technology, Thu Dau Mot University, Binh Duong Province, Vietnam
| | - Thi Thuy Nguyen
- Faculty of Engineering and Technology, Thu Dau Mot University, Binh Duong Province, Vietnam
| | - Ngoc Trung Kien Ho
- Faculty of Engineering and Technology, Thu Dau Mot University, Binh Duong Province, Vietnam
| | - Thanh Q. Nguyen
- Department of Railway-Metro Engineering, Ho Chi Minh City University of Transport, Ho Chi Minh City, Vietnam
| |
Collapse
|
4
|
A Robust Feature Construction for Fish Classification Using Grey Wolf Optimizer. CYBERNETICS AND INFORMATION TECHNOLOGIES 2022. [DOI: 10.2478/cait-2022-0045] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/13/2022]
Abstract
Abstract
The low quality of the collected fish image data directly from its habitat affects its feature qualities. Previous studies tended to be more concerned with finding the best method rather than the feature quality. This article proposes a new fish classification workflow using a combination of Contrast-Adaptive Color Correction (NCACC) image enhancement and optimization-based feature construction called Grey Wolf Optimizer (GWO). This approach improves the image feature extraction results to obtain new and more meaningful features. This article compares the GWO-based and other optimization method-based fish classification on the newly generated features. The comparison results show that GWO-based classification had 0.22% lower accuracy than GA-based but 1.13 % higher than PSO. Based on ANOVA tests, the accuracy of GA and GWO were statistically indifferent, and GWO and PSO were statistically different. On the other hand, GWO-based performed 0.61 times faster than GA-based classification and 1.36 minutes faster than the other.
Collapse
|
5
|
Riehl K, Riccio C, Miska EA, Hemberg M. TransposonUltimate: software for transposon classification, annotation and detection. Nucleic Acids Res 2022; 50:e64. [PMID: 35234904 PMCID: PMC9226531 DOI: 10.1093/nar/gkac136] [Citation(s) in RCA: 30] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/09/2021] [Revised: 02/09/2022] [Accepted: 02/14/2022] [Indexed: 12/17/2022] Open
Abstract
Most genomes harbor a large number of transposons, and they play an important role in evolution and gene regulation. They are also of interest to clinicians as they are involved in several diseases, including cancer and neurodegeneration. Although several methods for transposon identification are available, they are often highly specialised towards specific tasks or classes of transposons, and they lack common standards such as a unified taxonomy scheme and output file format. We present TransposonUltimate, a powerful bundle of three modules for transposon classification, annotation, and detection of transposition events. TransposonUltimate comes as a Conda package under the GPL-3.0 licence, is well documented and it is easy to install through https://github.com/DerKevinRiehl/TransposonUltimate. We benchmark the classification module on the large TransposonDB covering 891,051 sequences to demonstrate that it outperforms the currently best existing solutions. The annotation and detection modules combine sixteen existing softwares, and we illustrate its use by annotating Caenorhabditis elegans, Rhizophagus irregularis and Oryza sativa subs. japonica genomes. Finally, we use the detection module to discover 29 554 transposition events in the genomes of 20 wild type strains of C. elegans. Databases, assemblies, annotations and further findings can be downloaded from (https://doi.org/10.5281/zenodo.5518085).
Collapse
Affiliation(s)
- Kevin Riehl
- Gurdon Institute, University of Cambridge, Cambridge CB2 1QN, UK
| | - Cristian Riccio
- Gurdon Institute, University of Cambridge, Cambridge CB2 1QN, UK
- Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton CB10 1SA, UK
| | - Eric A Miska
- Gurdon Institute, University of Cambridge, Cambridge CB2 1QN, UK
- Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton CB10 1SA, UK
- Department of Genetics, University of Cambridge, Downing Street, Cambridge CB2 3EH, UK
| | - Martin Hemberg
- Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton CB10 1SA, UK
- Evergrande Center for Immunologic Diseases, Harvard Medical School and Brigham and Women’s Hospital, 75 Francis Street, Boston, MA 02215, USA
| |
Collapse
|
6
|
Dee W. LMPred: predicting antimicrobial peptides using pre-trained language models and deep learning. BIOINFORMATICS ADVANCES 2022; 2:vbac021. [PMID: 36699381 PMCID: PMC9710646 DOI: 10.1093/bioadv/vbac021] [Citation(s) in RCA: 11] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 11/29/2021] [Revised: 03/01/2022] [Accepted: 03/29/2022] [Indexed: 01/28/2023]
Abstract
Motivation Antimicrobial peptides (AMPs) are increasingly being used in the development of new therapeutic drugs in areas such as cancer therapy and hypertension. Additionally, they are seen as an alternative to antibiotics due to the increasing occurrence of bacterial resistance. Wet-laboratory experimental identification, however, is both time-consuming and costly, so in silico models are now commonly used in order to screen new AMP candidates. Results This paper proposes a novel approach for creating model inputs; using pre-trained language models to produce contextualized embeddings, representing the amino acids within each peptide sequence, before a convolutional neural network is trained as the classifier. The results were validated on two datasets-one previously used in AMP prediction research, and a larger independent dataset created by this paper. Predictive accuracies of 93.33% and 88.26% were achieved, respectively, outperforming previous state-of-the-art classification models. Availability and implementation All codes are available and can be accessed here: https://github.com/williamdee1/LMPred_AMP_Prediction. Supplementary information Supplementary data are available at Bioinformatics Advances online.
Collapse
Affiliation(s)
- William Dee
- Department of Bioinformatics, School of Biological and Behavioural Sciences, Queen Mary University of London, London E1 4NS, UK,To whom correspondence should be addressed.
| |
Collapse
|
7
|
Hu G, Du B, Wang X, Wei G. An enhanced black widow optimization algorithm for feature selection. Knowl Based Syst 2022. [DOI: 10.1016/j.knosys.2021.107638] [Citation(s) in RCA: 17] [Impact Index Per Article: 5.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/07/2023]
|
8
|
Cheng M, Li Y, Nazarian S, Bogdan P. From rumor to genetic mutation detection with explanations: a GAN approach. Sci Rep 2021; 11:5861. [PMID: 33712675 PMCID: PMC7955089 DOI: 10.1038/s41598-021-84993-1] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/05/2020] [Accepted: 02/22/2021] [Indexed: 02/07/2023] Open
Abstract
Social media have emerged as increasingly popular means and environments for information gathering and propagation. This vigorous growth of social media contributed not only to a pandemic (fast-spreading and far-reaching) of rumors and misinformation, but also to an urgent need for text-based rumor detection strategies. To speed up the detection of misinformation, traditional rumor detection methods based on hand-crafted feature selection need to be replaced by automatic artificial intelligence (AI) approaches. AI decision making systems require to provide explanations in order to assure users of their trustworthiness. Inspired by the thriving development of generative adversarial networks (GANs) on text applications, we propose a GAN-based layered model for rumor detection with explanations. To demonstrate the universality of the proposed approach, we demonstrate its benefits on a gene classification with mutation detection case study. Similarly to the rumor detection, the gene classification can also be formulated as a text-based classification problem. Unlike fake news detection that needs a previously collected verified news database, our model provides explanations in rumor detection based on tweet-level texts only without referring to a verified news database. The layered structure of both generative and discriminative models contributes to the outstanding performance. The layered generators produce rumors by intelligently inserting controversial information in non-rumors, and force the layered discriminators to detect detailed glitches and deduce exactly which parts in the sentence are problematic. On average, in the rumor detection task, our proposed model outperforms state-of-the-art baselines on PHEME dataset by \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$$26.85\%$$\end{document}26.85% in terms of macro-f1. The excellent performance of our model for textural sequences is also demonstrated by the gene mutation case study on which it achieves \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$$72.69\%$$\end{document}72.69% macro-f1 score.
Collapse
Affiliation(s)
- Mingxi Cheng
- Ming Hsieh Department of Electrical and Computer Engineering, University of Southern California, Los Angeles, CA, 90007, USA
| | - Yizhi Li
- School of Computer Science, Beijing University of Posts and Telecommunications, Beijing, 100876, China
| | - Shahin Nazarian
- Ming Hsieh Department of Electrical and Computer Engineering, University of Southern California, Los Angeles, CA, 90007, USA
| | - Paul Bogdan
- Ming Hsieh Department of Electrical and Computer Engineering, University of Southern California, Los Angeles, CA, 90007, USA.
| |
Collapse
|
9
|
Moosa S, Amira PA, Boughorbel DS. DASSI: differential architecture search for splice identification from DNA sequences. BioData Min 2021; 14:15. [PMID: 33588916 PMCID: PMC7885202 DOI: 10.1186/s13040-021-00237-y] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/01/2020] [Accepted: 01/05/2021] [Indexed: 11/28/2022] Open
Abstract
Background The data explosion caused by unprecedented advancements in the field of genomics is constantly challenging the conventional methods used in the interpretation of the human genome. The demand for robust algorithms over the recent years has brought huge success in the field of Deep Learning (DL) in solving many difficult tasks in image, speech and natural language processing by automating the manual process of architecture design. This has been fueled through the development of new DL architectures. Yet genomics possesses unique challenges that requires customization and development of new DL models. Methods We proposed a new model, DASSI, by adapting a differential architecture search method and applying it to the Splice Site (SS) recognition task on DNA sequences to discover new high-performance convolutional architectures in an automated manner. We evaluated the discovered model against state-of-the-art tools to classify true and false SS in Homo sapiens (Human), Arabidopsis thaliana (Plant), Caenorhabditis elegans (Worm) and Drosophila melanogaster (Fly). Results Our experimental evaluation demonstrated that the discovered architecture outperformed baseline models and fixed architectures and showed competitive results against state-of-the-art models used in classification of splice sites. The proposed model - DASSI has a compact architecture and showed very good results on a transfer learning task. The benchmarking experiments of execution time and precision on architecture search and evaluation process showed better performance on recently available GPUs making it feasible to adopt architecture search based methods on large datasets. Conclusions We proposed the use of differential architecture search method (DASSI) to perform SS classification on raw DNA sequences, and discovered new neural network models with low number of tunable parameters and competitive performance compared with manually engineered architectures. We have extensively benchmarked DASSI model with other state-of-the-art models and assessed its computational efficiency. The results have shown a high potential of using automated architecture search mechanism for solving various problems in the field of genomics.
Collapse
Affiliation(s)
- Shabir Moosa
- Department of Systems Biology, SIDRA Medicine, Doha, 26999, Qatar. .,Dept. of Computer Science and Engineering, Qatar University, Doha, 2713, Qatar.
| | - Prof Abbes Amira
- Dept. of Computer Science and Engineering, Qatar University, Doha, 2713, Qatar
| | | |
Collapse
|
10
|
Fu H, Cao Z, Li M, Wang S. ACEP: improving antimicrobial peptides recognition through automatic feature fusion and amino acid embedding. BMC Genomics 2020; 21:597. [PMID: 32859150 PMCID: PMC7455913 DOI: 10.1186/s12864-020-06978-0] [Citation(s) in RCA: 27] [Impact Index Per Article: 5.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/31/2020] [Accepted: 08/11/2020] [Indexed: 12/14/2022] Open
Abstract
BACKGROUND Antimicrobial resistance is one of our most serious health threats. Antimicrobial peptides (AMPs), effecter molecules of innate immune system, can defend host organisms against microbes and most have shown a lowered likelihood for bacteria to form resistance compared to many conventional drugs. Thus, AMPs are gaining popularity as better substitute to antibiotics. To aid researchers in novel AMPs discovery, we design computational approaches to screen promising candidates. RESULTS In this work, we design a deep learning model that can learn amino acid embedding patterns, automatically extract sequence features, and fuse heterogeneous information. Results show that the proposed model outperforms state-of-the-art methods on recognition of AMPs. By visualizing data in some layers of the model, we overcome the black-box nature of deep learning, explain the working mechanism of the model, and find some import motifs in sequences. CONCLUSIONS ACEP model can capture similarity between amino acids, calculate attention scores for different parts of a peptide sequence in order to spot important parts that significantly contribute to final predictions, and automatically fuse a variety of heterogeneous information or features. For high-throughput AMPs recognition, open source software and datasets are made freely available at https://github.com/Fuhaoyi/ACEP .
Collapse
Affiliation(s)
- Haoyi Fu
- School of Information Science and Engineering, Yunnan University, Kunming, 650500, China
| | - Zicheng Cao
- School of Public Health (Shenzhen), Sun Yat-sen University, Guangzhou, 510006, China
| | - Mingyuan Li
- School of Information Science and Engineering, Yunnan University, Kunming, 650500, China
| | - Shunfang Wang
- School of Information Science and Engineering, Yunnan University, Kunming, 650500, China.
| |
Collapse
|
11
|
Amilpur S, Bhukya R. EDeepSSP: Explainable deep neural networks for exact splice sites prediction. J Bioinform Comput Biol 2020; 18:2050024. [PMID: 32696716 DOI: 10.1142/s0219720020500249] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
Splice site prediction is crucial for understanding underlying gene regulation, gene function for better genome annotation. Many computational methods exist for recognizing the splice sites. Although most of the methods achieve a competent performance, their interpretability remains challenging. Moreover, all traditional machine learning methods manually extract features, which is tedious job. To address these challenges, we propose a deep learning-based approach (EDeepSSP) that employs convolutional neural networks (CNNs) architecture for automatic feature extraction and effectively predicts splice sites. Our model, EDeepSSP, divulges the opaque nature of CNN by extracting significant motifs and explains why these motifs are vital for predicting splice sites. In this study, experiments have been conducted on six benchmark acceptors and donor datasets of humans, cress, and fly. The results show that EDeepSSP has outperformed many state-of-the-art approaches. EDeepSSP achieves the highest area under the receiver operating characteristic curve (AUC_ROC) and area under the precision-recall curve (AUC_PR) of 99.32% and 99.26% on human donor datasets, respectively. We also analyze various filter activities, feature activations, and extracted significant motifs responsible for the splice site prediction. Further, we validate the learned motifs of our model against known motifs of JASPAR splice site database.
Collapse
Affiliation(s)
- Santhosh Amilpur
- Computer Science and Engineering, National Institute of Technology Warangal, Warangal, Telangana 506004, India
| | - Raju Bhukya
- Computer Science and Engineering, National Institute of Technology Warangal, Warangal, Telangana 506004, India
| |
Collapse
|
12
|
Measuring Performance Metrics of Machine Learning Algorithms for Detecting and Classifying Transposable Elements. Processes (Basel) 2020. [DOI: 10.3390/pr8060638] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/13/2022] Open
Abstract
Because of the promising results obtained by machine learning (ML) approaches in several fields, every day is more common, the utilization of ML to solve problems in bioinformatics. In genomics, a current issue is to detect and classify transposable elements (TEs) because of the tedious tasks involved in bioinformatics methods. Thus, ML was recently evaluated for TE datasets, demonstrating better results than bioinformatics applications. A crucial step for ML approaches is the selection of metrics that measure the realistic performance of algorithms. Each metric has specific characteristics and measures properties that may be different from the predicted results. Although the most commonly used way to compare measures is by using empirical analysis, a non-result-based methodology has been proposed, called measure invariance properties. These properties are calculated on the basis of whether a given measure changes its value under certain modifications in the confusion matrix, giving comparative parameters independent of the datasets. Measure invariance properties make metrics more or less informative, particularly on unbalanced, monomodal, or multimodal negative class datasets and for real or simulated datasets. Although several studies applied ML to detect and classify TEs, there are no works evaluating performance metrics in TE tasks. Here, we analyzed 26 different metrics utilized in binary, multiclass, and hierarchical classifications, through bibliographic sources, and their invariance properties. Then, we corroborated our findings utilizing freely available TE datasets and commonly used ML algorithms. Based on our analysis, the most suitable metrics for TE tasks must be stable, even using highly unbalanced datasets, multimodal negative class, and training datasets with errors or outliers. Based on these parameters, we conclude that the F1-score and the area under the precision-recall curve are the most informative metrics since they are calculated based on other metrics, providing insight into the development of an ML application.
Collapse
|
13
|
Yan H, Bombarely A, Li S. DeepTE: a computational method for de novo classification of transposons with convolutional neural network. Bioinformatics 2020; 36:4269-4275. [DOI: 10.1093/bioinformatics/btaa519] [Citation(s) in RCA: 22] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/09/2020] [Revised: 04/12/2020] [Accepted: 05/12/2020] [Indexed: 01/23/2023] Open
Abstract
Abstract
Motivation
Transposable elements (TEs) classification is an essential step to decode their roles in genome evolution. With a large number of genomes from non-model species becoming available, accurate and efficient TE classification has emerged as a new challenge in genomic sequence analysis.
Results
We developed a novel tool, DeepTE, which classifies unknown TEs using convolutional neural networks (CNNs). DeepTE transferred sequences into input vectors based on k-mer counts. A tree structured classification process was used where eight models were trained to classify TEs into super families and orders. DeepTE also detected domains inside TEs to correct false classification. An additional model was trained to distinguish between non-TEs and TEs in plants. Given unclassified TEs of different species, DeepTE can classify TEs into seven orders, which include 15, 24 and 16 super families in plants, metazoans and fungi, respectively. In several benchmarking tests, DeepTE outperformed other existing tools for TE classification. In conclusion, DeepTE successfully leverages CNN for TE classification, and can be used to precisely classify TEs in newly sequenced eukaryotic genomes.
Availability and implementation
DeepTE is accessible at https://github.com/LiLabAtVT/DeepTE.
Supplementary information
Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Haidong Yan
- School of Plant and Environmental Sciences (SPES), Virginia Tech, Blacksburg, VA 24061, USA
| | - Aureliano Bombarely
- School of Plant and Environmental Sciences (SPES), Virginia Tech, Blacksburg, VA 24061, USA
- Department of Life Sciences, University of Milan, Milan 20122, Italy
| | - Song Li
- School of Plant and Environmental Sciences (SPES), Virginia Tech, Blacksburg, VA 24061, USA
- Graduate Program in Genetics, Bioinformatics and Computational Biology (GBCB), Virginia Tech, Blacksburg, VA 24061, USA
| |
Collapse
|
14
|
Albaradei S, Magana-Mora A, Thafar M, Uludag M, Bajic VB, Gojobori T, Essack M, Jankovic BR. Splice2Deep: An ensemble of deep convolutional neural networks for improved splice site prediction in genomic DNA. Gene 2020; 763S:100035. [PMID: 32550561 PMCID: PMC7285987 DOI: 10.1016/j.gene.2020.100035] [Citation(s) in RCA: 21] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/30/2020] [Accepted: 05/06/2020] [Indexed: 12/21/2022]
Abstract
Background The accurate identification of the exon/intron boundaries is critical for the correct annotation of genes with multiple exons. Donor and acceptor splice sites (SS) demarcate these boundaries. Therefore, deriving accurate computational models to predict the SS are useful for functional annotation of genes and genomes, and for finding alternative SS associated with different diseases. Although various models have been proposed for the in silico prediction of SS, improving their accuracy is required for reliable annotation. Moreover, models are often derived and tested using the same genome, providing no evidence of broad application, i.e. to other poorly studied genomes. Results With this in mind, we developed the Splice2Deep models for SS detection. Each model is an ensemble of deep convolutional neural networks. We evaluated the performance of the models based on the ability to detect SS in Homo sapiens, Oryza sativa japonica, Arabidopsis thaliana, Drosophila melanogaster, and Caenorhabditis elegans. Results demonstrate that the models efficiently detect SS in other organisms not considered during the training of the models. Compared to the state-of-the-art tools, Splice2Deep models achieved significantly reduced average error rates of 41.97% and 28.51% for acceptor and donor SS, respectively. Moreover, the Splice2Deep cross-organism validation demonstrates that models correctly identify conserved genomic elements enabling annotation of SS in new genomes by choosing the taxonomically closest model. Conclusions The results of our study demonstrated that Splice2Deep both achieved a considerably reduced error rate compared to other state-of-the-art models and the ability to accurately recognize SS in other organisms for which the model was not trained, enabling annotation of poorly studied or newly sequenced genomes. Splice2Deep models are implemented in Python using Keras API; the models and the data are available at https://github.com/SomayahAlbaradei/Splice_Deep.git.
Collapse
Key Words
- AUC, area under curve
- AcSS, acceptor splice site
- Acc, accuracy
- Bioinformatics
- CNN, convolutional neural network
- CONV, convolutional layers
- DL, deep learning
- DNA, deoxyribonucleic acid
- DT, decision trees
- Deep-learning
- DoSS, donor splice site
- FC, fully connected layer
- ML, machine learning
- NB, naive Bayes
- NN, neural network
- POOL, pooling layer
- Prediction
- RF, random forest
- RNA, ribonucleic acid
- ReLU, rectified linear unit layer
- SS, splice site
- SVM, support vector machine
- Sn, sensitivity
- Sp, specificity
- Splice sites
- Splicing
Collapse
Affiliation(s)
- Somayah Albaradei
- Computer, Electrical and Mathematical Sciences and Engineering Division (CEMSE), Computational Bioscience Research Center, Computer (CBRC), King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Saudi Arabia.,Faculty of Computing and Information Technology, King Abdulaziz University, Saudi Arabia
| | - Arturo Magana-Mora
- Computer, Electrical and Mathematical Sciences and Engineering Division (CEMSE), Computational Bioscience Research Center, Computer (CBRC), King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Saudi Arabia.,Saudi Aramco, EXPEC-ARC, Drilling Technology Team, Dhahran 31311, Saudi Arabia
| | - Maha Thafar
- Computer, Electrical and Mathematical Sciences and Engineering Division (CEMSE), Computational Bioscience Research Center, Computer (CBRC), King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Saudi Arabia.,Faculty of Computers and Information Systems, Taif University, Saudi Arabia
| | - Mahmut Uludag
- Computer, Electrical and Mathematical Sciences and Engineering Division (CEMSE), Computational Bioscience Research Center, Computer (CBRC), King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Saudi Arabia
| | - Vladimir B Bajic
- Computer, Electrical and Mathematical Sciences and Engineering Division (CEMSE), Computational Bioscience Research Center, Computer (CBRC), King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Saudi Arabia
| | - Takashi Gojobori
- Computer, Electrical and Mathematical Sciences and Engineering Division (CEMSE), Computational Bioscience Research Center, Computer (CBRC), King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Saudi Arabia.,Biological and Environmental Sciences and Engineering Division (BESE), King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Saudi Arabia
| | - Magbubah Essack
- Computer, Electrical and Mathematical Sciences and Engineering Division (CEMSE), Computational Bioscience Research Center, Computer (CBRC), King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Saudi Arabia
| | - Boris R Jankovic
- Computer, Electrical and Mathematical Sciences and Engineering Division (CEMSE), Computational Bioscience Research Center, Computer (CBRC), King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Saudi Arabia
| |
Collapse
|
15
|
Ma J, Gao X. A filter-based feature construction and feature selection approach for classification using Genetic Programming. Knowl Based Syst 2020. [DOI: 10.1016/j.knosys.2020.105806] [Citation(s) in RCA: 27] [Impact Index Per Article: 5.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/11/2022]
|
16
|
Hammami M, Bechikh S, Louati A, Makhlouf M, Said LB. Feature construction as a bi-level optimization problem. Neural Comput Appl 2020. [DOI: 10.1007/s00521-020-04784-z] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
|
17
|
Orozco-Arias S, Isaza G, Guyot R, Tabares-Soto R. A systematic review of the application of machine learning in the detection and classification of transposable elements. PeerJ 2019; 7:e8311. [PMID: 31976169 PMCID: PMC6967008 DOI: 10.7717/peerj.8311] [Citation(s) in RCA: 18] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/05/2019] [Accepted: 11/28/2019] [Indexed: 12/16/2022] Open
Abstract
Background Transposable elements (TEs) constitute the most common repeated sequences in eukaryotic genomes. Recent studies demonstrated their deep impact on species diversity, adaptation to the environment and diseases. Although there are many conventional bioinformatics algorithms for detecting and classifying TEs, none have achieved reliable results on different types of TEs. Machine learning (ML) techniques can automatically extract hidden patterns and novel information from labeled or non-labeled data and have been applied to solving several scientific problems. Methodology We followed the Systematic Literature Review (SLR) process, applying the six stages of the review protocol from it, but added a previous stage, which aims to detect the need for a review. Then search equations were formulated and executed in several literature databases. Relevant publications were scanned and used to extract evidence to answer research questions. Results Several ML approaches have already been tested on other bioinformatics problems with promising results, yet there are few algorithms and architectures available in literature focused specifically on TEs, despite representing the majority of the nuclear DNA of many organisms. Only 35 articles were found and categorized as relevant in TE or related fields. Conclusions ML is a powerful tool that can be used to address many problems. Although ML techniques have been used widely in other biological tasks, their utilization in TE analyses is still limited. Following the SLR, it was possible to notice that the use of ML for TE analyses (detection and classification) is an open problem, and this new field of research is growing in interest.
Collapse
Affiliation(s)
- Simon Orozco-Arias
- Department of Computer Science, Universidad Autónoma de Manizales, Manizales, Caldas, Colombia.,Department of Systems and Informatics, Universidad de Caldas, Manizales, Caldas, Colombia
| | - Gustavo Isaza
- Department of Systems and Informatics, Universidad de Caldas, Manizales, Caldas, Colombia
| | - Romain Guyot
- Institut de Recherche pour le Développement, CIRAD, University of Montpellier, Montpellier, France.,Department of Electronics and Automation, Universidad Autónoma de Manizales, Manizales, Caldas, Colombia
| | - Reinel Tabares-Soto
- Department of Electronics and Automation, Universidad Autónoma de Manizales, Manizales, Caldas, Colombia
| |
Collapse
|
18
|
Splice sites detection using chaos game representation and neural network. Genomics 2019; 112:1847-1852. [PMID: 31704313 DOI: 10.1016/j.ygeno.2019.10.018] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/18/2018] [Revised: 03/18/2019] [Accepted: 10/29/2019] [Indexed: 11/23/2022]
Abstract
A novel method is proposed to detect the acceptor and donor splice sites using chaos game representation and artificial neural network. In order to achieve high accuracy, inputs to the neural network, or feature vector, shall reflect the true nature of the DNA segments. Therefore it is important to have one-to-one numerical representation, i.e. a feature vector should be able to represent the original data. Chaos game representation (CGR) is an iterative mapping technique that assigns each nucleotide in a DNA sequence to a respective position on the plane in a one-to-one manner. Using CGR, a DNA sequence can be mapped to a numerical sequence that reflects the true nature of the original sequence. In this research, we propose to use CGR as feature input to a neural network to detect splice sites on the NN269 dataset. Computational experiments indicate that this approach gives good accuracy while being simpler than other methods in the literature, with only one neural network component. The code and data for our method can be accessed from this link: https://github.com/thoang3/portfolio/tree/SpliceSites_ANN_CGR.
Collapse
|
19
|
Veltri D, Kamath U, Shehu A. Deep learning improves antimicrobial peptide recognition. Bioinformatics 2019; 34:2740-2747. [PMID: 29590297 PMCID: PMC6084614 DOI: 10.1093/bioinformatics/bty179] [Citation(s) in RCA: 262] [Impact Index Per Article: 43.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/14/2017] [Accepted: 03/28/2018] [Indexed: 01/09/2023] Open
Abstract
Motivation Bacterial resistance to antibiotics is a growing concern. Antimicrobial peptides (AMPs), natural components of innate immunity, are popular targets for developing new drugs. Machine learning methods are now commonly adopted by wet-laboratory researchers to screen for promising candidates. Results In this work, we utilize deep learning to recognize antimicrobial activity. We propose a neural network model with convolutional and recurrent layers that leverage primary sequence composition. Results show that the proposed model outperforms state-of-the-art classification models on a comprehensive dataset. By utilizing the embedding weights, we also present a reduced-alphabet representation and show that reasonable AMP recognition can be maintained using nine amino acid types. Availability and implementation Models and datasets are made freely available through the Antimicrobial Peptide Scanner vr.2 web server at www.ampscanner.com. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Daniel Veltri
- Bioinformatics and Computational Biosciences Branch, Office of Cyber Infrastructure and Computational Biology, National Institute of Allergy and Infectious Diseases, U.S. National Institutes of Health, Rockville, MD, USA.,Medical Science & Computing, LLC, Rockville, MD, USA
| | | | - Amarda Shehu
- Department of Computer Science, George Mason University, Fairfax, VA, USA.,Department of Bioengineering, George Mason University, Fairfax, VA, USA.,School of Systems Biology, George Mason University, Manassas, VA, USA
| |
Collapse
|
20
|
A hybrid multiple feature construction approach for classification using Genetic Programming. Appl Soft Comput 2019. [DOI: 10.1016/j.asoc.2019.04.039] [Citation(s) in RCA: 15] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
|
21
|
Meher PK, Sahu TK, Gahoi S, Satpathy S, Rao AR. Evaluating the performance of sequence encoding schemes and machine learning methods for splice sites recognition. Gene 2019; 705:113-126. [PMID: 31009682 DOI: 10.1016/j.gene.2019.04.047] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/20/2018] [Revised: 03/27/2019] [Accepted: 04/17/2019] [Indexed: 02/02/2023]
Abstract
Identification of splice sites is imperative for prediction of gene structure. Machine learning-based approaches (MLAs) have been reported to be more successful than the rule-based methods for identification of splice sites. However, the strings of alphabets should be transformed into numeric features through sequence encoding before using them as input in MLAs. In this study, we evaluated the performances of 8 different sequence encoding schemes i.e., Bayes kernel, density and sparse (DS), distribution of tri-nucleotide and 1st order Markov model (DM), frequency difference distance measure (FDDM), paired-nucleotide frequency difference between true and false sites (FDTF), 1st order Markov model (MM1), combination of both 1st and 2nd order Markov model (MM1 + MM2) and 2nd order Markov model (MM2) in respect of predicting donor and acceptor splice sites using 5 supervised learning methods (ANN, Bagging, Boosting, RF and SVM). The encoding schemes and machine learning methods were first evaluated in 4 species i.e., A. thaliana, C. elegans, D. melanogaster and H. sapiens, and then performances were validated with another four species i.e., Ciona intestinalis, Dictyostelium discoideum, Phaeodactylum tricornutum and Trypanosoma brucei. In terms of ROC (receiver-operating-characteristics) and PR (precision-recall) curves, FDTF encoding approach achieved higher accuracy followed by either MM2 or FDDM. Further, SVM was found to achieve higher accuracy (in terms of ROC and PR curves) followed by RF across encoding schemes and species. In terms of prediction accuracy across species, the SVM-FDTF combination was optimum than other combinations of classifiers and encoding schemes. Further, splice site prediction accuracies were observed higher for the species with low intron density. To our limited knowledge, this is the first attempt as far as comprehensive evaluation of sequence encoding schemes for prediction of splice sites is concerned. We have also developed an R-package EncDNA (https://cran.r-project.org/web/packages/EncDNA/index.html) for encoding of splice site motifs with different encoding schemes, which is expected to supplement the existing nucleotide sequence encoding approaches. This study is believed to be useful for the computational biologists for predicting different functional elements on the genomic DNA.
Collapse
Affiliation(s)
- Prabina Kumar Meher
- ICAR-Indian Agricultural Statistics Research Institute, New Delhi 110012, India.
| | - Tanmaya Kumar Sahu
- ICAR-Indian Agricultural Statistics Research Institute, New Delhi 110012, India
| | - Shachi Gahoi
- ICAR-Indian Agricultural Statistics Research Institute, New Delhi 110012, India
| | - Subhrajit Satpathy
- ICAR-Indian Agricultural Statistics Research Institute, New Delhi 110012, India
| | | |
Collapse
|
22
|
|
23
|
Meher PK, Sahu TK, Gahoi S, Tomar R, Rao AR. funbarRF: DNA barcode-based fungal species prediction using multiclass Random Forest supervised learning model. BMC Genet 2019; 20:2. [PMID: 30616524 PMCID: PMC6323839 DOI: 10.1186/s12863-018-0710-z] [Citation(s) in RCA: 13] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/13/2018] [Accepted: 12/26/2018] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Identification of unknown fungal species aids to the conservation of fungal diversity. As many fungal species cannot be cultured, morphological identification of those species is almost impossible. But, DNA barcoding technique can be employed for identification of such species. For fungal taxonomy prediction, the ITS (internal transcribed spacer) region of rDNA (ribosomal DNA) is used as barcode. Though the computational prediction of fungal species has become feasible with the availability of huge volume of barcode sequences in public domain, prediction of fungal species is challenging due to high degree of variability among ITS regions within species. RESULTS A Random Forest (RF)-based predictor was built for identification of unknown fungal species. The reference and query sequences were mapped onto numeric features based on gapped base pair compositions, and then used as training and test sets respectively for prediction of fungal species using RF. More than 85% accuracy was found when 4 sequences per species in the reference set were utilized; whereas it was seen to be stabilized at ~88% if ≥7 sequence per species in the reference set were used for training of the model. The proposed model achieved comparable accuracy, while evaluated against existing methods through cross-validation procedure. The proposed model also outperformed several existing models used for identification of different species other than fungi. CONCLUSIONS An online prediction server "funbarRF" is established at http://cabgrid.res.in:8080/funbarrf/ for fungal species identification. Besides, an R-package funbarRF ( https://cran.r-project.org/web/packages/funbarRF/ ) is also available for prediction using high throughput sequence data. The effort put in this work will certainly supplement the future endeavors in the direction of fungal taxonomy assignments based on DNA barcode.
Collapse
Affiliation(s)
- Prabina Kumar Meher
- Division of Statistical Genetics, ICAR-Indian Agricultural Statistics Research Institute, New Delhi, 110012 India
| | - Tanmaya Kumar Sahu
- Centre for Agricultural Bioinformatics, ICAR-Indian Agricultural Statistics Research Institute, New Delhi, 110012 India
| | - Shachi Gahoi
- Centre for Agricultural Bioinformatics, ICAR-Indian Agricultural Statistics Research Institute, New Delhi, 110012 India
| | - Ruchi Tomar
- Centre for Agricultural Bioinformatics, ICAR-Indian Agricultural Statistics Research Institute, New Delhi, 110012 India
- Department of Bioinformatics, Janta Vedic College, Baraut, Baghpat, Uttar Pradesh 250611 India
| | - Atmakuri Ramakrishna Rao
- Centre for Agricultural Bioinformatics, ICAR-Indian Agricultural Statistics Research Institute, New Delhi, 110012 India
| |
Collapse
|
24
|
Pashaei E, Aydin N. Markovian encoding models in human splice site recognition using SVM. Comput Biol Chem 2018; 73:159-170. [PMID: 29486390 DOI: 10.1016/j.compbiolchem.2018.02.005] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/27/2017] [Revised: 02/04/2018] [Accepted: 02/05/2018] [Indexed: 11/26/2022]
Abstract
Splice site recognition is among the most significant and challenging tasks in bioinformatics due to its key role in gene annotation. Effective prediction of splice site requires nucleotide encoding methods that reveal the characteristics of DNA sequences to provide appropriate features to serve as input of machine learning classifiers. Markovian models are the most influential encoding methods that highly used for pattern recognition in biological data. However, a direct performance comparison of these methods in splice site domain has not been assessed yet. This study compares various Markovian encoding models for splice site prediction utilizing support vector machine, as the most outstanding learning method in the domain, and conducts a new precise evaluation of Markovian approaches that corrects this limitation. Moreover, a novel sequence encoding approach based on third order Markov model (MM3) is proposed. The experimental results show that the proposed method, namely MM3-SVM, performs significantly better than thirteen best known state-of-the-art algorithms, while tested on HS3D dataset considering several performance criteria. Further, it achieved higher prediction accuracy than several well-known tools like NNsplice, MEM, MM1, WMM, and GeneID, using an independent test set of 50 genes. We also developed MMSVM, a web tool to predict splice sites in any human sequence using the proposed approach. The MMSVM web server can be assessed at https://pashaei.shinyapps.io/mmsvm.
Collapse
Affiliation(s)
- Elham Pashaei
- Department of Computer Engineering, Yildiz Technical University, Istanbul, Turkey.
| | - Nizamettin Aydin
- Department of Computer Engineering, Yildiz Technical University, Istanbul, Turkey.
| |
Collapse
|
25
|
Veltri D, Kamath U, Shehu A. Improving Recognition of Antimicrobial Peptides and Target Selectivity through Machine Learning and Genetic Programming. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2017; 14:300-313. [PMID: 28368808 DOI: 10.1109/tcbb.2015.2462364] [Citation(s) in RCA: 36] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/07/2023]
Abstract
Growing bacterial resistance to antibiotics is spurring research on utilizing naturally-occurring antimicrobial peptides (AMPs) as templates for novel drug design. While experimentalists mainly focus on systematic point mutations to measure the effect on antibacterial activity, the computational community seeks to understand what determines such activity in a machine learning setting. The latter seeks to identify the biological signals or features that govern activity. In this paper, we advance research in this direction through a novel method that constructs and selects complex sequence-based features which capture information about distal patterns within a peptide. Comparative analysis with state-of-the-art methods in AMP recognition reveals our method is not only among the top performers, but it also provides transparent summarizations of antibacterial activity at the sequence level. Moreover, this paper demonstrates for the first time the capability not only to recognize that a peptide is an AMP or not but also to predict its target selectivity based on models of activity against only Gram-positive, only Gram-negative, or both types of bacteria. The work described in this paper is a step forward in computational research seeking to facilitate AMP design or modification in the wet laboratory.
Collapse
|
26
|
K. K, P. G. L, Rangarajan L, K. AK. Effective Feature Selection for Classification of Promoter Sequences. PLoS One 2016; 11:e0167165. [PMID: 27978541 PMCID: PMC5158321 DOI: 10.1371/journal.pone.0167165] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/13/2016] [Accepted: 11/09/2016] [Indexed: 11/18/2022] Open
Abstract
Exploring novel computational methods in making sense of biological data has not only been a necessity, but also productive. A part of this trend is the search for more efficient in silico methods/tools for analysis of promoters, which are parts of DNA sequences that are involved in regulation of expression of genes into other functional molecules. Promoter regions vary greatly in their function based on the sequence of nucleotides and the arrangement of protein-binding short-regions called motifs. In fact, the regulatory nature of the promoters seems to be largely driven by the selective presence and/or the arrangement of these motifs. Here, we explore computational classification of promoter sequences based on the pattern of motif distributions, as such classification can pave a new way of functional analysis of promoters and to discover the functionally crucial motifs. We make use of Position Specific Motif Matrix (PSMM) features for exploring the possibility of accurately classifying promoter sequences using some of the popular classification techniques. The classification results on the complete feature set are low, perhaps due to the huge number of features. We propose two ways of reducing features. Our test results show improvement in the classification output after the reduction of features. The results also show that decision trees outperform SVM (Support Vector Machine), KNN (K Nearest Neighbor) and ensemble classifier LibD3C, particularly with reduced features. The proposed feature selection methods outperform some of the popular feature transformation methods such as PCA and SVD. Also, the methods proposed are as accurate as MRMR (feature selection method) but much faster than MRMR. Such methods could be useful to categorize new promoters and explore regulatory mechanisms of gene expressions in complex eukaryotic species.
Collapse
Affiliation(s)
- Kouser K.
- DoS in Computer Science, Mysore, India
| | | | | | - Acharya Kshitish K.
- Institute of Bioinformatics and Applied Biotechnology (IBAB), Biotech Park, Electronic City, Bengaluru (Bangalore), Karnataka state, India
- Shodhaka Life Sciences Pvt. Ltd., IBAB, Biotech Park, Bengaluru (Bangalore), Karnataka state, India
| |
Collapse
|
27
|
Van Poucke S, Thomeer M, Heath J, Vukicevic M. Are Randomized Controlled Trials the (G)old Standard? From Clinical Intelligence to Prescriptive Analytics. J Med Internet Res 2016; 18:e185. [PMID: 27383622 PMCID: PMC4954919 DOI: 10.2196/jmir.5549] [Citation(s) in RCA: 34] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/21/2016] [Revised: 06/01/2016] [Accepted: 06/21/2016] [Indexed: 12/11/2022] Open
Abstract
Despite the accelerating pace of scientific discovery, the current clinical research enterprise does not sufficiently address pressing clinical questions. Given the constraints on clinical trials, for a majority of clinical questions, the only relevant data available to aid in decision making are based on observation and experience. Our purpose here is 3-fold. First, we describe the classic context of medical research guided by Poppers' scientific epistemology of "falsificationism." Second, we discuss challenges and shortcomings of randomized controlled trials and present the potential of observational studies based on big data. Third, we cover several obstacles related to the use of observational (retrospective) data in clinical studies. We conclude that randomized controlled trials are not at risk for extinction, but innovations in statistics, machine learning, and big data analytics may generate a completely new ecosystem for exploration and validation.
Collapse
Affiliation(s)
- Sven Van Poucke
- Department of Anesthesiology, Critical Care, Emergency Medicine, Pain Therapy, Ziekenhuis Oost-Limburg, Genk, Belgium.
| | | | | | | |
Collapse
|
28
|
Meher PK, Sahu TK, Rao AR, Wahi SD. Identification of donor splice sites using support vector machine: a computational approach based on positional, compositional and dependency features. Algorithms Mol Biol 2016; 11:16. [PMID: 27252772 PMCID: PMC4888255 DOI: 10.1186/s13015-016-0078-4] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/28/2015] [Accepted: 05/17/2016] [Indexed: 11/16/2022] Open
Abstract
Background Identification of splice sites is essential for annotation of genes. Though existing approaches have achieved an acceptable level of accuracy, still there is a need for further improvement. Besides, most of the approaches are species-specific and hence it is required to develop approaches compatible across species. Results Each splice site sequence was transformed into a numeric vector of length 49, out of which four were positional, four were dependency and 41 were compositional features. Using the transformed vectors as input, prediction was made through support vector machine. Using balanced training set, the proposed approach achieved area under ROC curve (AUC-ROC) of 96.05, 96.96, 96.95, 96.24 % and area under PR curve (AUC-PR) of 97.64, 97.89, 97.91, 97.90 %, while tested on human, cattle, fish and worm datasets respectively. On the other hand, AUC-ROC of 97.21, 97.45, 97.41, 98.06 % and AUC-PR of 93.24, 93.34, 93.38, 92.29 % were obtained, while imbalanced training datasets were used. The proposed approach was found comparable with state-of-art splice site prediction approaches, while compared using the bench mark NN269 dataset and other datasets. Conclusions The proposed approach achieved consistent accuracy across different species as well as found comparable with the existing approaches. Thus, we believe that the proposed approach can be used as a complementary method to the existing methods for the prediction of splice sites. A web server named as ‘HSplice’ has also been developed based on the proposed approach for easy prediction of 5′ splice sites by the users and is freely available at http://cabgrid.res.in:8080/HSplice.
Collapse
|
29
|
Kulshreshtha S, Chaudhary V, Goswami GK, Mathur N. Computational approaches for predicting mutant protein stability. J Comput Aided Mol Des 2016; 30:401-12. [DOI: 10.1007/s10822-016-9914-3] [Citation(s) in RCA: 47] [Impact Index Per Article: 5.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/09/2016] [Accepted: 05/02/2016] [Indexed: 11/24/2022]
|
30
|
Katsonis P, Koire A, Wilson SJ, Hsu TK, Lua RC, Wilkins AD, Lichtarge O. Single nucleotide variations: biological impact and theoretical interpretation. Protein Sci 2014; 23:1650-66. [PMID: 25234433 PMCID: PMC4253807 DOI: 10.1002/pro.2552] [Citation(s) in RCA: 88] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/06/2014] [Revised: 09/12/2014] [Accepted: 09/15/2014] [Indexed: 12/27/2022]
Abstract
Genome-wide association studies (GWAS) and whole-exome sequencing (WES) generate massive amounts of genomic variant information, and a major challenge is to identify which variations drive disease or contribute to phenotypic traits. Because the majority of known disease-causing mutations are exonic non-synonymous single nucleotide variations (nsSNVs), most studies focus on whether these nsSNVs affect protein function. Computational studies show that the impact of nsSNVs on protein function reflects sequence homology and structural information and predict the impact through statistical methods, machine learning techniques, or models of protein evolution. Here, we review impact prediction methods and discuss their underlying principles, their advantages and limitations, and how they compare to and complement one another. Finally, we present current applications and future directions for these methods in biological research and medical genetics.
Collapse
Affiliation(s)
- Panagiotis Katsonis
- Department of Molecular and Human Genetics, Baylor College of MedicineHouston, Texas
| | - Amanda Koire
- Department of Structural and Computational Biology and Molecular BiophysicsHouston, Texas
| | - Stephen Joseph Wilson
- Department of Biochemistry and Molecular Biology, Baylor College of MedicineHouston, Texas
| | - Teng-Kuei Hsu
- Department of Biochemistry and Molecular Biology, Baylor College of MedicineHouston, Texas
| | - Rhonald C Lua
- Department of Molecular and Human Genetics, Baylor College of MedicineHouston, Texas
| | - Angela Dawn Wilkins
- Department of Molecular and Human Genetics, Baylor College of MedicineHouston, Texas
- Computational and Integrative Biomedical Research Center, Baylor College of MedicineHouston, Texas
| | - Olivier Lichtarge
- Department of Molecular and Human Genetics, Baylor College of MedicineHouston, Texas
- Department of Structural and Computational Biology and Molecular BiophysicsHouston, Texas
- Department of Biochemistry and Molecular Biology, Baylor College of MedicineHouston, Texas
- Computational and Integrative Biomedical Research Center, Baylor College of MedicineHouston, Texas
- Department of Pharmacology, Baylor College of MedicineHouston, Texas
| |
Collapse
|