1
|
Liu X, Zhang H, Zeng Y, Zhu X, Zhu L, Fu J. DRANetSplicer: A Splice Site Prediction Model Based on Deep Residual Attention Networks. Genes (Basel) 2024; 15:404. [PMID: 38674339 PMCID: PMC11048956 DOI: 10.3390/genes15040404] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/20/2024] [Revised: 03/20/2024] [Accepted: 03/23/2024] [Indexed: 04/28/2024] Open
Abstract
The precise identification of splice sites is essential for unraveling the structure and function of genes, constituting a pivotal step in the gene annotation process. In this study, we developed a novel deep learning model, DRANetSplicer, that integrates residual learning and attention mechanisms for enhanced accuracy in capturing the intricate features of splice sites. We constructed multiple datasets using the most recent versions of genomic data from three different organisms, Oryza sativa japonica, Arabidopsis thaliana and Homo sapiens. This approach allows us to train models with a richer set of high-quality data. DRANetSplicer outperformed benchmark methods on donor and acceptor splice site datasets, achieving an average accuracy of (96.57%, 95.82%) across the three organisms. Comparative analyses with benchmark methods, including SpliceFinder, Splice2Deep, Deep Splicer, EnsembleSplice, and DNABERT, revealed DRANetSplicer's superior predictive performance, resulting in at least a (4.2%, 11.6%) relative reduction in average error rate. We utilized the DRANetSplicer model trained on O. sativa japonica data to predict splice sites in A. thaliana, achieving accuracies for donor and acceptor sites of (94.89%, 94.25%). These results indicate that DRANetSplicer possesses excellent cross-organism predictive capabilities, with its performance in cross-organism predictions even surpassing that of benchmark methods in non-cross-organism predictions. Cross-organism validation showcased DRANetSplicer's excellence in predicting splice sites across similar organisms, supporting its applicability in gene annotation for understudied organisms. We employed multiple methods to visualize the decision-making process of the model. The visualization results indicate that DRANetSplicer can learn and interpret well-known biological features, further validating its overall performance. Our study systematically examined and confirmed the predictive ability of DRANetSplicer from various levels and perspectives, indicating that its practical application in gene annotation is justified.
Collapse
Affiliation(s)
- Xueyan Liu
- College of Information and Intelligence, Hunan Agricultural University, Changsha 410128, China; (X.L.); (X.Z.); (L.Z.); (J.F.)
| | - Hongyan Zhang
- College of Information and Intelligence, Hunan Agricultural University, Changsha 410128, China; (X.L.); (X.Z.); (L.Z.); (J.F.)
| | - Ying Zeng
- School of Computer and Communication, Hunan Institute of Engineering, Xiangtan 411104, China;
| | - Xinghui Zhu
- College of Information and Intelligence, Hunan Agricultural University, Changsha 410128, China; (X.L.); (X.Z.); (L.Z.); (J.F.)
| | - Lei Zhu
- College of Information and Intelligence, Hunan Agricultural University, Changsha 410128, China; (X.L.); (X.Z.); (L.Z.); (J.F.)
| | - Jiahui Fu
- College of Information and Intelligence, Hunan Agricultural University, Changsha 410128, China; (X.L.); (X.Z.); (L.Z.); (J.F.)
| |
Collapse
|
2
|
Ditz JC, Reuter B, Pfeifer N. Inherently interpretable position-aware convolutional motif kernel networks for biological sequencing data. Sci Rep 2023; 13:17216. [PMID: 37821530 PMCID: PMC10567796 DOI: 10.1038/s41598-023-44175-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/16/2023] [Accepted: 10/04/2023] [Indexed: 10/13/2023] Open
Abstract
Artificial neural networks show promising performance in detecting correlations within data that are associated with specific outcomes. However, the black-box nature of such models can hinder the knowledge advancement in research fields by obscuring the decision process and preventing scientist to fully conceptualize predicted outcomes. Furthermore, domain experts like healthcare providers need explainable predictions to assess whether a predicted outcome can be trusted in high stakes scenarios and to help them integrating a model into their own routine. Therefore, interpretable models play a crucial role for the incorporation of machine learning into high stakes scenarios like healthcare. In this paper we introduce Convolutional Motif Kernel Networks, a neural network architecture that involves learning a feature representation within a subspace of the reproducing kernel Hilbert space of the position-aware motif kernel function. The resulting model enables to directly interpret and evaluate prediction outcomes by providing a biologically and medically meaningful explanation without the need for additional post-hoc analysis. We show that our model is able to robustly learn on small datasets and reaches state-of-the-art performance on relevant healthcare prediction tasks. Our proposed method can be utilized on DNA and protein sequences. Furthermore, we show that the proposed method learns biologically meaningful concepts directly from data using an end-to-end learning scheme.
Collapse
Affiliation(s)
- Jonas C Ditz
- Methods in Medical Informatics, Department of Computer Science, University of Tübingen, Sand 14, Tübingen, 72076, Germany.
| | - Bernhard Reuter
- Methods in Medical Informatics, Department of Computer Science, University of Tübingen, Sand 14, Tübingen, 72076, Germany
| | - Nico Pfeifer
- Methods in Medical Informatics, Department of Computer Science, University of Tübingen, Sand 14, Tübingen, 72076, Germany.
| |
Collapse
|
3
|
Wagner N, Çelik MH, Hölzlwimmer FR, Mertes C, Prokisch H, Yépez VA, Gagneur J. Aberrant splicing prediction across human tissues. Nat Genet 2023; 55:861-870. [PMID: 37142848 DOI: 10.1038/s41588-023-01373-3] [Citation(s) in RCA: 10] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/05/2022] [Accepted: 03/14/2023] [Indexed: 05/06/2023]
Abstract
Aberrant splicing is a major cause of genetic disorders but its direct detection in transcriptomes is limited to clinically accessible tissues such as skin or body fluids. While DNA-based machine learning models can prioritize rare variants for affecting splicing, their performance in predicting tissue-specific aberrant splicing remains unassessed. Here we generated an aberrant splicing benchmark dataset, spanning over 8.8 million rare variants in 49 human tissues from the Genotype-Tissue Expression (GTEx) dataset. At 20% recall, state-of-the-art DNA-based models achieve maximum 12% precision. By mapping and quantifying tissue-specific splice site usage transcriptome-wide and modeling isoform competition, we increased precision by threefold at the same recall. Integrating RNA-sequencing data of clinically accessible tissues into our model, AbSplice, brought precision to 60%. These results, replicated in two independent cohorts, substantially contribute to noncoding loss-of-function variant identification and to genetic diagnostics design and analytics.
Collapse
Affiliation(s)
- Nils Wagner
- School of Computation, Information and Technology, Technical University of Munich, Garching, Germany
- Helmholtz Association - Munich School for Data Science (MUDS), Munich, Germany
| | - Muhammed H Çelik
- School of Computation, Information and Technology, Technical University of Munich, Garching, Germany
- Center for Complex Biological Systems, University of California, Irvine, Irvine, CA, USA
| | - Florian R Hölzlwimmer
- School of Computation, Information and Technology, Technical University of Munich, Garching, Germany
| | - Christian Mertes
- School of Computation, Information and Technology, Technical University of Munich, Garching, Germany
- Munich Data Science Institute, Technical University of Munich, Garching, Germany
| | - Holger Prokisch
- Institute of Human Genetics, School of Medicine, Technical University of Munich, Munich, Germany
- Computational Health Center, Helmholtz Center Munich, Neuherberg, Germany
| | - Vicente A Yépez
- School of Computation, Information and Technology, Technical University of Munich, Garching, Germany
| | - Julien Gagneur
- School of Computation, Information and Technology, Technical University of Munich, Garching, Germany.
- Helmholtz Association - Munich School for Data Science (MUDS), Munich, Germany.
- Institute of Human Genetics, School of Medicine, Technical University of Munich, Munich, Germany.
- Computational Health Center, Helmholtz Center Munich, Neuherberg, Germany.
| |
Collapse
|
4
|
Akpokiro V, Martin T, Oluwadare O. EnsembleSplice: ensemble deep learning model for splice site prediction. BMC Bioinformatics 2022; 23:413. [PMID: 36203144 PMCID: PMC9535948 DOI: 10.1186/s12859-022-04971-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/31/2022] [Accepted: 09/29/2022] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Identifying splice site regions is an important step in the genomic DNA sequencing pipelines of biomedical and pharmaceutical research. Within this research purview, efficient and accurate splice site detection is highly desirable, and a variety of computational models have been developed toward this end. Neural network architectures have recently been shown to outperform classical machine learning approaches for the task of splice site prediction. Despite these advances, there is still considerable potential for improvement, especially regarding model prediction accuracy, and error rate. RESULTS Given these deficits, we propose EnsembleSplice, an ensemble learning architecture made up of four (4) distinct convolutional neural networks (CNN) model architecture combination that outperform existing splice site detection methods in the experimental evaluation metrics considered including the accuracies and error rates. We trained and tested a variety of ensembles made up of CNNs and DNNs using the five-fold cross-validation method to identify the model that performed the best across the evaluation and diversity metrics. As a result, we developed our diverse and highly effective splice site (SS) detection model, which we evaluated using two (2) genomic Homo sapiens datasets and the Arabidopsis thaliana dataset. The results showed that for of the Homo sapiens EnsembleSplice achieved accuracies of 94.16% for one of the acceptor splice sites and 95.97% for donor splice sites, with an error rate for the same Homo sapiens dataset, 4.03% for the donor splice sites and 5.84% for the acceptor splice sites datasets. CONCLUSIONS Our five-fold cross validation ensured the prediction accuracy of our models are consistent. For reproducibility, all the datasets used, models generated, and results in our work are publicly available in our GitHub repository here: https://github.com/OluwadareLab/EnsembleSplice.
Collapse
Affiliation(s)
- Victor Akpokiro
- Department of Computer Science, University of Colorado, Colorado Springs, CO, 80918, USA
| | - Trevor Martin
- Department of Mathematics, Oberlin College, Oberlin, OH, 44074, USA
| | - Oluwatosin Oluwadare
- Department of Computer Science, University of Colorado, Colorado Springs, CO, 80918, USA.
| |
Collapse
|
5
|
Scalzitti N, Kress A, Orhand R, Weber T, Moulinier L, Jeannin-Girardon A, Collet P, Poch O, Thompson JD. Spliceator: multi-species splice site prediction using convolutional neural networks. BMC Bioinformatics 2021; 22:561. [PMID: 34814826 PMCID: PMC8609763 DOI: 10.1186/s12859-021-04471-3] [Citation(s) in RCA: 21] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/04/2021] [Accepted: 11/09/2021] [Indexed: 12/14/2022] Open
Abstract
Background Ab initio prediction of splice sites is an essential step in eukaryotic genome annotation. Recent predictors have exploited Deep Learning algorithms and reliable gene structures from model organisms. However, Deep Learning methods for non-model organisms are lacking. Results We developed Spliceator to predict splice sites in a wide range of species, including model and non-model organisms. Spliceator uses a convolutional neural network and is trained on carefully validated data from over 100 organisms. We show that Spliceator achieves consistently high accuracy (89–92%) compared to existing methods on independent benchmarks from human, fish, fly, worm, plant and protist organisms. Conclusions Spliceator is a new Deep Learning method trained on high-quality data, which can be used to predict splice sites in diverse organisms, ranging from human to protists, with consistently high accuracy. Supplementary Information The online version contains supplementary material available at 10.1186/s12859-021-04471-3.
Collapse
Affiliation(s)
- Nicolas Scalzitti
- Complex Systems and Translational Bioinformatics (CSTB), ICube Laboratory, UMR7357, University of Strasbourg, 1 rue Eugène Boeckel, 67000, Strasbourg, France
| | - Arnaud Kress
- Complex Systems and Translational Bioinformatics (CSTB), ICube Laboratory, UMR7357, University of Strasbourg, 1 rue Eugène Boeckel, 67000, Strasbourg, France.,BiGEst-ICube Platform, ICube Laboratory, UMR7357, 1 rue Eugène Boeckel, 67000, Strasbourg, France
| | - Romain Orhand
- Complex Systems and Translational Bioinformatics (CSTB), ICube Laboratory, UMR7357, University of Strasbourg, 1 rue Eugène Boeckel, 67000, Strasbourg, France
| | - Thomas Weber
- Complex Systems and Translational Bioinformatics (CSTB), ICube Laboratory, UMR7357, University of Strasbourg, 1 rue Eugène Boeckel, 67000, Strasbourg, France
| | - Luc Moulinier
- Complex Systems and Translational Bioinformatics (CSTB), ICube Laboratory, UMR7357, University of Strasbourg, 1 rue Eugène Boeckel, 67000, Strasbourg, France.,BiGEst-ICube Platform, ICube Laboratory, UMR7357, 1 rue Eugène Boeckel, 67000, Strasbourg, France
| | - Anne Jeannin-Girardon
- Complex Systems and Translational Bioinformatics (CSTB), ICube Laboratory, UMR7357, University of Strasbourg, 1 rue Eugène Boeckel, 67000, Strasbourg, France
| | - Pierre Collet
- Complex Systems and Translational Bioinformatics (CSTB), ICube Laboratory, UMR7357, University of Strasbourg, 1 rue Eugène Boeckel, 67000, Strasbourg, France
| | - Olivier Poch
- Complex Systems and Translational Bioinformatics (CSTB), ICube Laboratory, UMR7357, University of Strasbourg, 1 rue Eugène Boeckel, 67000, Strasbourg, France
| | - Julie D Thompson
- Complex Systems and Translational Bioinformatics (CSTB), ICube Laboratory, UMR7357, University of Strasbourg, 1 rue Eugène Boeckel, 67000, Strasbourg, France.
| |
Collapse
|
6
|
Perez-Rodriguez J, de Haro-Garcia A, Garcia-Pedrajas N. Floating Search Methodology for Combining Classification Models for Site Recognition in DNA Sequences. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2021; 18:2471-2482. [PMID: 32078558 DOI: 10.1109/tcbb.2020.2974221] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
Recognition of the functional sites of genes, such as translation initiation sites, donor and acceptor splice sites and stop codons, is a relevant part of many current problems in bioinformatics. The best approaches use sophisticated classifiers, such as support vector machines. However, with the rapid accumulation of sequence data, methods for combining many sources of evidence are necessary as it is unlikely that a single classifier can solve this problem with the best possible performance. A major issue is that the number of possible models to combine is large and the use of all of these models is impractical. In this paper we present a methodology for combining many sources of information to recognize any functional site using "floating search", a powerful heuristics applicable when the cost of evaluating each solution is high. We present experiments on four functional sites in the human genome, which is used as the target genome, and use another 20 species as sources of evidence. The proposed methodology shows significant improvement over state-of-the-art methods. The results show an advantage of the proposed method and also challenge the standard assumption of using only genomes not very close and not very far from the human to improve the recognition of functional sites.
Collapse
|
7
|
Meher PK, Satpathy S. Improved recognition of splice sites in A. thaliana by incorporating secondary structure information into sequence-derived features: a computational study. 3 Biotech 2021; 11:484. [PMID: 34790508 DOI: 10.1007/s13205-021-03036-8] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/29/2021] [Accepted: 10/18/2021] [Indexed: 10/19/2022] Open
Abstract
Identification of splice sites is an important aspect with regard to the prediction of gene structure. In most of the existing splice site prediction studies, machine learning algorithms coupled with sequence-derived features have been successfully employed for splice site recognition. However, the splice site identification by incorporating the secondary structure information is lacking, particularly in plant species. Thus, we made an attempt in this study to evaluate the performance of structural features on the splice site prediction accuracy in Arabidopsis thaliana. Prediction accuracies were evaluated with the sequence-derived features alone as well as by incorporating the structural features into the sequence-derived features, where support vector machine (SVM) was employed as prediction algorithm. Both short (40 base pairs) and long (105 base pairs) sequence datasets were considered for evaluation. After incorporating the secondary structure features, improvements in accuracies were observed only for the longer sequence dataset and the improvement was found to be higher with the sequence-derived features that accounted nucleotide dependencies. On the other hand, either a little or no improvement in accuracies was found for the short sequence dataset. The performance of SVM was further compared with that of LogitBoost, Random Forest (RF), AdaBoost and XGBoost machine learning methods. The prediction accuracies of SVM, AdaBoost and XGBoost were observed to be at par and higher than that of RF and LogitBoost algorithms. While prediction was performed by taking all the sequence-derived features along with the structural features, a little improvement in accuracies was found as compared to the combination of individual sequence-based features and structural features. To the best of our knowledge, this is the first attempt concerning the computational prediction of splice sites using machine learning methods by incorporating the secondary structure information into the sequence-derived features. All the source codes are available at https://github.com/meher861982/SSFeature. SUPPLEMENTARY INFORMATION The online version contains supplementary material available at 10.1007/s13205-021-03036-8.
Collapse
|
8
|
Das L, Das JK, Mohapatra S, Nanda S. DNA numerical encoding schemes for exon prediction: a recent history. NUCLEOSIDES NUCLEOTIDES & NUCLEIC ACIDS 2021; 40:985-1017. [PMID: 34455915 DOI: 10.1080/15257770.2021.1966797] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/20/2022]
Abstract
Bioinformatics in the present day has been firmly established as a regulator in genomics. In recent times, applications of Signal processing in exon prediction have gained a lot of attention. The exons carry protein information. Proteins are composed of connected constituents known as amino acids that characterize the specific function. Conversion of the nucleotide character string into a numerical sequence is the gateway before analyzing it through signal processing methods. This numeric encoding is the mathematical descriptor of nucleotides and is based on some statistical properties of the structure of nucleic acids. Since the type of encoding extremely affects the exon detection accuracy, this paper is devised for the review of existing encoding (mapping) schemes. The comparative analysis is formulated to emphasize the importance of the genetic code setting of amino acids considered for application related to computational elucidation for exon detection. This work covers much helpful information for future applications.
Collapse
Affiliation(s)
- Lopamudra Das
- School of Electronics Engineering, KIIT, Bhubaneswar, India
| | - J K Das
- School of Electronics Engineering, KIIT, Bhubaneswar, India
| | - S Mohapatra
- School of Electronics Engineering, KIIT, Bhubaneswar, India
| | - Sarita Nanda
- School of Electronics Engineering, KIIT, Bhubaneswar, India
| |
Collapse
|
9
|
Zrimec J, Buric F, Kokina M, Garcia V, Zelezniak A. Learning the Regulatory Code of Gene Expression. Front Mol Biosci 2021; 8:673363. [PMID: 34179082 PMCID: PMC8223075 DOI: 10.3389/fmolb.2021.673363] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/27/2021] [Accepted: 05/24/2021] [Indexed: 11/13/2022] Open
Abstract
Data-driven machine learning is the method of choice for predicting molecular phenotypes from nucleotide sequence, modeling gene expression events including protein-DNA binding, chromatin states as well as mRNA and protein levels. Deep neural networks automatically learn informative sequence representations and interpreting them enables us to improve our understanding of the regulatory code governing gene expression. Here, we review the latest developments that apply shallow or deep learning to quantify molecular phenotypes and decode the cis-regulatory grammar from prokaryotic and eukaryotic sequencing data. Our approach is to build from the ground up, first focusing on the initiating protein-DNA interactions, then specific coding and non-coding regions, and finally on advances that combine multiple parts of the gene and mRNA regulatory structures, achieving unprecedented performance. We thus provide a quantitative view of gene expression regulation from nucleotide sequence, concluding with an information-centric overview of the central dogma of molecular biology.
Collapse
Affiliation(s)
- Jan Zrimec
- Department of Biology and Biological Engineering, Chalmers University of Technology, Gothenburg, Sweden
| | - Filip Buric
- Department of Biology and Biological Engineering, Chalmers University of Technology, Gothenburg, Sweden
| | - Mariia Kokina
- Department of Biology and Biological Engineering, Chalmers University of Technology, Gothenburg, Sweden
- Novo Nordisk Foundation Center for Biosustainability, Technical University of Denmark, Kongens Lyngby, Denmark
| | - Victor Garcia
- School of Life Sciences and Facility Management, Zurich University of Applied Sciences, Wädenswil, Switzerland
| | - Aleksej Zelezniak
- Department of Biology and Biological Engineering, Chalmers University of Technology, Gothenburg, Sweden
- Science for Life Laboratory, Stockholm, Sweden
| |
Collapse
|
10
|
Dutta A, Singh KK, Anand A. SpliceViNCI: Visualizing the splicing of non-canonical introns through recurrent neural networks. J Bioinform Comput Biol 2021; 19:2150014. [PMID: 34088258 DOI: 10.1142/s0219720021500141] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
Most of the current computational models for splice junction prediction are based on the identification of canonical splice junctions. However, it is observed that the junctions lacking the consensus dimers GT and AG also undergo splicing. Identification of such splice junctions, called the non-canonical splice junctions, is also essential for a comprehensive understanding of the splicing phenomenon. This work focuses on the identification of non-canonical splice junctions through the application of a bidirectional long short-term memory (BLSTM) network. Furthermore, we apply a back-propagation-based (integrated gradient) and a perturbation-based (occlusion) visualization techniques to extract the non-canonical splicing features learned by the model. The features obtained are validated with the existing knowledge from the literature. Integrated gradient extracts features that comprise contiguous nucleotides, whereas occlusion extracts features that are individual nucleotides distributed across the sequence.
Collapse
Affiliation(s)
- Aparajita Dutta
- Department of CSE, Indian Institute of Technology, Guwahati, India
| | | | - Ashish Anand
- Department of CSE, Indian Institute of Technology, Guwahati, India
| |
Collapse
|
11
|
Cheng J, Çelik MH, Kundaje A, Gagneur J. MTSplice predicts effects of genetic variants on tissue-specific splicing. Genome Biol 2021; 22:94. [PMID: 33789710 PMCID: PMC8011109 DOI: 10.1186/s13059-021-02273-7] [Citation(s) in RCA: 17] [Impact Index Per Article: 5.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/06/2020] [Accepted: 01/14/2021] [Indexed: 12/20/2022] Open
Abstract
We develop the free and open-source model Multi-tissue Splicing (MTSplice) to predict the effects of genetic variants on splicing of cassette exons in 56 human tissues. MTSplice combines MMSplice, which models constitutive regulatory sequences, with a new neural network that models tissue-specific regulatory sequences. MTSplice outperforms MMSplice on predicting tissue-specific variations associated with genetic variants in most tissues of the GTEx dataset, with largest improvements on brain tissues. Furthermore, MTSplice predicts that autism-associated de novo mutations are enriched for variants affecting splicing specifically in the brain. We foresee that MTSplice will aid interpreting variants associated with tissue-specific disorders.
Collapse
Affiliation(s)
- Jun Cheng
- Department of Informatics, Technical University of Munich, Boltzmannstraße, Garching, 85748, Germany.
| | - Muhammed Hasan Çelik
- Department of Informatics, Technical University of Munich, Boltzmannstraße, Garching, 85748, Germany
| | - Anshul Kundaje
- Department of Computer Science, Stanford University, Stanford, CA, USA
- Department of Genetics, Stanford University, Stanford, CA, USA
| | - Julien Gagneur
- Department of Informatics, Technical University of Munich, Boltzmannstraße, Garching, 85748, Germany.
- Institute of Computational Biology, Helmholtz Zentrum München, Neuherberg, Germany.
- Institute of Human Genetics, Klinikum rechts der Isar, Technical University of Munich, Munich, Germany.
| |
Collapse
|
12
|
Moosa S, Amira PA, Boughorbel DS. DASSI: differential architecture search for splice identification from DNA sequences. BioData Min 2021; 14:15. [PMID: 33588916 PMCID: PMC7885202 DOI: 10.1186/s13040-021-00237-y] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/01/2020] [Accepted: 01/05/2021] [Indexed: 11/28/2022] Open
Abstract
Background The data explosion caused by unprecedented advancements in the field of genomics is constantly challenging the conventional methods used in the interpretation of the human genome. The demand for robust algorithms over the recent years has brought huge success in the field of Deep Learning (DL) in solving many difficult tasks in image, speech and natural language processing by automating the manual process of architecture design. This has been fueled through the development of new DL architectures. Yet genomics possesses unique challenges that requires customization and development of new DL models. Methods We proposed a new model, DASSI, by adapting a differential architecture search method and applying it to the Splice Site (SS) recognition task on DNA sequences to discover new high-performance convolutional architectures in an automated manner. We evaluated the discovered model against state-of-the-art tools to classify true and false SS in Homo sapiens (Human), Arabidopsis thaliana (Plant), Caenorhabditis elegans (Worm) and Drosophila melanogaster (Fly). Results Our experimental evaluation demonstrated that the discovered architecture outperformed baseline models and fixed architectures and showed competitive results against state-of-the-art models used in classification of splice sites. The proposed model - DASSI has a compact architecture and showed very good results on a transfer learning task. The benchmarking experiments of execution time and precision on architecture search and evaluation process showed better performance on recently available GPUs making it feasible to adopt architecture search based methods on large datasets. Conclusions We proposed the use of differential architecture search method (DASSI) to perform SS classification on raw DNA sequences, and discovered new neural network models with low number of tunable parameters and competitive performance compared with manually engineered architectures. We have extensively benchmarked DASSI model with other state-of-the-art models and assessed its computational efficiency. The results have shown a high potential of using automated architecture search mechanism for solving various problems in the field of genomics.
Collapse
Affiliation(s)
- Shabir Moosa
- Department of Systems Biology, SIDRA Medicine, Doha, 26999, Qatar. .,Dept. of Computer Science and Engineering, Qatar University, Doha, 2713, Qatar.
| | - Prof Abbes Amira
- Dept. of Computer Science and Engineering, Qatar University, Doha, 2713, Qatar
| | | |
Collapse
|
13
|
Amilpur S, Bhukya R. EDeepSSP: Explainable deep neural networks for exact splice sites prediction. J Bioinform Comput Biol 2020; 18:2050024. [PMID: 32696716 DOI: 10.1142/s0219720020500249] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
Splice site prediction is crucial for understanding underlying gene regulation, gene function for better genome annotation. Many computational methods exist for recognizing the splice sites. Although most of the methods achieve a competent performance, their interpretability remains challenging. Moreover, all traditional machine learning methods manually extract features, which is tedious job. To address these challenges, we propose a deep learning-based approach (EDeepSSP) that employs convolutional neural networks (CNNs) architecture for automatic feature extraction and effectively predicts splice sites. Our model, EDeepSSP, divulges the opaque nature of CNN by extracting significant motifs and explains why these motifs are vital for predicting splice sites. In this study, experiments have been conducted on six benchmark acceptors and donor datasets of humans, cress, and fly. The results show that EDeepSSP has outperformed many state-of-the-art approaches. EDeepSSP achieves the highest area under the receiver operating characteristic curve (AUC_ROC) and area under the precision-recall curve (AUC_PR) of 99.32% and 99.26% on human donor datasets, respectively. We also analyze various filter activities, feature activations, and extracted significant motifs responsible for the splice site prediction. Further, we validate the learned motifs of our model against known motifs of JASPAR splice site database.
Collapse
Affiliation(s)
- Santhosh Amilpur
- Computer Science and Engineering, National Institute of Technology Warangal, Warangal, Telangana 506004, India
| | - Raju Bhukya
- Computer Science and Engineering, National Institute of Technology Warangal, Warangal, Telangana 506004, India
| |
Collapse
|
14
|
Thanapattheerakul T, Engchuan W, Chan JH. Predicting the effect of variants on splicing using Convolutional Neural Networks. PeerJ 2020; 8:e9470. [PMID: 32704450 PMCID: PMC7346860 DOI: 10.7717/peerj.9470] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/25/2019] [Accepted: 06/11/2020] [Indexed: 11/23/2022] Open
Abstract
Mutations that cause an error in the splicing of a messenger RNA (mRNA) can lead to diseases in humans. Various computational models have been developed to recognize the sequence pattern of the splice sites. In recent studies, Convolutional Neural Network (CNN) architectures were shown to outperform other existing models in predicting the splice sites. However, an insufficient effort has been put into extending the CNN model to predict the effect of the genomic variants on the splicing of mRNAs. This study proposes a framework to elaborate on the utility of CNNs to assess the effect of splice variants on the identification of potential disease-causing variants that disrupt the RNA splicing process. Five models, including three CNN-based and two non-CNN machine learning based, were trained and compared using two existing splice site datasets, Genome Wide Human splice sites (GWH) and a dataset provided at the Deep Learning and Artificial Intelligence winter school 2018 (DLAI). The donor sites were also used to test on the HSplice tool to evaluate the predictive models. To improve the effectiveness of predictive models, two datasets were combined. The CNN model with four convolutional layers showed the best splice site prediction performance with an AUPRC of 93.4% and 88.8% for donor and acceptor sites, respectively. The effects of variants on splicing were estimated by applying the best model on variant data from the ClinVar database. Based on the estimation, the framework could effectively differentiate pathogenic variants from the benign variants (p = 5.9 × 10−7). These promising results support that the proposed framework could be applied in future genetic studies to identify disease causing loci involving the splicing mechanism. The datasets and Python scripts used in this study are available on the GitHub repository at https://github.com/smiile8888/rna-splice-sites-recognition.
Collapse
Affiliation(s)
| | - Worrawat Engchuan
- Genetics and Genome Biology, The Hospital for Sick Children, Toronto, Ontario, Canada.,The Centre for Applied Genomics, The Hospital of Sick Children, Toronto, Ontario, Canada
| | - Jonathan H Chan
- School of Information Technology, King Mongkut's University of Technology Thonburi, Bangkok, Thailand.,IC2-DLab, School of Information Technology, King Mongkut's University of Technology Thonburi, Bangkok, Thailand
| |
Collapse
|
15
|
Mahood EH, Kruse LH, Moghe GD. Machine learning: A powerful tool for gene function prediction in plants. APPLICATIONS IN PLANT SCIENCES 2020; 8:e11376. [PMID: 32765975 PMCID: PMC7394712 DOI: 10.1002/aps3.11376] [Citation(s) in RCA: 51] [Impact Index Per Article: 12.8] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/01/2019] [Accepted: 03/19/2020] [Indexed: 05/06/2023]
Abstract
Recent advances in sequencing and informatic technologies have led to a deluge of publicly available genomic data. While it is now relatively easy to sequence, assemble, and identify genic regions in diploid plant genomes, functional annotation of these genes is still a challenge. Over the past decade, there has been a steady increase in studies utilizing machine learning algorithms for various aspects of functional prediction, because these algorithms are able to integrate large amounts of heterogeneous data and detect patterns inconspicuous through rule-based approaches. The goal of this review is to introduce experimental plant biologists to machine learning, by describing how it is currently being used in gene function prediction to gain novel biological insights. In this review, we discuss specific applications of machine learning in identifying structural features in sequenced genomes, predicting interactions between different cellular components, and predicting gene function and organismal phenotypes. Finally, we also propose strategies for stimulating functional discovery using machine learning-based approaches in plants.
Collapse
Affiliation(s)
- Elizabeth H. Mahood
- Plant Biology SectionSchool of Integrative Plant SciencesCornell UniversityIthacaNew York14853USA
| | - Lars H. Kruse
- Plant Biology SectionSchool of Integrative Plant SciencesCornell UniversityIthacaNew York14853USA
| | - Gaurav D. Moghe
- Plant Biology SectionSchool of Integrative Plant SciencesCornell UniversityIthacaNew York14853USA
| |
Collapse
|
16
|
Payrovnaziri SN, Chen Z, Rengifo-Moreno P, Miller T, Bian J, Chen JH, Liu X, He Z. Explainable artificial intelligence models using real-world electronic health record data: a systematic scoping review. J Am Med Inform Assoc 2020; 27:1173-1185. [PMID: 32417928 PMCID: PMC7647281 DOI: 10.1093/jamia/ocaa053] [Citation(s) in RCA: 87] [Impact Index Per Article: 21.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/22/2020] [Revised: 04/01/2020] [Accepted: 04/07/2020] [Indexed: 01/08/2023] Open
Abstract
OBJECTIVE To conduct a systematic scoping review of explainable artificial intelligence (XAI) models that use real-world electronic health record data, categorize these techniques according to different biomedical applications, identify gaps of current studies, and suggest future research directions. MATERIALS AND METHODS We searched MEDLINE, IEEE Xplore, and the Association for Computing Machinery (ACM) Digital Library to identify relevant papers published between January 1, 2009 and May 1, 2019. We summarized these studies based on the year of publication, prediction tasks, machine learning algorithm, dataset(s) used to build the models, the scope, category, and evaluation of the XAI methods. We further assessed the reproducibility of the studies in terms of the availability of data and code and discussed open issues and challenges. RESULTS Forty-two articles were included in this review. We reported the research trend and most-studied diseases. We grouped XAI methods into 5 categories: knowledge distillation and rule extraction (N = 13), intrinsically interpretable models (N = 9), data dimensionality reduction (N = 8), attention mechanism (N = 7), and feature interaction and importance (N = 5). DISCUSSION XAI evaluation is an open issue that requires a deeper focus in the case of medical applications. We also discuss the importance of reproducibility of research work in this field, as well as the challenges and opportunities of XAI from 2 medical professionals' point of view. CONCLUSION Based on our review, we found that XAI evaluation in medicine has not been adequately and formally practiced. Reproducibility remains a critical concern. Ample opportunities exist to advance XAI research in medicine.
Collapse
Affiliation(s)
| | - Zhaoyi Chen
- Department of Health Outcomes and Biomedical Informatics, University of Florida, Gainesville, Florida, USA
| | - Pablo Rengifo-Moreno
- College of Medicine, Florida State University, Tallahassee, Florida, USA
- Tallahassee Memorial Hospital, Tallahassee, Florida, USA
| | - Tim Miller
- School of Computing and Information Systems, The University of Melbourne, Melbourne, Victoria, Australia
| | - Jiang Bian
- Department of Health Outcomes and Biomedical Informatics, University of Florida, Gainesville, Florida, USA
| | - Jonathan H Chen
- Center for Biomedical Informatics Research, Department of Medicine, Stanford University, Stanford, California, USA
- Division of Hospital Medicine, Department of Medicine, Stanford University, Stanford, California, USA
| | - Xiuwen Liu
- Department of Computer Science, Florida State University, Tallahassee, Florida, USA
| | - Zhe He
- School of Information, Florida State University, Tallahassee, Florida, USA
| |
Collapse
|
17
|
Lee D, Zhang J, Liu J, Gerstein M. Epigenome-based splicing prediction using a recurrent neural network. PLoS Comput Biol 2020; 16:e1008006. [PMID: 32584815 PMCID: PMC7343189 DOI: 10.1371/journal.pcbi.1008006] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/25/2019] [Revised: 07/08/2020] [Accepted: 06/01/2020] [Indexed: 12/16/2022] Open
Abstract
Alternative RNA splicing provides an important means to expand metazoan transcriptome diversity. Contrary to what was accepted previously, splicing is now thought to predominantly take place during transcription. Motivated by emerging data showing the physical proximity of the spliceosome to Pol II, we surveyed the effect of epigenetic context on co-transcriptional splicing. In particular, we observed that splicing factors were not necessarily enriched at exon junctions and that most epigenetic signatures had a distinctly asymmetric profile around known splice sites. Given this, we tried to build an interpretable model that mimics the physical layout of splicing regulation where the chromatin context progressively changes as the Pol II moves along the guide DNA. We used a recurrent-neural-network architecture to predict the inclusion of a spliced exon based on adjacent epigenetic signals, and we showed that distinct spatio-temporal features of these signals were key determinants of model outcome, in addition to the actual nucleotide sequence of the guide DNA strand. After the model had been trained and tested (with >80% precision-recall curve metric), we explored the derived weights of the latent factors, finding they highlight the importance of the asymmetric time-direction of chromatin context during transcription.
Collapse
Affiliation(s)
- Donghoon Lee
- Program in Computational Biology and Bioinformatics, Yale University, New Haven, Connecticut, United States of America
- Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, Connecticut, United States of America
| | - Jing Zhang
- Program in Computational Biology and Bioinformatics, Yale University, New Haven, Connecticut, United States of America
- Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, Connecticut, United States of America
| | - Jason Liu
- Program in Computational Biology and Bioinformatics, Yale University, New Haven, Connecticut, United States of America
- Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, Connecticut, United States of America
| | - Mark Gerstein
- Program in Computational Biology and Bioinformatics, Yale University, New Haven, Connecticut, United States of America
- Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, Connecticut, United States of America
- Department of Computer Science, Yale University, New Haven, Connecticut, United States of America
- Department of Statistics and Data Science, Yale University, New Haven, Connecticut, United States of America
- * E-mail:
| |
Collapse
|
18
|
Albaradei S, Magana-Mora A, Thafar M, Uludag M, Bajic VB, Gojobori T, Essack M, Jankovic BR. Splice2Deep: An ensemble of deep convolutional neural networks for improved splice site prediction in genomic DNA. Gene 2020; 763S:100035. [PMID: 32550561 PMCID: PMC7285987 DOI: 10.1016/j.gene.2020.100035] [Citation(s) in RCA: 21] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/30/2020] [Accepted: 05/06/2020] [Indexed: 12/21/2022]
Abstract
Background The accurate identification of the exon/intron boundaries is critical for the correct annotation of genes with multiple exons. Donor and acceptor splice sites (SS) demarcate these boundaries. Therefore, deriving accurate computational models to predict the SS are useful for functional annotation of genes and genomes, and for finding alternative SS associated with different diseases. Although various models have been proposed for the in silico prediction of SS, improving their accuracy is required for reliable annotation. Moreover, models are often derived and tested using the same genome, providing no evidence of broad application, i.e. to other poorly studied genomes. Results With this in mind, we developed the Splice2Deep models for SS detection. Each model is an ensemble of deep convolutional neural networks. We evaluated the performance of the models based on the ability to detect SS in Homo sapiens, Oryza sativa japonica, Arabidopsis thaliana, Drosophila melanogaster, and Caenorhabditis elegans. Results demonstrate that the models efficiently detect SS in other organisms not considered during the training of the models. Compared to the state-of-the-art tools, Splice2Deep models achieved significantly reduced average error rates of 41.97% and 28.51% for acceptor and donor SS, respectively. Moreover, the Splice2Deep cross-organism validation demonstrates that models correctly identify conserved genomic elements enabling annotation of SS in new genomes by choosing the taxonomically closest model. Conclusions The results of our study demonstrated that Splice2Deep both achieved a considerably reduced error rate compared to other state-of-the-art models and the ability to accurately recognize SS in other organisms for which the model was not trained, enabling annotation of poorly studied or newly sequenced genomes. Splice2Deep models are implemented in Python using Keras API; the models and the data are available at https://github.com/SomayahAlbaradei/Splice_Deep.git.
Collapse
Key Words
- AUC, area under curve
- AcSS, acceptor splice site
- Acc, accuracy
- Bioinformatics
- CNN, convolutional neural network
- CONV, convolutional layers
- DL, deep learning
- DNA, deoxyribonucleic acid
- DT, decision trees
- Deep-learning
- DoSS, donor splice site
- FC, fully connected layer
- ML, machine learning
- NB, naive Bayes
- NN, neural network
- POOL, pooling layer
- Prediction
- RF, random forest
- RNA, ribonucleic acid
- ReLU, rectified linear unit layer
- SS, splice site
- SVM, support vector machine
- Sn, sensitivity
- Sp, specificity
- Splice sites
- Splicing
Collapse
Affiliation(s)
- Somayah Albaradei
- Computer, Electrical and Mathematical Sciences and Engineering Division (CEMSE), Computational Bioscience Research Center, Computer (CBRC), King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Saudi Arabia.,Faculty of Computing and Information Technology, King Abdulaziz University, Saudi Arabia
| | - Arturo Magana-Mora
- Computer, Electrical and Mathematical Sciences and Engineering Division (CEMSE), Computational Bioscience Research Center, Computer (CBRC), King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Saudi Arabia.,Saudi Aramco, EXPEC-ARC, Drilling Technology Team, Dhahran 31311, Saudi Arabia
| | - Maha Thafar
- Computer, Electrical and Mathematical Sciences and Engineering Division (CEMSE), Computational Bioscience Research Center, Computer (CBRC), King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Saudi Arabia.,Faculty of Computers and Information Systems, Taif University, Saudi Arabia
| | - Mahmut Uludag
- Computer, Electrical and Mathematical Sciences and Engineering Division (CEMSE), Computational Bioscience Research Center, Computer (CBRC), King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Saudi Arabia
| | - Vladimir B Bajic
- Computer, Electrical and Mathematical Sciences and Engineering Division (CEMSE), Computational Bioscience Research Center, Computer (CBRC), King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Saudi Arabia
| | - Takashi Gojobori
- Computer, Electrical and Mathematical Sciences and Engineering Division (CEMSE), Computational Bioscience Research Center, Computer (CBRC), King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Saudi Arabia.,Biological and Environmental Sciences and Engineering Division (BESE), King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Saudi Arabia
| | - Magbubah Essack
- Computer, Electrical and Mathematical Sciences and Engineering Division (CEMSE), Computational Bioscience Research Center, Computer (CBRC), King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Saudi Arabia
| | - Boris R Jankovic
- Computer, Electrical and Mathematical Sciences and Engineering Division (CEMSE), Computational Bioscience Research Center, Computer (CBRC), King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Saudi Arabia
| |
Collapse
|
19
|
Yang T, Zhang L, Lin Q, Zhu S, Jin R. High-dimensional model recovery from random sketched data by exploring intrinsic sparsity. Mach Learn 2020. [DOI: 10.1007/s10994-019-05865-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/25/2022]
|
20
|
Van Messem A. Support vector machines: A robust prediction method with applications in bioinformatics. HANDBOOK OF STATISTICS 2020. [DOI: 10.1016/bs.host.2019.08.003] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/02/2023]
|
21
|
Wang R, Wang Z, Wang J, Li S. SpliceFinder: ab initio prediction of splice sites using convolutional neural network. BMC Bioinformatics 2019; 20:652. [PMID: 31881982 PMCID: PMC6933889 DOI: 10.1186/s12859-019-3306-3] [Citation(s) in RCA: 23] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/11/2022] Open
Abstract
Background Identifying splice sites is a necessary step to analyze the location and structure of genes. Two dinucleotides, GT and AG, are highly frequent on splice sites, and many other patterns are also on splice sites with important biological functions. Meanwhile, the dinucleotides occur frequently at the sequences without splice sites, which makes the prediction prone to generate false positives. Most existing tools select all the sequences with the two dimers and then focus on distinguishing the true splice sites from those pseudo ones. Such an approach will lead to a decrease in false positives; however, it will result in non-canonical splice sites missing. Result We have designed SpliceFinder based on convolutional neural network (CNN) to predict splice sites. To achieve the ab initio prediction, we used human genomic data to train our neural network. An iterative approach is adopted to reconstruct the dataset, which tackles the data unbalance problem and forces the model to learn more features of splice sites. The proposed CNN obtains the classification accuracy of 90.25%, which is 10% higher than the existing algorithms. The method outperforms other existing methods in terms of area under receiver operating characteristics (AUC), recall, precision, and F1 score. Furthermore, SpliceFinder can find the exact position of splice sites on long genomic sequences with a sliding window. Compared with other state-of-the-art splice site prediction tools, SpliceFinder generates results in about half lower false positive while keeping recall higher than 0.8. Also, SpliceFinder captures the non-canonical splice sites. In addition, SpliceFinder performs well on the genomic sequences of Drosophila melanogaster, Mus musculus, Rattus, and Danio rerio without retraining. Conclusion Based on CNN, we have proposed a new ab initio splice site prediction tool, SpliceFinder, which generates less false positives and can detect non-canonical splice sites. Additionally, SpliceFinder is transferable to other species without retraining. The source code and additional materials are available at https://gitlab.deepomics.org/wangruohan/SpliceFinder.
Collapse
Affiliation(s)
- Ruohan Wang
- Department of Computer Science, City University of Hong Kong, 83 Tat Chee Ave, Kowloon Tong, Hong Kong, China
| | - Zishuai Wang
- Department of Computer Science, City University of Hong Kong, 83 Tat Chee Ave, Kowloon Tong, Hong Kong, China
| | - Jianping Wang
- Department of Computer Science, City University of Hong Kong, 83 Tat Chee Ave, Kowloon Tong, Hong Kong, China.
| | - Shuaicheng Li
- Department of Computer Science, City University of Hong Kong, 83 Tat Chee Ave, Kowloon Tong, Hong Kong, China.
| |
Collapse
|
22
|
Splice sites detection using chaos game representation and neural network. Genomics 2019; 112:1847-1852. [PMID: 31704313 DOI: 10.1016/j.ygeno.2019.10.018] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/18/2018] [Revised: 03/18/2019] [Accepted: 10/29/2019] [Indexed: 11/23/2022]
Abstract
A novel method is proposed to detect the acceptor and donor splice sites using chaos game representation and artificial neural network. In order to achieve high accuracy, inputs to the neural network, or feature vector, shall reflect the true nature of the DNA segments. Therefore it is important to have one-to-one numerical representation, i.e. a feature vector should be able to represent the original data. Chaos game representation (CGR) is an iterative mapping technique that assigns each nucleotide in a DNA sequence to a respective position on the plane in a one-to-one manner. Using CGR, a DNA sequence can be mapped to a numerical sequence that reflects the true nature of the original sequence. In this research, we propose to use CGR as feature input to a neural network to detect splice sites on the NN269 dataset. Computational experiments indicate that this approach gives good accuracy while being simpler than other methods in the literature, with only one neural network component. The code and data for our method can be accessed from this link: https://github.com/thoang3/portfolio/tree/SpliceSites_ANN_CGR.
Collapse
|
23
|
Meher PK, Sahu TK, Gahoi S, Satpathy S, Rao AR. Evaluating the performance of sequence encoding schemes and machine learning methods for splice sites recognition. Gene 2019; 705:113-126. [PMID: 31009682 DOI: 10.1016/j.gene.2019.04.047] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/20/2018] [Revised: 03/27/2019] [Accepted: 04/17/2019] [Indexed: 02/02/2023]
Abstract
Identification of splice sites is imperative for prediction of gene structure. Machine learning-based approaches (MLAs) have been reported to be more successful than the rule-based methods for identification of splice sites. However, the strings of alphabets should be transformed into numeric features through sequence encoding before using them as input in MLAs. In this study, we evaluated the performances of 8 different sequence encoding schemes i.e., Bayes kernel, density and sparse (DS), distribution of tri-nucleotide and 1st order Markov model (DM), frequency difference distance measure (FDDM), paired-nucleotide frequency difference between true and false sites (FDTF), 1st order Markov model (MM1), combination of both 1st and 2nd order Markov model (MM1 + MM2) and 2nd order Markov model (MM2) in respect of predicting donor and acceptor splice sites using 5 supervised learning methods (ANN, Bagging, Boosting, RF and SVM). The encoding schemes and machine learning methods were first evaluated in 4 species i.e., A. thaliana, C. elegans, D. melanogaster and H. sapiens, and then performances were validated with another four species i.e., Ciona intestinalis, Dictyostelium discoideum, Phaeodactylum tricornutum and Trypanosoma brucei. In terms of ROC (receiver-operating-characteristics) and PR (precision-recall) curves, FDTF encoding approach achieved higher accuracy followed by either MM2 or FDDM. Further, SVM was found to achieve higher accuracy (in terms of ROC and PR curves) followed by RF across encoding schemes and species. In terms of prediction accuracy across species, the SVM-FDTF combination was optimum than other combinations of classifiers and encoding schemes. Further, splice site prediction accuracies were observed higher for the species with low intron density. To our limited knowledge, this is the first attempt as far as comprehensive evaluation of sequence encoding schemes for prediction of splice sites is concerned. We have also developed an R-package EncDNA (https://cran.r-project.org/web/packages/EncDNA/index.html) for encoding of splice site motifs with different encoding schemes, which is expected to supplement the existing nucleotide sequence encoding approaches. This study is believed to be useful for the computational biologists for predicting different functional elements on the genomic DNA.
Collapse
Affiliation(s)
- Prabina Kumar Meher
- ICAR-Indian Agricultural Statistics Research Institute, New Delhi 110012, India.
| | - Tanmaya Kumar Sahu
- ICAR-Indian Agricultural Statistics Research Institute, New Delhi 110012, India
| | - Shachi Gahoi
- ICAR-Indian Agricultural Statistics Research Institute, New Delhi 110012, India
| | - Subhrajit Satpathy
- ICAR-Indian Agricultural Statistics Research Institute, New Delhi 110012, India
| | | |
Collapse
|
24
|
Zeng Y, Yuan H, Yuan Z, Chen Y. A high-performance approach for predicting donor splice sites based on short window size and imbalanced large samples. Biol Direct 2019; 14:6. [PMID: 30975175 PMCID: PMC6460831 DOI: 10.1186/s13062-019-0236-y] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/15/2018] [Accepted: 03/18/2019] [Indexed: 11/10/2022] Open
Abstract
Background Splice sites prediction has been a long-standing problem in bioinformatics. Although many computational approaches developed for splice site prediction have achieved satisfactory accuracy, further improvement in predictive accuracy is significant, for it is contributing to predict gene structure more accurately. Determining a proper window size before prediction is necessary. Overly long window size may introduce some irrelevant features, which would reduce predictive accuracy, while the use of short window size with maximum information may performs better in terms of predictive accuracy and time cost. Furthermore, the number of false splice sites following the GT–AG rule far exceeds that of true splice sites, accurate and rapid prediction of splice sites using imbalanced large samples has always been a challenge. Therefore, based on the short window size and imbalanced large samples, we developed a new computational method named chi-square decision table (χ2-DT) for donor splice site prediction. Results Using a short window size of 11 bp, χ2-DT extracts the improved positional features and compositional features based on chi-square test, then introduces features one by one based on information gain, and constructs a balanced decision table aimed at implementing imbalanced pattern classification. With a 2000:271,132 (true sites:false sites) training set, χ2-DT achieves the highest independent test accuracy (93.34%) when compared with three classifiers (random forest, artificial neural network, and relaxed variable kernel density estimator) and takes a short computation time (89 s). χ2-DT also exhibits good independent test accuracy (92.40%), when validated with BG-570 mutated sequences with frameshift errors (nucleotide insertions and deletions). Moreover, χ2-DT is compared with the long-window size-based methods and the short-window size-based methods, and is found to perform better than all of them in terms of predictive accuracy. Conclusions Based on short window size and imbalanced large samples, the proposed method not only achieves higher predictive accuracy than some existing methods, but also has high computational speed and good robustness against nucleotide insertions and deletions. Reviewers This article was reviewed by Ryan McGinty, Ph.D. and Dirk Walther. Electronic supplementary material The online version of this article (10.1186/s13062-019-0236-y) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Ying Zeng
- Hunan Engineering & Technology Research Center for Agricultural Big Data Analysis & Decision-making, Hunan Agricultural University, Changsha, 410128, Hunan, China.,Orient Science & Technology College, Hunan Agricultural University, Changsha, 410128, Hunan, China
| | - Hongjie Yuan
- Hunan Engineering & Technology Research Center for Agricultural Big Data Analysis & Decision-making, Hunan Agricultural University, Changsha, 410128, Hunan, China
| | - Zheming Yuan
- Hunan Engineering & Technology Research Center for Agricultural Big Data Analysis & Decision-making, Hunan Agricultural University, Changsha, 410128, Hunan, China. .,Hunan Provincial Key Laboratory for Biology and Control of Plant Diseases and Insect Pests, Hunan Agricultural University, Changsha, 410128, Hunan, China.
| | - Yuan Chen
- Hunan Provincial Key Laboratory of Crop Germplasm Innovation and Utilization, Hunan Agricultural University, Changsha, 410128, Hunan, China.
| |
Collapse
|
25
|
Zhang Y, Liu X, MacLeod J, Liu J. Discerning novel splice junctions derived from RNA-seq alignment: a deep learning approach. BMC Genomics 2018; 19:971. [PMID: 30591034 PMCID: PMC6307148 DOI: 10.1186/s12864-018-5350-1] [Citation(s) in RCA: 18] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/16/2017] [Accepted: 12/03/2018] [Indexed: 11/10/2022] Open
Abstract
Background Exon splicing is a regulated cellular process in the transcription of protein-coding genes. Technological advancements and cost reductions in RNA sequencing have made quantitative and qualitative assessments of the transcriptome both possible and widely available. RNA-seq provides unprecedented resolution to identify gene structures and resolve the diversity of splicing variants. However, currently available ab initio aligners are vulnerable to spurious alignments due to random sequence matches and sample-reference genome discordance. As a consequence, a significant set of false positive exon junction predictions would be introduced, which will further confuse downstream analyses of splice variant discovery and abundance estimation. Results In this work, we present a deep learning based splice junction sequence classifier, named DeepSplice, which employs convolutional neural networks to classify candidate splice junctions. We show (I) DeepSplice outperforms state-of-the-art methods for splice site classification when applied to the popular benchmark dataset HS3D, (II) DeepSplice shows high accuracy for splice junction classification with GENCODE annotation, and (III) the application of DeepSplice to classify putative splice junctions generated by Rail-RNA alignment of 21,504 human RNA-seq data significantly reduces 43 million candidates into around 3 million highly confident novel splice junctions. Conclusions A model inferred from the sequences of annotated exon junctions that can then classify splice junctions derived from primary RNA-seq data has been implemented. The performance of the model was evaluated and compared through comprehensive benchmarking and testing, indicating a reliable performance and gross usability for classifying novel splice junctions derived from RNA-seq alignment. Electronic supplementary material The online version of this article (10.1186/s12864-018-5350-1) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Yi Zhang
- Department of Computer Science, University of Kentucky, Lexington, KY, 40506, USA.
| | - Xinan Liu
- Department of Computer Science, University of Kentucky, Lexington, KY, 40506, USA
| | - James MacLeod
- Department of Veterinary Science, University of Kentucky, Lexington, KY, 40506, USA
| | - Jinze Liu
- Department of Computer Science, University of Kentucky, Lexington, KY, 40506, USA
| |
Collapse
|
26
|
Zuallaert J, Godin F, Kim M, Soete A, Saeys Y, De Neve W. SpliceRover: interpretable convolutional neural networks for improved splice site prediction. Bioinformatics 2018; 34:4180-4188. [DOI: 10.1093/bioinformatics/bty497] [Citation(s) in RCA: 58] [Impact Index Per Article: 9.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/23/2017] [Accepted: 06/19/2018] [Indexed: 11/13/2022] Open
Affiliation(s)
- Jasper Zuallaert
- Center for Biotech Data Science, Department of Environmental Technology, Food Technology and Molecular Biotechnology, Ghent University Global Campus, Songdo, Incheon, South Korea
- IDLab, Department for Electronics and Information Systems, Ghent University, Ghent, Belgium
| | - Fréderic Godin
- IDLab, Department for Electronics and Information Systems, Ghent University, Ghent, Belgium
| | - Mijung Kim
- Center for Biotech Data Science, Department of Environmental Technology, Food Technology and Molecular Biotechnology, Ghent University Global Campus, Songdo, Incheon, South Korea
- IDLab, Department for Electronics and Information Systems, Ghent University, Ghent, Belgium
| | - Arne Soete
- Department of Biomedical Molecular Biology, Ghent University, Ghent, Belgium
- Data Mining and Modeling for Biomedicine, VIB Inflammation Research Center, Ghent, Belgium
| | - Yvan Saeys
- Data Mining and Modeling for Biomedicine, VIB Inflammation Research Center, Ghent, Belgium
- Department of Applied Mathematics, Computer Science and Statistics, Ghent University, Ghent, Belgium
| | - Wesley De Neve
- Center for Biotech Data Science, Department of Environmental Technology, Food Technology and Molecular Biotechnology, Ghent University Global Campus, Songdo, Incheon, South Korea
- IDLab, Department for Electronics and Information Systems, Ghent University, Ghent, Belgium
| |
Collapse
|
27
|
Ma W, Yang L, Rohs R, Noble WS. DNA sequence+shape kernel enables alignment-free modeling of transcription factor binding. Bioinformatics 2018; 33:3003-3010. [PMID: 28541376 PMCID: PMC5870879 DOI: 10.1093/bioinformatics/btx336] [Citation(s) in RCA: 21] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/25/2016] [Accepted: 05/23/2017] [Indexed: 01/07/2023] Open
Abstract
Motivation Transcription factors (TFs) bind to specific DNA sequence motifs. Several lines of evidence suggest that TF-DNA binding is mediated in part by properties of the local DNA shape: the width of the minor groove, the relative orientations of adjacent base pairs, etc. Several methods have been developed to jointly account for DNA sequence and shape properties in predicting TF binding affinity. However, a limitation of these methods is that they typically require a training set of aligned TF binding sites. Results We describe a sequence + shape kernel that leverages DNA sequence and shape information to better understand protein-DNA binding preference and affinity. This kernel extends an existing class of k-mer based sequence kernels, based on the recently described di-mismatch kernel. Using three in vitro benchmark datasets, derived from universal protein binding microarrays (uPBMs), genomic context PBMs (gcPBMs) and SELEX-seq data, we demonstrate that incorporating DNA shape information improves our ability to predict protein-DNA binding affinity. In particular, we observe that (i) the k-spectrum + shape model performs better than the classical k-spectrum kernel, particularly for small k values; (ii) the di-mismatch kernel performs better than the k-mer kernel, for larger k; and (iii) the di-mismatch + shape kernel performs better than the di-mismatch kernel for intermediate k values. Availability and implementation The software is available at https://bitbucket.org/wenxiu/sequence-shape.git. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Wenxiu Ma
- Department of Statistics, University of California Riverside, Riverside, CA 92521, USA
| | - Lin Yang
- Molecular and Computational Biology Program, Departments of Biological Sciences, Chemistry, Physics, and Computer Science, University of Southern California, Los Angeles, CA 90089, USA
| | - Remo Rohs
- Molecular and Computational Biology Program, Departments of Biological Sciences, Chemistry, Physics, and Computer Science, University of Southern California, Los Angeles, CA 90089, USA
| | - William Stafford Noble
- Department of Genome Sciences, Department of Computer Science and Engineering, University of Washington, Seattle, WA 98195, USA
| |
Collapse
|
28
|
SpliceVec: Distributed feature representations for splice junction prediction. Comput Biol Chem 2018; 74:434-441. [DOI: 10.1016/j.compbiolchem.2018.03.009] [Citation(s) in RCA: 18] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/08/2018] [Accepted: 03/12/2018] [Indexed: 12/12/2022]
|
29
|
Abstract
Accurate splice-site prediction is essential to delineate gene structures from sequence data. Several computational techniques have been applied to create a system to predict canonical splice sites. For classification tasks, deep neural networks (DNNs) have achieved record-breaking results and often outperformed other supervised learning techniques. In this study, a new method of splice-site prediction using DNNs was proposed. The proposed system receives an input sequence data and returns an answer as to whether it is splice site. The length of input is 140 nucleotides, with the consensus sequence (i.e., "GT" and "AG" for the donor and acceptor sites, respectively) in the middle. Each input sequence model is applied to the pretrained DNN model that determines the probability that an input is a splice site. The model consists of convolutional layers and bidirectional long short-term memory network layers. The pretraining and validation were conducted using the data set tested in previously reported methods. The performance evaluation results showed that the proposed method can outperform the previous methods. In addition, the pattern learned by the DNNs was visualized as position frequency matrices (PFMs). Some of PFMs were very similar to the consensus sequence. The trained DNN model and the brief source code for the prediction system are uploaded. Further improvement will be achieved following the further development of DNNs.
Collapse
Affiliation(s)
- Tatsuhiko Naito
- Department of Neurology, Graduate School of Medicine, The University of Tokyo , Tokyo, Japan
| |
Collapse
|
30
|
Pashaei E, Aydin N. Markovian encoding models in human splice site recognition using SVM. Comput Biol Chem 2018; 73:159-170. [PMID: 29486390 DOI: 10.1016/j.compbiolchem.2018.02.005] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/27/2017] [Revised: 02/04/2018] [Accepted: 02/05/2018] [Indexed: 11/26/2022]
Abstract
Splice site recognition is among the most significant and challenging tasks in bioinformatics due to its key role in gene annotation. Effective prediction of splice site requires nucleotide encoding methods that reveal the characteristics of DNA sequences to provide appropriate features to serve as input of machine learning classifiers. Markovian models are the most influential encoding methods that highly used for pattern recognition in biological data. However, a direct performance comparison of these methods in splice site domain has not been assessed yet. This study compares various Markovian encoding models for splice site prediction utilizing support vector machine, as the most outstanding learning method in the domain, and conducts a new precise evaluation of Markovian approaches that corrects this limitation. Moreover, a novel sequence encoding approach based on third order Markov model (MM3) is proposed. The experimental results show that the proposed method, namely MM3-SVM, performs significantly better than thirteen best known state-of-the-art algorithms, while tested on HS3D dataset considering several performance criteria. Further, it achieved higher prediction accuracy than several well-known tools like NNsplice, MEM, MM1, WMM, and GeneID, using an independent test set of 50 genes. We also developed MMSVM, a web tool to predict splice sites in any human sequence using the proposed approach. The MMSVM web server can be assessed at https://pashaei.shinyapps.io/mmsvm.
Collapse
Affiliation(s)
- Elham Pashaei
- Department of Computer Engineering, Yildiz Technical University, Istanbul, Turkey.
| | - Nizamettin Aydin
- Department of Computer Engineering, Yildiz Technical University, Istanbul, Turkey.
| |
Collapse
|
31
|
Kim M, Tagkopoulos I. Data integration and predictive modeling methods for multi-omics datasets. Mol Omics 2018; 14:8-25. [DOI: 10.1039/c7mo00051k] [Citation(s) in RCA: 56] [Impact Index Per Article: 9.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/24/2022]
Abstract
We provide an overview of opportunities and challenges in multi-omics predictive analytics with particular emphasis on data integration and machine learning methods.
Collapse
Affiliation(s)
- Minseung Kim
- Department of Computer Science
- University of California
- Davis
- USA
- Genome Center
| | - Ilias Tagkopoulos
- Department of Computer Science
- University of California
- Davis
- USA
- Genome Center
| |
Collapse
|
32
|
Pashaei E, Yilmaz A, Ozen M, Aydin N. A novel method for splice sites prediction using sequence component and hidden Markov model. ANNUAL INTERNATIONAL CONFERENCE OF THE IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. ANNUAL INTERNATIONAL CONFERENCE 2017; 2016:3076-3079. [PMID: 28268961 DOI: 10.1109/embc.2016.7591379] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/07/2022]
Abstract
With increasing growth of DNA sequence data, it has become an urgent demand to develop new methods to accurately predict the genes. The performance of gene detection methods mainly depend on the efficiency of splice site prediction methods. In this paper, a novel method for detecting splice sites is proposed by using a new effective DNA encoding method and AdaBoost.M1 classifier. Our proposed DNA encoding method is based on multi-scale component (MSC) and first order Markov model (MM1). It has been applied to the HS3D dataset with repeated 10 fold cross validation. The experimental results indicate that the new method has increased the classification accuracy and outperformed some current methods such as MM1-SVM, Reduced MM1-SVM, SVM-B, LVMM, DM-SVM, DM2-AdaBoost and MS C+Pos(+APR)-SVM.
Collapse
|
33
|
Zhou G, Zhang F, Liu Y, Sun B. Pathway‑based detection of idiopathic pulmonary fibrosis at an early stage. Mol Med Rep 2017; 15:2023-2028. [PMID: 28260097 PMCID: PMC5364974 DOI: 10.3892/mmr.2017.6274] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/08/2016] [Accepted: 01/06/2017] [Indexed: 12/02/2022] Open
Abstract
Idiopathic pulmonary fibrosis (IPF) is the most common interstitial pneumonia and the most aggressive interstitial lung disease. Usually, IPF is confirmed by the histopathological pattern of typical interstitial pneumonia and requires an integrated multidisciplinary approach from pulmonologists, radiologists and pathologists. However, these diagnoses are performed at an advanced stage of IPF. At present, pathway‑based detection requires investigation, as it can be performed at an early stage of the disease. The aim of the present study was to find an effective method of diagnosing IPF at an early stage. Microarray data forE‑GEOD‑33566 were downloaded from the ArrayExpress database. Human pathways were downloaded from Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway database. An individual pathway‑based method to diagnose IPF at an early stage was introduced. Pathway statistics were analyzed with an individualized pathway aberrance score. P‑values were obtained with different methods, including the Wilcoxon test, linear models for microarray data (Limma) test and attract methods, generating three pathway groups. Support vector machines (SVM) were used to identify the best group for diagnosing IPF at an early stage. There were 106 differential pathways in Wilcoxon‑based KEGG Pathway (n>5) group, 100 in the Limma‑based KEGG Pathway (n>5) group, and seven in the attract‑based KEGG Pathway (n>5) group. The pathway statistics of these differential pathways in three groups were analyzed with linear SVM. The results demonstrated that the Wilcoxon‑based KEGG Pathway (n>5) group performed best in diagnosing IPF.
Collapse
Affiliation(s)
- Guojun Zhou
- Department of Emergency, Medical University Hospital of Binzhou, Binzhou, Shandong 256600, P.R. China
| | - Fangxia Zhang
- Department of Cardiology, Medical University Hospital of Binzhou, Binzhou, Shandong 256600, P.R. China
| | - Yufang Liu
- Department of Gynaecology and Obstetrics, Medical University Hospital of Binzhou, Binzhou, Shandong 256600, P.R. China
| | - Bin Sun
- Department of Emergency, Medical University Hospital of Binzhou, Binzhou, Shandong 256600, P.R. China
| |
Collapse
|
34
|
Vidovic MMC, Kloft M, Müller KR, Görnitz N. ML2Motif-Reliable extraction of discriminative sequence motifs from learning machines. PLoS One 2017; 12:e0174392. [PMID: 28346487 PMCID: PMC5367830 DOI: 10.1371/journal.pone.0174392] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/24/2016] [Accepted: 03/08/2017] [Indexed: 01/30/2023] Open
Abstract
High prediction accuracies are not the only objective to consider when solving problems using machine learning. Instead, particular scientific applications require some explanation of the learned prediction function. For computational biology, positional oligomer importance matrices (POIMs) have been successfully applied to explain the decision of support vector machines (SVMs) using weighted-degree (WD) kernels. To extract relevant biological motifs from POIMs, the motifPOIM method has been devised and showed promising results on real-world data. Our contribution in this paper is twofold: as an extension to POIMs, we propose gPOIM, a general measure of feature importance for arbitrary learning machines and feature sets (including, but not limited to, SVMs and CNNs) and devise a sampling strategy for efficient computation. As a second contribution, we derive a convex formulation of motifPOIMs that leads to more reliable motif extraction from gPOIMs. Empirical evaluations confirm the usefulness of our approach on artificially generated data as well as on real-world datasets.
Collapse
Affiliation(s)
| | - Marius Kloft
- Department of Computer Science, Humboldt University of Berlin, Berlin, Germany
| | - Klaus-Robert Müller
- Machine Learning Group, Technical University of Berlin, Berlin, Germany
- Department of Brain and Cognitive Engineering, Korea University, Anam-dong, Seongbuk-gu, Seoul 136-713, Korea
| | - Nico Görnitz
- Machine Learning Group, Technical University of Berlin, Berlin, Germany
| |
Collapse
|
35
|
Alt-Splice Gene Predictor Using Multitrack-Clique Analysis: Verification of Statistical Support for Modelling in Genomes of Multicellular Eukaryotes. INFORMATICS 2017. [DOI: 10.3390/informatics4010003] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022] Open
|
36
|
Xu X, Wu A, Zhang X, Su M, Jiang T, Yuan ZM. MetaDP: a comprehensive web server for disease prediction of 16S rRNA metagenomic datasets. BIOPHYSICS REPORTS 2017; 2:106-115. [PMID: 28317014 PMCID: PMC5334392 DOI: 10.1007/s41048-016-0033-4] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/05/2016] [Accepted: 11/08/2016] [Indexed: 01/16/2023] Open
Abstract
High-throughput sequencing-based metagenomics has garnered considerable interest in recent years. Numerous methods and tools have been developed for the analysis of metagenomic data. However, it is still a daunting task to install a large number of tools and complete a complicated analysis, especially for researchers with minimal bioinformatics backgrounds. To address this problem, we constructed an automated software named MetaDP for 16S rRNA sequencing data analysis, including data quality control, operational taxonomic unit clustering, diversity analysis, and disease risk prediction modeling. Furthermore, a support vector machine-based prediction model for intestinal bowel syndrome (IBS) was built by applying MetaDP to microbial 16S sequencing data from 108 children. The success of the IBS prediction model suggests that the platform may also be applied to other diseases related to gut microbes, such as obesity, metabolic syndrome, or intestinal cancer, among others (http://metadp.cn:7001/).
Collapse
Affiliation(s)
- Xilin Xu
- Hunan Provincial Key Laboratory for Biology and Control of Plant Diseases and Insect Pests, Hunan Agricultural University, Changsha, 410128 China
- Center for Systems Medicine, Institute of Basic Medical Sciences, Chinese Academy of Medical Sciences & Peking Union Medical College, Beijing 100005; Suzhou Institute of Systems Medicine, Suzhou, 215123 China
| | - Aiping Wu
- Center for Systems Medicine, Institute of Basic Medical Sciences, Chinese Academy of Medical Sciences & Peking Union Medical College, Beijing 100005; Suzhou Institute of Systems Medicine, Suzhou, 215123 China
| | - Xinlei Zhang
- Suzhou Geneworks Technology Company Limited, Suzhou, 215123 China
| | - Mingming Su
- Institute of Basic Medical Sciences, Chinese Academy of Medical Sciences & Peking Union Medical College, Beijing, 100005 China
| | - Taijiao Jiang
- Center for Systems Medicine, Institute of Basic Medical Sciences, Chinese Academy of Medical Sciences & Peking Union Medical College, Beijing 100005; Suzhou Institute of Systems Medicine, Suzhou, 215123 China
| | - Zhe-Ming Yuan
- Hunan Provincial Key Laboratory for Biology and Control of Plant Diseases and Insect Pests, Hunan Agricultural University, Changsha, 410128 China
| |
Collapse
|
37
|
|
38
|
Meher PK, Sahu TK, Rao AR, Wahi SD. A computational approach for prediction of donor splice sites with improved accuracy. J Theor Biol 2016; 404:285-294. [PMID: 27302911 DOI: 10.1016/j.jtbi.2016.06.013] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/18/2015] [Revised: 04/18/2016] [Accepted: 06/09/2016] [Indexed: 11/24/2022]
Abstract
Identification of splice sites is important due to their key role in predicting the exon-intron structure of protein coding genes. Though several approaches have been developed for the prediction of splice sites, further improvement in the prediction accuracy will help predict gene structure more accurately. This paper presents a computational approach for prediction of donor splice sites with higher accuracy. In this approach, true and false splice sites were first encoded into numeric vectors and then used as input in artificial neural network (ANN), support vector machine (SVM) and random forest (RF) for prediction. ANN and SVM were found to perform equally and better than RF, while tested on HS3D and NN269 datasets. Further, the performance of ANN, SVM and RF were analyzed by using an independent test set of 50 genes and found that the prediction accuracy of ANN was higher than that of SVM and RF. All the predictors achieved higher accuracy while compared with the existing methods like NNsplice, MEM, MDD, WMM, MM1, FSPLICE, GeneID and ASSP, using the independent test set. We have also developed an online prediction server (PreDOSS) available at http://cabgrid.res.in:8080/predoss, for prediction of donor splice sites using the proposed approach.
Collapse
Affiliation(s)
- Prabina Kumar Meher
- ICAR-Indian Agricultural Statistics Research Institute, New Delhi 110012, India.
| | - Tanmaya Kumar Sahu
- ICAR-Indian Agricultural Statistics Research Institute, New Delhi 110012, India.
| | - A R Rao
- ICAR-Indian Agricultural Statistics Research Institute, New Delhi 110012, India.
| | - S D Wahi
- ICAR-Indian Agricultural Statistics Research Institute, New Delhi 110012, India.
| |
Collapse
|
39
|
Meher PK, Sahu TK, Rao AR, Wahi SD. Identification of donor splice sites using support vector machine: a computational approach based on positional, compositional and dependency features. Algorithms Mol Biol 2016; 11:16. [PMID: 27252772 PMCID: PMC4888255 DOI: 10.1186/s13015-016-0078-4] [Citation(s) in RCA: 16] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/28/2015] [Accepted: 05/17/2016] [Indexed: 11/16/2022] Open
Abstract
Background Identification of splice sites is essential for annotation of genes. Though existing approaches have achieved an acceptable level of accuracy, still there is a need for further improvement. Besides, most of the approaches are species-specific and hence it is required to develop approaches compatible across species. Results Each splice site sequence was transformed into a numeric vector of length 49, out of which four were positional, four were dependency and 41 were compositional features. Using the transformed vectors as input, prediction was made through support vector machine. Using balanced training set, the proposed approach achieved area under ROC curve (AUC-ROC) of 96.05, 96.96, 96.95, 96.24 % and area under PR curve (AUC-PR) of 97.64, 97.89, 97.91, 97.90 %, while tested on human, cattle, fish and worm datasets respectively. On the other hand, AUC-ROC of 97.21, 97.45, 97.41, 98.06 % and AUC-PR of 93.24, 93.34, 93.38, 92.29 % were obtained, while imbalanced training datasets were used. The proposed approach was found comparable with state-of-art splice site prediction approaches, while compared using the bench mark NN269 dataset and other datasets. Conclusions The proposed approach achieved consistent accuracy across different species as well as found comparable with the existing approaches. Thus, we believe that the proposed approach can be used as a complementary method to the existing methods for the prediction of splice sites. A web server named as ‘HSplice’ has also been developed based on the proposed approach for easy prediction of 5′ splice sites by the users and is freely available at http://cabgrid.res.in:8080/HSplice.
Collapse
|
40
|
Pérez-Rodríguez J, García-Pedrajas N. Stepwise approach for combining many sources of evidence for site-recognition in genomic sequences. BMC Bioinformatics 2016; 17:117. [PMID: 26945666 PMCID: PMC4779560 DOI: 10.1186/s12859-016-0968-y] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/19/2015] [Accepted: 02/22/2016] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Recognizing the different functional parts of genes, such as promoters, translation initiation sites, donors, acceptors and stop codons, is a fundamental task of many current studies in Bioinformatics. Currently, the most successful methods use powerful classifiers, such as support vector machines with various string kernels. However, with the rapid evolution of our ability to collect genomic information, it has been shown that combining many sources of evidence is fundamental to the success of any recognition task. With the advent of next-generation sequencing, the number of available genomes is increasing very rapidly. Thus, methods for making use of such large amounts of information are needed. RESULTS In this paper, we present a methodology for combining tens or even hundreds of different classifiers for an improved performance. Our approach can include almost a limitless number of sources of evidence. We can use the evidence for the prediction of sites in a certain species, such as human, or other species as needed. This approach can be used for any of the functional recognition tasks cited above. However, to provide the necessary focus, we have tested our approach in two functional recognition tasks: translation initiation site and stop codon recognition. We have used the entire human genome as a target and another 20 species as sources of evidence and tested our method on five different human chromosomes. The proposed method achieves better accuracy than the best state-of-the-art method both in terms of the geometric mean of the specificity and sensitivity and the area under the receiver operating characteristic and precision recall curves. Furthermore, our approach shows a more principled way for selecting the best genomes to be combined for a given recognition task. CONCLUSIONS Our approach has proven to be a powerful tool for improving the performance of functional site recognition, and it is a useful method for combining many sources of evidence for any recognition task in Bioinformatics. The results also show that the common approach of heuristically choosing the species to be used as source of evidence can be improved because the best combinations of genomes for recognition were those not usually selected. Although the experiments were performed for translation initiation site and stop codon recognition, any other recognition task may benefit from our methodology.
Collapse
Affiliation(s)
- Javier Pérez-Rodríguez
- Department of Computing and Numerical Analysis, University of Córdoba, Córdoba, 14071, Campus de Rabanales, Spain.
| | - Nicolás García-Pedrajas
- Department of Computing and Numerical Analysis, University of Córdoba, Córdoba, 14071, Campus de Rabanales, Spain.
| |
Collapse
|
41
|
Herndon N, Caragea D. A Study of Domain Adaptation Classifiers Derived From Logistic Regression for the Task of Splice Site Prediction. IEEE Trans Nanobioscience 2016; 15:75-83. [PMID: 26849871 DOI: 10.1109/tnb.2016.2522400] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022]
Abstract
Supervised classifiers are highly dependent on abundant labeled training data. Alternatives for addressing the lack of labeled data include: labeling data (but this is costly and time consuming); training classifiers with abundant data from another domain (however, the classification accuracy usually decreases as the distance between domains increases); or complementing the limited labeled data with abundant unlabeled data from the same domain and learning semi-supervised classifiers (but the unlabeled data can mislead the classifier). A better alternative is to use both the abundant labeled data from a source domain, the limited labeled data and optionally the unlabeled data from the target domain to train classifiers in a domain adaptation setting. We propose two such classifiers, based on logistic regression, and evaluate them for the task of splice site prediction-a difficult and essential step in gene prediction. Our classifiers achieved high accuracy, with highest areas under the precision-recall curve between 50.83% and 82.61%.
Collapse
|
42
|
Meher PK, Sahu TK, Rao AR. Prediction of donor splice sites using random forest with a new sequence encoding approach. BioData Min 2016; 9:4. [PMID: 26807151 PMCID: PMC4724119 DOI: 10.1186/s13040-016-0086-4] [Citation(s) in RCA: 23] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/02/2015] [Accepted: 01/19/2016] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Detection of splice sites plays a key role for predicting the gene structure and thus development of efficient analytical methods for splice site prediction is vital. This paper presents a novel sequence encoding approach based on the adjacent di-nucleotide dependencies in which the donor splice site motifs are encoded into numeric vectors. The encoded vectors are then used as input in Random Forest (RF), Support Vector Machines (SVM) and Artificial Neural Network (ANN), Bagging, Boosting, Logistic regression, kNN and Naïve Bayes classifiers for prediction of donor splice sites. RESULTS The performance of the proposed approach is evaluated on the donor splice site sequence data of Homo sapiens, collected from Homo Sapiens Splice Sites Dataset (HS3D). The results showed that RF outperformed all the considered classifiers. Besides, RF achieved higher prediction accuracy than the existing methods viz., MEM, MDD, WMM, MM1, NNSplice and SpliceView, while compared using an independent test dataset. CONCLUSION Based on the proposed approach, we have developed an online prediction server (MaLDoSS) to help the biological community in predicting the donor splice sites. The server is made freely available at http://cabgrid.res.in:8080/maldoss. Due to computational feasibility and high prediction accuracy, the proposed approach is believed to help in predicting the eukaryotic gene structure.
Collapse
Affiliation(s)
- Prabina Kumar Meher
- Division of Statistical Genetics, Indian Agricultural Statistics Research Institute, New Delhi, 110 012 India
| | - Tanmaya Kumar Sahu
- Centre for Agricultural Bioinformatics, Indian Agricultural Statistics Research Institute, New Delhi, 110 012 India
| | - Atmakuri Ramakrishna Rao
- Centre for Agricultural Bioinformatics, Indian Agricultural Statistics Research Institute, New Delhi, 110 012 India
| |
Collapse
|
43
|
Random Forest in Splice Site Prediction of Human Genome. XIV MEDITERRANEAN CONFERENCE ON MEDICAL AND BIOLOGICAL ENGINEERING AND COMPUTING 2016 2016. [DOI: 10.1007/978-3-319-32703-7_100] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/09/2023]
|
44
|
Stanescu A, Caragea D. An empirical study of ensemble-based semi-supervised learning approaches for imbalanced splice site datasets. BMC SYSTEMS BIOLOGY 2015; 9 Suppl 5:S1. [PMID: 26356316 PMCID: PMC4565116 DOI: 10.1186/1752-0509-9-s5-s1] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]
Abstract
BACKGROUND Recent biochemical advances have led to inexpensive, time-efficient production of massive volumes of raw genomic data. Traditional machine learning approaches to genome annotation typically rely on large amounts of labeled data. The process of labeling data can be expensive, as it requires domain knowledge and expert involvement. Semi-supervised learning approaches that can make use of unlabeled data, in addition to small amounts of labeled data, can help reduce the costs associated with labeling. In this context, we focus on the problem of predicting splice sites in a genome using semi-supervised learning approaches. This is a challenging problem, due to the highly imbalanced distribution of the data, i.e., small number of splice sites as compared to the number of non-splice sites. To address this challenge, we propose to use ensembles of semi-supervised classifiers, specifically self-training and co-training classifiers. RESULTS Our experiments on five highly imbalanced splice site datasets, with positive to negative ratios of 1-to-99, showed that the ensemble-based semi-supervised approaches represent a good choice, even when the amount of labeled data consists of less than 1% of all training data. In particular, we found that ensembles of co-training and self-training classifiers that dynamically balance the set of labeled instances during the semi-supervised iterations show improvements over the corresponding supervised ensemble baselines. CONCLUSIONS In the presence of limited amounts of labeled data, ensemble-based semi-supervised approaches can successfully leverage the unlabeled data to enhance supervised ensembles learned from highly imbalanced data distributions. Given that such distributions are common for many biological sequence classification problems, our work can be seen as a stepping stone towards more sophisticated ensemble-based approaches to biological sequence annotation in a semi-supervised framework.
Collapse
Affiliation(s)
- Ana Stanescu
- Department of Computing and Information Sciences, Kansas State University, Nichols Hall, Manhattan, KS, 66506, USA
| | - Doina Caragea
- Department of Computing and Information Sciences, Kansas State University, Nichols Hall, Manhattan, KS, 66506, USA
| |
Collapse
|
45
|
Computational Recognition of RNA Splice Sites by Exact Algorithms for the Quadratic Traveling Salesman Problem. COMPUTATION 2015. [DOI: 10.3390/computation3020285] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
|
46
|
Abbas MM, Mohie-Eldin MM, EL-Manzalawy Y. Assessing the effects of data selection and representation on the development of reliable E. coli sigma 70 promoter region predictors. PLoS One 2015; 10:e0119721. [PMID: 25803493 PMCID: PMC4372424 DOI: 10.1371/journal.pone.0119721] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/12/2014] [Accepted: 01/26/2015] [Indexed: 11/27/2022] Open
Abstract
As the number of sequenced bacterial genomes increases, the need for rapid and reliable tools for the annotation of functional elements (e.g., transcriptional regulatory elements) becomes more desirable. Promoters are the key regulatory elements, which recruit the transcriptional machinery through binding to a variety of regulatory proteins (known as sigma factors). The identification of the promoter regions is very challenging because these regions do not adhere to specific sequence patterns or motifs and are difficult to determine experimentally. Machine learning represents a promising and cost-effective approach for computational identification of prokaryotic promoter regions. However, the quality of the predictors depends on several factors including: i) training data; ii) data representation; iii) classification algorithms; iv) evaluation procedures. In this work, we create several variants of E. coli promoter data sets and utilize them to experimentally examine the effect of these factors on the predictive performance of E. coli σ70 promoter models. Our results suggest that under some combinations of the first three criteria, a prediction model might perform very well on cross-validation experiments while its performance on independent test data is drastically very poor. This emphasizes the importance of evaluating promoter region predictors using independent test data, which corrects for the over-optimistic performance that might be estimated using the cross-validation procedure. Our analysis of the tested models shows that good prediction models often perform well despite how the non-promoter data was obtained. On the other hand, poor prediction models seems to be more sensitive to the choice of non-promoter sequences. Interestingly, the best performing sequence-based classifiers outperform the best performing structure-based classifiers on both cross-validation and independent test performance evaluation experiments. Finally, we propose a meta-predictor method combining two top performing sequence-based and structure-based classifiers and compare its performance with some of the state-of-the-art E. coli σ70 promoter prediction methods.
Collapse
Affiliation(s)
- Mostafa M. Abbas
- KINDI Center for Computing Research, College of Engineering, Qatar University, Doha, Qatar
| | | | - Yasser EL-Manzalawy
- Systems and Computer Engineering, Al-Azhar University, Cairo, Egypt
- College of Information Sciences, Penn State University, University Park, United States of America
| |
Collapse
|
47
|
Peters TJ, Buckley MJ, Statham AL, Pidsley R, Samaras K, V Lord R, Clark SJ, Molloy PL. De novo identification of differentially methylated regions in the human genome. Epigenetics Chromatin 2015; 8:6. [PMID: 25972926 PMCID: PMC4429355 DOI: 10.1186/1756-8935-8-6] [Citation(s) in RCA: 601] [Impact Index Per Article: 66.8] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/04/2014] [Accepted: 12/17/2014] [Indexed: 02/07/2023] Open
Abstract
Background The identification and characterisation of differentially methylated regions (DMRs) between phenotypes in the human genome is of prime interest in epigenetics. We present a novel method, DMRcate, that fits replicated methylation measurements from the Illumina HM450K BeadChip (or 450K array) spatially across the genome using a Gaussian kernel. DMRcate identifies and ranks the most differentially methylated regions across the genome based on tunable kernel smoothing of the differential methylation (DM) signal. The method is agnostic to both genomic annotation and local change in the direction of the DM signal, removes the bias incurred from irregularly spaced methylation sites, and assigns significance to each DMR called via comparison to a null model. Results We show that, for both simulated and real data, the predictive performance of DMRcate is superior to those of Bumphunter and Probe Lasso, and commensurate with that of comb-p. For the real data, we validate all array-derived DMRs from the candidate methods on a suite of DMRs derived from whole-genome bisulfite sequencing called from the same DNA samples, using two separate phenotype comparisons. Conclusions The agglomeration of genomically localised individual methylation sites into discrete DMRs is currently best served by a combination of DM-signal smoothing and subsequent threshold specification. The findings also suggest the design of the 450K array shows preference for CpG sites that are more likely to be differentially methylated, but its overall coverage does not adequately reflect the depth and complexity of methylation signatures afforded by sequencing. For the convenience of the research community we have created a user-friendly R software package called DMRcate, downloadable from Bioconductor and compatible with existing preprocessing packages, which allows others to apply the same DMR-finding method on 450K array data. Electronic supplementary material The online version of this article (doi:10.1186/1756-8935-8-6) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Timothy J Peters
- CSIRO Digital Productivity Flagship, Riverside Life Sciences Centre, 11 Julius Avenue, North Ryde, New South Wales, 2113 Australia
| | - Michael J Buckley
- CSIRO Digital Productivity Flagship, Riverside Life Sciences Centre, 11 Julius Avenue, North Ryde, New South Wales, 2113 Australia
| | - Aaron L Statham
- Epigenetics Program, Garvan Institute of Medical Research, Sydney, Australia
| | - Ruth Pidsley
- Epigenetics Program, Garvan Institute of Medical Research, Sydney, Australia
| | | | - Reginald V Lord
- School of Medicine, University of Notre Dame, Darlinghurst, New South Wales 2010 Australia
| | - Susan J Clark
- Epigenetics Program, Garvan Institute of Medical Research, Sydney, Australia ; St Vincent's Clinical School, Faculty of Medicine, University of New South Wales, Darlinghurst, New South Wales 2010 Australia
| | - Peter L Molloy
- CSIRO Food and Nutrition Flagship, Riverside Life Sciences Centre, 11 Julius Avenue, Sydney, Australia
| |
Collapse
|
48
|
Kauppi JP, Kandemir M, Saarinen VM, Hirvenkari L, Parkkonen L, Klami A, Hari R, Kaski S. Towards brain-activity-controlled information retrieval: Decoding image relevance from MEG signals. Neuroimage 2015; 112:288-298. [PMID: 25595505 DOI: 10.1016/j.neuroimage.2014.12.079] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/05/2014] [Revised: 11/25/2014] [Accepted: 12/31/2014] [Indexed: 10/24/2022] Open
Abstract
We hypothesize that brain activity can be used to control future information retrieval systems. To this end, we conducted a feasibility study on predicting the relevance of visual objects from brain activity. We analyze both magnetoencephalographic (MEG) and gaze signals from nine subjects who were viewing image collages, a subset of which was relevant to a predetermined task. We report three findings: i) the relevance of an image a subject looks at can be decoded from MEG signals with performance significantly better than chance, ii) fusion of gaze-based and MEG-based classifiers significantly improves the prediction performance compared to using either signal alone, and iii) non-linear classification of the MEG signals using Gaussian process classifiers outperforms linear classification. These findings break new ground for building brain-activity-based interactive image retrieval systems, as well as for systems utilizing feedback both from brain activity and eye movements.
Collapse
Affiliation(s)
- Jukka-Pekka Kauppi
- Department of Neuroscience and Biomedical Engineering, Aalto University, Espoo, Finland; Helsinki Institute for Information Technology HIIT, Department of Computer Science, University of Helsinki, Helsinki, Finland.
| | - Melih Kandemir
- Heidelberg University HCI/IWR, Heidelberg, Germany; Helsinki Institute for Information Technology HIIT, Department of Computer Science, Aalto University, Espoo, Finland.
| | - Veli-Matti Saarinen
- Department of Neuroscience and Biomedical Engineering, Aalto University, Espoo, Finland.
| | - Lotta Hirvenkari
- Department of Neuroscience and Biomedical Engineering, Aalto University, Espoo, Finland.
| | - Lauri Parkkonen
- Department of Neuroscience and Biomedical Engineering, Aalto University, Espoo, Finland; MEG Core, Aalto NeuroImaging, Aalto University, Espoo, Finland.
| | - Arto Klami
- Helsinki Institute for Information Technology HIIT, Department of Computer Science, University of Helsinki, Helsinki, Finland.
| | - Riitta Hari
- Department of Neuroscience and Biomedical Engineering, Aalto University, Espoo, Finland; MEG Core, Aalto NeuroImaging, Aalto University, Espoo, Finland.
| | - Samuel Kaski
- Helsinki Institute for Information Technology HIIT, Department of Computer Science, University of Helsinki, Helsinki, Finland; Helsinki Institute for Information Technology HIIT, Department of Computer Science, Aalto University, Espoo, Finland.
| |
Collapse
|
49
|
Wang X, Kuwahara H, Gao X. Modeling DNA affinity landscape through two-round support vector regression with weighted degree kernels. BMC SYSTEMS BIOLOGY 2014; 8 Suppl 5:S5. [PMID: 25605483 PMCID: PMC4305984 DOI: 10.1186/1752-0509-8-s5-s5] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 01/17/2023]
Abstract
BACKGROUND A quantitative understanding of interactions between transcription factors (TFs) and their DNA binding sites is key to the rational design of gene regulatory networks. Recent advances in high-throughput technologies have enabled high-resolution measurements of protein-DNA binding affinity. Importantly, such experiments revealed the complex nature of TF-DNA interactions, whereby the effects of nucleotide changes on the binding affinity were observed to be context dependent. A systematic method to give high-quality estimates of such complex affinity landscapes is, thus, essential to the control of gene expression and the advance of synthetic biology. RESULTS Here, we propose a two-round prediction method that is based on support vector regression (SVR) with weighted degree (WD) kernels. In the first round, a WD kernel with shifts and mismatches is used with SVR to detect the importance of subsequences with different lengths at different positions. The subsequences identified as important in the first round are then fed into a second WD kernel to fit the experimentally measured affinities. To our knowledge, this is the first attempt to increase the accuracy of the affinity prediction by applying two rounds of string kernels and by identifying a small number of crucial k-mers. The proposed method was tested by predicting the binding affinity landscape of Gcn4p in Saccharomyces cerevisiae using datasets from HiTS-FLIP. Our method explicitly identified important subsequences and showed significant performance improvements when compared with other state-of-the-art methods. Based on the identified important subsequences, we discovered two surprisingly stable 10-mers and one sensitive 10-mer which were not reported before. Further test on four other TFs in S. cerevisiae demonstrated the generality of our method. CONCLUSION We proposed in this paper a two-round method to quantitatively model the DNA binding affinity landscape. Since the ability to modify genetic parts to fine-tune gene expression rates is crucial to the design of biological systems, such a tool may play an important role in the success of synthetic biology going forward.
Collapse
|
50
|
Meher PK, Sahu TK, Rao AR, Wahi SD. A statistical approach for 5' splice site prediction using short sequence motifs and without encoding sequence data. BMC Bioinformatics 2014; 15:362. [PMID: 25420551 PMCID: PMC4702320 DOI: 10.1186/s12859-014-0362-6] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/01/2014] [Accepted: 10/24/2014] [Indexed: 11/17/2022] Open
Abstract
Background Most of the approaches for splice site prediction are based on machine learning techniques. Though, these approaches provide high prediction accuracy, the window lengths used are longer in size. Hence, these approaches may not be suitable to predict the novel splice variants using the short sequence reads generated from next generation sequencing technologies. Further, machine learning techniques require numerically encoded data and produce different accuracy with different encoding procedures. Therefore, splice site prediction with short sequence motifs and without encoding sequence data became a motivation for the present study. Results An approach for finding association among nucleotide bases in the splice site motifs is developed and used further to determine the appropriate window size. Besides, an approach for prediction of donor splice sites using sum of absolute error criterion has also been proposed. The proposed approach has been compared with commonly used approaches i.e., Maximum Entropy Modeling (MEM), Maximal Dependency Decomposition (MDD), Weighted Matrix Method (WMM) and Markov Model of first order (MM1) and was found to perform equally with MEM and MDD and better than WMM and MM1 in terms of prediction accuracy. Conclusions The proposed prediction approach can be used in the prediction of donor splice sites with higher accuracy using short sequence motifs and hence can be used as a complementary method to the existing approaches. Based on the proposed methodology, a web server was also developed for easy prediction of donor splice sites by users and is available at http://cabgrid.res.in:8080/sspred. Electronic supplementary material The online version of this article (doi:10.1186/s12859-014-0362-6) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Prabina Kumar Meher
- Division of Statistical Genetics, Indian Agricultural Statistics Research Institute, New Delhi, 110012, India.
| | - Tanmaya Kumar Sahu
- Centre for Agricultural Bioinformatics, Indian Agricultural Statistics Research Institute, New Delhi, 110012, India.
| | - Atmakuri Ramakrishna Rao
- Centre for Agricultural Bioinformatics, Indian Agricultural Statistics Research Institute, New Delhi, 110012, India.
| | - Sant Dass Wahi
- Division of Statistical Genetics, Indian Agricultural Statistics Research Institute, New Delhi, 110012, India.
| |
Collapse
|