1
|
Amilpur S, Bhukya R. EDeepSSP: Explainable deep neural networks for exact splice sites prediction. J Bioinform Comput Biol 2020; 18:2050024. [PMID: 32696716 DOI: 10.1142/s0219720020500249] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
Splice site prediction is crucial for understanding underlying gene regulation, gene function for better genome annotation. Many computational methods exist for recognizing the splice sites. Although most of the methods achieve a competent performance, their interpretability remains challenging. Moreover, all traditional machine learning methods manually extract features, which is tedious job. To address these challenges, we propose a deep learning-based approach (EDeepSSP) that employs convolutional neural networks (CNNs) architecture for automatic feature extraction and effectively predicts splice sites. Our model, EDeepSSP, divulges the opaque nature of CNN by extracting significant motifs and explains why these motifs are vital for predicting splice sites. In this study, experiments have been conducted on six benchmark acceptors and donor datasets of humans, cress, and fly. The results show that EDeepSSP has outperformed many state-of-the-art approaches. EDeepSSP achieves the highest area under the receiver operating characteristic curve (AUC_ROC) and area under the precision-recall curve (AUC_PR) of 99.32% and 99.26% on human donor datasets, respectively. We also analyze various filter activities, feature activations, and extracted significant motifs responsible for the splice site prediction. Further, we validate the learned motifs of our model against known motifs of JASPAR splice site database.
Collapse
Affiliation(s)
- Santhosh Amilpur
- Computer Science and Engineering, National Institute of Technology Warangal, Warangal, Telangana 506004, India
| | - Raju Bhukya
- Computer Science and Engineering, National Institute of Technology Warangal, Warangal, Telangana 506004, India
| |
Collapse
|
2
|
Pajares B, Porta J, Porta JM, Sousa CFD, Moreno I, Porta D, Durán G, Vega T, Ortiz I, Muriel C, Alba E, Márquez A. Hereditary breast and ovarian cancer in Andalusian families: a genetic population study. BMC Cancer 2018; 18:647. [PMID: 29884136 PMCID: PMC5994127 DOI: 10.1186/s12885-018-4537-9] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/30/2017] [Accepted: 05/21/2018] [Indexed: 11/24/2022] Open
Abstract
Background The BRCA1/2 mutation profile varies in Spain according to the geographical area studied. The mutational profile of BRCA1/2 in families at risk for hereditary breast and ovarian cancer has not so far been reported in Andalusia (southern Spain). Methods We analysed BRCA1/2 germline mutations in 562 high-risk cases with breast and/or ovarian cancer from Andalusian families from 2010 to 2015. Results Among the 562 cases, 120 (21.4%) carried a germline pathogenic mutation in BRCA1/2; 50 in BRCA1 (41.7%) and 70 in BRCA2 (58.3%). We detected 67 distinct mutations (29 in BRCA1 and 38 in BRCA2), of which 3 in BRCA1 (c.845C > A, c.1222_1223delAC, c.2527delA) and 5 in BRCA2 (c.293 T > G, c.5558_5559delGT, c.6034delT, c.6650_6654delAAGAT, c.6652delG) had not been previously described. The most frequent mutations in BRCA1 were c.5078_5080delCTG (10%) and c.5123C > A (10%), and in BRCA2 they were c.9018C > A (14%) and c.5720_5723delCTCT (8%). We identified 5 variants of unknown significance (VUS), all in BRCA2 (c.5836 T > C, c.6323G > T, c.9501 + 3A > T, c.8022_8030delGATAATGGA, c.10186A > C). We detected 76 polymorphisms (31 in BRCA1, 45 in BRCA2) not associated with breast cancer risk. Conclusions This is the first study reporting the mutational profile of BRCA1/2 in Andalusia. We identified 21.4% of patients harbouring BRCA1/2 mutations, 58.3% of them in BRCA2. We also characterized the clinical data, mutational profile, VUS and haplotype profile. Electronic supplementary material The online version of this article (10.1186/s12885-018-4537-9) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Bella Pajares
- Clinical Oncology Unit Hospitales Universitarios Regional y Virgen de la Victoria. Instituto de Investigación Biomédica de Málaga (IBIMA), Campus Teatinos s/n. 29010, Malaga, Spain.
| | - Javier Porta
- Genologica, Paseo de la Farola 16, 29016, Malaga, Spain
| | | | - Cristina Fernández-de Sousa
- Clinical Oncology Unit Hospitales Universitarios Regional y Virgen de la Victoria. Instituto de Investigación Biomédica de Málaga (IBIMA), Campus Teatinos s/n. 29010, Malaga, Spain
| | - Ignacio Moreno
- Clinical Oncology Unit Hospitales Universitarios Regional y Virgen de la Victoria. Instituto de Investigación Biomédica de Málaga (IBIMA), Campus Teatinos s/n. 29010, Malaga, Spain
| | - Daniel Porta
- Genologica, Paseo de la Farola 16, 29016, Malaga, Spain
| | - Gema Durán
- Clinical Oncology Unit Hospitales Universitarios Regional y Virgen de la Victoria. Instituto de Investigación Biomédica de Málaga (IBIMA), Campus Teatinos s/n. 29010, Malaga, Spain
| | - Tamara Vega
- Genologica, Paseo de la Farola 16, 29016, Malaga, Spain
| | | | - Carolina Muriel
- Clinical Oncology Unit Hospitales Universitarios Regional y Virgen de la Victoria. Instituto de Investigación Biomédica de Málaga (IBIMA), Campus Teatinos s/n. 29010, Malaga, Spain
| | - Emilio Alba
- Clinical Oncology Unit Hospitales Universitarios Regional y Virgen de la Victoria. Instituto de Investigación Biomédica de Málaga (IBIMA), Campus Teatinos s/n. 29010, Malaga, Spain
| | - Antonia Márquez
- Clinical Oncology Unit Hospitales Universitarios Regional y Virgen de la Victoria. Instituto de Investigación Biomédica de Málaga (IBIMA), Campus Teatinos s/n. 29010, Malaga, Spain
| |
Collapse
|
3
|
Kamath U, De Jong K, Shehu A. Effective automated feature construction and selection for classification of biological sequences. PLoS One 2014; 9:e99982. [PMID: 25033270 PMCID: PMC4102475 DOI: 10.1371/journal.pone.0099982] [Citation(s) in RCA: 45] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/12/2013] [Accepted: 05/21/2014] [Indexed: 11/25/2022] Open
Abstract
BACKGROUND Many open problems in bioinformatics involve elucidating underlying functional signals in biological sequences. DNA sequences, in particular, are characterized by rich architectures in which functional signals are increasingly found to combine local and distal interactions at the nucleotide level. Problems of interest include detection of regulatory regions, splice sites, exons, hypersensitive sites, and more. These problems naturally lend themselves to formulation as classification problems in machine learning. When classification is based on features extracted from the sequences under investigation, success is critically dependent on the chosen set of features. METHODOLOGY We present an algorithmic framework (EFFECT) for automated detection of functional signals in biological sequences. We focus here on classification problems involving DNA sequences which state-of-the-art work in machine learning shows to be challenging and involve complex combinations of local and distal features. EFFECT uses a two-stage process to first construct a set of candidate sequence-based features and then select a most effective subset for the classification task at hand. Both stages make heavy use of evolutionary algorithms to efficiently guide the search towards informative features capable of discriminating between sequences that contain a particular functional signal and those that do not. RESULTS To demonstrate its generality, EFFECT is applied to three separate problems of importance in DNA research: the recognition of hypersensitive sites, splice sites, and ALU sites. Comparisons with state-of-the-art algorithms show that the framework is both general and powerful. In addition, a detailed analysis of the constructed features shows that they contain valuable biological information about DNA architecture, allowing biologists and other researchers to directly inspect the features and potentially use the insights obtained to assist wet-laboratory studies on retainment or modification of a specific signal. Code, documentation, and all data for the applications presented here are provided for the community at http://www.cs.gmu.edu/~ashehu/?q=OurTools.
Collapse
Affiliation(s)
- Uday Kamath
- Computer Science, George Mason University, Fairfax, Virginia, United States of America
| | - Kenneth De Jong
- Computer Science, George Mason University, Fairfax, Virginia, United States of America
- Krasnow Institute, George Mason University, Fairfax, Virginia, United States of America
| | - Amarda Shehu
- Computer Science, George Mason University, Fairfax, Virginia, United States of America
- Bioengineering, George Mason University, Fairfax, Virginia, United States of America
- School of Systems Biology, George Mason University, Fairfax, Virginia, United States of America
| |
Collapse
|
4
|
Two new methods for DNA splice site prediction based on neuro-fuzzy network and clustering. Neural Comput Appl 2013. [DOI: 10.1007/s00521-012-1257-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/27/2022]
|
5
|
Kamath U, Compton J, Islamaj-Doğan R, De Jong KA, Shehu A. An evolutionary algorithm approach for feature generation from sequence data and its application to DNA splice site prediction. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2012; 9:1387-1398. [PMID: 22508909 DOI: 10.1109/tcbb.2012.53] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/31/2023]
Abstract
Associating functional information with biological sequences remains a challenge for machine learning methods. The performance of these methods often depends on deriving predictive features from the sequences sought to be classified. Feature generation is a difficult problem, as the connection between the sequence features and the sought property is not known a priori. It is often the task of domain experts or exhaustive feature enumeration techniques to generate a few features whose predictive power is then tested in the context of classification. This paper proposes an evolutionary algorithm to effectively explore a large feature space and generate predictive features from sequence data. The effectiveness of the algorithm is demonstrated on an important component of the gene-finding problem, DNA splice site prediction. This application is chosen due to the complexity of the features needed to obtain high classification accuracy and precision. Our results test the effectiveness of the obtained features in the context of classification by Support Vector Machines and show significant improvement in accuracy and precision over state-of-the-art approaches.
Collapse
Affiliation(s)
- Uday Kamath
- Department of Computer Science, George Mason University, Ashburn, VA 20147, USA.
| | | | | | | | | |
Collapse
|
6
|
Won KJ, Saunders C, Prügel-Bennett A. Evolving fisher kernels for biological sequence classification. EVOLUTIONARY COMPUTATION 2012; 21:83-105. [PMID: 22181969 DOI: 10.1162/evco_a_00065] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/31/2023]
Abstract
Fisher kernels have been successfully applied to many problems in bioinformatics. However, their success depends on the quality of the generative model upon which they are built. For Fisher kernel techniques to be used on novel problems, a mechanism for creating accurate generative models is required. A novel framework is presented for automatically creating domain-specific generative models that can be used to produce Fisher kernels for support vector machines (SVMs) and other kernel methods. The framework enables the capture of prior knowledge and addresses the issue of domain-specific kernels, both of which are current areas that are lacking in many kernel-based methods. To obtain the generative model, genetic algorithms are used to evolve the structure of hidden Markov models (HMMs). A Fisher kernel is subsequently created from the HMM, and used in conjunction with an SVM, to improve the discriminative power. This paper investigates the effectiveness of the proposed method, named GA-SVM. We show that its performance is comparable if not better than other state of the art methods in classifying secretory protein sequences of malaria. More interestingly, it showed better results than the sequence-similarity-based approach, without the need for additional homologous sequence information in protein enzyme family classification. The experiments clearly demonstrate that the GA-SVM is a novel way to find features with good performance from biological sequences, that does not require extensive tuning of a complex model.
Collapse
Affiliation(s)
- K-J Won
- Department of Genetics, Institute for Diabetes, Obesity and Metabolism, University of Pennsylvania, Translational Research Center, 12-111, 3400 Civic Center Blvd., Philadelphia, PA 19104, USA.
| | | | | |
Collapse
|
7
|
Varadwaj P, Purohit N, Arora B. Detection of Splice Sites Using Support Vector Machine. COMMUNICATIONS IN COMPUTER AND INFORMATION SCIENCE 2009. [DOI: 10.1007/978-3-642-03547-0_47] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/21/2023]
|
8
|
Rajapakse JC, Ho LS. Markov encoding for detecting signals in genomic sequences. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2005; 2:131-42. [PMID: 17044178 DOI: 10.1109/tcbb.2005.27] [Citation(s) in RCA: 24] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/12/2023]
Abstract
We present a technique to encode the inputs to neural networks for the detection of signals in genomic sequences. The encoding is based on lower-order Markov models which incorporate known biological characteristics in genomic sequences. The neural networks then learn intrinsic higher-order dependencies of nucleotides at the signal sites. We demonstrate the efficacy of the Markov encoding method in the detection of three genomic signals, namely, splice sites, transcription start sites, and translation initiation sites.
Collapse
Affiliation(s)
- Jagath C Rajapakse
- BioInformatics Research Center, School of Computer Engineering, Nanyang Technological University, Singapore 639798.
| | | |
Collapse
|
9
|
Harrington E, Herbrich R, Kivinen J, Platt J, Williamson RC. Online Bayes Point Machines. ADVANCES IN KNOWLEDGE DISCOVERY AND DATA MINING 2003. [DOI: 10.1007/3-540-36175-8_24] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/11/2023]
|
10
|
Sonnenburg S, Rätsch G, Jagota A, Müller KR. New Methods for Splice Site Recognition. ARTIFICIAL NEURAL NETWORKS — ICANN 2002 2002. [DOI: 10.1007/3-540-46084-5_54] [Citation(s) in RCA: 18] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/15/2023]
|