1
|
Mahmood Aamir K, Bilal M, Ramzan M, Attique Khan M, Nam Y, Kadry S. Classification of Retroviruses Based on Genomic Data Using RVGC. COMPUTERS, MATERIALS & CONTINUA 2021; 69:3829-3844. [DOI: 10.32604/cmc.2021.017835] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/13/2021] [Accepted: 04/17/2021] [Indexed: 08/25/2024]
|
2
|
Multiple instance learning for sequence data with across bag dependencies. INT J MACH LEARN CYB 2019. [DOI: 10.1007/s13042-019-01021-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/25/2022]
|
3
|
Mittal S, Banduni P, Mallikarjuna MG, Rao AR, Jain PA, Dash PK, Thirunavukkarasu N. Structural, Functional, and Evolutionary Characterization of Major Drought Transcription Factors Families in Maize. Front Chem 2018; 6:177. [PMID: 29876347 PMCID: PMC5974147 DOI: 10.3389/fchem.2018.00177] [Citation(s) in RCA: 20] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/31/2017] [Accepted: 05/03/2018] [Indexed: 01/22/2023] Open
Abstract
Drought is one of the major threats to the maize yield especially in subtropical production systems. Understanding the genes and regulatory mechanisms of drought tolerance is important to sustain the yield. Transcription factors (TFs) play a major role in gene regulation under drought stress. In the present study, a set of 15 major TF families comprising 1,436 genes was structurally and functionally characterized. The functional annotation indicated that the genes were involved in ABA signaling, ROS scavenging, photosynthesis, stomatal regulation, and sucrose metabolism. Duplication was identified as the primary force in divergence and expansion of TF families. Phylogenetic relationship was developed for individual TF and combined TF families. Phylogenetic analysis clustered the genes into specific and mixed groups. Gene structure analysis revealed that more number of genes were intron-rich as compared to intron-less. Drought-responsive cis-regulatory elements such as ABREA, ABREB, DRE1, and DRECRTCOREAT have been identified. Expression and interaction analyses identified leaf-specific bZIP TF, GRMZM2G140355, as a potential contributor toward drought tolerance in maize. Protein-protein interaction network of 269 drought-responsive genes belonging to different TFs has been provided. The information generated on structural and functional characteristics, expression, and interaction of the drought-related TF families will be useful to decipher the drought tolerance mechanisms and to breed drought-tolerant genotypes in maize.
Collapse
Affiliation(s)
- Shikha Mittal
- Division of Genetics, ICAR-Indian Agricultural Research Institute, New Delhi, India
| | - Pooja Banduni
- Division of Genetics, ICAR-Indian Agricultural Research Institute, New Delhi, India
| | | | - Atmakuri R Rao
- Centre for Agricultural Bioinformatics, ICAR-Indian Agricultural Statistics Research Institute, New Delhi, India
| | - Prashant A Jain
- Department of Computational Biology & Bioinformatics, J.I.B.B., Sam Higginbottom University of Agriculture, Technology and Sciences, Allahabad, India
| | - Prasanta K Dash
- National Research Centre on Plant Biotechnology, New Delhi, India
| | | |
Collapse
|
4
|
Szalkai B, Grolmusz V. Near perfect protein multi-label classification with deep neural networks. Methods 2018; 132:50-56. [DOI: 10.1016/j.ymeth.2017.06.034] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/30/2017] [Revised: 05/09/2017] [Accepted: 06/30/2017] [Indexed: 10/19/2022] Open
|
5
|
Wang H, Yan L, Huang H, Ding C. From Protein Sequence to Protein Function via Multi-Label Linear Discriminant Analysis. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2017; 14:503-513. [PMID: 27429445 DOI: 10.1109/tcbb.2016.2591529] [Citation(s) in RCA: 18] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/06/2023]
Abstract
Sequence describes the primary structure of a protein, which contains important structural, characteristic, and genetic information and thereby motivates many sequence-based computational approaches to infer protein function. Among them, feature-base approaches attract increased attention because they make prediction from a set of transformed and more biologically meaningful sequence features. However, original features extracted from sequence are usually of high dimensionality and often compromised by irrelevant patterns, therefore dimension reduction is necessary prior to classification for efficient and effective protein function prediction. A protein usually performs several different functions within an organism, which makes protein function prediction a multi-label classification problem. In machine learning, multi-label classification deals with problems where each object may belong to more than one class. As a well-known feature reduction method, linear discriminant analysis (LDA) has been successfully applied in many practical applications. It, however, by nature is designed for single-label classification, in which each object can belong to exactly one class. Because directly applying LDA in multi-label classification causes ambiguity when computing scatters matrices, we apply a new Multi-label Linear Discriminant Analysis (MLDA) approach to address this problem and meanwhile preserve powerful classification capability inherited from classical LDA. We further extend MLDA by l1-normalization to overcome the problem of over-counting data points with multiple labels. In addition, we incorporate biological network data using Laplacian embedding into our method, and assess the reliability of predicted putative functions. Extensive empirical evaluations demonstrate promising results of our methods.
Collapse
|
6
|
Iqbal MJ, Faye I, Said AMD, Samir BB. Computational Technique for an Efficient Classification of Protein Sequences With Distance-Based Sequence Encoding Algorithm. Comput Intell 2017. [DOI: 10.1111/coin.12069] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
Affiliation(s)
- Muhammad Javed Iqbal
- Computer and Information Sciences Department; Universiti Teknologi PETRONAS; Perak Malaysia
| | - Ibrahima Faye
- Fundamental and Applied Sciences Department; Universiti Teknologi PETRONAS; Perak Malaysia
| | - Abas MD Said
- Computer and Information Sciences Department; Universiti Teknologi PETRONAS; Perak Malaysia
| | | |
Collapse
|
7
|
Natural vs. random protein sequences: Discovering combinatorics properties on amino acid words. J Theor Biol 2015; 391:13-20. [PMID: 26656109 DOI: 10.1016/j.jtbi.2015.11.022] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/02/2015] [Revised: 07/29/2015] [Accepted: 11/23/2015] [Indexed: 01/02/2023]
Abstract
Casual mutations and natural selection have driven the evolution of protein amino acid sequences that we observe at present in nature. The question about which is the dominant force of proteins evolution is still lacking of an unambiguous answer. Casual mutations tend to randomize protein sequences while, in order to have the correct functionality, one expects that selection mechanisms impose rigid constraints on amino acid sequences. Moreover, one also has to consider that the space of all possible amino acid sequences is so astonishingly large that it could be reasonable to have a well tuned amino acid sequence indistinguishable from a random one. In order to study the possibility to discriminate between random and natural amino acid sequences, we introduce different measures of association between pairs of amino acids in a sequence, and apply them to a dataset of 1047 natural protein sequences and 10,470 random sequences, carefully generated in order to preserve the relative length and amino acid distribution of the natural proteins. We analyze the multidimensional measures with machine learning techniques and show that, to a reasonable extent, natural protein sequences can be differentiated from random ones.
Collapse
|
8
|
Darsey JA, Griffin WO, Joginipelli S, Melapu VK. Architecture and biological applications of artificial neural networks: a tuberculosis perspective. Methods Mol Biol 2015; 1260:269-83. [PMID: 25502388 DOI: 10.1007/978-1-4939-2239-0_17] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/16/2023]
Abstract
Advancement of science and technology has prompted researchers to develop new intelligent systems that can solve a variety of problems such as pattern recognition, prediction, and optimization. The ability of the human brain to learn in a fashion that tolerates noise and error has attracted many researchers and provided the starting point for the development of artificial neural networks: the intelligent systems. Intelligent systems can acclimatize to the environment or data and can maximize the chances of success or improve the efficiency of a search. Due to massive parallelism with large numbers of interconnected processers and their ability to learn from the data, neural networks can solve a variety of challenging computational problems. Neural networks have the ability to derive meaning from complicated and imprecise data; they are used in detecting patterns, and trends that are too complex for humans, or other computer systems. Solutions to the toughest problems will not be found through one narrow specialization; therefore we need to combine interdisciplinary approaches to discover the solutions to a variety of problems. Many researchers in different disciplines such as medicine, bioinformatics, molecular biology, and pharmacology have successfully applied artificial neural networks. This chapter helps the reader in understanding the basics of artificial neural networks, their applications, and methodology; it also outlines the network learning process and architecture. We present a brief outline of the application of neural networks to medical diagnosis, drug discovery, gene identification, and protein structure prediction. We conclude with a summary of the results from our study on tuberculosis data using neural networks, in diagnosing active tuberculosis, and predicting chronic vs. infiltrative forms of tuberculosis.
Collapse
Affiliation(s)
- Jerry A Darsey
- Department of Chemistry, University of Arkansas at Little Rock, 2801 S University Ave, Little Rock, AR, 72204, USA
| | | | | | | |
Collapse
|
9
|
Arango-Argoty GA, Jaramillo-Garzón JA, Castellanos-Domínguez G. Feature extraction by statistical contact potentials and wavelet transform for predicting subcellular localizations in gram negative bacterial proteins. J Theor Biol 2015; 364:121-30. [PMID: 25219623 DOI: 10.1016/j.jtbi.2014.08.051] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/08/2013] [Revised: 08/27/2014] [Accepted: 08/28/2014] [Indexed: 11/16/2022]
Abstract
Predicting the localization of a protein has become a useful practice for inferring its function. Most of the reported methods to predict subcellular localizations in Gram-negative bacterial proteins make use of standard protein representations that generally do not take into account the distribution of the amino acids and the structural information of the proteins. Here, we propose a protein representation based on the structural information contained in the pairwise statistical contact potentials. The wavelet transform decodes the information contained in the primary structure of the proteins, allowing the identification of patterns along the proteins, which are used to characterize the subcellular localizations. Then, a support vector machine classifier is trained to categorize them. Cellular compartments like periplasm and extracellular medium are difficult to predict, having a high false negative rate. The wavelet-based method achieves an overall high performance while maintaining a low false negative rate, particularly, on "periplasm" and "extracellular medium". Our results suggest the proposed protein characterization is a useful alternative to representing and predicting protein sequences over the classical and cutting edge protein depictions.
Collapse
Affiliation(s)
- G A Arango-Argoty
- Signal Processing and Recognition Group, Universidad Nacional de Colombia, s. Manizales, Campus La Nubia, km 7 via al Magdalena, Manizales, Colombia; Department of Computational and Systems Biology, University of Pittsburgh School of Medicine, 3501 Fifth Ave, Pittsburgh, PA 15260, USA.
| | - J A Jaramillo-Garzón
- Signal Processing and Recognition Group, Universidad Nacional de Colombia, s. Manizales, Campus La Nubia, km 7 via al Magdalena, Manizales, Colombia; Research Center of the Instituto Tecnologico Metropolitano, Calle 73 No 76A-354, Medellín, Colombia
| | - G Castellanos-Domínguez
- Signal Processing and Recognition Group, Universidad Nacional de Colombia, s. Manizales, Campus La Nubia, km 7 via al Magdalena, Manizales, Colombia
| |
Collapse
|
10
|
Szilágyi SM, Szilágyi L. A fast hierarchical clustering algorithm for large-scale protein sequence data sets. Comput Biol Med 2014; 48:94-101. [PMID: 24657908 DOI: 10.1016/j.compbiomed.2014.02.016] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/13/2013] [Revised: 02/10/2014] [Accepted: 02/25/2014] [Indexed: 10/25/2022]
Abstract
TRIBE-MCL is a Markov clustering algorithm that operates on a graph built from pairwise similarity information of the input data. Edge weights stored in the stochastic similarity matrix are alternately fed to the two main operations, inflation and expansion, and are normalized in each main loop to maintain the probabilistic constraint. In this paper we propose an efficient implementation of the TRIBE-MCL clustering algorithm, suitable for fast and accurate grouping of protein sequences. A modified sparse matrix structure is introduced that can efficiently handle most operations of the main loop. Taking advantage of the symmetry of the similarity matrix, a fast matrix squaring formula is also introduced to facilitate the time consuming expansion. The proposed algorithm was tested on protein sequence databases like SCOP95. In terms of efficiency, the proposed solution improves execution speed by two orders of magnitude, compared to recently published efficient solutions, reducing the total runtime well below 1min in the case of the 11,944proteins of SCOP95. This improvement in computation time is reached without losing anything from the partition quality. Convergence is generally reached in approximately 50 iterations. The efficient execution enabled us to perform a thorough evaluation of classification results and to formulate recommendations regarding the choice of the algorithm׳s parameter values.
Collapse
Affiliation(s)
- Sándor M Szilágyi
- Petru Maior University, Department of Informatics, Str. Nicolae Iorga Nr. 1, 540088 Tîrgu Mureş, Romania.
| | - László Szilágyi
- Budapest University of Technology and Economics, Department of Control Engineering and Information Technology, Magyar tudósok krt. 2, H-1117 Budapest, Hungary; Sapientia University of Transylvania, Faculty of Technical and Human Sciences, Şoseaua Sighişoarei 1/C, 540485 Tîrgu Mureş, Romania.
| |
Collapse
|
11
|
Abstract
Sequence classification has a broad range of applications such as genomic analysis, information retrieval, health informatics, finance, and abnormal detection. Different from the classification task on feature vectors, sequences do not have explicit features. Even with sophisticated feature selection techniques, the dimensionality of potential features may still be very high and the sequential nature of features is difficult to capture. This makes sequence classification a more challenging task than classification on feature vectors. In this paper, we present a brief review of the existing work on sequence classification. We summarize the sequence classification in terms of methodologies and application domains. We also provide a review on several extensions of the sequence classification problem, such as early classification on sequences and semi-supervised learning on sequences.
Collapse
Affiliation(s)
| | - Jian Pei
- Simon Fraser University, Burnaby, BC, Canada
| | | |
Collapse
|
12
|
Saraç ÖS, Atalay V, Cetin-Atalay R. GOPred: GO molecular function prediction by combined classifiers. PLoS One 2010; 5:e12382. [PMID: 20824206 PMCID: PMC2930845 DOI: 10.1371/journal.pone.0012382] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/13/2009] [Accepted: 06/22/2010] [Indexed: 11/18/2022] Open
Abstract
Functional protein annotation is an important matter for in vivo and in silico biology. Several computational methods have been proposed that make use of a wide range of features such as motifs, domains, homology, structure and physicochemical properties. There is no single method that performs best in all functional classification problems because information obtained using any of these features depends on the function to be assigned to the protein. In this study, we portray a novel approach that combines different methods to better represent protein function. First, we formulated the function annotation problem as a classification problem defined on 300 different Gene Ontology (GO) terms from molecular function aspect. We presented a method to form positive and negative training examples while taking into account the directed acyclic graph (DAG) structure and evidence codes of GO. We applied three different methods and their combinations. Results show that combining different methods improves prediction accuracy in most cases. The proposed method, GOPred, is available as an online computational annotation tool (http://kinaz.fen.bilkent.edu.tr/gopred).
Collapse
Affiliation(s)
- Ömer Sinan Saraç
- Department of Computer Engineering, Middle East Technical University, Ankara, Turkey
| | - Volkan Atalay
- Department of Computer Engineering, Middle East Technical University, Ankara, Turkey
| | - Rengul Cetin-Atalay
- Department of Molecular Biology and Genetics, Faculty of Science, Bilkent University, Ankara, Turkey
- * E-mail:
| |
Collapse
|
13
|
Ferreira PG, Azevedo PJ. Evaluating deterministic motif significance measures in protein databases. Algorithms Mol Biol 2007; 2:16. [PMID: 18157916 PMCID: PMC2254621 DOI: 10.1186/1748-7188-2-16] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/15/2007] [Accepted: 12/24/2007] [Indexed: 11/13/2022] Open
Abstract
BACKGROUND Assessing the outcome of motif mining algorithms is an essential task, as the number of reported motifs can be very large. Significance measures play a central role in automatically ranking those motifs, and therefore alleviating the analysis work. Spotting the most interesting and relevant motifs is then dependent on the choice of the right measures. The combined use of several measures may provide more robust results. However caution has to be taken in order to avoid spurious evaluations. RESULTS From the set of conducted experiments, it was verified that several of the selected significance measures show a very similar behavior in a wide range of situations therefore providing redundant information. Some measures have proved to be more appropriate to rank highly conserved motifs, while others are more appropriate for weakly conserved ones. Support appears as a very important feature to be considered for correct motif ranking. We observed that not all the measures are suitable for situations with poorly balanced class information, like for instance, when positive data is significantly less than negative data. Finally, a visualization scheme was proposed that, when several measures are applied, enables an easy identification of high scoring motifs. CONCLUSION In this work we have surveyed and categorized 14 significance measures for pattern evaluation. Their ability to rank three types of deterministic motifs was evaluated. Measures were applied in different testing conditions, where relations were identified. This study provides some pertinent insights on the choice of the right set of significance measures for the evaluation of deterministic motifs extracted from protein databases.
Collapse
Affiliation(s)
- Pedro Gabriel Ferreira
- Department of Informatics, University of Minho, Campus de Gualtar, 4710-057 Braga, Portugal
| | - Paulo J Azevedo
- Department of Informatics, University of Minho, Campus de Gualtar, 4710-057 Braga, Portugal
| |
Collapse
|
14
|
Sarac OS, Gürsoy-Yüzügüllü O, Cetin-Atalay R, Atalay V. Subsequence-based feature map for protein function classification. Comput Biol Chem 2007; 32:122-30. [PMID: 18243801 DOI: 10.1016/j.compbiolchem.2007.11.004] [Citation(s) in RCA: 13] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/09/2007] [Accepted: 11/30/2007] [Indexed: 11/19/2022]
Abstract
Automated classification of proteins is indispensable for further in vivo investigation of excessive number of unknown sequences generated by large scale molecular biology techniques. This study describes a discriminative system based on feature space mapping, called subsequence profile map (SPMap) for functional classification of protein sequences. SPMap takes into account the information coming from the subsequences of a protein. A group of protein sequences that belong to the same level of classification is decomposed into fixed-length subsequences and they are clustered to obtain a representative feature space mapping. Mapping is defined as the distribution of the subsequences of a protein sequence over these clusters. The resulting feature space representation is used to train discriminative classifiers for functional families. The aim of this approach is to incorporate information coming from important subregions that are conserved over a family of proteins while avoiding the difficult task of explicit motif identification. The performance of the method was assessed through tests on various protein classification tasks. Our results showed that SPMap is capable of high accuracy classification in most of these tasks. Furthermore SPMap is fast and scalable enough to handle large datasets.
Collapse
Affiliation(s)
- Omer Sinan Sarac
- Department of Computer Engineering, Middle East Technical University, 06531 Ankara, Turkey
| | | | | | | |
Collapse
|
15
|
Exarchos TP, Papaloukas C, Lampros C, Fotiadis DI. Mining sequential patterns for protein fold recognition. J Biomed Inform 2007; 41:165-79. [PMID: 17573243 DOI: 10.1016/j.jbi.2007.05.004] [Citation(s) in RCA: 13] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/20/2006] [Revised: 04/06/2007] [Accepted: 05/05/2007] [Indexed: 10/23/2022]
Abstract
Protein data contain discriminative patterns that can be used in many beneficial applications if they are defined correctly. In this work sequential pattern mining (SPM) is utilized for sequence-based fold recognition. Protein classification in terms of fold recognition plays an important role in computational protein analysis, since it can contribute to the determination of the function of a protein whose structure is unknown. Specifically, one of the most efficient SPM algorithms, cSPADE, is employed for the analysis of protein sequence. A classifier uses the extracted sequential patterns to classify proteins in the appropriate fold category. For training and evaluating the proposed method we used the protein sequences from the Protein Data Bank and the annotation of the SCOP database. The method exhibited an overall accuracy of 25% in a classification problem with 36 candidate categories. The classification performance reaches up to 56% when the five most probable protein folds are considered.
Collapse
Affiliation(s)
- Themis P Exarchos
- Department of Medical Physics, Medical School, University of Ioannina, GR 45110 Ioannina, Greece
| | | | | | | |
Collapse
|