1
|
Type II fuzzy set-based data analytics to explore amino acid associations in protein sequences of Swine Influenza Virus. Appl Soft Comput 2020. [DOI: 10.1016/j.asoc.2019.105856] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/23/2022]
|
2
|
Gruca A, Sikora M. Data- and expert-driven rule induction and filtering framework for functional interpretation and description of gene sets. J Biomed Semantics 2017; 8:23. [PMID: 28651634 PMCID: PMC5483958 DOI: 10.1186/s13326-017-0129-x] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2016] [Accepted: 05/26/2017] [Indexed: 04/13/2023] Open
Abstract
Background High-throughput methods in molecular biology provided researchers with abundance of experimental data that need to be interpreted in order to understand the experimental results. Manual methods of functional gene/protein group interpretation are expensive and time-consuming; therefore, there is a need to develop new efficient data mining methods and bioinformatics tools that could support the expert in the process of functional analysis of experimental results. Results In this study, we propose a comprehensive framework for the induction of logical rules in the form of combinations of Gene Ontology (GO) terms for functional interpretation of gene sets. Within the framework, we present four approaches: the fully automated method of rule induction without filtering, rule induction method with filtering, expert-driven rule filtering method based on additive utility functions, and expert-driven rule induction method based on the so-called seed or expert terms – the GO terms of special interest which should be included into the description. These GO terms usually describe some processes or pathways of particular interest, which are related to the experiment that is being performed. During the rule induction and filtering processes such seed terms are used as a base on which the description is build. Conclusion We compare the descriptions obtained with different algorithms of rule induction and filtering and show that a filtering step is required to reduce the number of rules in the output set so that they could be analyzed by a human expert. However, filtering may remove information from the output rule set which is potentially interesting for the expert. Therefore, in the study, we present two methods that involve interaction with the expert during the process of rule induction. Both of them are able to reduce the number of rules, but only in the case of the method based on seed terms, each of the created rule includes expert terms in combination with the other terms. Further analysis of such combinations may provide new knowledge about biological processes and their combination with other pathways related to genes described by the rules. A suite of Matlab scripts that provide the functionality of a comprehensive framework for the rule induction and filtering presented in this study is available free of charge at: http://rulego.polsl.pl/framework. Electronic supplementary material The online version of this article (doi:10.1186/s13326-017-0129-x) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Aleksandra Gruca
- Institute of Informatics, Silesian University of Technology, Akademicka 16, Gliwice, 44-100, Poland.
| | - Marek Sikora
- Institute of Informatics, Silesian University of Technology, Akademicka 16, Gliwice, 44-100, Poland
| |
Collapse
|
3
|
Jain A, Pardasani KR. Fuzzy soft set model for mining amino acid associations in peptide sequences of Mycobacterium tuberculosis complex (MTBC). JOURNAL OF INTELLIGENT & FUZZY SYSTEMS 2016. [DOI: 10.3233/ifs-162139] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Affiliation(s)
- Amita Jain
- Department of Computer Application, MANIT, Bhopal, Madhya Pradesh, India
| | - Kamal Raj Pardasani
- Department of Mathematics, Bioinformatics and Computer Applications, MANIT, Bhopal, Madhya Pradesh, India
| |
Collapse
|
4
|
Ghosh A, De RK. Fuzzy Correlated Association Mining: Selecting altered associations among the genes, and some possible marker genes mediating certain cancers. Appl Soft Comput 2016. [DOI: 10.1016/j.asoc.2015.09.057] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/11/2022]
|
5
|
Navarro C, Lopez FJ, Cano C, Garcia-Alcalde F, Blanco A. CisMiner: genome-wide in-silico cis-regulatory module prediction by fuzzy itemset mining. PLoS One 2014; 9:e108065. [PMID: 25268582 PMCID: PMC4182448 DOI: 10.1371/journal.pone.0108065] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/03/2014] [Accepted: 08/25/2014] [Indexed: 01/18/2023] Open
Abstract
Eukaryotic gene control regions are known to be spread throughout non-coding DNA sequences which may appear distant from the gene promoter. Transcription factors are proteins that coordinately bind to these regions at transcription factor binding sites to regulate gene expression. Several tools allow to detect significant co-occurrences of closely located binding sites (cis-regulatory modules, CRMs). However, these tools present at least one of the following limitations: 1) scope limited to promoter or conserved regions of the genome; 2) do not allow to identify combinations involving more than two motifs; 3) require prior information about target motifs. In this work we present CisMiner, a novel methodology to detect putative CRMs by means of a fuzzy itemset mining approach able to operate at genome-wide scale. CisMiner allows to perform a blind search of CRMs without any prior information about target CRMs nor limitation in the number of motifs. CisMiner tackles the combinatorial complexity of genome-wide cis-regulatory module extraction using a natural representation of motif combinations as itemsets and applying the Top-Down Fuzzy Frequent- Pattern Tree algorithm to identify significant itemsets. Fuzzy technology allows CisMiner to better handle the imprecision and noise inherent to regulatory processes. Results obtained for a set of well-known binding sites in the S. cerevisiae genome show that our method yields highly reliable predictions. Furthermore, CisMiner was also applied to putative in-silico predicted transcription factor binding sites to identify significant combinations in S. cerevisiae and D. melanogaster, proving that our approach can be further applied genome-wide to more complex genomes. CisMiner is freely accesible at: http://genome2.ugr.es/cisminer. CisMiner can be queried for the results presented in this work and can also perform a customized cis-regulatory module prediction on a query set of transcription factor binding sites provided by the user.
Collapse
Affiliation(s)
- Carmen Navarro
- Department of Computer Science and AI, University of Granada, Granada, Spain
| | - Francisco J. Lopez
- Andalusian Human Genome Sequencing Centre (CASEGH), Medical Genome Project (MGP), Sevilla, Spain
| | - Carlos Cano
- Department of Computer Science and AI, University of Granada, Granada, Spain
| | | | - Armando Blanco
- Department of Computer Science and AI, University of Granada, Granada, Spain
| |
Collapse
|
6
|
Liu R, France B, George S, Rallo R, Zhang H, Xia T, Nel AE, Bradley K, Cohen Y. Association rule mining of cellular responses induced by metal and metal oxide nanoparticles. Analyst 2014; 139:943-53. [PMID: 24260774 DOI: 10.1039/c3an01409f] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/12/2022]
Abstract
Relationships among fourteen different biological responses (including ten signaling pathway activities and four cytotoxicity effects) of murine macrophage (RAW264.7) and bronchial epithelial (BEAS-2B) cells exposed to six metal and metal oxide nanoparticles (NPs) were analyzed using both statistical and data mining approaches. Both the pathway activities and cytotoxicity effects were assessed using high-throughput screening (HTS) over an exposure period of up to 24 h and concentration range of 0.39-200 mg L(-1). HTS data were processed by outlier removal, normalization, and hit-identification (for significantly regulated cellular responses) to arrive at reliable multiparametric bioactivity profiles for the NPs. Association rule mining was then applied to the bioactivity profiles followed by a pruning process to remove redundant rules. The non-redundant association rules indicated that "significant regulation" of one or more cellular responses implies regulation of other (associated) cellular response types. Pairwise correlation analysis (via Pearson's χ(2) test) and self-organizing map clustering of the different cellular response types indicated consistency with the identified non-redundant association rules. Furthermore, in order to explore the potential use of association rules as a tool for data-driven hypothesis generation, specific pathway activity experiments were carried out for ZnO NPs. The experimental results confirmed the association rule identified for the p53 pathway and mitochondrial superoxide levels (via MitoSox reagent) and further revealed that blocking of the transcriptional activity of p53 lowered the MitoSox signal. The present approach of using association rule mining for data-driven hypothesis generation has important implications for streamlining multi-parameter HTS assays, improving the understanding of NP toxicity mechanisms, and selection of endpoints for the development of nanomaterial structure-activity relationships.
Collapse
Affiliation(s)
- Rong Liu
- Institute of the Environment and Sustainability, University of California, Los Angeles, CA 90095, USA.
| | | | | | | | | | | | | | | | | |
Collapse
|
7
|
Chen H, Lonardi S, Zheng J. Deciphering histone code of transcriptional regulation in malaria parasites by large-scale data mining. Comput Biol Chem 2014; 50:3-10. [PMID: 24581698 DOI: 10.1016/j.compbiolchem.2014.01.002] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 12/23/2013] [Indexed: 10/25/2022]
Abstract
Histone modifications play a major role in the regulation of gene expression. Accumulated evidence has shown that histone modifications mediate biological processes such as transcription cooperatively. This has led to the hypothesis of 'histone code' which suggests that combinations of different histone modifications correspond to unique chromatin states and have distinct functions. In this paper, we propose a framework based on association rule mining to discover the potential regulatory relations between histone modifications and gene expression in Plasmodium falciparum. Our approach can output rules with statistical significance. Some of the discovered rules are supported by literature of experimental results. Moreover, we have also discovered de novo rules which can guide further research in epigenetic regulation of transcription. Based on our association rules we build a model to predict gene expression, which outperforms a published Bayesian network model for gene expression prediction by histone modifications. The results of our study reveal mechanisms for histone modifications to regulate transcription in large-scale. Among our findings, the cooperation among histone modifications provides new evidence for the hypothesis of histone code. Furthermore, the rules output by our method can be used to predict the change of gene expression.
Collapse
Affiliation(s)
- Haifen Chen
- School of Computer Engineering, Nanyang Technological University, 50 Nanyang Avenue, Singapore 639798, Singapore.
| | - Stefano Lonardi
- Department of Computer Science and Engineering, University of California Riverside, 900 University Avenue, Riverside, CA 92521, USA.
| | - Jie Zheng
- School of Computer Engineering, Nanyang Technological University, 50 Nanyang Avenue, Singapore 639798, Singapore; Genome Institute of Singapore, A*STAR (Agency for Science, Technology, and Research), Biopolis, Singapore 138672, Singapore.
| |
Collapse
|
8
|
Naulaerts S, Meysman P, Bittremieux W, Vu TN, Vanden Berghe W, Goethals B, Laukens K. A primer to frequent itemset mining for bioinformatics. Brief Bioinform 2013; 16:216-31. [PMID: 24162173 PMCID: PMC4364064 DOI: 10.1093/bib/bbt074] [Citation(s) in RCA: 52] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/17/2023] Open
Abstract
Over the past two decades, pattern mining techniques have become an integral part of many bioinformatics solutions. Frequent itemset mining is a popular group of pattern mining techniques designed to identify elements that frequently co-occur. An archetypical example is the identification of products that often end up together in the same shopping basket in supermarket transactions. A number of algorithms have been developed to address variations of this computationally non-trivial problem. Frequent itemset mining techniques are able to efficiently capture the characteristics of (complex) data and succinctly summarize it. Owing to these and other interesting properties, these techniques have proven their value in biological data analysis. Nevertheless, information about the bioinformatics applications of these techniques remains scattered. In this primer, we introduce frequent itemset mining and their derived association rules for life scientists. We give an overview of various algorithms, and illustrate how they can be used in several real-life bioinformatics application domains. We end with a discussion of the future potential and open challenges for frequent itemset mining in the life sciences.
Collapse
|
9
|
Biomedical application of fuzzy association rules for identifying breast cancer biomarkers. Med Biol Eng Comput 2012; 50:981-90. [DOI: 10.1007/s11517-012-0914-8] [Citation(s) in RCA: 23] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/23/2011] [Accepted: 05/03/2012] [Indexed: 01/26/2023]
|
10
|
Induction and selection of the most interesting Gene Ontology based multiattribute rules for descriptions of gene groups. Pattern Recognit Lett 2011. [DOI: 10.1016/j.patrec.2010.08.011] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/21/2022]
|
11
|
Garcia-Alcalde F, Blanco A, Shepherd AJ. An intuitionistic approach to scoring DNA sequences against transcription factor binding site motifs. BMC Bioinformatics 2010; 11:551. [PMID: 21059262 PMCID: PMC3098096 DOI: 10.1186/1471-2105-11-551] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/27/2010] [Accepted: 11/08/2010] [Indexed: 02/04/2023] Open
Abstract
Background Transcription factors (TFs) control transcription by binding to specific regions of DNA called transcription factor binding sites (TFBSs). The identification of TFBSs is a crucial problem in computational biology and includes the subtask of predicting the location of known TFBS motifs in a given DNA sequence. It has previously been shown that, when scoring matches to known TFBS motifs, interdependencies between positions within a motif should be taken into account. However, this remains a challenging task owing to the fact that sequences similar to those of known TFBSs can occur by chance with a relatively high frequency. Here we present a new method for matching sequences to TFBS motifs based on intuitionistic fuzzy sets (IFS) theory, an approach that has been shown to be particularly appropriate for tackling problems that embody a high degree of uncertainty. Results We propose SCintuit, a new scoring method for measuring sequence-motif affinity based on IFS theory. Unlike existing methods that consider dependencies between positions, SCintuit is designed to prevent overestimation of less conserved positions of TFBSs. For a given pair of bases, SCintuit is computed not only as a function of their combined probability of occurrence, but also taking into account the individual importance of each single base at its corresponding position. We used SCintuit to identify known TFBSs in DNA sequences. Our method provides excellent results when dealing with both synthetic and real data, outperforming the sensitivity and the specificity of two existing methods in all the experiments we performed. Conclusions The results show that SCintuit improves the prediction quality for TFs of the existing approaches without compromising sensitivity. In addition, we show how SCintuit can be successfully applied to real research problems. In this study the reliability of the IFS theory for motif discovery tasks is proven.
Collapse
Affiliation(s)
- Fernando Garcia-Alcalde
- Bionformatics and Genomics Department, Centro de Investigación Príncipe Felipe , Valencia 46013, Spain.
| | | | | |
Collapse
|
12
|
Rodríguez-González AY, Martínez-Trinidad JF, Carrasco-Ochoa JA, Ruiz-Shulcloper J. RP-Miner: a relaxed prune algorithm for frequent similar pattern mining. Knowl Inf Syst 2010. [DOI: 10.1007/s10115-010-0309-9] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/19/2022]
|
13
|
An L, Obradovic Z, Smith D, Bodenreider O, Megalooikonomou V. Mining Association Rules among Gene Functions in Clusters of Similar Gene Expression Maps. IEEE INTERNATIONAL CONFERENCE ON BIOINFORMATICS AND BIOMEDICINE WORKSHOPS. IEEE INTERNATIONAL CONFERENCE ON BIOINFORMATICS AND BIOMEDICINE 2009; 2009:254-259. [PMID: 25635265 PMCID: PMC4307020 DOI: 10.1109/bibmw.2009.5332104] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/07/2022]
Abstract
Association rules mining methods have been recently applied to gene expression data analysis to reveal relationships between genes and different conditions and features. However, not much effort has focused on detecting the relation between gene expression maps and related gene functions. Here we describe such an approach to mine association rules among gene functions in clusters of similar gene expression maps on mouse brain. The experimental results show that the detected association rules make sense biologically. By inspecting the obtained clusters and the genes having the gene functions of frequent itemsets, interesting clues were discovered that provide valuable insight to biological scientists. Moreover, discovered association rules can be potentially used to predict gene functions based on similarity of gene expression maps.
Collapse
Affiliation(s)
- Li An
- Data Engineering Laboratory, Dept. of Computer and Information Sciences, Temple University, PA, USA
| | - Zoran Obradovic
- Center for Information Science and Technology, Temple University, PA, USA
| | - Desmond Smith
- Dept. of Molecular and Medical Pharmacology, David Geffen School of Medicine, UCLA, CA, USA
| | - Olivier Bodenreider
- The Lister Hill National Center for Biomedical Communications, U.S. National Library of Medicine, Washington D.C., USA
| | | |
Collapse
|
14
|
Alves R, Rodriguez-Baena DS, Aguilar-Ruiz JS. Gene association analysis: a survey of frequent pattern mining from gene expression data. Brief Bioinform 2009; 11:210-24. [PMID: 19815645 DOI: 10.1093/bib/bbp042] [Citation(s) in RCA: 53] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
Establishing an association between variables is always of interest in genomic studies. Generation of DNA microarray gene expression data introduces a variety of data analysis issues not encountered in traditional molecular biology or medicine. Frequent pattern mining (FPM) has been applied successfully in business and scientific data for discovering interesting association patterns, and is becoming a promising strategy in microarray gene expression analysis. We review the most relevant FPM strategies, as well as surrounding main issues when devising efficient and practical methods for gene association analysis (GAA). We observed that, so far, scalability achieved by efficient methods does not imply biological soundness of the discovered association patterns, and vice versa. Ideally, GAA should employ a balanced mining model taking into account best practices employed by methods reviewed in this survey. Integrative approaches, in which biological knowledge plays an important role within the mining process, are becoming more reliable.
Collapse
Affiliation(s)
- Ronnie Alves
- Institute of Developmental Biology and Cancer, Centre de Biochimie, Faculte des Sciences, the University of Nice, 06108 Nice cedex 2.
| | | | | |
Collapse
|
15
|
Antezana E, Kuiper M, Mironov V. Biological knowledge management: the emerging role of the Semantic Web technologies. Brief Bioinform 2009; 10:392-407. [PMID: 19457869 DOI: 10.1093/bib/bbp024] [Citation(s) in RCA: 83] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
New knowledge is produced at a continuously increasing speed, and the list of papers, databases and other knowledge sources that a researcher in the life sciences needs to cope with is actually turning into a problem rather than an asset. The adequate management of knowledge is therefore becoming fundamentally important for life scientists, especially if they work with approaches that thoroughly depend on knowledge integration, such as systems biology. Several initiatives to organize biological knowledge sources into a readily exploitable resourceome are presently being carried out. Ontologies and Semantic Web technologies revolutionize these efforts. Here, we review the benefits, trends, current possibilities, and the potential this holds for the biosciences.
Collapse
Affiliation(s)
- Erick Antezana
- Department of Biology at the Norwegian University of Science and Technology
| | | | | |
Collapse
|
16
|
Das S, Roymondal U, Sahoo S. Analyzing gene expression from relative codon usage bias in Yeast genome: a statistical significance and biological relevance. Gene 2009; 443:121-31. [PMID: 19410638 DOI: 10.1016/j.gene.2009.04.022] [Citation(s) in RCA: 24] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/22/2008] [Revised: 03/08/2009] [Accepted: 04/20/2009] [Indexed: 11/17/2022]
Abstract
Based on the hypothesis that highly expressed genes are often characterized by strong compositional bias in terms of codon usage, there are a number of measures currently in use that quantify codon usage bias in genes, and hence provide numerical indices to predict the expression levels of genes. With the recent advent of expression measure from the score of the relative codon usage bias (RCBS), we have explicitly tested the performance of this numerical measure to predict the gene expression level and illustrate this with an analysis of Yeast genomes. In contradiction with previous other studies, we observe a weak correlations between GC content and RCBS, but a selective pressure on the codon preferences in highly expressed genes. The assertion that the expression of a given gene depends on the score of relative codon usage bias (RCBS) is supported by the data. We further observe a strong correlation between RCBS and protein length indicating natural selection in favour of shorter genes to be expressed at higher level. We also attempt a statistical analysis to assess the strength of relative codon bias in genes as a guide to their likely expression level, suggesting a decrease of the informational entropy in the highly expressed genes.
Collapse
Affiliation(s)
- Shibsankar Das
- Department of Mathematics, Uluberia College, Uluberia, Howrah, W.B., India
| | | | | |
Collapse
|