1
|
Biomedical text mining and its applications in cancer research. J Biomed Inform 2013; 46:200-11. [DOI: 10.1016/j.jbi.2012.10.007] [Citation(s) in RCA: 159] [Impact Index Per Article: 14.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/28/2012] [Revised: 10/30/2012] [Accepted: 10/30/2012] [Indexed: 11/21/2022]
|
2
|
Abstract
In biological networks of molecular interactions in a cell, network motifs that are biologically relevant are also functionally coherent, or form functional modules. These functionally coherent modules combine in a hierarchical manner into larger, less cohesive subsystems, thus revealing one of the essential design principles of system-level cellular organization and function–hierarchical modularity. Arguably, hierarchical modularity has not been explicitly taken into consideration by most, if not all, functional annotation systems. As a result, the existing methods would often fail to assign a statistically significant functional coherence score to biologically relevant molecular machines. We developed a methodology for hierarchical functional annotation. Given the hierarchical taxonomy of functional concepts (e.g., Gene Ontology) and the association of individual genes or proteins with these concepts (e.g., GO terms), our method will assign a Hierarchical Modularity Score (HMS) to each node in the hierarchy of functional modules; the HMS score and its value measure functional coherence of each module in the hierarchy. While existing methods annotate each module with a set of “enriched” functional terms in a bag of genes, our complementary method provides the hierarchical functional annotation of the modules and their hierarchically organized components. A hierarchical organization of functional modules often comes as a bi-product of cluster analysis of gene expression data or protein interaction data. Otherwise, our method will automatically build such a hierarchy by directly incorporating the functional taxonomy information into the hierarchy search process and by allowing multi-functional genes to be part of more than one component in the hierarchy. In addition, its underlying HMS scoring metric ensures that functional specificity of the terms across different levels of the hierarchical taxonomy is properly treated. We have evaluated our method using Saccharomyces cerevisiae data from KEGG and MIPS databases and several other computationally derived and curated datasets. The code and additional supplemental files can be obtained from http://code.google.com/p/functional-annotation-of-hierarchical-modularity/ (Accessed 2012 March 13).
Collapse
|
3
|
Xu L, Furlotte N, Lin Y, Heinrich K, Berry MW, George EO, Homayouni R. Functional cohesion of gene sets determined by latent semantic indexing of PubMed abstracts. PLoS One 2011; 6:e18851. [PMID: 21533142 PMCID: PMC3077411 DOI: 10.1371/journal.pone.0018851] [Citation(s) in RCA: 22] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/10/2010] [Accepted: 03/21/2011] [Indexed: 12/31/2022] Open
Abstract
High-throughput genomic technologies enable researchers to identify genes that are co-regulated with respect to specific experimental conditions. Numerous statistical approaches have been developed to identify differentially expressed genes. Because each approach can produce distinct gene sets, it is difficult for biologists to determine which statistical approach yields biologically relevant gene sets and is appropriate for their study. To address this issue, we implemented Latent Semantic Indexing (LSI) to determine the functional coherence of gene sets. An LSI model was built using over 1 million Medline abstracts for over 20,000 mouse and human genes annotated in Entrez Gene. The gene-to-gene LSI-derived similarities were used to calculate a literature cohesion p-value (LPv) for a given gene set using a Fisher's exact test. We tested this method against genes in more than 6,000 functional pathways annotated in Gene Ontology (GO) and found that approximately 75% of gene sets in GO biological process category and 90% of the gene sets in GO molecular function and cellular component categories were functionally cohesive (LPv<0.05). These results indicate that the LPv methodology is both robust and accurate. Application of this method to previously published microarray datasets demonstrated that LPv can be helpful in selecting the appropriate feature extraction methods. To enable real-time calculation of LPv for mouse or human gene sets, we developed a web tool called Gene-set Cohesion Analysis Tool (GCAT). GCAT can complement other gene set enrichment approaches by determining the overall functional cohesion of data sets, taking into account both explicit and implicit gene interactions reported in the biomedical literature.
Collapse
Affiliation(s)
- Lijing Xu
- Bioinformatics Program, University of Memphis, Memphis, Tennessee, United States of America
- Department of Mathematical Sciences, University of Memphis, Memphis, Tennessee, United States of America
| | - Nicholas Furlotte
- Bioinformatics Program, University of Memphis, Memphis, Tennessee, United States of America
| | - Yunyue Lin
- Department of Computer Science, University of Memphis, Memphis, Tennessee, United States of America
| | - Kevin Heinrich
- Computable Genomix, Memphis, Tennessee, United States of America
| | - Michael W. Berry
- Department of Electrical and Computer Engineering, University of Tennessee, Knoxville, Tennessee, United States of America
| | - Ebenezer O. George
- Bioinformatics Program, University of Memphis, Memphis, Tennessee, United States of America
- Department of Mathematical Sciences, University of Memphis, Memphis, Tennessee, United States of America
| | - Ramin Homayouni
- Bioinformatics Program, University of Memphis, Memphis, Tennessee, United States of America
- Department of Biological Sciences, University of Memphis, Memphis, Tennessee, United States of America
- * E-mail:
| |
Collapse
|
4
|
Frijters R, van Vugt M, Smeets R, van Schaik R, de Vlieg J, Alkema W. Literature mining for the discovery of hidden connections between drugs, genes and diseases. PLoS Comput Biol 2010; 6. [PMID: 20885778 PMCID: PMC2944780 DOI: 10.1371/journal.pcbi.1000943] [Citation(s) in RCA: 120] [Impact Index Per Article: 8.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/15/2010] [Accepted: 08/26/2010] [Indexed: 01/19/2023] Open
Abstract
The scientific literature represents a rich source for retrieval of knowledge on associations between biomedical concepts such as genes, diseases and cellular processes. A commonly used method to establish relationships between biomedical concepts from literature is co-occurrence. Apart from its use in knowledge retrieval, the co-occurrence method is also well-suited to discover new, hidden relationships between biomedical concepts following a simple ABC-principle, in which A and C have no direct relationship, but are connected via shared B-intermediates. In this paper we describe CoPub Discovery, a tool that mines the literature for new relationships between biomedical concepts. Statistical analysis using ROC curves showed that CoPub Discovery performed well over a wide range of settings and keyword thesauri. We subsequently used CoPub Discovery to search for new relationships between genes, drugs, pathways and diseases. Several of the newly found relationships were validated using independent literature sources. In addition, new predicted relationships between compounds and cell proliferation were validated and confirmed experimentally in an in vitro cell proliferation assay. The results show that CoPub Discovery is able to identify novel associations between genes, drugs, pathways and diseases that have a high probability of being biologically valid. This makes CoPub Discovery a useful tool to unravel the mechanisms behind disease, to find novel drug targets, or to find novel applications for existing drugs. The biomedical literature is an important source of knowledge on the function of genes and on the mechanisms by which these genes regulate cellular processes. Several text mining approaches have been developed to leverage this rich source of information by automatically extracting associations between concepts such as genes, diseases and drugs from a large body of text. Here, we describe a new method that extracts novel, not yet recognized associations between genes, diseases, drugs and cellular processes from the biomedical literature. Our method is built on the assumption that even if two concepts do not have a direct connection in literature, they may be functionally related if they are both connected to an overlapping set of concepts. Using this approach we predicted several novel connections between genes, diseases, drugs and pathways. Our results imply that our method is able to predict novel relationships from literature and, most importantly, that these newly identified relationships are biologically relevant. Our method can aid the drug discovery process where it can be used to find novel drug targets, increase insight in mode of action of a drug or find novel applications for known drugs.
Collapse
Affiliation(s)
- Raoul Frijters
- Computational Drug Discovery (CDD), Nijmegen Centre for Molecular Life Sciences (NCMLS), Radboud University Nijmegen Medical Centre, Nijmegen, The Netherlands
| | - Marianne van Vugt
- Department of Immune Therapeutics, Schering-Plough, Oss, The Netherlands
| | - Ruben Smeets
- Department of Immune Therapeutics, Schering-Plough, Oss, The Netherlands
| | - René van Schaik
- Department of Molecular Design & Informatics, Schering-Plough, Oss, The Netherlands
| | - Jacob de Vlieg
- Computational Drug Discovery (CDD), Nijmegen Centre for Molecular Life Sciences (NCMLS), Radboud University Nijmegen Medical Centre, Nijmegen, The Netherlands
- Department of Molecular Design & Informatics, Schering-Plough, Oss, The Netherlands
| | - Wynand Alkema
- Department of Molecular Design & Informatics, Schering-Plough, Oss, The Netherlands
- * E-mail:
| |
Collapse
|
5
|
He X, Sarma MS, Ling X, Chee B, Zhai C, Schatz B. Identifying overrepresented concepts in gene lists from literature: a statistical approach based on Poisson mixture model. BMC Bioinformatics 2010; 11:272. [PMID: 20487560 PMCID: PMC2885378 DOI: 10.1186/1471-2105-11-272] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/08/2009] [Accepted: 05/20/2010] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Large-scale genomic studies often identify large gene lists, for example, the genes sharing the same expression patterns. The interpretation of these gene lists is generally achieved by extracting concepts overrepresented in the gene lists. This analysis often depends on manual annotation of genes based on controlled vocabularies, in particular, Gene Ontology (GO). However, the annotation of genes is a labor-intensive process; and the vocabularies are generally incomplete, leaving some important biological domains inadequately covered. RESULTS We propose a statistical method that uses the primary literature, i.e. free-text, as the source to perform overrepresentation analysis. The method is based on a statistical framework of mixture model and addresses the methodological flaws in several existing programs. We implemented this method within a literature mining system, BeeSpace, taking advantage of its analysis environment and added features that facilitate the interactive analysis of gene sets. Through experimentation with several datasets, we showed that our program can effectively summarize the important conceptual themes of large gene sets, even when traditional GO-based analysis does not yield informative results. CONCLUSIONS We conclude that the current work will provide biologists with a tool that effectively complements the existing ones for overrepresentation analysis from genomic experiments. Our program, Genelist Analyzer, is freely available at: http://workerbee.igb.uiuc.edu:8080/BeeSpace/Search.jsp.
Collapse
Affiliation(s)
- Xin He
- Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA
| | | | | | | | | | | |
Collapse
|
6
|
Wu S, Liu T, Altman RB. Identification of recurring protein structure microenvironments and discovery of novel functional sites around CYS residues. BMC STRUCTURAL BIOLOGY 2010; 10:4. [PMID: 20122268 PMCID: PMC2833161 DOI: 10.1186/1472-6807-10-4] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 07/31/2009] [Accepted: 02/02/2010] [Indexed: 11/29/2022]
Abstract
Background The emergence of structural genomics presents significant challenges in the annotation of biologically uncharacterized proteins. Unfortunately, our ability to analyze these proteins is restricted by the limited catalog of known molecular functions and their associated 3D motifs. Results In order to identify novel 3D motifs that may be associated with molecular functions, we employ an unsupervised, two-phase clustering approach that combines k-means and hierarchical clustering with knowledge-informed cluster selection and annotation methods. We applied the approach to approximately 20,000 cysteine-based protein microenvironments (3D regions 7.5 Å in radius) and identified 70 interesting clusters, some of which represent known motifs (e.g. metal binding and phosphatase activity), and some of which are novel, including several zinc binding sites. Detailed annotation results are available online for all 70 clusters at http://feature.stanford.edu/clustering/cys. Conclusions The use of microenvironments instead of backbone geometric criteria enables flexible exploration of protein function space, and detection of recurring motifs that are discontinuous in sequence and diverse in structure. Clustering microenvironments may thus help to functionally characterize novel proteins and better understand the protein structure-function relationship.
Collapse
Affiliation(s)
- Shirley Wu
- 23andMe, 1390 Shorebird Way, Mountain View, CA, USA
| | | | | |
Collapse
|
7
|
Vazquez M, Carmona-Saez P, Nogales-Cadenas R, Chagoyen M, Tirado F, Carazo JM, Pascual-Montano A. SENT: semantic features in text. Nucleic Acids Res 2009; 37:W153-9. [PMID: 19458159 PMCID: PMC2703940 DOI: 10.1093/nar/gkp392] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
We present SENT (semantic features in text), a functional interpretation tool based on literature analysis. SENT uses Non-negative Matrix Factorization to identify topics in the scientific articles related to a collection of genes or their products, and use them to group and summarize these genes. In addition, the application allows users to rank and explore the articles that best relate to the topics found, helping put the analysis results into context. This approach is useful as an exploratory step in the workflow of interpreting and understanding experimental data, shedding some light into the complex underlying biological mechanisms. This tool provides a user-friendly interface via a web site, and a programmatic access via a SOAP web server. SENT is freely accessible at http://sent.dacya.ucm.es.
Collapse
Affiliation(s)
- Miguel Vazquez
- Software Engineering Department, Complutense University and Biocomputing Unit, National Center for Biotechnology, CNB-CSIC, Madrid, Spain
| | | | | | | | | | | | | |
Collapse
|
8
|
Abstract
Clustering is a popular technique commonly used to search for groups of similarly expressed genes using mRNA expression data. There are many different clustering algorithms and the application of each one will usually produce different results. Without additional evaluation, it is difficult to determine which solutions are better.In this chapter we discuss methods to assess algorithms for clustering of gene expression data. In particular, we present a new method that uses two elements: an internal index of validity based on the MDL principle and an external index of validity that measures the consistency with experimental data. Each one is used to suggest an effective set of models, but it is only the combination of both that is capable of pinpointing the best model overall. Our method can be used to compare different clustering algorithms and pick the one that maximizes the correlation with functional links in gene networks while minimizing the error rate. We test our methods on several popular clustering algorithms as well as on clustering algorithms that are specially tailored to deal with noisy data. Finally, we propose methods for assessing the significance of individual clusters and study the correspondence between gene clusters and biochemical pathways.
Collapse
|
9
|
Chagoyen M, Carazo JM, Pascual-Montano A. Assessment of protein set coherence using functional annotations. BMC Bioinformatics 2008; 9:444. [PMID: 18937846 PMCID: PMC2588600 DOI: 10.1186/1471-2105-9-444] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/08/2008] [Accepted: 10/20/2008] [Indexed: 11/23/2022] Open
Abstract
Background Analysis of large-scale experimental datasets frequently produces one or more sets of proteins that are subsequently mined for functional interpretation and validation. To this end, a number of computational methods have been devised that rely on the analysis of functional annotations. Although current methods provide valuable information (e.g. significantly enriched annotations, pairwise functional similarities), they do not specifically measure the degree of homogeneity of a protein set. Results In this work we present a method that scores the degree of functional homogeneity, or coherence, of a set of proteins on the basis of the global similarity of their functional annotations. The method uses statistical hypothesis testing to assess the significance of the set in the context of the functional space of a reference set. As such, it can be used as a first step in the validation of sets expected to be homogeneous prior to further functional interpretation. Conclusion We evaluate our method by analysing known biologically relevant sets as well as random ones. The known relevant sets comprise macromolecular complexes, cellular components and pathways described for Saccharomyces cerevisiae, which are mostly significantly coherent. Finally, we illustrate the usefulness of our approach for validating 'functional modules' obtained from computational analysis of protein-protein interaction networks. Matlab code and supplementary data are available at
Collapse
|
10
|
Aubry M, Monnier A, Chicault C, Galibert MD, Burgun A, Mosser J. Iron-related transcriptomic variations in Caco-2 cells: In silico perspectives. Biochimie 2008; 90:669-78. [DOI: 10.1016/j.biochi.2008.01.002] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/15/2007] [Accepted: 01/04/2008] [Indexed: 10/22/2022]
|
11
|
Nair R, Rost B. Protein subcellular localization prediction using artificial intelligence technology. Methods Mol Biol 2008; 484:435-63. [PMID: 18592195 DOI: 10.1007/978-1-59745-398-1_27] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/20/2023]
Abstract
Proteins perform many important tasks in living organisms, such as catalysis of biochemical reactions, transport of nutrients, and recognition and transmission of signals. The plethora of aspects of the role of any particular protein is referred to as its "function." One aspect of protein function that has been the target of intensive research by computational biologists is its subcellular localization. Proteins must be localized in the same subcellular compartment to cooperate toward a common physiological function. Aberrant subcellular localization of proteins can result in several diseases, including kidney stones, cancer, and Alzheimer's disease. To date, sequence homology remains the most widely used method for inferring the function of a protein. However, the application of advanced artificial intelligence (AI)-based techniques in recent years has resulted in significant improvements in our ability to predict the subcellular localization of a protein. The prediction accuracy has risen steadily over the years, in large part due to the application of AI-based methods such as hidden Markov models (HMMs), neural networks (NNs), and support vector machines (SVMs), although the availability of larger experimental datasets has also played a role. Automatic methods that mine textual information from the biological literature and molecular biology databases have considerably sped up the process of annotation for proteins for which some information regarding function is available in the literature. State-of-the-art methods based on NNs and HMMs can predict the presence of N-terminal sorting signals extremely accurately. Ab initio methods that predict subcellular localization for any protein sequence using only the native amino acid sequence and features predicted from the native sequence have shown the most remarkable improvements. The prediction accuracy of these methods has increased by over 30% in the past decade. The accuracy of these methods is now on par with high-throughput methods for predicting localization, and they are beginning to play an important role in directing experimental research. In this chapter, we review some of the most important methods for the prediction of subcellular localization.
Collapse
Affiliation(s)
- Rajesh Nair
- CUBIC Department of Biochemistry and Molecular Biophysics and Center for Computational Biology and Bioinformatics, Columbia University, New York, NY, USA
| | | |
Collapse
|
12
|
Erhardt RAA, Schneider R, Blaschke C. Status of text-mining techniques applied to biomedical text. Drug Discov Today 2007; 11:315-25. [PMID: 16580973 DOI: 10.1016/j.drudis.2006.02.011] [Citation(s) in RCA: 82] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/10/2005] [Revised: 02/08/2006] [Accepted: 02/27/2006] [Indexed: 11/16/2022]
Abstract
Scientific progress is increasingly based on knowledge and information. Knowledge is now recognized as the driver of productivity and economic growth, leading to a new focus on the role of information in the decision-making process. Most scientific knowledge is registered in publications and other unstructured representations that make it difficult to use and to integrate the information with other sources (e.g. biological databases). Making a computer understand human language has proven to be a complex achievement, but there are techniques capable of detecting, distinguishing and extracting a limited number of different classes of facts. In the biomedical field, extracting information has specific problems: complex and ever-changing nomenclature (especially genes and proteins) and the limited representation of domain knowledge.
Collapse
|
13
|
Kankainen M, Brader G, Törönen P, Palva ET, Holm L. Identifying functional gene sets from hierarchically clustered expression data: map of abiotic stress regulated genes in Arabidopsis thaliana. Nucleic Acids Res 2006; 34:e124. [PMID: 17003050 PMCID: PMC1636450 DOI: 10.1093/nar/gkl694] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
We present MultiGO, a web-enabled tool for the identification of biologically relevant gene sets from hierarchically clustered gene expression trees (). High-throughput gene expression measuring techniques, such as microarrays, are nowadays often used to monitor the expression of thousands of genes. Since these experiments can produce overwhelming amounts of data, computational methods that assist the data analysis and interpretation are essential. MultiGO is a tool that automatically extracts the biological information for multiple clusters and determines their biological relevance, and hence facilitates the interpretation of the data. Since the entire expression tree is analysed, MultiGO is guaranteed to report all clusters that share a common enriched biological function, as defined by Gene Ontology annotations. The tool also identifies a plausible cluster set, which represents the key biological functions affected by the experiment. The performance is demonstrated by analysing drought-, cold- and abscisic acid-related expression data sets from Arabidopsis thaliana. The analysis not only identified known biological functions, but also brought into focus the less established connections to defense-related gene clusters. Thus, in comparison to analyses of manually selected gene lists, the systematic analysis of every cluster can reveal unexpected biological phenomena and produce much more comprehensive biological insights to the experiment of interest.
Collapse
Affiliation(s)
- Matti Kankainen
- Institute of BiotechnologyPO Box 56 (Viikinkaari 5), FIN-00014, Helsinki, Finland
| | - Günter Brader
- Department of Biological and Environmental Sciences, Division of Genetics, University of HelsinkiPO Box 56 (Viikinkaari 5), FIN-00014, Helsinki, Finland
| | - Petri Törönen
- Institute of BiotechnologyPO Box 56 (Viikinkaari 5), FIN-00014, Helsinki, Finland
| | - E. Tapio Palva
- Department of Biological and Environmental Sciences, Division of Genetics, University of HelsinkiPO Box 56 (Viikinkaari 5), FIN-00014, Helsinki, Finland
| | - Liisa Holm
- Institute of BiotechnologyPO Box 56 (Viikinkaari 5), FIN-00014, Helsinki, Finland
- Department of Biological and Environmental Sciences, Division of Genetics, University of HelsinkiPO Box 56 (Viikinkaari 5), FIN-00014, Helsinki, Finland
- To whom correspondence should be addressed. Tel:+358 9 19159115; Fax:+358 9 19159079;
| |
Collapse
|
14
|
Liu Y, Ciliax BJ, Borges K, Dasigi V, Ram A, Navathe SB, Dingledine R. Comparison of two schemes for automatic keyword extraction from MEDLINE for functional gene clustering. PROCEEDINGS. IEEE COMPUTATIONAL SYSTEMS BIOINFORMATICS CONFERENCE 2006:394-404. [PMID: 16448032 DOI: 10.1109/csb.2004.1332452] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
Abstract
One of the key challenges of microarray studies is to derive biological insights from the unprecedented quatities of data on gene-expression patterns. Clustering genes by functional keyword association can provide direct information about the nature of the functional links among genes within the derived clusters. However, the quality of the keyword lists extracted from biomedical literature for each gene significantly affects the clustering results. We extracted keywords from MEDLINE that describes the most prominent functions of the genes, and used the resulting weights of the keywords as feature vectors for gene clustering. By analyzing the resulting cluster quality, we compared two keyword weighting schemes: normalized z-score and term frequency-inverse document frequency (TFIDF). The best combination of background comparison set, stop list and stemming algorithm was selected based on precision and recall metrics. In a test set of four known gene groups, a hierarchical algorithm correctly assigned 25 of 26 genes to the appropriate clusters based on keywords extracted by the TDFIDF weighting scheme, but only 23 og 26 with the z-score method. To evaluate the effectiveness of the weighting schemes for keyword extraction for gene clusters from microarray profiles, 44 yeast genes that are differentially expressed during the cell cycle were used as a second test set. Using established measures of cluster quality, the results produced from TFIDF-weighted keywords had higher purity, lower entropy, and higher mutual information than those produced from normalized z-score weighted keywords. The optimized algorithms should be useful for sorting genes from microarray lists into functionally discrete clusters.
Collapse
Affiliation(s)
- Ying Liu
- College of Computing, Georgia Institute of Technology, Atlanta, 30322, USA.
| | | | | | | | | | | | | |
Collapse
|
15
|
Jensen LJ, Saric J, Bork P. Literature mining for the biologist: from information retrieval to biological discovery. Nat Rev Genet 2006; 7:119-29. [PMID: 16418747 DOI: 10.1038/nrg1768] [Citation(s) in RCA: 356] [Impact Index Per Article: 19.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
Abstract
For the average biologist, hands-on literature mining currently means a keyword search in PubMed. However, methods for extracting biomedical facts from the scientific literature have improved considerably, and the associated tools will probably soon be used in many laboratories to automatically annotate and analyse the growing number of system-wide experimental data sets. Owing to the increasing body of text and the open-access policies of many journals, literature mining is also becoming useful for both hypothesis generation and biological discovery. However, the latter will require the integration of literature and high-throughput data, which should encourage close collaborations between biologists and computational linguists.
Collapse
Affiliation(s)
- Lars Juhl Jensen
- European Molecular Biology Laboratory, D-69117 Heidelberg, Germany.
| | | | | |
Collapse
|
16
|
Aubry M, Monnier A, Chicault C, de Tayrac M, Galibert MD, Burgun A, Mosser J. Combining evidence, biomedical literature and statistical dependence: new insights for functional annotation of gene sets. BMC Bioinformatics 2006; 7:241. [PMID: 16674810 PMCID: PMC1482722 DOI: 10.1186/1471-2105-7-241] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/30/2005] [Accepted: 05/04/2006] [Indexed: 12/18/2022] Open
Abstract
Background Large-scale genomic studies based on transcriptome technologies provide clusters of genes that need to be functionally annotated. The Gene Ontology (GO) implements a controlled vocabulary organised into three hierarchies: cellular components, molecular functions and biological processes. This terminology allows a coherent and consistent description of the knowledge about gene functions. The GO terms related to genes come primarily from semi-automatic annotations made by trained biologists (annotation based on evidence) or text-mining of the published scientific literature (literature profiling). Results We report an original functional annotation method based on a combination of evidence and literature that overcomes the weaknesses and the limitations of each approach. It relies on the Gene Ontology Annotation database (GOA Human) and the PubGene biomedical literature index. We support these annotations with statistically associated GO terms and retrieve associative relations across the three GO hierarchies to emphasise the major pathways involved by a gene cluster. Both annotation methods and associative relations were quantitatively evaluated with a reference set of 7397 genes and a multi-cluster study of 14 clusters. We also validated the biological appropriateness of our hybrid method with the annotation of a single gene (cdc2) and that of a down-regulated cluster of 37 genes identified by a transcriptome study of an in vitro enterocyte differentiation model (CaCo-2 cells). Conclusion The combination of both approaches is more informative than either separate approach: literature mining can enrich an annotation based only on evidence. Text-mining of the literature can also find valuable associated MEDLINE references that confirm the relevance of the annotation. Eventually, GO terms networks can be built with associative relations in order to highlight cooperative and competitive pathways and their connected molecular functions.
Collapse
Affiliation(s)
- Marc Aubry
- CNRS UMR 6061 Génétique et Développement, Université de Rennes 1, Groupe Oncogénomique, IFR140 GFAS, Faculté de Médecine, 2 avenue du Pr, Léon Bernard, CS 34317, 35043 Rennes Cedex, France.
| | | | | | | | | | | | | |
Collapse
|
17
|
Semeiks JR, Rizki A, Bissell MJ, Mian IS. Ensemble attribute profile clustering: discovering and characterizing groups of genes with similar patterns of biological features. BMC Bioinformatics 2006; 7:147. [PMID: 16542449 PMCID: PMC1435935 DOI: 10.1186/1471-2105-7-147] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/28/2005] [Accepted: 03/16/2006] [Indexed: 11/17/2022] Open
Abstract
Background Ensemble attribute profile clustering is a novel, text-based strategy for analyzing a user-defined list of genes and/or proteins. The strategy exploits annotation data present in gene-centered corpora and utilizes ideas from statistical information retrieval to discover and characterize properties shared by subsets of the list. The practical utility of this method is demonstrated by employing it in a retrospective study of two non-overlapping sets of genes defined by a published investigation as markers for normal human breast luminal epithelial cells and myoepithelial cells. Results Each genetic locus was characterized using a finite set of biological properties and represented as a vector of features indicating attributes associated with the locus (a gene attribute profile). In this study, the vector space models for a pre-defined list of genes were constructed from the Gene Ontology (GO) terms and the Conserved Domain Database (CDD) protein domain terms assigned to the loci by the gene-centered corpus LocusLink. This data set of GO- and CDD-based gene attribute profiles, vectors of binary random variables, was used to estimate multiple finite mixture models and each ensuing model utilized to partition the profiles into clusters. The resultant partitionings were combined using a unanimous voting scheme to produce consensus clusters, sets of profiles that co-occured consistently in the same cluster. Attributes that were important in defining the genes assigned to a consensus cluster were identified. The clusters and their attributes were inspected to ascertain the GO and CDD terms most associated with subsets of genes and in conjunction with external knowledge such as chromosomal location, used to gain functional insights into human breast biology. The 52 luminal epithelial cell markers and 89 myoepithelial cell markers are disjoint sets of genes. Ensemble attribute profile clustering-based analysis indicated that both lists contained groups of genes with the functional properties of membrane receptor biology/signal transduction and nucleic acid binding/transcription. A subset of the luminal markers was associated with metabolic and oxidoreductase activities, whereas a subset of myoepithelial markers was associated with protein hydrolase activity. Conclusion Given a set of genes and/or proteins associated with a phenomenon, process or system of interest, ensemble attribute profile clustering provides a simple method for collating and sythesizing the annotation data pertaining to them that are present in text-based, gene-centered corpora. The results provide information about properties common and unique to subsets of the list and hence insights into the biology of the problem under investigation.
Collapse
Affiliation(s)
- JR Semeiks
- Life Sciences Division (MS 977-225A), Lawrence Berkeley National Laboratory, 1 Cyclotron Road, Berkeley, CA 94720, USA
| | - A Rizki
- Life Sciences Division (MS 977-225A), Lawrence Berkeley National Laboratory, 1 Cyclotron Road, Berkeley, CA 94720, USA
| | - MJ Bissell
- Life Sciences Division (MS 977-225A), Lawrence Berkeley National Laboratory, 1 Cyclotron Road, Berkeley, CA 94720, USA
| | - IS Mian
- Life Sciences Division (MS 74-197), Lawrence Berkeley National Laboratory, 1 Cyclotron Road, Berkeley, CA 94720-8265, USA
| |
Collapse
|
18
|
Chagoyen M, Carmona-Saez P, Shatkay H, Carazo JM, Pascual-Montano A. Discovering semantic features in the literature: a foundation for building functional associations. BMC Bioinformatics 2006; 7:41. [PMID: 16438716 PMCID: PMC1386711 DOI: 10.1186/1471-2105-7-41] [Citation(s) in RCA: 59] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/01/2005] [Accepted: 01/26/2006] [Indexed: 11/10/2022] Open
Abstract
Background Experimental techniques such as DNA microarray, serial analysis of gene expression (SAGE) and mass spectrometry proteomics, among others, are generating large amounts of data related to genes and proteins at different levels. As in any other experimental approach, it is necessary to analyze these data in the context of previously known information about the biological entities under study. The literature is a particularly valuable source of information for experiment validation and interpretation. Therefore, the development of automated text mining tools to assist in such interpretation is one of the main challenges in current bioinformatics research. Results We present a method to create literature profiles for large sets of genes or proteins based on common semantic features extracted from a corpus of relevant documents. These profiles can be used to establish pair-wise similarities among genes, utilized in gene/protein classification or can be even combined with experimental measurements. Semantic features can be used by researchers to facilitate the understanding of the commonalities indicated by experimental results. Our approach is based on non-negative matrix factorization (NMF), a machine-learning algorithm for data analysis, capable of identifying local patterns that characterize a subset of the data. The literature is thus used to establish putative relationships among subsets of genes or proteins and to provide coherent justification for this clustering into subsets. We demonstrate the utility of the method by applying it to two independent and vastly different sets of genes. Conclusion The presented method can create literature profiles from documents relevant to sets of genes. The representation of genes as additive linear combinations of semantic features allows for the exploration of functional associations as well as for clustering, suggesting a valuable methodology for the validation and interpretation of high-throughput experimental data.
Collapse
Affiliation(s)
- Monica Chagoyen
- Biocomputing Unit, Centro Nacional de Biotecnologia – CSIC, Madrid, Spain
| | - Pedro Carmona-Saez
- Biocomputing Unit, Centro Nacional de Biotecnologia – CSIC, Madrid, Spain
| | - Hagit Shatkay
- School of Computing, Queen's University, Kingston, Ontario, Canada
| | - Jose M Carazo
- Biocomputing Unit, Centro Nacional de Biotecnologia – CSIC, Madrid, Spain
| | | |
Collapse
|
19
|
Curtis RK, Oresic M, Vidal-Puig A. Pathways to the analysis of microarray data. Trends Biotechnol 2005; 23:429-35. [PMID: 15950303 DOI: 10.1016/j.tibtech.2005.05.011] [Citation(s) in RCA: 176] [Impact Index Per Article: 9.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/10/2005] [Revised: 04/11/2005] [Accepted: 05/24/2005] [Indexed: 10/25/2022]
Abstract
The development of microarray technology allows the simultaneous measurement of the expression of many thousands of genes. The information gained offers an unprecedented opportunity to fully characterize biological processes. However, this challenge will only be successful if new tools for the efficient integration and interpretation of large datasets are available. One of these tools, pathway analysis, involves looking for consistent but subtle changes in gene expression by incorporating either pathway or functional annotations. We review several methods of pathway analysis and compare the performance of three, the binomial distribution, z scores, and gene set enrichment analysis, on two microarray datasets. Pathway analysis is a promising tool to identify the mechanisms that underlie diseases, adaptive physiological compensatory responses and new avenues for investigation.
Collapse
Affiliation(s)
- R Keira Curtis
- University of Cambridge Department of Clinical Biochemistry, Box 232, Addenbrooke's Hospital, Hills Road, Cambridge, UK, CB2 2QR.
| | | | | |
Collapse
|
20
|
Natarajan J, Berrar D, Hack CJ, Dubitzky W. Knowledge discovery in biology and biotechnology texts: a review of techniques, evaluation strategies, and applications. Crit Rev Biotechnol 2005; 25:31-52. [PMID: 15999851 DOI: 10.1080/07388550590935571] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/24/2023]
Abstract
Arguably, the richest source of knowledge (as opposed to fact and data collections) about biology and biotechnology is captured in natural-language documents such as technical reports, conference proceedings and research articles. The automatic exploitation of this rich knowledge base for decision making, hypothesis management (generation and testing) and knowledge discovery constitutes a formidable challenge. Recently, a set of technologies collectively referred to as knowledge discovery in text (KDT) has been advocated as a promising approach to tackle this challenge. KDT comprises three main tasks: information retrieval, information extraction and text mining. These tasks are the focus of much recent scientific research and many algorithms have been developed and applied to documents and text in biology and biotechnology. This article introduces the basic concepts of KDT, provides an overview of some of these efforts in the field of bioscience and biotechnology, and presents a framework of commonly used techniques for evaluating KDT methods, tools and systems.
Collapse
Affiliation(s)
- J Natarajan
- University of Ulster, School of Biomedical Sciences, Bioinformatics Research Group, Coleraine BT52 1SA, Northern Ireland
| | | | | | | |
Collapse
|
21
|
Hakenberg J, Schmeier S, Kowald A, Klipp E, Leser U. Finding kinetic parameters using text mining. OMICS-A JOURNAL OF INTEGRATIVE BIOLOGY 2005; 8:131-52. [PMID: 15268772 DOI: 10.1089/1536231041388366] [Citation(s) in RCA: 31] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/13/2022]
Abstract
The mathematical modeling and description of complex biological processes has become more and more important over the last years. Systems biology aims at the computational simulation of complex systems, up to whole cell simulations. An essential part focuses on solving a large number of parameterized differential equations. However, measuring those parameters is an expensive task, and finding them in the literature is very laborious. We developed a text mining system that supports researchers in their search for experimentally obtained parameters for kinetic models. Our system classifies full text documents regarding the question whether or not they contain appropriate data using a support vector machine. We evaluated our approach on a manually tagged corpus of 800 documents and found that it outperforms keyword searches in abstracts by a factor of five in terms of precision.
Collapse
Affiliation(s)
- Jörg Hakenberg
- Humboldt-Universität zu Berlin, Department of Computer Science, Berlin, Germany.
| | | | | | | | | |
Collapse
|
22
|
Alako BTF, Veldhoven A, van Baal S, Jelier R, Verhoeven S, Rullmann T, Polman J, Jenster G. CoPub Mapper: mining MEDLINE based on search term co-publication. BMC Bioinformatics 2005; 6:51. [PMID: 15760478 PMCID: PMC1274248 DOI: 10.1186/1471-2105-6-51] [Citation(s) in RCA: 56] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/21/2004] [Accepted: 03/11/2005] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND High throughput microarray analyses result in many differentially expressed genes that are potentially responsible for the biological process of interest. In order to identify biological similarities between genes, publications from MEDLINE were identified in which pairs of gene names and combinations of gene name with specific keywords were co-mentioned. RESULTS MEDLINE search strings for 15,621 known genes and 3,731 keywords were generated and validated. PubMed IDs were retrieved from MEDLINE and relative probability of co-occurrences of all gene-gene and gene-keyword pairs determined. To assess gene clustering according to literature co-publication, 150 genes consisting of 8 sets with known connections (same pathway, same protein complex, or same cellular localization, etc.) were run through the program. Receiver operator characteristics (ROC) analyses showed that most gene sets were clustered much better than expected by random chance. To test grouping of genes from real microarray data, 221 differentially expressed genes from a microarray experiment were analyzed with CoPub Mapper, which resulted in several relevant clusters of genes with biological process and disease keywords. In addition, all genes versus keywords were hierarchical clustered to reveal a complete grouping of published genes based on co-occurrence. CONCLUSION The CoPub Mapper program allows for quick and versatile querying of co-published genes and keywords and can be successfully used to cluster predefined groups of genes and microarray data.
Collapse
Affiliation(s)
- Blaise TF Alako
- Department of Molecular Design & Informatics, Organon NV, P.O. Box 20, 5340 BH Oss, The Netherlands
| | - Antoine Veldhoven
- Department of Urology, Erasmus MC, P.O. Box 1738, 3000 DR Rotterdam, The Netherlands
| | - Sjozef van Baal
- Department of Genetics, Erasmus MC, Rotterdam, The Netherlands
| | - Rob Jelier
- Department of Medical Informatics, Erasmus MC, Rotterdam, The Netherlands
| | - Stefan Verhoeven
- Department of Molecular Design & Informatics, Organon NV, P.O. Box 20, 5340 BH Oss, The Netherlands
| | - Ton Rullmann
- Department of Molecular Design & Informatics, Organon NV, P.O. Box 20, 5340 BH Oss, The Netherlands
| | - Jan Polman
- Department of Molecular Design & Informatics, Organon NV, P.O. Box 20, 5340 BH Oss, The Netherlands
| | - Guido Jenster
- Department of Urology, Erasmus MC, P.O. Box 1738, 3000 DR Rotterdam, The Netherlands
| |
Collapse
|
23
|
Liu Y, Navathe SB, Civera J, Dasigi V, Ram A, Ciliax BJ, Dingledine R. Text mining biomedical literature for discovering gene-to-gene relationships: a comparative study of algorithms. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2005; 2:62-76. [PMID: 17044165 DOI: 10.1109/tcbb.2005.14] [Citation(s) in RCA: 18] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/12/2023]
Abstract
Partitioning closely related genes into clusters has become an important element of practically all statistical analyses of microarray data. A number of computer algorithms have been developed for this task. Although these algorithms have demonstrated their usefulness for gene clustering, some basic problems remain. This paper describes our work on extracting functional keywords from MEDLINE for a set of genes that are isolated for further study from microarray experiments based on their differential expression patterns. The sharing of functional keywords among genes is used as a basis for clustering in a new approach called BEA-PARTITION in this paper. Functional keywords associated with genes were extracted from MEDLINE abstracts. We modified the Bond Energy Algorithm (BEA), which is widely accepted in psychology and database design but is virtually unknown in bioinformatics, to cluster genes by functional keyword associations. The results showed that BEA-PARTITION and hierarchical clustering algorithm outperformed k-means clustering and self-organizing map by correctly assigning 25 of 26 genes in a test set of four known gene groups. To evaluate the effectiveness of BEA-PARTITION for clustering genes identified by microarray profiles, 44 yeast genes that are differentially expressed during the cell cycle and have been widely studied in the literature were used as a second test set. Using established measures of cluster quality, the results produced by BEA-PARTITION had higher purity, lower entropy, and higher mutual information than those produced by k-means and self-organizing map. Whereas BEA-PARTITION and the hierarchical clustering produced similar quality of clusters, BEA-PARTITION provides clear cluster boundaries compared to the hierarchical clustering. BEA-PARTITION is simple to implement and provides a powerful approach to clustering genes or to any clustering problem where starting matrices are available from experimental observations.
Collapse
Affiliation(s)
- Ying Liu
- College of Computing, Georgia Institute of Technology, 801 Atlantic Drive, Atlanta, GA 30322, USA.
| | | | | | | | | | | | | |
Collapse
|
24
|
Minagar A, Shapshak P, Duran EM, Kablinger AS, Alexander JS, Kelley RE, Seth R, Kazic T. HIV-associated dementia, Alzheimer's disease, multiple sclerosis, and schizophrenia: gene expression review. J Neurol Sci 2004; 224:3-17. [PMID: 15450765 DOI: 10.1016/j.jns.2004.06.007] [Citation(s) in RCA: 13] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/13/2004] [Revised: 06/15/2004] [Accepted: 06/16/2004] [Indexed: 12/18/2022]
Abstract
RNA and protein gene expression technologies are revolutionizing our view and understanding of human diseases and enable us to analyze the concurrent expression patterns of large numbers of genes. These new technologies allow simultaneous study of thousands of genes and their changes in regulation and modulation patterns in relation to disease state, time, and tissue specificity. This review summarizes the application of this modern technology to four common neurological and psychiatric disorders: HIV-1-associated dementia, Alzheimer's disease, multiple sclerosis, and schizophrenia and is a first comparison of these diseases using this approach.
Collapse
Affiliation(s)
- Alireza Minagar
- Department of Neurology, Louisiana State University School of Medicine, Shreveport 71130, USA.
| | | | | | | | | | | | | | | |
Collapse
|
25
|
Santos C, Eggle D, States DJ. Wnt pathway curation using automated natural language processing: combining statistical methods with partial and full parse for knowledge extraction. Bioinformatics 2004; 21:1653-8. [PMID: 15564295 DOI: 10.1093/bioinformatics/bti165] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
MOTIVATION Wnt signaling is a very active area of research with highly relevant publications appearing at a rate of more than one per day. Building and maintaining databases describing signal transduction networks is a time-consuming and demanding task that requires careful literature analysis and extensive domain-specific knowledge. For instance, more than 50 factors involved in Wnt signal transduction have been identified as of late 2003. In this work we describe a natural language processing (NLP) system that is able to identify references to biological interaction networks in free text and automatically assembles a protein association and interaction map. RESULTS A 'gold standard' set of names and assertions was derived by manual scanning of the Wnt genes website (http://www.stanford.edu/~rnusse/wntwindow.html) including 53 interactions involved in Wnt signaling. This system was used to analyze a corpus of peer-reviewed articles related to Wnt signaling including 3369 Pubmed and 1230 full text papers. Names for key Wnt-pathway associated proteins and biological entities are identified using a chi-squared analysis of noun phrases over-represented in the Wnt literature as compared to the general signal transduction literature. Interestingly, we identified several instances where generic terms were used on the website when more specific terms occur in the literature, and one typographic error on the Wnt canonical pathway. Using the named entity list and performing an exhaustive assertion extraction of the corpus, 34 of the 53 interactions in the 'gold standard' Wnt signaling set were successfully identified (64% recall). In addition, the automated extraction found several interactions involving key Wnt-related molecules which were missing or different from those in the canonical diagram, and these were confirmed by manual review of the text. These results suggest that a combination of NLP techniques for information extraction can form a useful first-pass tool for assisting human annotation and maintenance of signal pathway databases. AVAILABILITY The pipeline software components are freely available on request to the authors. CONTACT dstates@umich.edu SUPPLEMENTARY INFORMATION http://stateslab.bioinformatics.med.umich.edu/software.html.
Collapse
Affiliation(s)
- Carlos Santos
- Bioinformatics Program, The University of Michigan, Ann Arbor, MI 48109, USA
| | | | | |
Collapse
|
26
|
Herrero J, Vaquerizas JM, Al-Shahrour F, Conde L, Mateos A, Díaz-Uriarte JSR, Dopazo J. New challenges in gene expression data analysis and the extended GEPAS. Nucleic Acids Res 2004; 32:W485-91. [PMID: 15215434 PMCID: PMC441559 DOI: 10.1093/nar/gkh421] [Citation(s) in RCA: 39] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/13/2004] [Revised: 04/07/2004] [Accepted: 04/07/2004] [Indexed: 01/30/2023] Open
Abstract
Since the first papers published in the late nineties, including, for the first time, a comprehensive analysis of microarray data, the number of questions that have been addressed through this technique have both increased and diversified. Initially, interest focussed on genes coexpressing across sets of experimental conditions, implying, essentially, the use of clustering techniques. Recently, however, interest has focussed more on finding genes differentially expressed among distinct classes of experiments, or correlated to diverse clinical outcomes, as well as in building predictors. In addition to this, the availability of accurate genomic data and the recent implementation of CGH arrays has made mapping expression and genomic data on the chromosomes possible. There is also a clear demand for methods that allow the automatic transfer of biological information to the results of microarray experiments. Different initiatives, such as the Gene Ontology (GO) consortium, pathways databases, protein functional motifs, etc., provide curated annotations for genes. Whereas many resources on the web focus mainly on clustering methods, GEPAS has evolved to cope with the aforementioned new challenges that have recently arisen in the field of microarray data analysis. The web-based pipeline for microarray gene expression data, GEPAS, is available at http://gepas.bioinfo.cnio.es.
Collapse
Affiliation(s)
- Javier Herrero
- Bioinformatics Unit, Biotechnology Programme, Centro Nacional de Investigaciones Oncológicas, Melchor Fernández Almagro, 3, E-28029 Madrid, Spain
| | | | | | | | | | | | | |
Collapse
|
27
|
Glenisson P, Coessens B, Van Vooren S, Mathys J, Moreau Y, De Moor B. TXTGate: profiling gene groups with text-based information. Genome Biol 2004; 5:R43. [PMID: 15186494 PMCID: PMC463076 DOI: 10.1186/gb-2004-5-6-r43] [Citation(s) in RCA: 53] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/24/2003] [Revised: 02/03/2004] [Accepted: 04/27/2004] [Indexed: 11/23/2022] Open
Abstract
We implemented a framework called TXTGate that combines literature indices of selected public biological resources in a flexible text-mining system designed towards the analysis of groups of genes. By means of tailored vocabularies, term- as well as gene-centric views are offered on selected textual fields and MEDLINE abstracts used in LocusLink and the Saccharomyces Genome Database. Subclustering and links to external resources allow for in-depth analysis of the resulting term profiles.
Collapse
Affiliation(s)
- Patrick Glenisson
- Departement Elektrotechniek (ESAT), Faculteit Toegepaste Wetenschappen, Katholieke Universiteit Leuven, Kasteelpark Arenberg 10, 3001 Heverlee (Leuven), Belgium
| | - Bert Coessens
- Departement Elektrotechniek (ESAT), Faculteit Toegepaste Wetenschappen, Katholieke Universiteit Leuven, Kasteelpark Arenberg 10, 3001 Heverlee (Leuven), Belgium
| | - Steven Van Vooren
- Departement Elektrotechniek (ESAT), Faculteit Toegepaste Wetenschappen, Katholieke Universiteit Leuven, Kasteelpark Arenberg 10, 3001 Heverlee (Leuven), Belgium
| | - Janick Mathys
- Departement Elektrotechniek (ESAT), Faculteit Toegepaste Wetenschappen, Katholieke Universiteit Leuven, Kasteelpark Arenberg 10, 3001 Heverlee (Leuven), Belgium
| | - Yves Moreau
- Departement Elektrotechniek (ESAT), Faculteit Toegepaste Wetenschappen, Katholieke Universiteit Leuven, Kasteelpark Arenberg 10, 3001 Heverlee (Leuven), Belgium
- Current address: Center for Biological Sequence Analysis, BioCentrum, Danish Technical University, Kemitorvet, DK-2800 Lyngby, Denmark
| | - Bart De Moor
- Departement Elektrotechniek (ESAT), Faculteit Toegepaste Wetenschappen, Katholieke Universiteit Leuven, Kasteelpark Arenberg 10, 3001 Heverlee (Leuven), Belgium
| |
Collapse
|
28
|
Wilkinson DM, Huberman BA. A method for finding communities of related genes. Proc Natl Acad Sci U S A 2004; 101 Suppl 1:5241-8. [PMID: 14757821 PMCID: PMC387302 DOI: 10.1073/pnas.0307740100] [Citation(s) in RCA: 164] [Impact Index Per Article: 8.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Abstract
We present a method for creating a network of gene co-occurrences from the literature and partitioning it into communities of related genes. The way in which our method identifies communities makes it likely that the component genes of each community will be related by their function. The method processes a large database of article abstracts, synthesizing information from many sources to shed light on groups of genes that have been shown to interact. It is a tool to be used by researchers in the biomedical sciences to swiftly search for known interactions and to provide insight into unexplored connections. The partitioning procedure is designed to be particularly applicable to large networks in which individual nodes may play a role in more than one community. In this paper, we explain the details of the method, in particular the partitioning process. We also apply the method to produce communities of genes related to colon cancer and show that the results are useful.
Collapse
Affiliation(s)
- Dennis M Wilkinson
- Stanford University and HP Laboratories, 1501 Page Mill Road, Palo Alto, CA 94394, USA
| | | |
Collapse
|
29
|
Karaoz U, Murali TM, Letovsky S, Zheng Y, Ding C, Cantor CR, Kasif S. Whole-genome annotation by using evidence integration in functional-linkage networks. Proc Natl Acad Sci U S A 2004; 101:2888-93. [PMID: 14981259 PMCID: PMC365715 DOI: 10.1073/pnas.0307326101] [Citation(s) in RCA: 240] [Impact Index Per Article: 12.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Abstract
The advent of high-throughput biology has catalyzed a remarkable improvement in our ability to identify new genes. A large fraction of newly discovered genes have an unknown functional role, particularly when they are specific to a particular lineage or organism. These genes, currently labeled "hypothetical," might support important biological cell functions and could potentially serve as targets for medical, diagnostic, or pharmacogenomic studies. An important challenge to the scientific community is to associate these newly predicted genes with a biological function that can be validated by experimental screens. In the absence of sequence or structural homology to known genes, we must rely on advanced biotechnological methods, such as DNA chips and protein-protein interaction screens as well as computational techniques to assign putative functions to these genes. In this article, we propose an effective methodology for combining biological evidence obtained in several high-throughput experimental screens and integrating this evidence in a way that provides consistent functional assignments to hypothetical genes. We use the visualization method of propagation diagrams to illustrate the flow of functional evidence that supports the functional assignments produced by the algorithm. Our results contain a number of predictions and furnish strong evidence that integration of functional information is indeed a promising direction for improving the accuracy and robustness of functional genomics.
Collapse
Affiliation(s)
- Ulas Karaoz
- Bioinformatics Program, Boston University, 48 Cummington Street, Boston, MA 02215, USA
| | | | | | | | | | | | | |
Collapse
|
30
|
Abstract
It is now obvious that the rate-limiting step in high throughput experimentation is neither data acquisition nor analysis, but rather our ability to interpret data on a genome-wide scale. Indeed, the explosion of data sampling capacity combined with increasing publication rates greatly impairs our ability to find meaning in vast collections of data. In order to support data interpretation, bioinformatic tools are needed to identify critical information contained in large bodies of literature. However, extracting knowledge embedded in free text is an arduous task, compounded in the biomedical field by an inconsistent gene nomenclature, domain-specific language and restricted access to full text articles. This paper presents a selection of currently available biomedical literature mining software. These tools rely on statistic and, more recently, semantic analyses (Natural Language Processing) to automatically extract information from the literature. In addition, a literature mining strategy has been developed to explore patterns of term occurrences in abstracts. This method automatically identifies relevant keywords in collections of abstracts, and uses a pattern discovery algorithm to generate a visual interface for exploring functional associations among genes. Term occurrence heatmaps can also be combined with gene expression profiles to provide valuable functional annotations. Furthermore, as demonstrated with tumor cell line literature profiling results, this approach can be applied to a variety of themes beyond genomic data analysis. Altogether, these examples illustrate how literature analysis can be employed to support knowledge discovery in biomedical research.
Collapse
Affiliation(s)
- Damien Chaussabel
- Immunobiology Section, Laboratory of Parasitic Diseases, National Institute of Allergy and Infectious Diseases, National Institutes of Health, Bethesda, Maryland, USA.
| |
Collapse
|
31
|
Kazic T, Coe E, Polacco M, Shyu CR. Whither biological database research? OMICS : A JOURNAL OF INTEGRATIVE BIOLOGY 2003; 7:61-5. [PMID: 12831558 DOI: 10.1089/153623103322006625] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/12/2022]
Abstract
We consider how the landscape of biological databases may evolve in the future, and what research is needed to realize this evolution. We suggest today's dispersal of diverse resources will only increase as the number and size of those resources, driving the need for semantic interoperability even more strongly. Because the complexity of the questions biologists want answered automatically continues to rapidly escalate, we will need to draw upon high-performance computing resources such as the GRID to process complex queries. Finally, we still need data, and our ways of acquiring and curating data must improve by orders of magnitude.
Collapse
Affiliation(s)
- Toni Kazic
- Department of Computer Engineering and Computer Science, University of Missouri, Columbia, Missouri 65201, USA.
| | | | | | | |
Collapse
|
32
|
Raychaudhuri S, Chang JT, Imam F, Altman RB. The computational analysis of scientific literature to define and recognize gene expression clusters. Nucleic Acids Res 2003; 31:4553-60. [PMID: 12888516 PMCID: PMC169898 DOI: 10.1093/nar/gkg636] [Citation(s) in RCA: 40] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
A limitation of many gene expression analytic approaches is that they do not incorporate comprehensive background knowledge about the genes into the analysis. We present a computational method that leverages the peer-reviewed literature in the automatic analysis of gene expression data sets. Including the literature in the analysis of gene expression data offers an opportunity to incorporate functional information about the genes when defining expression clusters. We have created a method that associates gene expression profiles with known biological functions. Our method has two steps. First, we apply hierarchical clustering to the given gene expression data set. Secondly, we use text from abstracts about genes to (i) resolve hierarchical cluster boundaries to optimize the functional coherence of the clusters and (ii) recognize those clusters that are most functionally coherent. In the case where a gene has not been investigated and therefore lacks primary literature, articles about well-studied homologous genes are added as references. We apply our method to two large gene expression data sets with different properties. The first contains measurements for a subset of well-studied Saccharomyces cerevisiae genes with multiple literature references, and the second contains newly discovered genes in Drosophila melanogaster; many have no literature references at all. In both cases, we are able to rapidly define and identify the biologically relevant gene expression profiles without manual intervention. In both cases, we identified novel clusters that were not noted by the original investigators.
Collapse
Affiliation(s)
- Soumya Raychaudhuri
- Department of Genetics and. Department of Biochemistry, Stanford University, Stanford, CA 94305, USA
| | | | | | | |
Collapse
|
33
|
Troyanskaya OG, Dolinski K, Owen AB, Altman RB, Botstein D. A Bayesian framework for combining heterogeneous data sources for gene function prediction (in Saccharomyces cerevisiae). Proc Natl Acad Sci U S A 2003; 100:8348-53. [PMID: 12826619 PMCID: PMC166232 DOI: 10.1073/pnas.0832373100] [Citation(s) in RCA: 410] [Impact Index Per Article: 19.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Abstract
Genomic sequencing is no longer a novelty, but gene function annotation remains a key challenge in modern biology. A variety of functional genomics experimental techniques are available, from classic methods such as affinity precipitation to advanced high-throughput techniques such as gene expression microarrays. In the future, more disparate methods will be developed, further increasing the need for integrated computational analysis of data generated by these studies. We address this problem with MAGIC (Multisource Association of Genes by Integration of Clusters), a general framework that uses formal Bayesian reasoning to integrate heterogeneous types of high-throughput biological data (such as large-scale two-hybrid screens and multiple microarray analyses) for accurate gene function prediction. The system formally incorporates expert knowledge about relative accuracies of data sources to combine them within a normative framework. MAGIC provides a belief level with its output that allows the user to vary the stringency of predictions. We applied MAGIC to Saccharomyces cerevisiae genetic and physical interactions, microarray, and transcription factor binding sites data and assessed the biological relevance of gene groupings using Gene Ontology annotations produced by the Saccharomyces Genome Database. We found that by creating functional groupings based on heterogeneous data types, MAGIC improved accuracy of the groupings compared with microarray analysis alone. We describe several of the biological gene groupings identified.
Collapse
Affiliation(s)
- Olga G. Troyanskaya
- Department of Genetics and
Saccharomyces Genome Database,
Stanford University School of Medicine, and
Department of Statistics, Stanford University,
Stanford, CA 94305
| | - Kara Dolinski
- Department of Genetics and
Saccharomyces Genome Database,
Stanford University School of Medicine, and
Department of Statistics, Stanford University,
Stanford, CA 94305
| | - Art B. Owen
- Department of Genetics and
Saccharomyces Genome Database,
Stanford University School of Medicine, and
Department of Statistics, Stanford University,
Stanford, CA 94305
| | - Russ B. Altman
- Department of Genetics and
Saccharomyces Genome Database,
Stanford University School of Medicine, and
Department of Statistics, Stanford University,
Stanford, CA 94305
| | - David Botstein
- Department of Genetics and
Saccharomyces Genome Database,
Stanford University School of Medicine, and
Department of Statistics, Stanford University,
Stanford, CA 94305
| |
Collapse
|
34
|
Kim CC, Falkow S. Significance analysis of lexical bias in microarray data. BMC Bioinformatics 2003; 4:12. [PMID: 12697067 PMCID: PMC153504 DOI: 10.1186/1471-2105-4-12] [Citation(s) in RCA: 26] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/08/2003] [Accepted: 04/03/2003] [Indexed: 11/17/2022] Open
Abstract
BACKGROUND Genes that are determined to be significantly differentially regulated in microarray analyses often appear to have functional commonalities, such as being components of the same biochemical pathway. This results in certain words being under- or overrepresented in the list of genes. Distinguishing between biologically meaningful trends and artifacts of annotation and analysis procedures is of the utmost importance, as only true biological trends are of interest for further experimentation. A number of sophisticated methods for identification of significant lexical trends are currently available, but these methods are generally too cumbersome for practical use by most microarray users. RESULTS We have developed a tool, LACK, for calculating the statistical significance of apparent lexical bias in microarray datasets. The frequency of a user-specified list of search terms in a list of genes which are differentially regulated is assessed for statistical significance by comparison to randomly generated datasets. The simplicity of the input files and user interface targets the average microarray user who wishes to have a statistical measure of apparent lexical trends in analyzed datasets without the need for bioinformatics skills. The software is available as Perl source or a Windows executable. CONCLUSION We have used LACK in our laboratory to generate biological hypotheses based on our microarray data. We demonstrate the program's utility using an example in which we confirm significant upregulation of SPI-2 pathogenicity island of Salmonella enterica serovar Typhimurium by the cation chelator dipyridyl.
Collapse
Affiliation(s)
- Charles C Kim
- Microbiology and Immunology, Stanford University Medical Center, Stanford, CA, 94305, USA
| | - Stanley Falkow
- Microbiology and Immunology, Stanford University Medical Center, Stanford, CA, 94305, USA
| |
Collapse
|
35
|
Raychaudhuri S, Altman RB. A literature-based method for assessing the functional coherence of a gene group. Bioinformatics 2003; 19:396-401. [PMID: 12584126 PMCID: PMC2669934 DOI: 10.1093/bioinformatics/btg002] [Citation(s) in RCA: 34] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
MOTIVATION Many experimental and algorithmic approaches in biology generate groups of genes that need to be examined for related functional properties. For example, gene expression profiles are frequently organized into clusters of genes that may share functional properties. We evaluate a method, neighbor divergence per gene (NDPG), that uses scientific literature to assess whether a group of genes are functionally related. The method requires only a corpus of documents and an index connecting the documents to genes. RESULTS We evaluate NDPG on 2796 functional groups generated by the Gene Ontology consortium in four organisms: mouse, fly, worm and yeast. NDPG finds functional coherence in 96, 92, 82 and 45% of the groups (at 99.9% specificity) in yeast, mouse, fly and worm respectively.
Collapse
Affiliation(s)
- Soumya Raychaudhuri
- Department of Genetics, Stanford Medical Informatics, 251 Campus Drive, MSOB X-215, Stanford University, Stanford, CA 94305-5479, USA
| | | |
Collapse
|
36
|
Current Awareness on Comparative and Functional Genomics. Comp Funct Genomics 2003; 4:277-84. [PMID: 18629117 PMCID: PMC2447404 DOI: 10.1002/cfg.227] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
|