Reference Citation Analysis: Find an Article, Find a Category, Find a Journal, Find a Scholar

For: Raychaudhuri S, Schütze H, Altman RB. Using text analysis to identify functionally coherent gene groups. Genome Res 2002;12:1582-90. [PMID: 12368251 PMCID: PMC187532 DOI: 10.1101/gr.116402] [Citation(s) in RCA: 46] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]

For:	Raychaudhuri S, Schütze H, Altman RB. Using text analysis to identify functionally coherent gene groups. Genome Res 2002;12:1582-90. [PMID: 12368251 PMCID: PMC187532 DOI: 10.1101/gr.116402] [Citation(s) in RCA: 46] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]

Number

Cited by Other Article(s)

Biomedical text mining and its applications in cancer research. J Biomed Inform 2013;46:200-11. [DOI: 10.1016/j.jbi.2012.10.007] [Citation(s) in RCA: 159] [Impact Index Per Article: 14.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/28/2012] [Revised: 10/30/2012] [Accepted: 10/30/2012] [Indexed: 11/21/2022]

Functional annotation of hierarchical modularity. PLoS One 2012;7:e33744. [PMID: 22496762 PMCID: PMC3319548 DOI: 10.1371/journal.pone.0033744] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/12/2011] [Accepted: 02/16/2012] [Indexed: 11/28/2022] Open

Abstract

In biological networks of molecular interactions in a cell, network motifs that are biologically relevant are also functionally coherent, or form functional modules. These functionally coherent modules combine in a hierarchical manner into larger, less cohesive subsystems, thus revealing one of the essential design principles of system-level cellular organization and function–hierarchical modularity. Arguably, hierarchical modularity has not been explicitly taken into consideration by most, if not all, functional annotation systems. As a result, the existing methods would often fail to assign a statistically significant functional coherence score to biologically relevant molecular machines. We developed a methodology for hierarchical functional annotation. Given the hierarchical taxonomy of functional concepts (e.g., Gene Ontology) and the association of individual genes or proteins with these concepts (e.g., GO terms), our method will assign a Hierarchical Modularity Score (HMS) to each node in the hierarchy of functional modules; the HMS score and its value measure functional coherence of each module in the hierarchy. While existing methods annotate each module with a set of “enriched” functional terms in a bag of genes, our complementary method provides the hierarchical functional annotation of the modules and their hierarchically organized components. A hierarchical organization of functional modules often comes as a bi-product of cluster analysis of gene expression data or protein interaction data. Otherwise, our method will automatically build such a hierarchy by directly incorporating the functional taxonomy information into the hierarchy search process and by allowing multi-functional genes to be part of more than one component in the hierarchy. In addition, its underlying HMS scoring metric ensures that functional specificity of the terms across different levels of the hierarchical taxonomy is properly treated. We have evaluated our method using Saccharomyces cerevisiae data from KEGG and MIPS databases and several other computationally derived and curated datasets. The code and additional supplemental files can be obtained from http://code.google.com/p/functional-annotation-of-hierarchical-modularity/ (Accessed 2012 March 13).

Collapse

Xu L, Furlotte N, Lin Y, Heinrich K, Berry MW, George EO, Homayouni R. Functional cohesion of gene sets determined by latent semantic indexing of PubMed abstracts. PLoS One 2011;6:e18851. [PMID: 21533142 PMCID: PMC3077411 DOI: 10.1371/journal.pone.0018851] [Citation(s) in RCA: 22] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/10/2010] [Accepted: 03/21/2011] [Indexed: 12/31/2022] Open

Frijters R, van Vugt M, Smeets R, van Schaik R, de Vlieg J, Alkema W. Literature mining for the discovery of hidden connections between drugs, genes and diseases. PLoS Comput Biol 2010;6. [PMID: 20885778 PMCID: PMC2944780 DOI: 10.1371/journal.pcbi.1000943] [Citation(s) in RCA: 120] [Impact Index Per Article: 8.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/15/2010] [Accepted: 08/26/2010] [Indexed: 01/19/2023] Open

Abstract

The scientific literature represents a rich source for retrieval of knowledge on associations between biomedical concepts such as genes, diseases and cellular processes. A commonly used method to establish relationships between biomedical concepts from literature is co-occurrence. Apart from its use in knowledge retrieval, the co-occurrence method is also well-suited to discover new, hidden relationships between biomedical concepts following a simple ABC-principle, in which A and C have no direct relationship, but are connected via shared B-intermediates. In this paper we describe CoPub Discovery, a tool that mines the literature for new relationships between biomedical concepts. Statistical analysis using ROC curves showed that CoPub Discovery performed well over a wide range of settings and keyword thesauri. We subsequently used CoPub Discovery to search for new relationships between genes, drugs, pathways and diseases. Several of the newly found relationships were validated using independent literature sources. In addition, new predicted relationships between compounds and cell proliferation were validated and confirmed experimentally in an in vitro cell proliferation assay. The results show that CoPub Discovery is able to identify novel associations between genes, drugs, pathways and diseases that have a high probability of being biologically valid. This makes CoPub Discovery a useful tool to unravel the mechanisms behind disease, to find novel drug targets, or to find novel applications for existing drugs.

The biomedical literature is an important source of knowledge on the function of genes and on the mechanisms by which these genes regulate cellular processes. Several text mining approaches have been developed to leverage this rich source of information by automatically extracting associations between concepts such as genes, diseases and drugs from a large body of text. Here, we describe a new method that extracts novel, not yet recognized associations between genes, diseases, drugs and cellular processes from the biomedical literature. Our method is built on the assumption that even if two concepts do not have a direct connection in literature, they may be functionally related if they are both connected to an overlapping set of concepts. Using this approach we predicted several novel connections between genes, diseases, drugs and pathways. Our results imply that our method is able to predict novel relationships from literature and, most importantly, that these newly identified relationships are biologically relevant. Our method can aid the drug discovery process where it can be used to find novel drug targets, increase insight in mode of action of a drug or find novel applications for known drugs.

Collapse

He X, Sarma MS, Ling X, Chee B, Zhai C, Schatz B. Identifying overrepresented concepts in gene lists from literature: a statistical approach based on Poisson mixture model. BMC Bioinformatics 2010;11:272. [PMID: 20487560 PMCID: PMC2885378 DOI: 10.1186/1471-2105-11-272] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/08/2009] [Accepted: 05/20/2010] [Indexed: 11/10/2022] Open

Wu S, Liu T, Altman RB. Identification of recurring protein structure microenvironments and discovery of novel functional sites around CYS residues. BMC STRUCTURAL BIOLOGY 2010;10:4. [PMID: 20122268 PMCID: PMC2833161 DOI: 10.1186/1472-6807-10-4] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 07/31/2009] [Accepted: 02/02/2010] [Indexed: 11/29/2022]

Vazquez M, Carmona-Saez P, Nogales-Cadenas R, Chagoyen M, Tirado F, Carazo JM, Pascual-Montano A. SENT: semantic features in text. Nucleic Acids Res 2009;37:W153-9. [PMID: 19458159 PMCID: PMC2703940 DOI: 10.1093/nar/gkp392] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open

Comparing algorithms for clustering of expression data: how to assess gene clusters. Methods Mol Biol 2009;541:479-509. [PMID: 19381534 DOI: 10.1007/978-1-59745-243-4_21] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/06/2023]

Chagoyen M, Carazo JM, Pascual-Montano A. Assessment of protein set coherence using functional annotations. BMC Bioinformatics 2008;9:444. [PMID: 18937846 PMCID: PMC2588600 DOI: 10.1186/1471-2105-9-444] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/08/2008] [Accepted: 10/20/2008] [Indexed: 11/23/2022] Open

Aubry M, Monnier A, Chicault C, Galibert MD, Burgun A, Mosser J. Iron-related transcriptomic variations in Caco-2 cells: In silico perspectives. Biochimie 2008;90:669-78. [DOI: 10.1016/j.biochi.2008.01.002] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/15/2007] [Accepted: 01/04/2008] [Indexed: 10/22/2022]

Nair R, Rost B. Protein subcellular localization prediction using artificial intelligence technology. Methods Mol Biol 2008;484:435-63. [PMID: 18592195 DOI: 10.1007/978-1-59745-398-1_27] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/20/2023]

Abstract

Proteins perform many important tasks in living organisms, such as catalysis of biochemical reactions, transport of nutrients, and recognition and transmission of signals. The plethora of aspects of the role of any particular protein is referred to as its "function." One aspect of protein function that has been the target of intensive research by computational biologists is its subcellular localization. Proteins must be localized in the same subcellular compartment to cooperate toward a common physiological function. Aberrant subcellular localization of proteins can result in several diseases, including kidney stones, cancer, and Alzheimer's disease. To date, sequence homology remains the most widely used method for inferring the function of a protein. However, the application of advanced artificial intelligence (AI)-based techniques in recent years has resulted in significant improvements in our ability to predict the subcellular localization of a protein. The prediction accuracy has risen steadily over the years, in large part due to the application of AI-based methods such as hidden Markov models (HMMs), neural networks (NNs), and support vector machines (SVMs), although the availability of larger experimental datasets has also played a role. Automatic methods that mine textual information from the biological literature and molecular biology databases have considerably sped up the process of annotation for proteins for which some information regarding function is available in the literature. State-of-the-art methods based on NNs and HMMs can predict the presence of N-terminal sorting signals extremely accurately. Ab initio methods that predict subcellular localization for any protein sequence using only the native amino acid sequence and features predicted from the native sequence have shown the most remarkable improvements. The prediction accuracy of these methods has increased by over 30% in the past decade. The accuracy of these methods is now on par with high-throughput methods for predicting localization, and they are beginning to play an important role in directing experimental research. In this chapter, we review some of the most important methods for the prediction of subcellular localization.

Collapse

Erhardt RAA, Schneider R, Blaschke C. Status of text-mining techniques applied to biomedical text. Drug Discov Today 2007;11:315-25. [PMID: 16580973 DOI: 10.1016/j.drudis.2006.02.011] [Citation(s) in RCA: 82] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/10/2005] [Revised: 02/08/2006] [Accepted: 02/27/2006] [Indexed: 11/16/2022]

Kankainen M, Brader G, Törönen P, Palva ET, Holm L. Identifying functional gene sets from hierarchically clustered expression data: map of abiotic stress regulated genes in Arabidopsis thaliana. Nucleic Acids Res 2006;34:e124. [PMID: 17003050 PMCID: PMC1636450 DOI: 10.1093/nar/gkl694] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open

Liu Y, Ciliax BJ, Borges K, Dasigi V, Ram A, Navathe SB, Dingledine R. Comparison of two schemes for automatic keyword extraction from MEDLINE for functional gene clustering. PROCEEDINGS. IEEE COMPUTATIONAL SYSTEMS BIOINFORMATICS CONFERENCE 2006:394-404. [PMID: 16448032 DOI: 10.1109/csb.2004.1332452] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]

Jensen LJ, Saric J, Bork P. Literature mining for the biologist: from information retrieval to biological discovery. Nat Rev Genet 2006;7:119-29. [PMID: 16418747 DOI: 10.1038/nrg1768] [Citation(s) in RCA: 356] [Impact Index Per Article: 19.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]

Aubry M, Monnier A, Chicault C, de Tayrac M, Galibert MD, Burgun A, Mosser J. Combining evidence, biomedical literature and statistical dependence: new insights for functional annotation of gene sets. BMC Bioinformatics 2006;7:241. [PMID: 16674810 PMCID: PMC1482722 DOI: 10.1186/1471-2105-7-241] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/30/2005] [Accepted: 05/04/2006] [Indexed: 12/18/2022] Open

Abstract

Background

Large-scale genomic studies based on transcriptome technologies provide clusters of genes that need to be functionally annotated. The Gene Ontology (GO) implements a controlled vocabulary organised into three hierarchies: cellular components, molecular functions and biological processes. This terminology allows a coherent and consistent description of the knowledge about gene functions. The GO terms related to genes come primarily from semi-automatic annotations made by trained biologists (annotation based on evidence) or text-mining of the published scientific literature (literature profiling).

Results

We report an original functional annotation method based on a combination of evidence and literature that overcomes the weaknesses and the limitations of each approach. It relies on the Gene Ontology Annotation database (GOA Human) and the PubGene biomedical literature index. We support these annotations with statistically associated GO terms and retrieve associative relations across the three GO hierarchies to emphasise the major pathways involved by a gene cluster. Both annotation methods and associative relations were quantitatively evaluated with a reference set of 7397 genes and a multi-cluster study of 14 clusters. We also validated the biological appropriateness of our hybrid method with the annotation of a single gene (cdc2) and that of a down-regulated cluster of 37 genes identified by a transcriptome study of an in vitro enterocyte differentiation model (CaCo-2 cells).

Conclusion

The combination of both approaches is more informative than either separate approach: literature mining can enrich an annotation based only on evidence. Text-mining of the literature can also find valuable associated MEDLINE references that confirm the relevance of the annotation. Eventually, GO terms networks can be built with associative relations in order to highlight cooperative and competitive pathways and their connected molecular functions.

Collapse

Semeiks JR, Rizki A, Bissell MJ, Mian IS. Ensemble attribute profile clustering: discovering and characterizing groups of genes with similar patterns of biological features. BMC Bioinformatics 2006;7:147. [PMID: 16542449 PMCID: PMC1435935 DOI: 10.1186/1471-2105-7-147] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/28/2005] [Accepted: 03/16/2006] [Indexed: 11/17/2022] Open

Abstract

Background

Ensemble attribute profile clustering is a novel, text-based strategy for analyzing a user-defined list of genes and/or proteins. The strategy exploits annotation data present in gene-centered corpora and utilizes ideas from statistical information retrieval to discover and characterize properties shared by subsets of the list. The practical utility of this method is demonstrated by employing it in a retrospective study of two non-overlapping sets of genes defined by a published investigation as markers for normal human breast luminal epithelial cells and myoepithelial cells.

Results

Each genetic locus was characterized using a finite set of biological properties and represented as a vector of features indicating attributes associated with the locus (a gene attribute profile). In this study, the vector space models for a pre-defined list of genes were constructed from the Gene Ontology (GO) terms and the Conserved Domain Database (CDD) protein domain terms assigned to the loci by the gene-centered corpus LocusLink. This data set of GO- and CDD-based gene attribute profiles, vectors of binary random variables, was used to estimate multiple finite mixture models and each ensuing model utilized to partition the profiles into clusters. The resultant partitionings were combined using a unanimous voting scheme to produce consensus clusters, sets of profiles that co-occured consistently in the same cluster. Attributes that were important in defining the genes assigned to a consensus cluster were identified. The clusters and their attributes were inspected to ascertain the GO and CDD terms most associated with subsets of genes and in conjunction with external knowledge such as chromosomal location, used to gain functional insights into human breast biology. The 52 luminal epithelial cell markers and 89 myoepithelial cell markers are disjoint sets of genes. Ensemble attribute profile clustering-based analysis indicated that both lists contained groups of genes with the functional properties of membrane receptor biology/signal transduction and nucleic acid binding/transcription. A subset of the luminal markers was associated with metabolic and oxidoreductase activities, whereas a subset of myoepithelial markers was associated with protein hydrolase activity.

Conclusion

Given a set of genes and/or proteins associated with a phenomenon, process or system of interest, ensemble attribute profile clustering provides a simple method for collating and sythesizing the annotation data pertaining to them that are present in text-based, gene-centered corpora. The results provide information about properties common and unique to subsets of the list and hence insights into the biology of the problem under investigation.

Collapse

Chagoyen M, Carmona-Saez P, Shatkay H, Carazo JM, Pascual-Montano A. Discovering semantic features in the literature: a foundation for building functional associations. BMC Bioinformatics 2006;7:41. [PMID: 16438716 PMCID: PMC1386711 DOI: 10.1186/1471-2105-7-41] [Citation(s) in RCA: 59] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/01/2005] [Accepted: 01/26/2006] [Indexed: 11/10/2022] Open

Abstract

Background

Experimental techniques such as DNA microarray, serial analysis of gene expression (SAGE) and mass spectrometry proteomics, among others, are generating large amounts of data related to genes and proteins at different levels. As in any other experimental approach, it is necessary to analyze these data in the context of previously known information about the biological entities under study. The literature is a particularly valuable source of information for experiment validation and interpretation. Therefore, the development of automated text mining tools to assist in such interpretation is one of the main challenges in current bioinformatics research.

Results

We present a method to create literature profiles for large sets of genes or proteins based on common semantic features extracted from a corpus of relevant documents. These profiles can be used to establish pair-wise similarities among genes, utilized in gene/protein classification or can be even combined with experimental measurements. Semantic features can be used by researchers to facilitate the understanding of the commonalities indicated by experimental results. Our approach is based on non-negative matrix factorization (NMF), a machine-learning algorithm for data analysis, capable of identifying local patterns that characterize a subset of the data. The literature is thus used to establish putative relationships among subsets of genes or proteins and to provide coherent justification for this clustering into subsets. We demonstrate the utility of the method by applying it to two independent and vastly different sets of genes.

Conclusion

The presented method can create literature profiles from documents relevant to sets of genes. The representation of genes as additive linear combinations of semantic features allows for the exploration of functional associations as well as for clustering, suggesting a valuable methodology for the validation and interpretation of high-throughput experimental data.

Collapse

Curtis RK, Oresic M, Vidal-Puig A. Pathways to the analysis of microarray data. Trends Biotechnol 2005;23:429-35. [PMID: 15950303 DOI: 10.1016/j.tibtech.2005.05.011] [Citation(s) in RCA: 176] [Impact Index Per Article: 9.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/10/2005] [Revised: 04/11/2005] [Accepted: 05/24/2005] [Indexed: 10/25/2022]

Natarajan J, Berrar D, Hack CJ, Dubitzky W. Knowledge discovery in biology and biotechnology texts: a review of techniques, evaluation strategies, and applications. Crit Rev Biotechnol 2005;25:31-52. [PMID: 15999851 DOI: 10.1080/07388550590935571] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/24/2023]

Hakenberg J, Schmeier S, Kowald A, Klipp E, Leser U. Finding kinetic parameters using text mining. OMICS-A JOURNAL OF INTEGRATIVE BIOLOGY 2005;8:131-52. [PMID: 15268772 DOI: 10.1089/1536231041388366] [Citation(s) in RCA: 31] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/13/2022]

Alako BTF, Veldhoven A, van Baal S, Jelier R, Verhoeven S, Rullmann T, Polman J, Jenster G. CoPub Mapper: mining MEDLINE based on search term co-publication. BMC Bioinformatics 2005;6:51. [PMID: 15760478 PMCID: PMC1274248 DOI: 10.1186/1471-2105-6-51] [Citation(s) in RCA: 56] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/21/2004] [Accepted: 03/11/2005] [Indexed: 11/10/2022] Open

Liu Y, Navathe SB, Civera J, Dasigi V, Ram A, Ciliax BJ, Dingledine R. Text mining biomedical literature for discovering gene-to-gene relationships: a comparative study of algorithms. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2005;2:62-76. [PMID: 17044165 DOI: 10.1109/tcbb.2005.14] [Citation(s) in RCA: 18] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/12/2023]

Abstract

Partitioning closely related genes into clusters has become an important element of practically all statistical analyses of microarray data. A number of computer algorithms have been developed for this task. Although these algorithms have demonstrated their usefulness for gene clustering, some basic problems remain. This paper describes our work on extracting functional keywords from MEDLINE for a set of genes that are isolated for further study from microarray experiments based on their differential expression patterns. The sharing of functional keywords among genes is used as a basis for clustering in a new approach called BEA-PARTITION in this paper. Functional keywords associated with genes were extracted from MEDLINE abstracts. We modified the Bond Energy Algorithm (BEA), which is widely accepted in psychology and database design but is virtually unknown in bioinformatics, to cluster genes by functional keyword associations. The results showed that BEA-PARTITION and hierarchical clustering algorithm outperformed k-means clustering and self-organizing map by correctly assigning 25 of 26 genes in a test set of four known gene groups. To evaluate the effectiveness of BEA-PARTITION for clustering genes identified by microarray profiles, 44 yeast genes that are differentially expressed during the cell cycle and have been widely studied in the literature were used as a second test set. Using established measures of cluster quality, the results produced by BEA-PARTITION had higher purity, lower entropy, and higher mutual information than those produced by k-means and self-organizing map. Whereas BEA-PARTITION and the hierarchical clustering produced similar quality of clusters, BEA-PARTITION provides clear cluster boundaries compared to the hierarchical clustering. BEA-PARTITION is simple to implement and provides a powerful approach to clustering genes or to any clustering problem where starting matrices are available from experimental observations.

Collapse

Minagar A, Shapshak P, Duran EM, Kablinger AS, Alexander JS, Kelley RE, Seth R, Kazic T. HIV-associated dementia, Alzheimer's disease, multiple sclerosis, and schizophrenia: gene expression review. J Neurol Sci 2004;224:3-17. [PMID: 15450765 DOI: 10.1016/j.jns.2004.06.007] [Citation(s) in RCA: 13] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/13/2004] [Revised: 06/15/2004] [Accepted: 06/16/2004] [Indexed: 12/18/2022]

Santos C, Eggle D, States DJ. Wnt pathway curation using automated natural language processing: combining statistical methods with partial and full parse for knowledge extraction. Bioinformatics 2004;21:1653-8. [PMID: 15564295 DOI: 10.1093/bioinformatics/bti165] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open

Abstract

MOTIVATION

Wnt signaling is a very active area of research with highly relevant publications appearing at a rate of more than one per day. Building and maintaining databases describing signal transduction networks is a time-consuming and demanding task that requires careful literature analysis and extensive domain-specific knowledge. For instance, more than 50 factors involved in Wnt signal transduction have been identified as of late 2003. In this work we describe a natural language processing (NLP) system that is able to identify references to biological interaction networks in free text and automatically assembles a protein association and interaction map.

RESULTS

A 'gold standard' set of names and assertions was derived by manual scanning of the Wnt genes website (http://www.stanford.edu/~rnusse/wntwindow.html) including 53 interactions involved in Wnt signaling. This system was used to analyze a corpus of peer-reviewed articles related to Wnt signaling including 3369 Pubmed and 1230 full text papers. Names for key Wnt-pathway associated proteins and biological entities are identified using a chi-squared analysis of noun phrases over-represented in the Wnt literature as compared to the general signal transduction literature. Interestingly, we identified several instances where generic terms were used on the website when more specific terms occur in the literature, and one typographic error on the Wnt canonical pathway. Using the named entity list and performing an exhaustive assertion extraction of the corpus, 34 of the 53 interactions in the 'gold standard' Wnt signaling set were successfully identified (64% recall). In addition, the automated extraction found several interactions involving key Wnt-related molecules which were missing or different from those in the canonical diagram, and these were confirmed by manual review of the text. These results suggest that a combination of NLP techniques for information extraction can form a useful first-pass tool for assisting human annotation and maintenance of signal pathway databases.

AVAILABILITY

The pipeline software components are freely available on request to the authors.

CONTACT

dstates@umich.edu

SUPPLEMENTARY INFORMATION

http://stateslab.bioinformatics.med.umich.edu/software.html.

Collapse

Herrero J, Vaquerizas JM, Al-Shahrour F, Conde L, Mateos A, Díaz-Uriarte JSR, Dopazo J. New challenges in gene expression data analysis and the extended GEPAS. Nucleic Acids Res 2004;32:W485-91. [PMID: 15215434 PMCID: PMC441559 DOI: 10.1093/nar/gkh421] [Citation(s) in RCA: 39] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/13/2004] [Revised: 04/07/2004] [Accepted: 04/07/2004] [Indexed: 01/30/2023] Open

Glenisson P, Coessens B, Van Vooren S, Mathys J, Moreau Y, De Moor B. TXTGate: profiling gene groups with text-based information. Genome Biol 2004;5:R43. [PMID: 15186494 PMCID: PMC463076 DOI: 10.1186/gb-2004-5-6-r43] [Citation(s) in RCA: 53] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/24/2003] [Revised: 02/03/2004] [Accepted: 04/27/2004] [Indexed: 11/23/2022] Open

Wilkinson DM, Huberman BA. A method for finding communities of related genes. Proc Natl Acad Sci U S A 2004;101 Suppl 1:5241-8. [PMID: 14757821 PMCID: PMC387302 DOI: 10.1073/pnas.0307740100] [Citation(s) in RCA: 164] [Impact Index Per Article: 8.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open

Karaoz U, Murali TM, Letovsky S, Zheng Y, Ding C, Cantor CR, Kasif S. Whole-genome annotation by using evidence integration in functional-linkage networks. Proc Natl Acad Sci U S A 2004;101:2888-93. [PMID: 14981259 PMCID: PMC365715 DOI: 10.1073/pnas.0307326101] [Citation(s) in RCA: 240] [Impact Index Per Article: 12.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open

Chaussabel D. Biomedical Literature Mining. ACTA ACUST UNITED AC 2004;4:383-93. [PMID: 15651899 DOI: 10.2165/00129785-200404060-00005] [Citation(s) in RCA: 14] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/02/2022]

Kazic T, Coe E, Polacco M, Shyu CR. Whither biological database research? OMICS : A JOURNAL OF INTEGRATIVE BIOLOGY 2003;7:61-5. [PMID: 12831558 DOI: 10.1089/153623103322006625] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/12/2022]

Raychaudhuri S, Chang JT, Imam F, Altman RB. The computational analysis of scientific literature to define and recognize gene expression clusters. Nucleic Acids Res 2003;31:4553-60. [PMID: 12888516 PMCID: PMC169898 DOI: 10.1093/nar/gkg636] [Citation(s) in RCA: 40] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open

Troyanskaya OG, Dolinski K, Owen AB, Altman RB, Botstein D. A Bayesian framework for combining heterogeneous data sources for gene function prediction (in Saccharomyces cerevisiae). Proc Natl Acad Sci U S A 2003;100:8348-53. [PMID: 12826619 PMCID: PMC166232 DOI: 10.1073/pnas.0832373100] [Citation(s) in RCA: 410] [Impact Index Per Article: 19.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open

Kim CC, Falkow S. Significance analysis of lexical bias in microarray data. BMC Bioinformatics 2003;4:12. [PMID: 12697067 PMCID: PMC153504 DOI: 10.1186/1471-2105-4-12] [Citation(s) in RCA: 26] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/08/2003] [Accepted: 04/03/2003] [Indexed: 11/17/2022] Open

Raychaudhuri S, Altman RB. A literature-based method for assessing the functional coherence of a gene group. Bioinformatics 2003;19:396-401. [PMID: 12584126 PMCID: PMC2669934 DOI: 10.1093/bioinformatics/btg002] [Citation(s) in RCA: 34] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open

Current Awareness on Comparative and Functional Genomics. Comp Funct Genomics 2003;4:277-84. [PMID: 18629117 PMCID: PMC2447404 DOI: 10.1002/cfg.227] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open