1
|
Westergaard D, Stærfeldt HH, Tønsberg C, Jensen LJ, Brunak S. A comprehensive and quantitative comparison of text-mining in 15 million full-text articles versus their corresponding abstracts. PLoS Comput Biol 2018; 14:e1005962. [PMID: 29447159 PMCID: PMC5831415 DOI: 10.1371/journal.pcbi.1005962] [Citation(s) in RCA: 67] [Impact Index Per Article: 9.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/05/2017] [Revised: 02/28/2018] [Accepted: 01/05/2018] [Indexed: 12/21/2022] Open
Abstract
Across academia and industry, text mining has become a popular strategy for keeping up with the rapid growth of the scientific literature. Text mining of the scientific literature has mostly been carried out on collections of abstracts, due to their availability. Here we present an analysis of 15 million English scientific full-text articles published during the period 1823-2016. We describe the development in article length and publication sub-topics during these nearly 250 years. We showcase the potential of text mining by extracting published protein-protein, disease-gene, and protein subcellular associations using a named entity recognition system, and quantitatively report on their accuracy using gold standard benchmark data sets. We subsequently compare the findings to corresponding results obtained on 16.5 million abstracts included in MEDLINE and show that text mining of full-text articles consistently outperforms using abstracts only.
Collapse
Affiliation(s)
- David Westergaard
- Center for Biological Sequence Analysis, Department of Bio and Health Informatics, Technical University of Denmark, Lyngby, Denmark
- Novo Nordisk Foundation Center for Protein Research, Faculty of Health and Medical Sciences, University of Copenhagen, Copenhagen, Denmark
| | - Hans-Henrik Stærfeldt
- Center for Biological Sequence Analysis, Department of Bio and Health Informatics, Technical University of Denmark, Lyngby, Denmark
| | - Christian Tønsberg
- Office for Innovation and Sector Services, Technical Information Center of Denmark, Technical University of Denmark, Lyngby, Denmark
| | - Lars Juhl Jensen
- Novo Nordisk Foundation Center for Protein Research, Faculty of Health and Medical Sciences, University of Copenhagen, Copenhagen, Denmark
- * E-mail: (LJJ); (SB)
| | - Søren Brunak
- Center for Biological Sequence Analysis, Department of Bio and Health Informatics, Technical University of Denmark, Lyngby, Denmark
- * E-mail: (LJJ); (SB)
| |
Collapse
|
2
|
Zerva C, Batista-Navarro R, Day P, Ananiadou S. Using uncertainty to link and rank evidence from biomedical literature for model curation. Bioinformatics 2017; 33:3784-3792. [PMID: 29036627 PMCID: PMC5860317 DOI: 10.1093/bioinformatics/btx466] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/24/2017] [Revised: 06/27/2017] [Accepted: 07/21/2017] [Indexed: 11/20/2022] Open
Abstract
MOTIVATION In recent years, there has been great progress in the field of automated curation of biomedical networks and models, aided by text mining methods that provide evidence from literature. Such methods must not only extract snippets of text that relate to model interactions, but also be able to contextualize the evidence and provide additional confidence scores for the interaction in question. Although various approaches calculating confidence scores have focused primarily on the quality of the extracted information, there has been little work on exploring the textual uncertainty conveyed by the author. Despite textual uncertainty being acknowledged in biomedical text mining as an attribute of text mined interactions (events), it is significantly understudied as a means of providing a confidence measure for interactions in pathways or other biomedical models. In this work, we focus on improving identification of textual uncertainty for events and explore how it can be used as an additional measure of confidence for biomedical models. RESULTS We present a novel method for extracting uncertainty from the literature using a hybrid approach that combines rule induction and machine learning. Variations of this hybrid approach are then discussed, alongside their advantages and disadvantages. We use subjective logic theory to combine multiple uncertainty values extracted from different sources for the same interaction. Our approach achieves F-scores of 0.76 and 0.88 based on the BioNLP-ST and Genia-MK corpora, respectively, making considerable improvements over previously published work. Moreover, we evaluate our proposed system on pathways related to two different areas, namely leukemia and melanoma cancer research. AVAILABILITY AND IMPLEMENTATION The leukemia pathway model used is available in Pathway Studio while the Ras model is available via PathwayCommons. Online demonstration of the uncertainty extraction system is available for research purposes at http://argo.nactem.ac.uk/test. The related code is available on https://github.com/c-zrv/uncertainty_components.git. Details on the above are available in the Supplementary Material. CONTACT sophia.ananiadou@manchester.ac.uk. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Chrysoula Zerva
- National Centre for Text Mining, School of Computer Science, The University of Manchester, Manchester, UK
| | - Riza Batista-Navarro
- National Centre for Text Mining, School of Computer Science, The University of Manchester, Manchester, UK
| | - Philip Day
- Manchester Institute of Biotechnology, The University of Manchester, Manchester, UK
| | - Sophia Ananiadou
- National Centre for Text Mining, School of Computer Science, The University of Manchester, Manchester, UK
| |
Collapse
|
3
|
Frenkel-Morgenstern M, Gorohovski A, Tagore S, Sekar V, Vazquez M, Valencia A. ChiPPI: a novel method for mapping chimeric protein-protein interactions uncovers selection principles of protein fusion events in cancer. Nucleic Acids Res 2017; 45:7094-7105. [PMID: 28549153 PMCID: PMC5499553 DOI: 10.1093/nar/gkx423] [Citation(s) in RCA: 29] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/19/2016] [Accepted: 05/07/2017] [Indexed: 12/20/2022] Open
Abstract
Fusion proteins, comprising peptides deriving from the translation of two parental genes, are produced in cancer by chromosomal aberrations. The expressed fusion protein incorporates domains of both parental proteins. Using a methodology that treats discrete protein domains as binding sites for specific domains of interacting proteins, we have cataloged the protein interaction networks for 11 528 cancer fusions (ChiTaRS-3.1). Here, we present our novel method, chimeric protein–protein interactions (ChiPPI) that uses the domain–domain co-occurrence scores in order to identify preserved interactors of chimeric proteins. Mapping the influence of fusion proteins on cell metabolism and pathways reveals that ChiPPI networks often lose tumor suppressor proteins and gain oncoproteins. Furthermore, fusions often induce novel connections between non-interactors skewing interaction networks and signaling pathways. We compared fusion protein PPI networks in leukemia/lymphoma, sarcoma and solid tumors finding distinct enrichment patterns for each disease type. While certain pathways are enriched in all three diseases (Wnt, Notch and TGF β), there are distinct patterns for leukemia (EGFR signaling, DNA replication and CCKR signaling), for sarcoma (p53 pathway and CCKR signaling) and solid tumors (FGFR and EGFR signaling). Thus, the ChiPPI method represents a comprehensive tool for studying the anomaly of skewed cellular networks produced by fusion proteins in cancer.
Collapse
Affiliation(s)
| | | | - Somnath Tagore
- Faculty of Medicine, Bar-Ilan-University, Henrietta Szold 8, Safed 1311502, Israel
| | - Vaishnovi Sekar
- Structural Biology and BioComputing Programme, Spanish National Cancer Research Centre (CNIO), M.F.Almagro 3, 28029 Madrid, Spain
| | - Miguel Vazquez
- Structural Biology and BioComputing Programme, Spanish National Cancer Research Centre (CNIO), M.F.Almagro 3, 28029 Madrid, Spain
| | - Alfonso Valencia
- Structural Biology and BioComputing Programme, Spanish National Cancer Research Centre (CNIO), M.F.Almagro 3, 28029 Madrid, Spain
| |
Collapse
|
4
|
Krallinger M, Rabal O, Lourenço A, Oyarzabal J, Valencia A. Information Retrieval and Text Mining Technologies for Chemistry. Chem Rev 2017; 117:7673-7761. [PMID: 28475312 DOI: 10.1021/acs.chemrev.6b00851] [Citation(s) in RCA: 129] [Impact Index Per Article: 16.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/03/2023]
Abstract
Efficient access to chemical information contained in scientific literature, patents, technical reports, or the web is a pressing need shared by researchers and patent attorneys from different chemical disciplines. Retrieval of important chemical information in most cases starts with finding relevant documents for a particular chemical compound or family. Targeted retrieval of chemical documents is closely connected to the automatic recognition of chemical entities in the text, which commonly involves the extraction of the entire list of chemicals mentioned in a document, including any associated information. In this Review, we provide a comprehensive and in-depth description of fundamental concepts, technical implementations, and current technologies for meeting these information demands. A strong focus is placed on community challenges addressing systems performance, more particularly CHEMDNER and CHEMDNER patents tasks of BioCreative IV and V, respectively. Considering the growing interest in the construction of automatically annotated chemical knowledge bases that integrate chemical information and biological data, cheminformatics approaches for mapping the extracted chemical names into chemical structures and their subsequent annotation together with text mining applications for linking chemistry with biological information are also presented. Finally, future trends and current challenges are highlighted as a roadmap proposal for research in this emerging field.
Collapse
Affiliation(s)
- Martin Krallinger
- Structural Computational Biology Group, Structural Biology and BioComputing Programme, Spanish National Cancer Research Centre , C/Melchor Fernández Almagro 3, Madrid E-28029, Spain
| | - Obdulia Rabal
- Small Molecule Discovery Platform, Molecular Therapeutics Program, Center for Applied Medical Research (CIMA), University of Navarra , Avenida Pio XII 55, Pamplona E-31008, Spain
| | - Anália Lourenço
- ESEI - Department of Computer Science, University of Vigo , Edificio Politécnico, Campus Universitario As Lagoas s/n, Ourense E-32004, Spain.,Centro de Investigaciones Biomédicas (Centro Singular de Investigación de Galicia) , Campus Universitario Lagoas-Marcosende, Vigo E-36310, Spain.,CEB-Centre of Biological Engineering, University of Minho , Campus de Gualtar, Braga 4710-057, Portugal
| | - Julen Oyarzabal
- Small Molecule Discovery Platform, Molecular Therapeutics Program, Center for Applied Medical Research (CIMA), University of Navarra , Avenida Pio XII 55, Pamplona E-31008, Spain
| | - Alfonso Valencia
- Life Science Department, Barcelona Supercomputing Centre (BSC-CNS) , C/Jordi Girona, 29-31, Barcelona E-08034, Spain.,Joint BSC-IRB-CRG Program in Computational Biology, Parc Científic de Barcelona , C/ Baldiri Reixac 10, Barcelona E-08028, Spain.,Institució Catalana de Recerca i Estudis Avançats (ICREA) , Passeig de Lluís Companys 23, Barcelona E-08010, Spain
| |
Collapse
|
5
|
Zhao M, Chen Y, Qu D, Qu H. METSP: a maximum-entropy classifier based text mining tool for transporter-substrate identification with semistructured text. BIOMED RESEARCH INTERNATIONAL 2015; 2015:254838. [PMID: 26495291 PMCID: PMC4606149 DOI: 10.1155/2015/254838] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 03/13/2015] [Accepted: 06/21/2015] [Indexed: 01/16/2023]
Abstract
The substrates of a transporter are not only useful for inferring function of the transporter, but also important to discover compound-compound interaction and to reconstruct metabolic pathway. Though plenty of data has been accumulated with the developing of new technologies such as in vitro transporter assays, the search for substrates of transporters is far from complete. In this article, we introduce METSP, a maximum-entropy classifier devoted to retrieve transporter-substrate pairs (TSPs) from semistructured text. Based on the high quality annotation from UniProt, METSP achieves high precision and recall in cross-validation experiments. When METSP is applied to 182,829 human transporter annotation sentences in UniProt, it identifies 3942 sentences with transporter and compound information. Finally, 1547 confidential human TSPs are identified for further manual curation, among which 58.37% pairs with novel substrates not annotated in public transporter databases. METSP is the first efficient tool to extract TSPs from semistructured annotation text in UniProt. This tool can help to determine the precise substrates and drugs of transporters, thus facilitating drug-target prediction, metabolic network reconstruction, and literature classification.
Collapse
Affiliation(s)
- Min Zhao
- School of Engineering, Faculty of Science, Health, Education and Engineering, University of the Sunshine Coast, Maroochydore DC, QLD 4558, Australia
| | - Yanming Chen
- School of Computer Science & Technology, Beijing Institute of Technology, Beijing 100081, China
| | - Dacheng Qu
- School of Computer Science & Technology, Beijing Institute of Technology, Beijing 100081, China
| | - Hong Qu
- Center for Bioinformatics, State Key Laboratory of Protein and Plant Gene Research, College of Life Sciences, Peking University, Beijing 100871, China
| |
Collapse
|
6
|
Frenkel-Morgenstern M, Gorohovski A, Vucenovic D, Maestre L, Valencia A. ChiTaRS 2.1--an improved database of the chimeric transcripts and RNA-seq data with novel sense-antisense chimeric RNA transcripts. Nucleic Acids Res 2014; 43:D68-75. [PMID: 25414346 PMCID: PMC4383979 DOI: 10.1093/nar/gku1199] [Citation(s) in RCA: 24] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/20/2022] Open
Abstract
Chimeric RNAs that comprise two or more different transcripts have been identified in many cancers and among the Expressed Sequence Tags (ESTs) isolated from different organisms; they might represent functional proteins and produce different disease phenotypes. The ChiTaRS 2.1 database of chimeric transcripts and RNA-Seq data (http://chitars.bioinfo.cnio.es/) is the second version of the ChiTaRS database and includes improvements in content and functionality. Chimeras from eight organisms have been collated including novel sense–antisense (SAS) chimeras resulting from the slippage of the sense and anti-sense intragenic regions. The new database version collects more than 29 000 chimeric transcripts and indicates the expression and tissue specificity for 333 entries confirmed by RNA-seq reads mapping the chimeric junction sites. User interface allows for rapid and easy analysis of evolutionary conservation of fusions, literature references and experimental data supporting fusions in different organisms. More than 1428 cancer breakpoints have been automatically collected from public databases and manually verified to identify their correct cross-references, genomic sequences and junction sites. As a result, the ChiTaRS 2.1 collection of chimeras from eight organisms and human cancer breakpoints extends our understanding of the evolution of chimeric transcripts in eukaryotes as well as their functional role in carcinogenic processes.
Collapse
Affiliation(s)
- Milana Frenkel-Morgenstern
- Structural Biology and BioComputing Program, Spanish National Cancer Research Centre (CNIO), Madrid 28029, Spain
| | - Alessandro Gorohovski
- Structural Biology and BioComputing Program, Spanish National Cancer Research Centre (CNIO), Madrid 28029, Spain
| | - Dunja Vucenovic
- Structural Biology and BioComputing Program, Spanish National Cancer Research Centre (CNIO), Madrid 28029, Spain
| | - Lorena Maestre
- Monoclonal Antibodies Unit, Spanish National Cancer Research Centre (CNIO), Madrid 28029, Spain
| | - Alfonso Valencia
- Structural Biology and BioComputing Program, Spanish National Cancer Research Centre (CNIO), Madrid 28029, Spain.
| |
Collapse
|
7
|
A guide for building biological pathways along with two case studies: hair and breast development. Methods 2014; 74:16-35. [PMID: 25449898 DOI: 10.1016/j.ymeth.2014.10.006] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/05/2014] [Revised: 08/26/2014] [Accepted: 10/03/2014] [Indexed: 11/23/2022] Open
Abstract
Genomic information is being underlined in the format of biological pathways. Building these biological pathways is an ongoing demand and benefits from methods for extracting information from biomedical literature with the aid of text-mining tools. Here we hopefully guide you in the attempt of building a customized pathway or chart representation of a system. Our manual is based on a group of software designed to look at biointeractions in a set of abstracts retrieved from PubMed. However, they aim to support the work of someone with biological background, who does not need to be an expert on the subject and will play the role of manual curator while designing the representation of the system, the pathway. We therefore illustrate with two challenging case studies: hair and breast development. They were chosen for focusing on recent acquisitions of human evolution. We produced sub-pathways for each study, representing different phases of development. Differently from most charts present in current databases, we present detailed descriptions, which will additionally guide PESCADOR users along the process. The implementation as a web interface makes PESCADOR a unique tool for guiding the user along the biointeractions, which will constitute a novel pathway.
Collapse
|
8
|
Wu C, Schwartz JM, Nenadic G. PathNER: a tool for systematic identification of biological pathway mentions in the literature. BMC SYSTEMS BIOLOGY 2013; 7 Suppl 3:S2. [PMID: 24555844 PMCID: PMC3852116 DOI: 10.1186/1752-0509-7-s3-s2] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]
Abstract
Background Biological pathways are central to many biomedical studies and are frequently discussed in the literature. Several curated databases have been established to collate the knowledge of molecular processes constituting pathways. Yet, there has been little focus on enabling systematic detection of pathway mentions in the literature. Results We developed a tool, named PathNER (Pathway Named Entity Recognition), for the systematic identification of pathway mentions in the literature. PathNER is based on soft dictionary matching and rules, with the dictionary generated from public pathway databases. The rules utilise general pathway-specific keywords, syntactic information and gene/protein mentions. Detection results from both components are merged. On a gold-standard corpus, PathNER achieved an F1-score of 84%. To illustrate its potential, we applied PathNER on a collection of articles related to Alzheimer's disease to identify associated pathways, highlighting cases that can complement an existing manually curated knowledgebase. Conclusions In contrast to existing text-mining efforts that target the automatic reconstruction of pathway details from molecular interactions mentioned in the literature, PathNER focuses on identifying specific named pathway mentions. These mentions can be used to support large-scale curation and pathway-related systems biology applications, as demonstrated in the example of Alzheimer's disease. PathNER is implemented in Java and made freely available online at http://sourceforge.net/projects/pathner/.
Collapse
|
9
|
Ullah M, Stich S, Häupl T, Eucker J, Sittinger M, Ringe J. Reverse differentiation as a gene filtering tool in genome expression profiling of adipogenesis for fat marker gene selection and their analysis. PLoS One 2013; 8:e69754. [PMID: 23922792 PMCID: PMC3724870 DOI: 10.1371/journal.pone.0069754] [Citation(s) in RCA: 21] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/23/2013] [Accepted: 06/11/2013] [Indexed: 01/27/2023] Open
Abstract
BACKGROUND During mesenchymal stem cell (MSC) conversion into adipocytes, the adipogenic cocktail consisting of insulin, dexamethasone, indomethacin and 3-isobutyl-1-methylxanthine not only induces adipogenic-specific but also genes for non-adipogenic processes. Therefore, not all significantly expressed genes represent adipogenic-specific marker genes. So, our aim was to filter only adipogenic-specific out of all expressed genes. We hypothesize that exclusively adipogenic-specific genes change their expression during adipogenesis, and reverse during dedifferentiation. Thus, MSC were adipogenic differentiated and dedifferentiated. RESULTS Adipogenesis and reverse adipogenesis was verified by Oil Red O staining and expression of PPARG and FABP4. Based on GeneChips, 991 genes were differentially expressed during adipogenesis and grouped in 4 clusters. According to bioinformatic analysis the relevance of genes with adipogenic-linked biological annotations, expression sites, molecular functions, signaling pathways and transcription factor binding sites was high in cluster 1, including all prominent adipogenic genes like ADIPOQ, C/EBPA, LPL, PPARG and FABP4, moderate in clusters 2-3, and negligible in cluster 4. During reversed adipogenesis, only 782 expressed genes (clusters 1-3) were reverted, including 597 genes not reported for adipogenesis before. We identified APCDD1, CHI3L1, RARRES1 and SEMA3G as potential adipogenic-specific genes. CONCLUSION The model system of adipogenesis linked to reverse adipogenesis allowed the filtration of 782 adipogenic-specific genes out of total 991 significantly expressed genes. Database analysis of adipogenic-specific biological annotations, transcription factors and signaling pathways further validated and valued our concept, because most of the filtered 782 genes showed affiliation to adipogenesis. Based on this approach, the selected and filtered genes would be potentially important for characterization of adipogenesis and monitoring of clinical translation for soft-tissue regeneration. Moreover, we report 4 new marker genes.
Collapse
Affiliation(s)
- Mujib Ullah
- Tissue Engineering Laboratory & Berlin-Brandenburg Center for Regenerative Therapies, Department of Rheumatology and Clinical Immunology, Charité-University Medicine Berlin, Berlin, Germany
| | - Stefan Stich
- Tissue Engineering Laboratory & Berlin-Brandenburg Center for Regenerative Therapies, Department of Rheumatology and Clinical Immunology, Charité-University Medicine Berlin, Berlin, Germany
| | - Thomas Häupl
- Tissue Engineering Laboratory & Berlin-Brandenburg Center for Regenerative Therapies, Department of Rheumatology and Clinical Immunology, Charité-University Medicine Berlin, Berlin, Germany
| | - Jan Eucker
- Department of Hematology and Oncology, Charité-University Medicine Berlin, Berlin, Germany
| | - Michael Sittinger
- Tissue Engineering Laboratory & Berlin-Brandenburg Center for Regenerative Therapies, Department of Rheumatology and Clinical Immunology, Charité-University Medicine Berlin, Berlin, Germany
| | - Jochen Ringe
- Tissue Engineering Laboratory & Berlin-Brandenburg Center for Regenerative Therapies, Department of Rheumatology and Clinical Immunology, Charité-University Medicine Berlin, Berlin, Germany
| |
Collapse
|
10
|
Li C, Liakata M, Rebholz-Schuhmann D. Biological network extraction from scientific literature: state of the art and challenges. Brief Bioinform 2013; 15:856-77. [PMID: 23434632 DOI: 10.1093/bib/bbt006] [Citation(s) in RCA: 34] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
Networks of molecular interactions explain complex biological processes, and all known information on molecular events is contained in a number of public repositories including the scientific literature. Metabolic and signalling pathways are often viewed separately, even though both types are composed of interactions involving proteins and other chemical entities. It is necessary to be able to combine data from all available resources to judge the functionality, complexity and completeness of any given network overall, but especially the full integration of relevant information from the scientific literature is still an ongoing and complex task. Currently, the text-mining research community is steadily moving towards processing the full body of the scientific literature by making use of rich linguistic features such as full text parsing, to extract biological interactions. The next step will be to combine these with information from scientific databases to support hypothesis generation for the discovery of new knowledge and the extension of biological networks. The generation of comprehensive networks requires technologies such as entity grounding, coordination resolution and co-reference resolution, which are not fully solved and are required to further improve the quality of results. Here, we analyse the state of the art for the extraction of network information from the scientific literature and the evaluation of extraction methods against reference corpora, discuss challenges involved and identify directions for future research.
Collapse
|
11
|
Czarnecki J, Nobeli I, Smith AM, Shepherd AJ. A text-mining system for extracting metabolic reactions from full-text articles. BMC Bioinformatics 2012; 13:172. [PMID: 22823282 PMCID: PMC3475109 DOI: 10.1186/1471-2105-13-172] [Citation(s) in RCA: 30] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/27/2011] [Accepted: 06/30/2012] [Indexed: 01/20/2023] Open
Abstract
BACKGROUND Increasingly biological text mining research is focusing on the extraction of complex relationships relevant to the construction and curation of biological networks and pathways. However, one important category of pathway - metabolic pathways - has been largely neglected.Here we present a relatively simple method for extracting metabolic reaction information from free text that scores different permutations of assigned entities (enzymes and metabolites) within a given sentence based on the presence and location of stemmed keywords. This method extends an approach that has proved effective in the context of the extraction of protein-protein interactions. RESULTS When evaluated on a set of manually-curated metabolic pathways using standard performance criteria, our method performs surprisingly well. Precision and recall rates are comparable to those previously achieved for the well-known protein-protein interaction extraction task. CONCLUSIONS We conclude that automated metabolic pathway construction is more tractable than has often been assumed, and that (as in the case of protein-protein interaction extraction) relatively simple text-mining approaches can prove surprisingly effective. It is hoped that these results will provide an impetus to further research and act as a useful benchmark for judging the performance of more sophisticated methods that are yet to be developed.
Collapse
Affiliation(s)
- Jan Czarnecki
- Department of Biological Sciences and Institute of Molecular and Structural Biology, Birkbeck, University of London, Malet Street, London, WC1E 7HX, UK
| | - Irene Nobeli
- Department of Biological Sciences and Institute of Molecular and Structural Biology, Birkbeck, University of London, Malet Street, London, WC1E 7HX, UK
| | - Adrian M Smith
- Unilever R&D, Colworth Science Park, Sharnbrook, Bedfordshire, MK44 1LG, UK
| | - Adrian J Shepherd
- Department of Biological Sciences and Institute of Molecular and Structural Biology, Birkbeck, University of London, Malet Street, London, WC1E 7HX, UK
| |
Collapse
|
12
|
Lechner M, Höhn V, Brauner B, Dunger I, Fobo G, Frishman G, Montrone C, Kastenmüller G, Waegele B, Ruepp A. CIDeR: multifactorial interaction networks in human diseases. Genome Biol 2012; 13:R62. [PMID: 22809392 PMCID: PMC3491383 DOI: 10.1186/gb-2012-13-7-r62] [Citation(s) in RCA: 26] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/22/2012] [Accepted: 07/18/2012] [Indexed: 12/12/2022] Open
Abstract
The pathobiology of common diseases is influenced by heterogeneous factors interacting in complex networks. CIDeR http://mips.helmholtz-muenchen.de/cider/ is a publicly available, manually curated, integrative database of metabolic and neurological disorders. The resource provides structured information on 18,813 experimentally validated interactions between molecules, bioprocesses and environmental factors extracted from the scientific literature. Systematic annotation and interactive graphical representation of disease networks make CIDeR a versatile knowledge base for biologists, analysis of large-scale data and systems biology approaches.
Collapse
|
13
|
Uncovering the molecular machinery of the human spindle--an integration of wet and dry systems biology. PLoS One 2012; 7:e31813. [PMID: 22427808 PMCID: PMC3302876 DOI: 10.1371/journal.pone.0031813] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/07/2011] [Accepted: 01/18/2012] [Indexed: 11/19/2022] Open
Abstract
The mitotic spindle is an essential molecular machine involved in cell division, whose composition has been studied extensively by detailed cellular biology, high-throughput proteomics, and RNA interference experiments. However, because of its dynamic organization and complex regulation it is difficult to obtain a complete description of its molecular composition. We have implemented an integrated computational approach to characterize novel human spindle components and have analysed in detail the individual candidates predicted to be spindle proteins, as well as the network of predicted relations connecting known and putative spindle proteins. The subsequent experimental validation of a number of predicted novel proteins confirmed not only their association with the spindle apparatus but also their role in mitosis. We found that 75% of our tested proteins are localizing to the spindle apparatus compared to a success rate of 35% when expert knowledge alone was used. We compare our results to the previously published MitoCheck study and see that our approach does validate some findings by this consortium. Further, we predict so-called "hidden spindle hub", proteins whose network of interactions is still poorly characterised by experimental means and which are thought to influence the functionality of the mitotic spindle on a large scale. Our analyses suggest that we are still far from knowing the complete repertoire of functionally important components of the human spindle network. Combining integrated bio-computational approaches and single gene experimental follow-ups could be key to exploring the still hidden regions of the human spindle system.
Collapse
|
14
|
Thieu T, Joshi S, Warren S, Korkin D. Literature mining of host–pathogen interactions: comparing feature-based supervised learning and language-based approaches. Bioinformatics 2012; 28:867-75. [DOI: 10.1093/bioinformatics/bts042] [Citation(s) in RCA: 29] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022] Open
|
15
|
Hossain MS, Gresock J, Edmonds Y, Helm R, Potts M, Ramakrishnan N. Connecting the dots between PubMed abstracts. PLoS One 2012; 7:e29509. [PMID: 22235301 PMCID: PMC3250456 DOI: 10.1371/journal.pone.0029509] [Citation(s) in RCA: 28] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/26/2011] [Accepted: 11/29/2011] [Indexed: 11/23/2022] Open
Abstract
Background There are now a multitude of articles published in a diversity of journals providing information about genes, proteins, pathways, and diseases. Each article investigates subsets of a biological process, but to gain insight into the functioning of a system as a whole, we must integrate information from multiple publications. Particularly, unraveling relationships between extra-cellular inputs and downstream molecular response mechanisms requires integrating conclusions from diverse publications. Methodology We present an automated approach to biological knowledge discovery from PubMed abstracts, suitable for “connecting the dots” across the literature. We describe a storytelling algorithm that, given a start and end publication, typically with little or no overlap in content, identifies a chain of intermediate publications from one to the other, such that neighboring publications have significant content similarity. The quality of discovered stories is measured using local criteria such as the size of supporting neighborhoods for each link and the strength of individual links connecting publications, as well as global metrics of dispersion. To ensure that the story stays coherent as it meanders from one publication to another, we demonstrate the design of novel coherence and overlap filters for use as post-processing steps. Conclusions We demonstrate the application of our storytelling algorithm to three case studies: i) a many-one study exploring relationships between multiple cellular inputs and a molecule responsible for cell-fate decisions, ii) a many-many study exploring the relationships between multiple cytokines and multiple downstream transcription factors, and iii) a one-to-one study to showcase the ability to recover a cancer related association, viz. the Warburg effect, from past literature. The storytelling pipeline helps narrow down a scientist's focus from several hundreds of thousands of relevant documents to only around a hundred stories. We argue that our approach can serve as a valuable discovery aid for hypothesis generation and connection exploration in large unstructured biological knowledge bases.
Collapse
Affiliation(s)
- M Shahriar Hossain
- Department of Computer Science, Virginia Tech, Blacksburg, Virginia, United States of America.
| | | | | | | | | | | |
Collapse
|
16
|
Harmston N, Filsell W, Stumpf MPH. Which species is it? Species-driven gene name disambiguation using random walks over a mixture of adjacency matrices. Bioinformatics 2011; 28:254-60. [PMID: 22135416 DOI: 10.1093/bioinformatics/btr640] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
MOTIVATION The scientific literature contains a wealth of information about biological systems. Manual curation lacks the scalability to extract this information due to the ever-increasing numbers of papers being published. The development and application of text mining technologies has been proposed as a way of dealing with this problem. However, the inter-species ambiguity of the genomic nomenclature makes mapping of gene mentions identified in text to their corresponding Entrez gene identifiers an extremely difficult task. We propose a novel method, which transforms a MEDLINE record into a mixture of adjacency matrices; by performing a random walkover the resulting graph, we can perform multi-class supervised classification allowing the assignment of taxonomy identifiers to individual gene mentions. The ability to achieve good performance at this task has a direct impact on the performance of normalizing gene mentions to Entrez gene identifiers. Such graph mixtures add flexibility and allow us to generate probabilistic classification schemes that naturally reflect the uncertainties inherent, even in literature-derived data. RESULTS Our method performs well in terms of both micro- and macro-averaged performance, achieving micro-F(1) of 0.76 and macro-F(1) of 0.36 on the publicly available DECA corpus. Re-curation of the DECA corpus was performed, with our method achieving 0.88 micro-F(1) and 0.51 macro-F(1). Our method improves over standard classification techniques [such as support vector machines (SVMs)] in a number of ways: flexibility, interpretability and its resistance to the effects of class bias in the training data. Good performance is achieved without the need for computationally expensive parse tree generation or 'bag of words classification'.
Collapse
Affiliation(s)
- Nathan Harmston
- Centre for Bioinformatics, Division of Molecular Biosciences, Imperial College London, London SW7 2AZ, UK
| | | | | |
Collapse
|
17
|
Sandbichler AM, Egg M, Schwerte T, Pelster B. Claudin 28b and F-actin are involved in rainbow trout gill pavement cell tight junction remodeling under osmotic stress. ACTA ACUST UNITED AC 2011; 214:1473-87. [PMID: 21490256 DOI: 10.1242/jeb.050062] [Citation(s) in RCA: 27] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/05/2023]
Abstract
Permeability of rainbow trout gill pavement cells cultured on permeable supports (single seeded inserts) changes upon exposure to freshwater or treatment with cortisol. The molecular components of this change are largely unknown, but tight junctions that regulate the paracellular pathway are prime candidates in this adaptational process. Using differential display polymerase chain reaction we found a set of 17 differentially regulated genes in trout pavement cells that had been exposed to freshwater apically for 24 h. Five genes were related to the cell-cell contact. One of these genes was isolated and identified as encoding claudin 28b, an integral component of the tight junction. Immunohistochemical reactivity to claudin 28b protein was concentrated in a circumferential ring colocalized to the cortical F-actin ring. To study the contribution of this isoform to changes in transepithelial resistance and Phenol Red diffusion under apical hypo-or hyperosmotic exposure we quantified the fluorescence signal of this claudin isoform in immunohistochemical stainings together with the fluorescence of phalloidin-probed F-actin. Upon hypo-osmotic stress claudin 28b fluorescence and epithelial tightness remained stable. Under hyperosmotic stress, the presence of claudin 28b at the junction significantly decreased, and epithelial tightness was severely reduced. Cortical F-actin fluorescence increased upon hypo-osmotic stress, whereas hyperosmotic stress led to a separation of cortical F-actin rings and the number of apical crypt-like pores increased. Addition of cortisol to the basolateral medium attenuated cortical F-actin separation and pore formation during hyperosmotic stress and reduced claudin 28b in junctions except after recovery of cells from exposure to freshwater. Our results showed that short-term salinity stress response in cultured trout gill cells was dependent on a dynamic remodeling of tight junctions, which involves claudin 28b and the supporting F-actin ring.
Collapse
Affiliation(s)
- Adolf Michael Sandbichler
- Institute of Zoology, and Center for Molecular Biosciences, University of Innsbruck, Technikerstr. 25, 6020 Innsbruck, Austria
| | | | | | | |
Collapse
|
18
|
Identification, modeling and simulation of key pathways underlying certain cancers. YI CHUAN = HEREDITAS 2011; 33:809-19. [DOI: 10.3724/sp.j.1005.2011.00809] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
|
19
|
Remmerie N, De Vijlder T, Laukens K, Dang TH, Lemière F, Mertens I, Valkenborg D, Blust R, Witters E. Next generation functional proteomics in non-model plants: A survey on techniques and applications for the analysis of protein complexes and post-translational modifications. PHYTOCHEMISTRY 2011; 72:1192-218. [PMID: 21345472 DOI: 10.1016/j.phytochem.2011.01.003] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/06/2010] [Revised: 11/21/2010] [Accepted: 01/03/2011] [Indexed: 05/11/2023]
Abstract
The congruent development of computational technology, bioinformatics and analytical instrumentation makes proteomics ready for the next leap. Present-day state of the art proteomics grew from a descriptive method towards a full stake holder in systems biology. High throughput and genome wide studies are now made at the functional level. These include quantitative aspects, functional aspects with respect to protein interactions as well as post translational modifications and advanced computational methods that aid in predicting protein function and mapping these functionalities across the species border. In this review an overview is given of the current status of these aspects in plant studies with special attention to non-genomic model plants.
Collapse
Affiliation(s)
- Noor Remmerie
- Center for Proteomics, University of Antwerp, Groenenborgerlaan 171, B-2020 Antwerp, Belgium
| | | | | | | | | | | | | | | | | |
Collapse
|
20
|
Bell L, Chowdhary R, Liu JS, Niu X, Zhang J. Integrated bio-entity network: a system for biological knowledge discovery. PLoS One 2011; 6:e21474. [PMID: 21738677 PMCID: PMC3124513 DOI: 10.1371/journal.pone.0021474] [Citation(s) in RCA: 21] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/28/2011] [Accepted: 06/01/2011] [Indexed: 01/26/2023] Open
Abstract
A significant part of our biological knowledge is centered on relationships between biological entities (bio-entities) such as proteins, genes, small molecules, pathways, gene ontology (GO) terms and diseases. Accumulated at an increasing speed, the information on bio-entity relationships is archived in different forms at scattered places. Most of such information is buried in scientific literature as unstructured text. Organizing heterogeneous information in a structured form not only facilitates study of biological systems using integrative approaches, but also allows discovery of new knowledge in an automatic and systematic way. In this study, we performed a large scale integration of bio-entity relationship information from both databases containing manually annotated, structured information and automatic information extraction of unstructured text in scientific literature. The relationship information we integrated in this study includes protein–protein interactions, protein/gene regulations, protein–small molecule interactions, protein–GO relationships, protein–pathway relationships, and pathway–disease relationships. The relationship information is organized in a graph data structure, named integrated bio-entity network (IBN), where the vertices are the bio-entities and edges represent their relationships. Under this framework, graph theoretic algorithms can be designed to perform various knowledge discovery tasks. We designed breadth-first search with pruning (BFSP) and most probable path (MPP) algorithms to automatically generate hypotheses—the indirect relationships with high probabilities in the network. We show that IBN can be used to generate plausible hypotheses, which not only help to better understand the complex interactions in biological systems, but also provide guidance for experimental designs.
Collapse
Affiliation(s)
- Lindsey Bell
- Department of Statistics, Florida State University, Tallahassee, Florida, United States of America
| | | | | | | | | |
Collapse
|
21
|
Fortney K, Jurisica I. Integrative computational biology for cancer research. Hum Genet 2011; 130:465-81. [PMID: 21691773 PMCID: PMC3179275 DOI: 10.1007/s00439-011-0983-z] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/03/2011] [Accepted: 04/02/2011] [Indexed: 12/21/2022]
Abstract
Over the past two decades, high-throughput (HTP) technologies such as microarrays and mass spectrometry have fundamentally changed clinical cancer research. They have revealed novel molecular markers of cancer subtypes, metastasis, and drug sensitivity and resistance. Some have been translated into the clinic as tools for early disease diagnosis, prognosis, and individualized treatment and response monitoring. Despite these successes, many challenges remain: HTP platforms are often noisy and suffer from false positives and false negatives; optimal analysis and successful validation require complex workflows; and great volumes of data are accumulating at a rapid pace. Here we discuss these challenges, and show how integrative computational biology can help diminish them by creating new software tools, analytical methods, and data standards.
Collapse
Affiliation(s)
- Kristen Fortney
- Department of Medical Biophysics, University of Toronto, Toronto, ON, Canada
| | | |
Collapse
|
22
|
|
23
|
Sánchez-Cabo F, Rainer J, Dopazo A, Trajanoski Z, Hackl H. Insights into global mechanisms and disease by gene expression profiling. Methods Mol Biol 2011; 719:269-98. [PMID: 21370089 DOI: 10.1007/978-1-61779-027-0_13] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/28/2023]
Abstract
Transcriptomics has played an essential role as proof of concept in the development of experimental and bioinformatics approaches for the generation and analysis of Omics data. We are giving an introduction on how large-scale technologies for gene expression profiling, especially microarrays, have changed the view from studying single molecular events to a systems level view of global mechanisms in a cell, the biological processes, and their pathological mutations. The main platforms available for gene expression profiling (from microarrays to RNA-seq) are presented and the general concepts that need to be taken into account for proper data analysis in order to extract objective and general conclusions from transcriptomics experiments are introduced. We also describe the available main bioinformatics resources used for this purpose.
Collapse
Affiliation(s)
- Fátima Sánchez-Cabo
- Genomics Unit, Centro Nacional de Investigaciones Cardiovasculares, Madrid, Spain
| | | | | | | | | |
Collapse
|
24
|
Mirzarezaee M, Araabi BN, Sadeghi M. Features analysis for identification of date and party hubs in protein interaction network of Saccharomyces Cerevisiae. BMC SYSTEMS BIOLOGY 2010; 4:172. [PMID: 21167069 PMCID: PMC3018396 DOI: 10.1186/1752-0509-4-172] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 03/17/2010] [Accepted: 12/19/2010] [Indexed: 12/12/2022]
Abstract
BACKGROUND It has been understood that biological networks have modular organizations which are the sources of their observed complexity. Analysis of networks and motifs has shown that two types of hubs, party hubs and date hubs, are responsible for this complexity. Party hubs are local coordinators because of their high co-expressions with their partners, whereas date hubs display low co-expressions and are assumed as global connectors. However there is no mutual agreement on these concepts in related literature with different studies reporting their results on different data sets. We investigated whether there is a relation between the biological features of Saccharomyces Cerevisiae's proteins and their roles as non-hubs, intermediately connected, party hubs, and date hubs. We propose a classifier that separates these four classes. RESULTS We extracted different biological characteristics including amino acid sequences, domain contents, repeated domains, functional categories, biological processes, cellular compartments, disordered regions, and position specific scoring matrix from various sources. Several classifiers are examined and the best feature-sets based on average correct classification rate and correlation coefficients of the results are selected. We show that fusion of five feature-sets including domains, Position Specific Scoring Matrix-400, cellular compartments level one, and composition pairs with two and one gaps provide the best discrimination with an average correct classification rate of 77%. CONCLUSIONS We study a variety of known biological feature-sets of the proteins and show that there is a relation between domains, Position Specific Scoring Matrix-400, cellular compartments level one, composition pairs with two and one gaps of Saccharomyces Cerevisiae's proteins, and their roles in the protein interaction network as non-hubs, intermediately connected, party hubs and date hubs. This study also confirms the possibility of predicting non-hubs, party hubs and date hubs based on their biological features with acceptable accuracy. If such a hypothesis is correct for other species as well, similar methods can be applied to predict the roles of proteins in those species.
Collapse
Affiliation(s)
- Mitra Mirzarezaee
- Department of Computer Engineering, Islamic Azad University, Science and Research Branch, Tehran, Iran
| | | | | |
Collapse
|
25
|
Demir E, Cary MP, Paley S, Fukuda K, Lemer C, Vastrik I, Wu G, D’Eustachio P, Schaefer C, Luciano J, Schacherer F, Martinez-Flores I, Hu Z, Jimenez-Jacinto V, Joshi-Tope G, Kandasamy K, Lopez-Fuentes AC, Mi H, Pichler E, Rodchenkov I, Splendiani A, Tkachev S, Zucker J, Gopinath G, Rajasimha H, Ramakrishnan R, Shah I, Syed M, Anwar N, Babur O, Blinov M, Brauner E, Corwin D, Donaldson S, Gibbons F, Goldberg R, Hornbeck P, Luna A, Murray-Rust P, Neumann E, Reubenacker O, Samwald M, van Iersel M, Wimalaratne S, Allen K, Braun B, Whirl-Carrillo M, Dahlquist K, Finney A, Gillespie M, Glass E, Gong L, Haw R, Honig M, Hubaut O, Kane D, Krupa S, Kutmon M, Leonard J, Marks D, Merberg D, Petri V, Pico A, Ravenscroft D, Ren L, Shah N, Sunshine M, Tang R, Whaley R, Letovksy S, Buetow KH, Rzhetsky A, Schachter V, Sobral BS, Dogrusoz U, McWeeney S, Aladjem M, Birney E, Collado-Vides J, Goto S, Hucka M, Le Novère N, Maltsev N, Pandey A, Thomas P, Wingender E, Karp PD, Sander C, Bader GD. The BioPAX community standard for pathway data sharing. Nat Biotechnol 2010; 28:935-42. [PMID: 20829833 PMCID: PMC3001121 DOI: 10.1038/nbt.1666] [Citation(s) in RCA: 455] [Impact Index Per Article: 30.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/20/2022]
Abstract
Biological Pathway Exchange (BioPAX) is a standard language to represent biological pathways at the molecular and cellular level and to facilitate the exchange of pathway data. The rapid growth of the volume of pathway data has spurred the development of databases and computational tools to aid interpretation; however, use of these data is hampered by the current fragmentation of pathway information across many databases with incompatible formats. BioPAX, which was created through a community process, solves this problem by making pathway data substantially easier to collect, index, interpret and share. BioPAX can represent metabolic and signaling pathways, molecular and genetic interactions and gene regulation networks. Using BioPAX, millions of interactions, organized into thousands of pathways, from many organisms are available from a growing number of databases. This large amount of pathway data in a computable form will support visualization, analysis and biological discovery.
Collapse
Affiliation(s)
- Emek Demir
- Computational Biology, Memorial Sloan-Kettering Cancer Center, New York NY, USA
- Center for Bioinformatics and Computer Engineering Department, Bilkent University, Ankara, Turkey
| | - Michael P. Cary
- Computational Biology, Memorial Sloan-Kettering Cancer Center, New York NY, USA
| | | | - Ken Fukuda
- Institute for Bioinformatics Research and Development Japan Science and Technology Agency, Tokyo, Japan
| | | | - Imre Vastrik
- European Bioinformatics Institute, Hinxton, Cambridge, UK
| | - Guanming Wu
- Ontario Institute for Cancer Research, Toronto ON, Canada
| | | | - Carl Schaefer
- National Cancer Institute, Center for Biomedical Informatics and Information Technology, Rockville MD, USA
| | | | | | - Irma Martinez-Flores
- Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, Cuernavaca, Morelos, Mexico
| | - Zhenjun Hu
- Biomolecular Systems Laboratory, Boston University, Boston MA, USA
| | - Veronica Jimenez-Jacinto
- Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, Cuernavaca, Morelos, Mexico
| | | | - Kumaran Kandasamy
- McKusick-Nathans Institute of Genetic Medicine and the Departments of Biological Chemistry, Pathology and Oncology, Johns Hopkins University, Baltimore MD , USA
| | | | - Huaiyu Mi
- Artificial Intelligence Center, SRI International, Menlo Park CA, USA
| | | | - Igor Rodchenkov
- Donnelly Center for Cellular and Biomolecular Research, Banting and Best Department of Medical Research, University of Toronto, Toronto, Ontario, Canada
| | - Andrea Splendiani
- Faculté de Médecine, Université Rennes 1, Rennes, France
- Rothamsted Research, Harpenden, UK
| | | | | | - Gopal Gopinath
- Center for Food Safety and Applied Nutrition, US Food and Drug Adminsitration, Laurel MD, USA
| | - Harsha Rajasimha
- Virginia Bioinformatics Institute, Virginia Polytechnic Institute and State University, Blacksburg VA, USA
- Neurobiology, Neurodegeneration and repair laboratory, National Eye Institute, NIH, Bethesda, MD, USA
| | - Ranjani Ramakrishnan
- Department of Behavioral Neuroscience. Oregon Health & Science University, Portland OR, USA
| | - Imran Shah
- U.S. Environmental Protection Agency Durham, NC USA
| | - Mustafa Syed
- Mathematics & Computer Science Division, Argonne National Laboratory, Argonne, IL, USA
| | - Nadia Anwar
- Computational Biology, Memorial Sloan-Kettering Cancer Center, New York NY, USA
| | - Ozgun Babur
- Computational Biology, Memorial Sloan-Kettering Cancer Center, New York NY, USA
- Center for Bioinformatics and Computer Engineering Department, Bilkent University, Ankara, Turkey
| | - Michael Blinov
- University of Connecticut Health Center, Farmington, CT, USA
| | - Erik Brauner
- Department of Biological Chemistry and Molecular Pharmacology, Harvard Medical School, Boston MA, USA
| | | | - Sylva Donaldson
- Donnelly Center for Cellular and Biomolecular Research, Banting and Best Department of Medical Research, University of Toronto, Toronto, Ontario, Canada
| | - Frank Gibbons
- Department of Biological Chemistry and Molecular Pharmacology, Harvard Medical School, Boston MA, USA
| | - Robert Goldberg
- Biotechnology Division, National Institute of Standards and Technology, Gaithersburg MD, USA
| | | | - Augustin Luna
- Center for Cancer Research, NCI, NIH, Bethesda MD, USA
| | - Peter Murray-Rust
- Unilever Centre for Molecular Sciences Informatics, Department of Chemistry, University of Cambridge, Cambridge UK
| | | | - Oliver Reubenacker
- Center for Cell Analysis and Modeling, University of Connecticut Health Center, Storrs CT, USA
| | - Matthias Samwald
- Digital Enterprise Research Institute, National University of Ireland, Galway, Ireland
- Konrad Lorenz Institute for Evolution and Cognition Research, Altenberg, Austria
| | - Martijn van Iersel
- Department of Bioinformatics, Maastricht University, Maastricht, Netherlands
| | | | - Keith Allen
- Syngenta Biotech Inc., Research Triangle Park, North Carolina, USA
| | | | | | | | - Andrew Finney
- Physiomics PLC, Magdalen Centre, Oxford Science Park Oxford, UK
| | | | - Elizabeth Glass
- Mathematics & Computer Science Division, Argonne National Laboratory, Argonne IL, USA
| | - Li Gong
- Department of Genetics, Stanford University, Stanford CA, USA
| | - Robin Haw
- The Ontario Institute for Cancer Research, Toronto, Ontario, Canada
| | | | | | | | - Shiva Krupa
- Novartis Knowledge Center, Cambridge MA, USA
| | | | - Julie Leonard
- Syngenta Biotech Inc., Research Triangle Park, North Carolina, USA
| | - Debbie Marks
- Department of Systems Biology, Harvard Medical School, Boston, MA, USA
| | | | - Victoria Petri
- Human and Molecular Genetics Center, Medical College of Wisconsin, Milwaukee WI, USA
| | - Alex Pico
- Gladstone Institute of Cardiovascular Disease, San Francisco CA, USA
| | - Dean Ravenscroft
- Department of Plant Breeding and Genetics, Cornell University, Ithaca, NY, USA
| | - Liya Ren
- Cold Spring Harbor Laboratory, Cold Spring Harbor NY, USA
| | - Nigam Shah
- Centre for Biomedical Informatics, School of Medicine, Stanford University, Stanford CA, USA
| | | | - Rebecca Tang
- Department of Genetics, Stanford University, Stanford CA, USA
| | - Ryan Whaley
- Department of Genetics, Stanford University, Stanford CA, USA
| | - Stan Letovksy
- Computational Sciences, Informatics, Millennium Pharmaceuticals Inc., Cambridge MA, USA
| | - Kenneth H. Buetow
- Center for Biomedical Informatics and Information Technology, National Cancer Institute, Bethesda MD, USA
| | - Andrey Rzhetsky
- Institute for Genomics and Systems Biology, The University of Chicago and Argonne National Laboratory, Chicago IL, USA
| | | | - Bruno S. Sobral
- Virginia Bioinformatics Institute, Virginia Polytechnic Institute and State University, Blacksburg VA, USA
| | - Ugur Dogrusoz
- Center for Bioinformatics and Computer Engineering Department, Bilkent University, Ankara, Turkey
| | - Shannon McWeeney
- Department of Behavioral Neuroscience. Oregon Health & Science University, Portland OR, USA
| | - Mirit Aladjem
- Center for Cancer Research, NCI, NIH, Bethesda MD, USA
| | - Ewan Birney
- European Bioinformatics Institute, Hinxton, Cambridge, UK
| | - Julio Collado-Vides
- Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, Cuernavaca, Morelos, Mexico
| | - Susumu Goto
- Bioinformatics Center, Institute for Chemical Research, Kyoto University, Uji, Kyoto, Japan
| | - Michael Hucka
- Biological Network Modeling Center, California Institute of Technology, Pasadena, CA, USA
| | | | - Natalia Maltsev
- Mathematics & Computer Science Division, Argonne National Laboratory, Argonne IL, USA
| | - Akhilesh Pandey
- McKusick-Nathans Institute of Genetic Medicine and the Departments of Biological Chemistry, Pathology and Oncology, Johns Hopkins University, Baltimore MD , USA
| | - Paul Thomas
- Artificial Intelligence Center, SRI International, Menlo Park CA, USA
| | | | | | - Chris Sander
- Computational Biology, Memorial Sloan-Kettering Cancer Center, New York NY, USA
| | - Gary D. Bader
- Donnelly Center for Cellular and Biomolecular Research, Banting and Best Department of Medical Research, University of Toronto, Toronto, Ontario, Canada
| |
Collapse
|
26
|
A comprehensive benchmark of kernel methods to extract protein-protein interactions from literature. PLoS Comput Biol 2010; 6:e1000837. [PMID: 20617200 PMCID: PMC2895635 DOI: 10.1371/journal.pcbi.1000837] [Citation(s) in RCA: 80] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/15/2010] [Accepted: 05/27/2010] [Indexed: 02/07/2023] Open
Abstract
The most important way of conveying new findings in biomedical research is scientific publication. Extraction of protein-protein interactions (PPIs) reported in scientific publications is one of the core topics of text mining in the life sciences. Recently, a new class of such methods has been proposed - convolution kernels that identify PPIs using deep parses of sentences. However, comparing published results of different PPI extraction methods is impossible due to the use of different evaluation corpora, different evaluation metrics, different tuning procedures, etc. In this paper, we study whether the reported performance metrics are robust across different corpora and learning settings and whether the use of deep parsing actually leads to an increase in extraction quality. Our ultimate goal is to identify the one method that performs best in real-life scenarios, where information extraction is performed on unseen text and not on specifically prepared evaluation data. We performed a comprehensive benchmarking of nine different methods for PPI extraction that use convolution kernels on rich linguistic information. Methods were evaluated on five different public corpora using cross-validation, cross-learning, and cross-corpus evaluation. Our study confirms that kernels using dependency trees generally outperform kernels based on syntax trees. However, our study also shows that only the best kernel methods can compete with a simple rule-based approach when the evaluation prevents information leakage between training and test corpora. Our results further reveal that the F-score of many approaches drops significantly if no corpus-specific parameter optimization is applied and that methods reaching a good AUC score often perform much worse in terms of F-score. We conclude that for most kernels no sensible estimation of PPI extraction performance on new text is possible, given the current heterogeneity in evaluation data. Nevertheless, our study shows that three kernels are clearly superior to the other methods.
Collapse
|
27
|
Garten Y, Altman RB. Pharmspresso: a text mining tool for extraction of pharmacogenomic concepts and relationships from full text. BMC Bioinformatics 2009; 10 Suppl 2:S6. [PMID: 19208194 PMCID: PMC2646239 DOI: 10.1186/1471-2105-10-s2-s6] [Citation(s) in RCA: 62] [Impact Index Per Article: 3.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022] Open
Abstract
Background Pharmacogenomics studies the relationship between genetic variation and the variation in drug response phenotypes. The field is rapidly gaining importance: it promises drugs targeted to particular subpopulations based on genetic background. The pharmacogenomics literature has expanded rapidly, but is dispersed in many journals. It is challenging, therefore, to identify important associations between drugs and molecular entities – particularly genes and gene variants, and thus these critical connections are often lost. Text mining techniques can allow us to convert the free-style text to a computable, searchable format in which pharmacogenomic concepts (such as genes, drugs, polymorphisms, and diseases) are identified, and important links between these concepts are recorded. Availability of full text articles as input into text mining engines is key, as literature abstracts often do not contain sufficient information to identify these pharmacogenomic associations. Results Thus, building on a tool called Textpresso, we have created the Pharmspresso tool to assist in identifying important pharmacogenomic facts in full text articles. Pharmspresso parses text to find references to human genes, polymorphisms, drugs and diseases and their relationships. It presents these as a series of marked-up text fragments, in which key concepts are visually highlighted. To evaluate Pharmspresso, we used a gold standard of 45 human-curated articles. Pharmspresso identified 78%, 61%, and 74% of target gene, polymorphism, and drug concepts, respectively. Conclusion Pharmspresso is a text analysis tool that extracts pharmacogenomic concepts from the literature automatically and thus captures our current understanding of gene-drug interactions in a computable form. We have made Pharmspresso available at .
Collapse
Affiliation(s)
- Yael Garten
- Biomedical Informatics Training Program, Stanford University, Stanford, CA, USA.
| | | |
Collapse
|
28
|
Hull D, Pettifer SR, Kell DB. Defrosting the digital library: bibliographic tools for the next generation web. PLoS Comput Biol 2008; 4:e1000204. [PMID: 18974831 PMCID: PMC2568856 DOI: 10.1371/journal.pcbi.1000204] [Citation(s) in RCA: 60] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/02/2022] Open
Abstract
Many scientists now manage the bulk of their bibliographic information electronically, thereby organizing their publications and citation material from digital libraries. However, a library has been described as "thought in cold storage," and unfortunately many digital libraries can be cold, impersonal, isolated, and inaccessible places. In this Review, we discuss the current chilly state of digital libraries for the computational biologist, including PubMed, IEEE Xplore, the ACM digital library, ISI Web of Knowledge, Scopus, Citeseer, arXiv, DBLP, and Google Scholar. We illustrate the current process of using these libraries with a typical workflow, and highlight problems with managing data and metadata using URIs. We then examine a range of new applications such as Zotero, Mendeley, Mekentosj Papers, MyNCBI, CiteULike, Connotea, and HubMed that exploit the Web to make these digital libraries more personal, sociable, integrated, and accessible places. We conclude with how these applications may begin to help achieve a digital defrost, and discuss some of the issues that will help or hinder this in terms of making libraries on the Web warmer places in the future, becoming resources that are considerably more useful to both humans and machines.
Collapse
Affiliation(s)
- Duncan Hull
- School of Chemistry, The University of Manchester, Manchester, UK.
| | | | | |
Collapse
|
29
|
Hsing M, Byler KG, Cherkasov A. The use of Gene Ontology terms for predicting highly-connected 'hub' nodes in protein-protein interaction networks. BMC SYSTEMS BIOLOGY 2008; 2:80. [PMID: 18796161 PMCID: PMC2553323 DOI: 10.1186/1752-0509-2-80] [Citation(s) in RCA: 40] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 05/01/2008] [Accepted: 09/16/2008] [Indexed: 11/10/2022]
Abstract
BACKGROUND Protein-protein interactions mediate a wide range of cellular functions and responses and have been studied rigorously through recent large-scale proteomics experiments and bioinformatics analyses. One of the most important findings of those endeavours was the observation that 'hub' proteins participate in significant numbers of protein interactions and play critical roles in the organization and function of cellular protein interaction networks (PINs) 12. It has also been demonstrated that such hub proteins may constitute an important pool of attractive drug targets.Thus, it is crucial to be able to identify hub proteins based not only on experimental data but also by means of bioinformatics predictions. RESULTS A hub protein classifier has been developed based on the available interaction data and Gene Ontology (GO) annotations for proteins in the Escherichia coli, Saccharomyces cerevisiae, Drosophila melanogaster and Homo sapiens genomes. In particular, by utilizing the machine learning method of boosting trees we were able to create a predictive bioinformatics tool for the identification of proteins that are likely to play the role of a hub in protein interaction networks. Testing the developed hub classifier on external sets of experimental protein interaction data in Methicillin-resistant Staphylococcus aureus (MRSA) 252 and Caenorhabditis elegans demonstrated that our approach can predict hub proteins with a high degree of accuracy.A practical application of the developed bioinformatics method has been illustrated by the effective protein bait selection for large-scale pull-down experiments that aim to map complete protein-protein interaction networks for several species. CONCLUSION The successful development of an accurate hub classifier demonstrated that highly-connected proteins tend to share certain relevant functional properties reflected in their Gene Ontology annotations. It is anticipated that the developed bioinformatics hub classifier will represent a useful tool for the theoretical prediction of highly-interacting proteins, the study of cellular network organizations, and the identification of prospective drug targets - even in those organisms that currently lack large-scale protein interaction data.
Collapse
Affiliation(s)
- Michael Hsing
- Faculty of Graduate Studies, Bioinformatics Graduate Program, University of British Columbia, Vancouver, BC, Canada.
| | | | | |
Collapse
|
30
|
Van Ness B, Ramos C, Haznadar M, Hoering A, Haessler J, Crowley J, Jacobus S, Oken M, Rajkumar V, Greipp P, Barlogie B, Durie B, Katz M, Atluri G, Fang G, Gupta R, Steinbach M, Kumar V, Mushlin R, Johnson D, Morgan G. Genomic variation in myeloma: design, content, and initial application of the Bank On A Cure SNP Panel to detect associations with progression-free survival. BMC Med 2008; 6:26. [PMID: 18778477 PMCID: PMC2553089 DOI: 10.1186/1741-7015-6-26] [Citation(s) in RCA: 40] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 07/29/2008] [Accepted: 09/08/2008] [Indexed: 01/07/2023] Open
Abstract
BACKGROUND We have engaged in an international program designated the Bank On A Cure, which has established DNA banks from multiple cooperative and institutional clinical trials, and a platform for examining the association of genetic variations with disease risk and outcomes in multiple myeloma. We describe the development and content of a novel custom SNP panel that contains 3404 SNPs in 983 genes, representing cellular functions and pathways that may influence disease severity at diagnosis, toxicity, progression or other treatment outcomes. A systematic search of national databases was used to identify non-synonymous coding SNPs and SNPs within transcriptional regulatory regions. To explore SNP associations with PFS we compared SNP profiles of short term (less than 1 year, n = 70) versus long term progression-free survivors (greater than 3 years, n = 73) in two phase III clinical trials. RESULTS Quality controls were established, demonstrating an accurate and robust screening panel for genetic variations, and some initial racial comparisons of allelic variation were done. A variety of analytical approaches, including machine learning tools for data mining and recursive partitioning analyses, demonstrated predictive value of the SNP panel in survival. While the entire SNP panel showed genotype predictive association with PFS, some SNP subsets were identified within drug response, cellular signaling and cell cycle genes. CONCLUSION A targeted gene approach was undertaken to develop an SNP panel that can test for associations with clinical outcomes in myeloma. The initial analysis provided some predictive power, demonstrating that genetic variations in the myeloma patient population may influence PFS.
Collapse
Affiliation(s)
- Brian Van Ness
- Cancer Center, University of Minnesota, Minneapolis, MN, USA.
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
31
|
Abstract
Abstraction of intracellular biomolecular interactions into networks is useful for data integration and graph analysis. Network analysis tools facilitate predictions of novel functions for proteins, prediction of functional interactions and identification of intracellular modules. These efforts are linked with drug and phenotype data to accelerate drug-target and biomarker discovery. This review highlights the currently available varieties of mammalian biomolecular networks, and surveys methods and tools to construct, compare, integrate, visualise and analyse such networks.
Collapse
Affiliation(s)
- A Ma'ayan
- Mount Sinai School of Medicine, Department of Pharmacology and Systems Therapeutics, New York, NY 10029-6574, USA.
| |
Collapse
|
32
|
Gene expression in women conceiving spontaneously over the age of 45 years. Fertil Steril 2008; 89:1641-50. [DOI: 10.1016/j.fertnstert.2007.06.058] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/13/2007] [Revised: 06/04/2007] [Accepted: 06/04/2007] [Indexed: 01/13/2023]
|
33
|
Zheng B, Lu X. Novel metrics for evaluating the functional coherence of protein groups via protein semantic network. Genome Biol 2008; 8:R153. [PMID: 17672896 PMCID: PMC2323239 DOI: 10.1186/gb-2007-8-7-r153] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/04/2007] [Revised: 04/23/2007] [Accepted: 07/31/2007] [Indexed: 11/17/2022] Open
Abstract
Metrics are presented for assessing overall functional coherence of a group of proteins based on the associated biomedical literature. We present the metrics for assessing overall functional coherence of a group of proteins based on associated biomedical literature. A probabilistic topic model is applied to extract biologic concepts from a corpus of protein-related biomedical literature. Bipartite protein semantic networks are constructed, so that the functional coherence of a protein group can be evaluated with metrics that measure the closeness and strength of connectivity of the proteins in the network.
Collapse
Affiliation(s)
- Bin Zheng
- Department of Biostatistics, Bioinformatics and Epidemiology, 135 Cannon Street, Charleston, South Carolina 29425, USA
- Laboratory for Functional Neurogenomics, Center for Neurologic Diseases, Harvard Medical School and Brigham and Women's Hospital, Landsdowne Street, Cambridge, Massachusetts 02139, USA
| | - Xinghua Lu
- Department of Biostatistics, Bioinformatics and Epidemiology, 135 Cannon Street, Charleston, South Carolina 29425, USA
| |
Collapse
|
34
|
van Baarlen P, van Esse HP, Siezen RJ, Thomma BPHJ. Challenges in plant cellular pathway reconstruction based on gene expression profiling. TRENDS IN PLANT SCIENCE 2008; 13:44-50. [PMID: 18155635 DOI: 10.1016/j.tplants.2007.11.003] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/02/2007] [Revised: 10/22/2007] [Accepted: 11/01/2007] [Indexed: 05/06/2023]
Abstract
Microarrays are used to profile transcriptional activity, providing global cell biology insight. Particularly for plants, interpretation of transcriptional profiles is challenging because many genes have unknown functions. Furthermore, many plant gene sequences do not have clear homologs in other model organisms. Fortunately, over the past five years, various tools that assist plant scientists have been developed. Here, we evaluate the currently available in silico tools for reconstruction of cellular (metabolic, biochemical and signal transduction) pathways based on plant gene expression datasets. Furthermore, we show how expression-profile comparison at the level of these various cellular pathways contributes to the postulation of novel hypotheses which, after experimental verification, can provide further insight into decisive elements that have roles in cellular processes.
Collapse
Affiliation(s)
- Peter van Baarlen
- Nijmegen Centre for Molecular Life Sciences, UMC Radboud University, Geert Grooteplein 26-28, 6525 GA Nijmegen, the Netherlands
| | | | | | | |
Collapse
|
35
|
BioPP: a tool for web-publication of biological networks. BMC Bioinformatics 2007; 8:168. [PMID: 17519033 PMCID: PMC1885811 DOI: 10.1186/1471-2105-8-168] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/09/2007] [Accepted: 05/22/2007] [Indexed: 11/25/2022] Open
Abstract
Background Cellular processes depend on the function of intracellular molecular networks. The curation of the literature relevant to specific biological pathways is important for many theoretical and experimental research teams and communities. No current tool supports web publication or hosting of user-developed large scale annotated pathway diagrams. Sharing via web publication is needed to allow real-time access to the current literature pathway knowledgebase, both privately within a research team or publicly among the outside research community. Web publication also facilitates team and/or community input into the curation process while allowing centralized control of the curation and validation process. We have developed new tool to address these needs. Biological Pathway Publisher (BioPP) is a software suite for converting CellDesigner Systems Biology Markup Language (CD-SBML) formatted pathways into a web viewable format. The BioPP suite is available for private use and for depositing knowledgebases into a newly created public repository. Results BioPP suite is a web-based application that allows pathway knowledgebases stored in CD-SBML to be web published with an easily navigated user interface. The BioPP suite consists of four interrelated elements: a pathway publisher, an upload web-interface, a pathway repository for user-deposited knowledgebases and a pathway navigator. Users have the option to convert their CD-SBML files to HTML for restricted use or to allow their knowledgebase to be web-accessible to the scientific community. All entities in all knowledgebases in the repository are linked to public database entries as well as to a newly created public wiki which provides a discussion forum. Conclusion BioPP tools and the public repository facilitate sharing of pathway knowledgebases and interactive curation for research teams and scientific communities. BioPP suite is accessible at
Collapse
|
36
|
Abstract
iHOP provides fast, accurate, comprehensive, and up-to-date summary information on more than 80 000 biological molecules by automatically extracting key sentences from millions of PubMed documents. Its intuitive user interface and navigation scheme have made iHOP extremely successful among biologists, counting more than 500 000 visits per month (iHOP access statistics: http://www.ihop-net.org/UniPub/iHOP/info/logs/). Here we describe a public programmatic API that enables the integration of main iHOP functionalities in bioinformatic programs and workflows.
Collapse
Affiliation(s)
- José M Fernández
- Structural Biology and Biocomputing Program, CNIO and Computer Science and Artificial Intelligence Laboratory, MIT.
| | | | | |
Collapse
|
37
|
Alves R, Sorribas A. In silico pathway reconstruction: Iron-sulfur cluster biogenesis in Saccharomyces cerevisiae. BMC SYSTEMS BIOLOGY 2007; 1:10. [PMID: 17408500 PMCID: PMC1839888 DOI: 10.1186/1752-0509-1-10] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 09/25/2006] [Accepted: 01/31/2007] [Indexed: 01/17/2023]
Abstract
Background Current advances in genomics, proteomics and other areas of molecular biology make the identification and reconstruction of novel pathways an emerging area of great interest. One such class of pathways is involved in the biogenesis of Iron-Sulfur Clusters (ISC). Results Our goal is the development of a new approach based on the use and combination of mathematical, theoretical and computational methods to identify the topology of a target network. In this approach, mathematical models play a central role for the evaluation of the alternative network structures that arise from literature data-mining, phylogenetic profiling, structural methods, and human curation. As a test case, we reconstruct the topology of the reaction and regulatory network for the mitochondrial ISC biogenesis pathway in S. cerevisiae. Predictions regarding how proteins act in ISC biogenesis are validated by comparison with published experimental results. For example, the predicted role of Arh1 and Yah1 and some of the interactions we predict for Grx5 both matches experimental evidence. A putative role for frataxin in directly regulating mitochondrial iron import is discarded from our analysis, which agrees with also published experimental results. Additionally, we propose a number of experiments for testing other predictions and further improve the identification of the network structure. Conclusion We propose and apply an iterative in silico procedure for predictive reconstruction of the network topology of metabolic pathways. The procedure combines structural bioinformatics tools and mathematical modeling techniques that allow the reconstruction of biochemical networks. Using the Iron Sulfur cluster biogenesis in S. cerevisiae as a test case we indicate how this procedure can be used to analyze and validate the network model against experimental results. Critical evaluation of the obtained results through this procedure allows devising new wet lab experiments to confirm its predictions or provide alternative explanations for further improving the models.
Collapse
Affiliation(s)
- Rui Alves
- Departament de Ciencies Mediques Basiques, Universidad de Lleida, Montserrat Roig 2, 25008 Lleida, Spain
| | - Albert Sorribas
- Departament de Ciencies Mediques Basiques, Universidad de Lleida, Montserrat Roig 2, 25008 Lleida, Spain
| |
Collapse
|
38
|
Good BM, Kawas EA, Kuo BYL, Wilkinson MD. iHOPerator: user-scripting a personalized bioinformatics Web, starting with the iHOP website. BMC Bioinformatics 2006; 7:534. [PMID: 17173692 PMCID: PMC1764905 DOI: 10.1186/1471-2105-7-534] [Citation(s) in RCA: 13] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/04/2006] [Accepted: 12/15/2006] [Indexed: 11/10/2022] Open
Abstract
Background User-scripts are programs stored in Web browsers that can manipulate the content of websites prior to display in the browser. They provide a novel mechanism by which users can conveniently gain increased control over the content and the display of the information presented to them on the Web. As the Web is the primary medium by which scientists retrieve biological information, any improvements in the mechanisms that govern the utility or accessibility of this information may have profound effects. GreaseMonkey is a Mozilla Firefox extension that facilitates the development and deployment of user-scripts for the Firefox web-browser. We utilize this to enhance the content and the presentation of the iHOP (information Hyperlinked Over Proteins) website. Results The iHOPerator is a GreaseMonkey user-script that augments the gene-centred pages on iHOP by providing a compact, configurable visualization of the defining information for each gene and by enabling additional data, such as biochemical pathway diagrams, to be collected automatically from third party resources and displayed in the same browsing context. Conclusion This open-source script provides an extension to the iHOP website, demonstrating how user-scripts can personalize and enhance the Web browsing experience in a relevant biological setting. The novel, user-driven controls over the content and the display of Web resources made possible by user-scripts, such as the iHOPerator, herald the beginning of a transition from a resource-centric to a user-centric Web experience. We believe that this transition is a necessary step in the development of Web technology that will eventually result in profound improvements in the way life scientists interact with information.
Collapse
Affiliation(s)
- Benjamin M Good
- The James Hogg iCAPTURE Centre for Cardiovascular and Pulmonary Research, Providence Health Care/University of British Columbia, St. Paul's Hospital, Rm. 166, 1081 Burrard St. Vancouver, British Columbia, V6Z 1Y6, Canada
| | - Edward A Kawas
- The James Hogg iCAPTURE Centre for Cardiovascular and Pulmonary Research, Providence Health Care/University of British Columbia, St. Paul's Hospital, Rm. 166, 1081 Burrard St. Vancouver, British Columbia, V6Z 1Y6, Canada
| | - Byron Yu-Lin Kuo
- The James Hogg iCAPTURE Centre for Cardiovascular and Pulmonary Research, Providence Health Care/University of British Columbia, St. Paul's Hospital, Rm. 166, 1081 Burrard St. Vancouver, British Columbia, V6Z 1Y6, Canada
| | - Mark D Wilkinson
- The James Hogg iCAPTURE Centre for Cardiovascular and Pulmonary Research, Providence Health Care/University of British Columbia, St. Paul's Hospital, Rm. 166, 1081 Burrard St. Vancouver, British Columbia, V6Z 1Y6, Canada
| |
Collapse
|
39
|
Abstract
BACKGROUND The "inverse" problem is related to the determination of unknown causes on the bases of the observation of their effects. This is the opposite of the corresponding "direct" problem, which relates to the prediction of the effects generated by a complete description of some agencies. The solution of an inverse problem entails the construction of a mathematical model and takes the moves from a number of experimental data. In this respect, inverse problems are often ill-conditioned as the amount of experimental conditions available are often insufficient to unambiguously solve the mathematical model. Several approaches to solving inverse problems are possible, both computational and experimental, some of which are mentioned in this article. In this work, we will describe in details the attempt to solve an inverse problem which arose in the study of an intracellular signaling pathway. RESULTS Using the Genetic Algorithm to find the sub-optimal solution to the optimization problem, we have estimated a set of unknown parameters describing a kinetic model of a signaling pathway in the neuronal cell. The model is composed of mass action ordinary differential equations, where the kinetic parameters describe protein-protein interactions, protein synthesis and degradation. The algorithm has been implemented on a parallel platform. Several potential solutions of the problem have been computed, each solution being a set of model parameters. A sub-set of parameters has been selected on the basis on their small coefficient of variation across the ensemble of solutions. CONCLUSION Despite the lack of sufficiently reliable and homogeneous experimental data, the genetic algorithm approach has allowed to estimate the approximate value of a number of model parameters in a kinetic model of a signaling pathway: these parameters have been assessed to be relevant for the reproduction of the available experimental data.
Collapse
Affiliation(s)
- Ivan Arisi
- European Brain Research Institute, Via Fosso del Fiorano 64, Roma, Italy
| | - Antonino Cattaneo
- European Brain Research Institute, Via Fosso del Fiorano 64, Roma, Italy
- Lay Line Genomics SpA, S.Raffaele Science Park, Castel Romano, Italy
- International School of Advanced Studies (SISSA/ISAS), Biophysics Dept., Via Beirut 2-4, Trieste, Italy
| | - Vittorio Rosato
- ENEA, Casaccia Research Center, Computing and Modelling Unit, Via Anguillarese 301, S.Maria di Galeria, Italy
- Ylichron Srl, c/o ENEA, Casaccia Research Center, Via Anguillarese 301, S.Maria di Galeria, Italy
| |
Collapse
|
40
|
Ananiadou S, Kell DB, Tsujii JI. Text mining and its potential applications in systems biology. Trends Biotechnol 2006; 24:571-9. [PMID: 17045684 DOI: 10.1016/j.tibtech.2006.10.002] [Citation(s) in RCA: 164] [Impact Index Per Article: 8.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/05/2006] [Revised: 08/21/2006] [Accepted: 10/02/2006] [Indexed: 11/15/2022]
Abstract
With biomedical literature increasing at a rate of several thousand papers per week, it is impossible to keep abreast of all developments; therefore, automated means to manage the information overload are required. Text mining techniques, which involve the processes of information retrieval, information extraction and data mining, provide a means of solving this. By adding meaning to text, these techniques produce a more structured analysis of textual knowledge than simple word searches, and can provide powerful tools for the production and analysis of systems biology models.
Collapse
Affiliation(s)
- Sophia Ananiadou
- School of Computer Science, National Centre for Text Mining, The Manchester Interdisciplinary Biocentre, The University of Manchester, 131 Princess Street, Manchester M1 7ND, UK.
| | | | | |
Collapse
|
41
|
Xiang Z, Zheng W, He Y. BBP: Brucella genome annotation with literature mining and curation. BMC Bioinformatics 2006; 7:347. [PMID: 16842628 PMCID: PMC1539029 DOI: 10.1186/1471-2105-7-347] [Citation(s) in RCA: 39] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/18/2006] [Accepted: 07/16/2006] [Indexed: 12/25/2022] Open
Abstract
BACKGROUND Brucella species are Gram-negative, facultative intracellular bacteria that cause brucellosis in humans and animals. Sequences of four Brucella genomes have been published, and various Brucella gene and genome data and analysis resources exist. A web gateway to integrate these resources will greatly facilitate Brucella research. Brucella genome data in current databases is largely derived from computational analysis without experimental validation typically found in peer-reviewed publications. It is partially due to the lack of a literature mining and curation system able to efficiently incorporate the large amount of literature data into genome annotation. It is further hypothesized that literature-based Brucella gene annotation would increase understanding of complicated Brucella pathogenesis mechanisms. RESULTS The Brucella Bioinformatics Portal (BBP) is developed to integrate existing Brucella genome data and analysis tools with literature mining and curation. The BBP InterBru database and Brucella Genome Browser allow users to search and analyze genes of 4 currently available Brucella genomes and link to more than 20 existing databases and analysis programs. Brucella literature publications in PubMed are extracted and can be searched by a TextPresso-powered natural language processing method, a MeSH browser, a keywords search, and an automatic literature update service. To efficiently annotate Brucella genes using the large amount of literature publications, a literature mining and curation system coined Limix is developed to integrate computational literature mining methods with a PubSearch-powered manual curation and management system. The Limix system is used to quickly find and confirm 107 Brucella gene mutations including 75 genes shown to be essential for Brucella virulence. The 75 genes are further clustered using COG. In addition, 62 Brucella genetic interactions are extracted from literature publications. These results make possible more comprehensive investigation of Brucella pathogenesis. Other BBP features include publication email alert service, Brucella researchers' contact database, and discussion forum. CONCLUSION BBP is a gateway for Brucella researchers to search, analyze, and curate Brucella genome data originated from public databases and literature. Brucella gene mutations and genetic interactions are annotated using Limix leading to better understanding of Brucella pathogenesis.
Collapse
Affiliation(s)
- Zuoshuang Xiang
- Unit for Laboratory Animal Medicine, University of Michigan Medical School, Ann Arbor, MI 48109, USA
| | | | - Yongqun He
- Unit for Laboratory Animal Medicine, University of Michigan Medical School, Ann Arbor, MI 48109, USA
- Department of Microbiology and Immunology, University of Michigan Medical School, Ann Arbor, MI 48109, USA
- Bioinformatics Program, University of Michigan Medical School, Ann Arbor, MI 48109, USA
| |
Collapse
|
42
|
Hahn U, Valencia A. Semantic Mining in Biomedicine (Introduction to the papers selected from the SMBM 2005 Symposium, Hinxton, U.K., April 2005). Bioinformatics 2006; 22:643-4. [PMID: 16527834 DOI: 10.1093/bioinformatics/btl084] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
|
43
|
Villoslada P, Oksenberg JR. Neuroinformatics in clinical practice: are computers going to help neurological patients and their physicians? FUTURE NEUROLOGY 2006. [DOI: 10.2217/14796708.1.2.159] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/21/2022]
Abstract
Neuroinformatics is the field that merges the power of computational analysis with neuroscience. It is a discipline that has evolved from the original use of computers for data organization to the current development and application of sophisticated computational tools for large-scale data and image management, analysis and modeling of brain function in health and disease. Neuroinformatics has the potential to be a powerful instrument in the discovery of biological markers of neurological diseases, as well as in the development of new and more effective therapies. Owing to the exponential growth in size and complexity of the information available in the neurosciences, neuroinformatic methods are becoming indispensable in modern neurological research. We predict that, in the near future, they will also be essential at the bedside.
Collapse
|
44
|
Grant SGN, Marshall MC, Page KL, Cumiskey MA, Armstrong JD. Synapse proteomics of multiprotein complexes: en route from genes to nervous system diseases. Hum Mol Genet 2005; 14 Spec No. 2:R225-34. [PMID: 16150739 DOI: 10.1093/hmg/ddi330] [Citation(s) in RCA: 52] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
Proteomic experiments have produced a draft profile of the overall molecular composition of the mammalian neuronal synapse. It appears that synapses have over 1000 protein components and the mapping of their interactions, organization and functions will lead to a global view of the role of synapses in physiology and disease. A major functional subcomponent of the synaptic machinery is a multiprotein complex of glutamate receptors and adhesion proteins with associated adaptor and signalling enzymes totally 185 proteins known as the N-methyl-d-aspartate receptor complex/MAGUK associated signalling complex (NRC/MASC). Here, we review the proteomic studies and functions of NRC/MASC and specifically report on the role of its component genes in human diseases. Using a systematic literature search protocol, we identified reports of mutations or polymorphisms in 47 genes associated with 183 disorders, of which 54 were nervous system disorders. A similar number of genes are important in mouse synaptic plasticity and behaviour, where the NRC/MASC acts as a signalling complex with multiple functions provided by its individual protein components and their interactions. The individual gene mutations suggest not only an important role for the NRC/MASC in human diseases but that these diseases may be functionally connected by their common link to the NRC/MASC. The NRC/MASC is a rich source of genetic variation and provides a platform for understanding relationships of disease phenotype amenable to systematic studies such as the Genes to Cognition research consortium (www.genes2cognition.org) that links human and mouse genetics with proteomic studies.
Collapse
Affiliation(s)
- Seth G N Grant
- Wellcome Trust Sanger Institute, Hinxton, Cambridgeshire CB10 1SA, UK
| | | | | | | | | |
Collapse
|