1
|
Hakala K, Kaewphan S, Bjorne J, Mehryary F, Moen H, Tolvanen M, Salakoski T, Ginter F. Neural Network and Random Forest Models in Protein Function Prediction. IEEE/ACM Trans Comput Biol Bioinform 2022; 19:1772-1781. [PMID: 33306472 DOI: 10.1109/tcbb.2020.3044230] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
Over the past decade, the demand for automated protein function prediction has increased due to the volume of newly sequenced proteins. In this paper, we address the function prediction task by developing an ensemble system automatically assigning Gene Ontology (GO) terms to the given input protein sequence. We develop an ensemble system which combines the GO predictions made by random forest (RF) and neural network (NN) classifiers. Both RF and NN models rely on features derived from BLAST sequence alignments, taxonomy and protein signature analysis tools. In addition, we report on experiments with a NN model that directly analyzes the amino acid sequence as its sole input, using a convolutional layer. The Swiss-Prot database is used as the training and evaluation data. In the CAFA3 evaluation, which relies on experimental verification of the functional predictions, our submitted ensemble model demonstrates competitive performance ranking among top-10 best-performing systems out of over 100 submitted systems. In this paper, we evaluate and further improve the CAFA3-submitted system. Our machine learning models together with the data pre-processing and feature generation tools are publicly available as an open source software at https://github.com/TurkuNLP/CAFA3.
Collapse
|
2
|
Moen H, Hakala K, Peltonen LM, Suhonen H, Ginter F, Salakoski T, Salanterä S. Supporting the use of standardized nursing terminologies with automatic subject heading prediction: a comparison of sentence-level text classification methods. J Am Med Inform Assoc 2021; 27:81-88. [PMID: 31605490 PMCID: PMC6913232 DOI: 10.1093/jamia/ocz150] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/21/2019] [Revised: 07/04/2019] [Accepted: 08/03/2019] [Indexed: 12/19/2022] Open
Abstract
Objective This study focuses on the task of automatically assigning standardized (topical) subject headings to free-text sentences in clinical nursing notes. The underlying motivation is to support nurses when they document patient care by developing a computer system that can assist in incorporating suitable subject headings that reflect the documented topics. Central in this study is performance evaluation of several text classification methods to assess the feasibility of developing such a system. Materials and Methods Seven text classification methods are evaluated using a corpus of approximately 0.5 million nursing notes (5.5 million sentences) with 676 unique headings extracted from a Finnish university hospital. Several of these methods are based on artificial neural networks. Evaluation is first done in an automatic manner for all methods, then a manual error analysis is done on a sample. Results We find that a method based on a bidirectional long short-term memory network performs best with an average recall of 0.5435 when allowed to suggest 1 subject heading per sentence and 0.8954 when allowed to suggest 10 subject headings per sentence. However, other methods achieve comparable results. The manual analysis indicates that the predictions are better than what the automatic evaluation suggests. Conclusions The results indicate that several of the tested methods perform well in suggesting the most appropriate subject headings on sentence level. Thus, we find it feasible to develop a text classification system that can support the use of standardized terminologies and save nurses time and effort on care documentation.
Collapse
Affiliation(s)
- Hans Moen
- Department of Future Technologies, University of Turku, Turku, Finland
| | - Kai Hakala
- Department of Future Technologies, University of Turku, Turku, Finland
| | | | - Henry Suhonen
- Department of Nursing Science, University of Turku, Turku, Finland.,Department of Nursing, Turku University Hospital, Turku, Finland
| | - Filip Ginter
- Department of Future Technologies, University of Turku, Turku, Finland
| | - Tapio Salakoski
- Department of Future Technologies, University of Turku, Turku, Finland
| | - Sanna Salanterä
- Department of Nursing Science, University of Turku, Turku, Finland.,Department of Nursing, Turku University Hospital, Turku, Finland
| |
Collapse
|
3
|
Abstract
Background: Syntactic analysis, or parsing, is a key task in natural language processing and a required component for many text mining approaches. In recent years, Universal Dependencies (UD) has emerged as the leading formalism for dependency parsing. While a number of recent tasks centering on UD have substantially advanced the state of the art in multilingual parsing, there has been only little study of parsing texts from specialized domains such as biomedicine. Methods: We explore the application of state-of-the-art neural dependency parsing methods to biomedical text using the recently introduced CRAFT-SA shared task dataset. The CRAFT-SA task broadly follows the UD representation and recent UD task conventions, allowing us to fine-tune the UD-compatible Turku Neural Parser and UDify neural parsers to the task. We further evaluate the effect of transfer learning using a broad selection of BERT models, including several models pre-trained specifically for biomedical text processing. Results: We find that recently introduced neural parsing technology is capable of generating highly accurate analyses of biomedical text, substantially improving on the best performance reported in the original CRAFT-SA shared task. We also find that initialization using a deep transfer learning model pre-trained on in-domain texts is key to maximizing the performance of the parsing methods.
Collapse
Affiliation(s)
- Jenna Kanerva
- TurkuNLP Group, University of Turku, Turku, Finland.
| | - Filip Ginter
- TurkuNLP Group, University of Turku, Turku, Finland
| | | |
Collapse
|
4
|
Moen H, Hakala K, Peltonen LM, Matinolli HM, Suhonen H, Terho K, Danielsson-Ojala R, Valta M, Ginter F, Salakoski T, Salanterä S. Assisting nurses in care documentation: from automated sentence classification to coherent document structures with subject headings. J Biomed Semantics 2020; 11:10. [PMID: 32873340 PMCID: PMC7465411 DOI: 10.1186/s13326-020-00229-7] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/19/2019] [Accepted: 08/14/2020] [Indexed: 11/10/2022] Open
Abstract
Background Up to 35% of nurses’ working time is spent on care documentation. We describe the evaluation of a system aimed at assisting nurses in documenting patient care and potentially reducing the documentation workload. Our goal is to enable nurses to write or dictate nursing notes in a narrative manner without having to manually structure their text under subject headings. In the current care classification standard used in the targeted hospital, there are more than 500 subject headings to choose from, making it challenging and time consuming for nurses to use. Methods The task of the presented system is to automatically group sentences into paragraphs and assign subject headings. For classification the system relies on a neural network-based text classification model. The nursing notes are initially classified on sentence level. Subsequently coherent paragraphs are constructed from related sentences. Results Based on a manual evaluation conducted by a group of three domain experts, we find that in about 69% of the paragraphs formed by the system the topics of the sentences are coherent and the assigned paragraph headings correctly describe the topics. We also show that the use of a paragraph merging step reduces the number of paragraphs produced by 23% without affecting the performance of the system. Conclusions The study shows that the presented system produces a coherent and logical structure for freely written nursing narratives and has the potential to reduce the time and effort nurses are currently spending on documenting care in hospitals.
Collapse
Affiliation(s)
- Hans Moen
- Department of Future Technologies, University of Turku, Vesilinnantie 5, Turku, 20500, Finland.
| | - Kai Hakala
- Department of Future Technologies, University of Turku, Vesilinnantie 5, Turku, 20500, Finland.,University of Turku Graduate School, University of Turku, Hämeenkatu 4, Turku, 20500, Finland
| | - Laura-Maria Peltonen
- Department of Nursing Science, University of Turku, Joukahaisenkatu 3-5, Turku, 20520, Finland
| | - Hanna-Maria Matinolli
- Department of Nursing Science, University of Turku, Joukahaisenkatu 3-5, Turku, 20520, Finland
| | - Henry Suhonen
- Department of Nursing Science, University of Turku, Joukahaisenkatu 3-5, Turku, 20520, Finland.,Turku University Hospital, Kiinamyllynkatu 4-8, Turku, 20521, Finland
| | - Kirsi Terho
- Department of Nursing Science, University of Turku, Joukahaisenkatu 3-5, Turku, 20520, Finland.,Turku University Hospital, Kiinamyllynkatu 4-8, Turku, 20521, Finland
| | - Riitta Danielsson-Ojala
- Department of Nursing Science, University of Turku, Joukahaisenkatu 3-5, Turku, 20520, Finland.,Turku University Hospital, Kiinamyllynkatu 4-8, Turku, 20521, Finland
| | - Maija Valta
- Turku University Hospital, Kiinamyllynkatu 4-8, Turku, 20521, Finland
| | - Filip Ginter
- Department of Future Technologies, University of Turku, Vesilinnantie 5, Turku, 20500, Finland
| | - Tapio Salakoski
- Department of Future Technologies, University of Turku, Vesilinnantie 5, Turku, 20500, Finland
| | - Sanna Salanterä
- Department of Nursing Science, University of Turku, Joukahaisenkatu 3-5, Turku, 20520, Finland.,Turku University Hospital, Kiinamyllynkatu 4-8, Turku, 20521, Finland
| |
Collapse
|
5
|
Sarker A, Belousov M, Friedrichs J, Hakala K, Kiritchenko S, Mehryary F, Han S, Tran T, Rios A, Kavuluru R, de Bruijn B, Ginter F, Mahata D, Mohammad SM, Nenadic G, Gonzalez-Hernandez G. Data and systems for medication-related text classification and concept normalization from Twitter: insights from the Social Media Mining for Health (SMM4H)-2017 shared task. J Am Med Inform Assoc 2019; 25:1274-1283. [PMID: 30272184 PMCID: PMC6188524 DOI: 10.1093/jamia/ocy114] [Citation(s) in RCA: 43] [Impact Index Per Article: 8.6] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/08/2018] [Accepted: 08/02/2018] [Indexed: 12/19/2022] Open
Abstract
Objective We executed the Social Media Mining for Health (SMM4H) 2017 shared tasks to enable the community-driven development and large-scale evaluation of automatic text processing methods for the classification and normalization of health-related text from social media. An additional objective was to publicly release manually annotated data. Materials and Methods We organized 3 independent subtasks: automatic classification of self-reports of 1) adverse drug reactions (ADRs) and 2) medication consumption, from medication-mentioning tweets, and 3) normalization of ADR expressions. Training data consisted of 15 717 annotated tweets for (1), 10 260 for (2), and 6650 ADR phrases and identifiers for (3); and exhibited typical properties of social-media-based health-related texts. Systems were evaluated using 9961, 7513, and 2500 instances for the 3 subtasks, respectively. We evaluated performances of classes of methods and ensembles of system combinations following the shared tasks. Results Among 55 system runs, the best system scores for the 3 subtasks were 0.435 (ADR class F1-score) for subtask-1, 0.693 (micro-averaged F1-score over two classes) for subtask-2, and 88.5% (accuracy) for subtask-3. Ensembles of system combinations obtained best scores of 0.476, 0.702, and 88.7%, outperforming individual systems. Discussion Among individual systems, support vector machines and convolutional neural networks showed high performance. Performance gains achieved by ensembles of system combinations suggest that such strategies may be suitable for operational systems relying on difficult text classification tasks (eg, subtask-1). Conclusions Data imbalance and lack of context remain challenges for natural language processing of social media text. Annotated data from the shared task have been made available as reference standards for future studies (http://dx.doi.org/10.17632/rxwfb3tysd.1).
Collapse
Affiliation(s)
- Abeed Sarker
- Department of Biostatistics, Epidemiology and Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, Pennsylvania, USA
| | - Maksim Belousov
- School of Computer Science, University of Manchester, Manchester, UK
| | | | - Kai Hakala
- Turku NLP Group, Department of Future Technologies, University of Turku, Turku, Finland.,The University of Turku Graduate School, University of Turku, Turku, Finland
| | - Svetlana Kiritchenko
- Digital Technologies Research Centre, National Research Council Canada, Ottawa, Canada
| | - Farrokh Mehryary
- Turku NLP Group, Department of Future Technologies, University of Turku, Turku, Finland.,The University of Turku Graduate School, University of Turku, Turku, Finland
| | - Sifei Han
- Department of Computer Science, University of Kentucky, Lexington, Kentucky, USA
| | - Tung Tran
- Department of Computer Science, University of Kentucky, Lexington, Kentucky, USA
| | - Anthony Rios
- Department of Computer Science, University of Kentucky, Lexington, Kentucky, USA
| | - Ramakanth Kavuluru
- Department of Computer Science, University of Kentucky, Lexington, Kentucky, USA.,Division of Biomedical Informatics, Department of Internal Medicine, University of Kentucky, Lexington, Kentucky, USA
| | - Berry de Bruijn
- Digital Technologies Research Centre, National Research Council Canada, Ottawa, Canada
| | - Filip Ginter
- Turku NLP Group, Department of Future Technologies, University of Turku, Turku, Finland
| | | | - Saif M Mohammad
- Digital Technologies Research Centre, National Research Council Canada, Ottawa, Canada
| | - Goran Nenadic
- School of Computer Science, University of Manchester, Manchester, UK
| | - Graciela Gonzalez-Hernandez
- Department of Biostatistics, Epidemiology and Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, Pennsylvania, USA
| |
Collapse
|
6
|
Zhou N, Jiang Y, Bergquist TR, Lee AJ, Kacsoh BZ, Crocker AW, Lewis KA, Georghiou G, Nguyen HN, Hamid MN, Davis L, Dogan T, Atalay V, Rifaioglu AS, Dalkıran A, Cetin Atalay R, Zhang C, Hurto RL, Freddolino PL, Zhang Y, Bhat P, Supek F, Fernández JM, Gemovic B, Perovic VR, Davidović RS, Sumonja N, Veljkovic N, Asgari E, Mofrad MRK, Profiti G, Savojardo C, Martelli PL, Casadio R, Boecker F, Schoof H, Kahanda I, Thurlby N, McHardy AC, Renaux A, Saidi R, Gough J, Freitas AA, Antczak M, Fabris F, Wass MN, Hou J, Cheng J, Wang Z, Romero AE, Paccanaro A, Yang H, Goldberg T, Zhao C, Holm L, Törönen P, Medlar AJ, Zosa E, Borukhov I, Novikov I, Wilkins A, Lichtarge O, Chi PH, Tseng WC, Linial M, Rose PW, Dessimoz C, Vidulin V, Dzeroski S, Sillitoe I, Das S, Lees JG, Jones DT, Wan C, Cozzetto D, Fa R, Torres M, Warwick Vesztrocy A, Rodriguez JM, Tress ML, Frasca M, Notaro M, Grossi G, Petrini A, Re M, Valentini G, Mesiti M, Roche DB, Reeb J, Ritchie DW, Aridhi S, Alborzi SZ, Devignes MD, Koo DCE, Bonneau R, Gligorijević V, Barot M, Fang H, Toppo S, Lavezzo E, Falda M, Berselli M, Tosatto SCE, Carraro M, Piovesan D, Ur Rehman H, Mao Q, Zhang S, Vucetic S, Black GS, Jo D, Suh E, Dayton JB, Larsen DJ, Omdahl AR, McGuffin LJ, Brackenridge DA, Babbitt PC, Yunes JM, Fontana P, Zhang F, Zhu S, You R, Zhang Z, Dai S, Yao S, Tian W, Cao R, Chandler C, Amezola M, Johnson D, Chang JM, Liao WH, Liu YW, Pascarelli S, Frank Y, Hoehndorf R, Kulmanov M, Boudellioua I, Politano G, Di Carlo S, Benso A, Hakala K, Ginter F, Mehryary F, Kaewphan S, Björne J, Moen H, Tolvanen MEE, Salakoski T, Kihara D, Jain A, Šmuc T, Altenhoff A, Ben-Hur A, Rost B, Brenner SE, Orengo CA, Jeffery CJ, Bosco G, Hogan DA, Martin MJ, O'Donovan C, Mooney SD, Greene CS, Radivojac P, Friedberg I. The CAFA challenge reports improved protein function prediction and new functional annotations for hundreds of genes through experimental screens. Genome Biol 2019; 20:244. [PMID: 31744546 PMCID: PMC6864930 DOI: 10.1186/s13059-019-1835-8] [Citation(s) in RCA: 166] [Impact Index Per Article: 33.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/16/2019] [Accepted: 09/24/2019] [Indexed: 12/23/2022] Open
Abstract
BACKGROUND The Critical Assessment of Functional Annotation (CAFA) is an ongoing, global, community-driven effort to evaluate and improve the computational annotation of protein function. RESULTS Here, we report on the results of the third CAFA challenge, CAFA3, that featured an expanded analysis over the previous CAFA rounds, both in terms of volume of data analyzed and the types of analysis performed. In a novel and major new development, computational predictions and assessment goals drove some of the experimental assays, resulting in new functional annotations for more than 1000 genes. Specifically, we performed experimental whole-genome mutation screening in Candida albicans and Pseudomonas aureginosa genomes, which provided us with genome-wide experimental data for genes associated with biofilm formation and motility. We further performed targeted assays on selected genes in Drosophila melanogaster, which we suspected of being involved in long-term memory. CONCLUSION We conclude that while predictions of the molecular function and biological process annotations have slightly improved over time, those of the cellular component have not. Term-centric prediction of experimental annotations remains equally challenging; although the performance of the top methods is significantly better than the expectations set by baseline methods in C. albicans and D. melanogaster, it leaves considerable room and need for improvement. Finally, we report that the CAFA community now involves a broad range of participants with expertise in bioinformatics, biological experimentation, biocuration, and bio-ontologies, working together to improve functional annotation, computational function prediction, and our ability to manage big data in the era of large experimental screens.
Collapse
Affiliation(s)
- Naihui Zhou
- Veterinary Microbiology and Preventive Medicine, Iowa State University, Ames, IA, USA.,Program in Bioinformatics and Computational Biology, Ames, IA, USA
| | - Yuxiang Jiang
- Indiana University Bloomington, Bloomington, Indiana, USA
| | - Timothy R Bergquist
- Department of Biomedical Informatics and Medical Education, University of Washington, Seattle, WA, USA
| | - Alexandra J Lee
- Department of Systems Pharmacology and Translational Therapeutics, University of Pennsylvania, Philadelphia, PA, USA
| | - Balint Z Kacsoh
- Geisel School of Medicine at Dartmouth, Hanover, NH, USA.,Department of Molecular and Systems Biology, Hanover, NH, USA
| | - Alex W Crocker
- Department of Microbiology and Immunology, Geisel School of Medicine at Dartmouth, Hanover, NH, USA
| | - Kimberley A Lewis
- Department of Microbiology and Immunology, Geisel School of Medicine at Dartmouth, Hanover, NH, USA
| | - George Georghiou
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Hinxton, United Kingdom
| | - Huy N Nguyen
- Veterinary Microbiology and Preventive Medicine, Iowa State University, Ames, IA, USA.,Program in Computer Science, Ames, IA, USA
| | - Md Nafiz Hamid
- Veterinary Microbiology and Preventive Medicine, Iowa State University, Ames, IA, USA.,Program in Bioinformatics and Computational Biology, Ames, IA, USA
| | - Larry Davis
- Program in Bioinformatics and Computational Biology, Ames, IA, USA
| | - Tunca Dogan
- Department of Computer Engineering, Hacettepe University, Ankara, Turkey.,European Molecular Biolo gy Labora tory, European Bioinformatics Institute (EMBL-EBI), Cambridge, UK
| | - Volkan Atalay
- Department of Computer Engineering, Middle East Technical University (METU), Ankara, Turkey
| | - Ahmet S Rifaioglu
- Department of Computer Engineering, Middle East Technical University (METU), Ankara, Turkey.,Department of Computer Engineering, Iskenderun Technical University, Hatay, Turkey
| | - Alperen Dalkıran
- Department of Computer Engineering, Middle East Technical University (METU), Ankara, Turkey
| | - Rengul Cetin Atalay
- CanSyL, Graduate School of Informatics, Middle East Technical University, Ankara, Turkey
| | - Chengxin Zhang
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI, USA
| | - Rebecca L Hurto
- Department of Biological Chemistry, University of Michigan, Ann Arbor, MI, USA
| | - Peter L Freddolino
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI, USA.,Department of Biological Chemistry, University of Michigan, Ann Arbor, MI, USA
| | - Yang Zhang
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI, USA.,Department of Biological Chemistry, University of Michigan, Ann Arbor, MI, USA
| | | | - Fran Supek
- Institute for Research in Biomedicine (IRB Barcelona), Barcelona, Spain.,Institució Catalana de Recerca i Estudis Avançats (ICREA), Barcelona, Spain
| | - José M Fernández
- INB Coordination Unit, Life Sciences Department, Barcelona Supercomputing Center, Barcelona, Catalonia, Spain.,(former) INB GN2, Structural and Computational Biology Programme, Spanish National Cancer Research Centre, Barcelona, Catalonia, Spain
| | - Branislava Gemovic
- Laboratory for Bioinformatics and Computational Chemistry, Institute of Nuclear Sciences VINCA, University of Belgrade, Belgrade, Serbia
| | - Vladimir R Perovic
- Laboratory for Bioinformatics and Computational Chemistry, Institute of Nuclear Sciences VINCA, University of Belgrade, Belgrade, Serbia
| | - Radoslav S Davidović
- Laboratory for Bioinformatics and Computational Chemistry, Institute of Nuclear Sciences VINCA, University of Belgrade, Belgrade, Serbia
| | - Neven Sumonja
- Laboratory for Bioinformatics and Computational Chemistry, Institute of Nuclear Sciences VINCA, University of Belgrade, Belgrade, Serbia
| | - Nevena Veljkovic
- Laboratory for Bioinformatics and Computational Chemistry, Institute of Nuclear Sciences VINCA, University of Belgrade, Belgrade, Serbia
| | - Ehsaneddin Asgari
- Molecular Cell Biomechanics Laboratory, Departments of Bioengineering, University of California Berkeley, Berkeley, CA, USA.,Computational Biology of Infection Research, Helmholtz Centre for Infection Research, Berkeley, CA, USA
| | | | - Giuseppe Profiti
- Bologna Biocomputing Group, Department of Pharmacy and Biotechnology, University of Bologna, Bologna, Italy.,National Research Council, IBIOM, Bologna, Italy
| | - Castrense Savojardo
- Bologna Biocomputing Group, Department of Pharmacy and Biotechnology, University of Bologna, Bologna, Italy
| | - Pier Luigi Martelli
- Bologna Biocomputing Group, Department of Pharmacy and Biotechnology, University of Bologna, Bologna, Italy
| | - Rita Casadio
- Bologna Biocomputing Group, Department of Pharmacy and Biotechnology, University of Bologna, Bologna, Italy
| | - Florian Boecker
- University of Bonn: INRES Crop Bioinformatics, Bonn, North Rhine-Westphalia, Germany
| | - Heiko Schoof
- INRES Crop Bioinformatics, University of Bonn, Bonn, Germany
| | - Indika Kahanda
- Gianforte School of Computing, Montana State University, Bozeman, Montana, USA
| | - Natalie Thurlby
- University of Bristol, Computer Science, Bristol, Bristol, United Kingdom
| | - Alice C McHardy
- Computational Biology of Infection Research, Helmholtz Centre for Infection Research, Brunswick, Germany.,RESIST, DFG Cluster of Excellence 2155, Brunswick, Germany
| | - Alexandre Renaux
- Interuniversity Institute of Bioinformatics in Brussels, Université libre de Bruxelles - Vrije Universiteit Brussel, Brussels, Belgium.,Machine Learning Group, Université libre de Bruxelles, Brussels, Belgium.,Artificial Intelligence lab, Vrije Universiteit Brussel, Brussels, Belgium
| | - Rabie Saidi
- European Molecular Biolo gy Labora tory, European Bioinformatics Institute (EMBL-EBI), Cambridge, UK
| | - Julian Gough
- MRC Laboratory of Molecular Biology, Cambridge, United Kingdom
| | - Alex A Freitas
- University of Kent, School of Computing, Canterbury, United Kingdom
| | - Magdalena Antczak
- School of Biosciences, University of Kent, Canterbury, Kent, United Kingdom
| | - Fabio Fabris
- University of Kent, School of Computing, Canterbury, United Kingdom
| | - Mark N Wass
- School of Biosciences, University of Kent, Canterbury, Kent, United Kingdom
| | - Jie Hou
- University of Missouri, Computer Science, Columbia, Missouri, USA.,Department of Electrical Engineering and Computer Science, University of Missouri, Columbia, MO, USA
| | - Jianlin Cheng
- Department of Electrical Engineering and Computer Science, University of Missouri, Columbia, MO, USA
| | - Zheng Wang
- University of Miami, Coral Gables, Florida, USA
| | - Alfonso E Romero
- Centre for Systems and Synthetic Biology, Department of Computer Science, Royal Holloway, University of London, Egham, Surrey, United Kingdom
| | - Alberto Paccanaro
- Centre for Systems and Synthetic Biology, Department of Computer Science, Royal Holloway, University of London, Egham, Surrey, United Kingdom
| | - Haixuan Yang
- School of Mathematics, Statistics and Applied Mathematics, National University of Ireland, Galway, Galway, Ireland.,Technical University of Munich, Garching, Germany
| | - Tatyana Goldberg
- Department of Informatics, Bioinformatics & Computational Biology-i12, Technische Universitat Munchen, Munich, Germany
| | - Chenguang Zhao
- Faculty for Informatics, Garching, Germany.,Department for Bioinformatics and Computational Biology, Garching, Germany.,School of Computing Sciences and Computer Engineering, Hattiesburg, Mississippi, USA
| | - Liisa Holm
- Institute of Biotechnology, Helsinki Institute of Life Sciences, University of Helsinki, Finland, Helsinki, Finland
| | - Petri Törönen
- Institute of Biotechnology, Helsinki Institute of Life Sciences, University of Helsinki, Finland, Helsinki, Finland
| | - Alan J Medlar
- Institute of Biotechnology, Helsinki Institute of Life Sciences, University of Helsinki, Finland, Helsinki, Finland
| | - Elaine Zosa
- Institute of Biotechnology, University of Helsinki, Helsinki, Finland
| | | | - Ilya Novikov
- Baylor College of Medicine, Department of Biochemistry and Molecular Biology, Houston, TX, USA
| | - Angela Wilkins
- Baylor College of Medicine, Department of Molecular and Human Genetics, Houston, TX, USA
| | - Olivier Lichtarge
- Baylor College of Medicine, Department of Molecular and Human Genetics, Houston, TX, USA
| | - Po-Han Chi
- National TsingHua University, Hsinchu, Taiwan
| | - Wei-Cheng Tseng
- Department of Electrical Engineering in National Tsing Hua University, Hsinchu City, Taiwan
| | - Michal Linial
- The Hebrew University of Jerusalem, Jerusalem, Israel
| | - Peter W Rose
- University of California San Diego, San Diego Supercomputer Center, La Jolla, California, USA
| | - Christophe Dessimoz
- Department of Computational Biology and Center for Integrative Genomics, University of Lausanne, Lausanne, Switzerland.,Department of Genetics, Evolution & Environment, and Department of Computer Science, University College London, London, UK.,Swiss Institute of Bioinformatics, Lausanne, Switzerland
| | - Vedrana Vidulin
- Department of Knowledge Technologies, Jozef Stefan Institute, Ljubljana, Slovenia
| | - Saso Dzeroski
- Jozef Stefan Institute, Ljubljana, Slovenia.,Jozef Stefan International Postgraduate School, Ljubljana, Slovenia
| | - Ian Sillitoe
- Research Department of Structural and Molecular Biology, University College London, London, England
| | - Sayoni Das
- Research Department of Structural and Molecular Biology, University College London, London, United Kingdom
| | - Jonathan Gill Lees
- Research Department of Structural and Molecular Biology, University College London, London, United Kingdom.,Department of Health and Life Sciences, Oxford Brookes University, London, UK
| | - David T Jones
- The Francis Crick Institute, Biomedical Data Science Laboratory, London, United Kingdom.,Department of Genetics, Evolution and Environment, University College London, Gower Street, London, WC1E 6BT, United Kingdom
| | - Cen Wan
- Department of Computer Science, University College London, London, United Kingdom.,The Francis Crick Institute, Biomedical Data Science Laboratory, London, United Kingdom
| | - Domenico Cozzetto
- Department of Computer Science, University College London, London, United Kingdom.,The Francis Crick Institute, Biomedical Data Science Laboratory, London, United Kingdom
| | - Rui Fa
- Department of Computer Science, University College London, London, United Kingdom.,The Francis Crick Institute, Biomedical Data Science Laboratory, London, United Kingdom
| | - Mateo Torres
- Centre for Systems and Synthetic Biology, Department of Computer Science, Royal Holloway, University of London, Egham, Surrey, United Kingdom
| | - Alex Warwick Vesztrocy
- Department of Genetics, Evolution and Environment, University College London, Gower Street, London, WC1E 6BT, United Kingdom.,SIB Swiss Institute of Bioinformatics, Lausanne, 1015, Switzerland
| | - Jose Manuel Rodriguez
- Cardiovascular Proteomics Laboratory, Centro Nacional de Investigaciones Cardiovasculares Carlos III (CNIC), Madrid, Spain
| | - Michael L Tress
- Spanish National Cancer Research Centre (CNIO), Madrid, Spain
| | - Marco Frasca
- Università degli Studi di Milano - Computer Science Department - AnacletoLab, Milan, Milan, Italy
| | - Marco Notaro
- Università degli Studi di Milano - Computer Science Department - AnacletoLab, Milan, Milan, Italy
| | - Giuliano Grossi
- Università degli Studi di Milano - Computer Science Department - AnacletoLab, Milan, Milan, Italy
| | - Alessandro Petrini
- Università degli Studi di Milano - Computer Science Department - AnacletoLab, Milan, Milan, Italy
| | - Matteo Re
- Università degli Studi di Milano - Computer Science Department - AnacletoLab, Milan, Milan, Italy
| | - Giorgio Valentini
- Università degli Studi di Milano - Computer Science Department - AnacletoLab, Milan, Milan, Italy
| | - Marco Mesiti
- Università degli Studi di Milano - Computer Science Department - AnacletoLab, Milan, Milan, Italy.,Institut de Biologie Computationnelle, LIRMM, CNRS-UMR 5506, Universite de Montpellier, Montpellier, France
| | - Daniel B Roche
- Department of Informatics, Bioinformatics and Computational Biology-i12, Technische Universitat Munchen, Munich, Germany
| | - Jonas Reeb
- Department of Informatics, Bioinformatics and Computational Biology-i12, Technische Universitat Munchen, Munich, Germany
| | - David W Ritchie
- University of Lorraine, CNRS, Inria, LORIA, Nancy, 54000, France
| | - Sabeur Aridhi
- University of Lorraine, CNRS, Inria, LORIA, Nancy, 54000, France
| | | | - Marie-Dominique Devignes
- University of Lorraine, CNRS, Inria, LORIA, Nancy, 54000, France.,University of Lorraine, Nancy, Lorraine, France.,Inria, Nancy, France
| | | | - Richard Bonneau
- NYU Center for Data Science, New York, 10010, NY, USA.,Flatiron Institute, CCB, New York, 10010, NY, USA
| | - Vladimir Gligorijević
- Center for Computational Biology (CCB), Flatiron Institute, Simons Foundation, New York, New York, USA
| | - Meet Barot
- Center for Data Science, New York University, New York, 10011, NY, USA
| | - Hai Fang
- Wellcome Centre for Human Genetics, University of Oxford, Oxford, UK
| | - Stefano Toppo
- Department of Molecular Medicine, University of Padova, Padova, Italy
| | - Enrico Lavezzo
- Department of Molecular Medicine, University of Padova, Padova, Italy
| | - Marco Falda
- Department of Biology, University of Padova, Padova, Italy
| | - Michele Berselli
- Department of Molecular Medicine, University of Padova, Padova, Italy
| | - Silvio C E Tosatto
- CNR Institute of Neuroscience, Padova, Italy.,Department of Biomedical Sciences, University of Padua, Padova, Italy
| | - Marco Carraro
- Department of Biomedical Sciences, University of Padua, Padova, Italy
| | - Damiano Piovesan
- Department of Biomedical Sciences, University of Padua, Padova, Italy
| | - Hafeez Ur Rehman
- Department of Computer Science, National University of Computer and Emerging Sciences, Peshawar, Khyber Pakhtoonkhwa, Pakistan
| | - Qizhong Mao
- Department of Computer and Information Sciences, Temple University, Philadelphia, PA, USA.,University of California, Riverside, Philadelphia, PA, USA
| | - Shanshan Zhang
- Department of Computer and Information Sciences, Temple University, Philadelphia, PA, USA
| | - Slobodan Vucetic
- Department of Computer and Information Sciences, Temple University, Philadelphia, PA, USA
| | - Gage S Black
- Department of Biology, Brigham Young University, Provo, UT, USA.,Bioinformatics Research Group, Provo, UT, USA
| | - Dane Jo
- Department of Biology, Brigham Young University, Provo, UT, USA.,Bioinformatics Research Group, Provo, UT, USA
| | - Erica Suh
- Department of Biology, Brigham Young University, Provo, UT, USA
| | - Jonathan B Dayton
- Department of Biology, Brigham Young University, Provo, UT, USA.,Bioinformatics Research Group, Provo, UT, USA
| | - Dallas J Larsen
- Department of Biology, Brigham Young University, Provo, UT, USA.,Bioinformatics Research Group, Provo, UT, USA
| | - Ashton R Omdahl
- Department of Biology, Brigham Young University, Provo, UT, USA.,Bioinformatics Research Group, Provo, UT, USA
| | - Liam J McGuffin
- School of Biological Sciences, University of Reading, Reading, England, United Kingdom
| | | | - Patricia C Babbitt
- Department of Pharmaceutical Chemistry, San Francisco, CA, USA.,Department of Bioengineering and Therapeutic Sciences, University of California, San Francisco, 94158, CA, USA
| | - Jeffrey M Yunes
- UC Berkeley - UCSF Graduate Program in Bioengineering, University of California, San Francisco, 94158, CA, USA.,Department of Bioengineering and Therapeutic Sciences, University of California, San Francisco, 94158, CA, USA
| | - Paolo Fontana
- Research and Innovation Center, Edmund Mach Foundation, San Michele all'Adige, Italy
| | - Feng Zhang
- State Key Laboratory of Genetic Engineering and Collaborative Innovation Center for Genetics and Development, Fudan University, Shanghai, Shanghai, China.,Department of Biostatistics and Computational Biology, School of Life Sciences, Fudan University, Shanghai, Shanghai, China
| | - Shanfeng Zhu
- School of Computer Science and Shanghai Key Lab of Intelligent Information Processing, Fudan University, Shanghai, China.,Institute of Science and Technology for Brain-Inspired Intelligence and Shanghai Institute of Artificial Intelligence Algorithms, Fudan University, Shanghai, China.,Key Laboratory of Computational Neuroscience and Brain-Inspired Intelligence (Fudan University), Ministry of Education, Shanghai, China
| | - Ronghui You
- School of Computer Science and Shanghai Key Lab of Intelligent Information Processing, Fudan University, Shanghai, China.,Institute of Science and Technology for Brain-Inspired Intelligence and Shanghai Institute of Artificial Intelligence Algorithms, Fudan University, Shanghai, China.,Key Laboratory of Computational Neuroscience and Brain-Inspired Intelligence (Fudan University), Ministry of Education, Shanghai, China
| | - Zihan Zhang
- School of Computer Science and Shanghai Key Lab of Intelligent Information Processing, Fudan University, Shanghai, China.,Key Laboratory of Computational Neuroscience and Brain-Inspired Intelligence (Fudan University), Ministry of Education, Shanghai, China
| | - Suyang Dai
- School of Computer Science and Shanghai Key Lab of Intelligent Information Processing, Fudan University, Shanghai, China.,Key Laboratory of Computational Neuroscience and Brain-Inspired Intelligence (Fudan University), Ministry of Education, Shanghai, China
| | - Shuwei Yao
- School of Computer Science and Shanghai Key Lab of Intelligent Information Processing, Fudan University, Shanghai, China.,Institute of Science and Technology for Brain-Inspired Intelligence and Shanghai Institute of Artificial Intelligence Algorithms, Fudan University, Shanghai, China
| | - Weidong Tian
- State Key Laboratory of Genetic Engineering and Collaborative Innovation Center for Genetics and Development, Department of Biostatistics and Computational Biology, School of Life Sciences, Fudan University, Shanghai, Shanghai, China.,Department of Pediatrics, Brain Tumor Center, Division of Experimental Hematology and Cancer Biology, Cincinnati Children's Hospital Medical Center, Cincinnati, OH, USA
| | - Renzhi Cao
- Department of Computer Science, Pacific Lutheran University, Tacoma, WA, USA
| | - Caleb Chandler
- Department of Computer Science, Pacific Lutheran University, Tacoma, WA, USA
| | - Miguel Amezola
- Department of Computer Science, Pacific Lutheran University, Tacoma, WA, USA
| | - Devon Johnson
- Department of Computer Science, Pacific Lutheran University, Tacoma, WA, USA
| | - Jia-Ming Chang
- Department of Computer Science, National Chengchi University, Taipei, Taiwan
| | - Wen-Hung Liao
- Department of Computer Science, National Chengchi University, Taipei, Taiwan
| | - Yi-Wei Liu
- Department of Computer Science, National Chengchi University, Taipei, Taiwan
| | | | | | - Robert Hoehndorf
- Computer, Electrical and Mathematical Sciences & Engineering Division, Computational Bioscience Research Center, King Abdullah University of Science and Technology, Thuwal, Jeddah, Saudi Arabia
| | - Maxat Kulmanov
- Computer, Electrical and Mathematical Sciences & Engineering Division, Computational Bioscience Research Center, King Abdullah University of Science and Technology, Thuwal, Jeddah, Saudi Arabia
| | - Imane Boudellioua
- Computational Bioscience Research Center (CBRC), King Abdullah University of Science and Technology, Thuwal, Saudi Arabia.,Computer, Electrical and Mathematical Sciences Engineering Division (CEMSE), King Abdullah University of Science and Technology, Thuwal, Saudi Arabia
| | - Gianfranco Politano
- Control and Computer Engineering Department, Politecnico di Torino, Torino, TO, Italy
| | - Stefano Di Carlo
- Control and Computer Engineering Department, Politecnico di Torino, Torino, TO, Italy
| | - Alfredo Benso
- Control and Computer Engineering Department, Politecnico di Torino, Torino, TO, Italy
| | - Kai Hakala
- Department of Future Technologies, Turku NLP Group, University of Turku, Turku, Finland.,University of Turku Graduate School (UTUGS), Turku, Finland
| | - Filip Ginter
- Department of Future Technologies, Turku NLP Group, University of Turku, Turku, Finland.,University of Turku, Turku, Finland
| | - Farrokh Mehryary
- Department of Future Technologies, Turku NLP Group, University of Turku, Turku, Finland.,University of Turku Graduate School (UTUGS), Turku, Finland
| | - Suwisa Kaewphan
- Department of Future Technologies, Turku NLP Group, University of Turku, Turku, Finland.,University of Turku Graduate School (UTUGS), Turku, Finland.,Turku Centre for Computer Science (TUCS), Turku, Finland
| | - Jari Björne
- Department of Future Technologies, Faculty of Science and Engineering, University of Turku, Turku, FI-20014, Finland.,Turku Centre for Computer Science (TUCS), Agora, Vesilinnantie 3, Turku, FI-20500, Finland
| | | | | | - Tapio Salakoski
- Department of Future Technologies, Faculty of Science and Engineering, University of Turku, Turku, FI-20014, Finland.,Turku Centre for Computer Science (TUCS), Agora, Vesilinnantie 3, Turku, FI-20500, Finland
| | - Daisuke Kihara
- Department of Biological Sciences, Department of Computer Science, Purdue University, 47907, IN, USA.,Department of Pediatrics, University of Cincinnati, Cincinnati, 45229, OH, USA
| | - Aashish Jain
- Department of Computer Science, Purdue University, West Lafayette, IN, USA
| | - Tomislav Šmuc
- Division of Electronics, Rudjer Boskovic Institute, Zagreb, Croatia
| | - Adrian Altenhoff
- Department of Computer Science, ETH Zurich, Zurich, Switzerland.,SIB Swiss Institute of Bioinformatics, Zurich, Switzerland
| | - Asa Ben-Hur
- Department of Computer Science, Colorado State University, Fort Collins, CO, USA
| | - Burkhard Rost
- Department of Informatics, Bioinformatics & Computational Biology-i12, Technische Universitat Munchen, Munich, Germany.,Institute for Food and Plant Sciences WZW, Technische Universität München, Freising, Germany
| | | | - Christine A Orengo
- Research Department of Structural and Molecular Biology, University College London, London, United Kingdom
| | - Constance J Jeffery
- Biological Sciences, University of Illinois at Chicago, Chicago, Illinois, USA
| | - Giovanni Bosco
- Department of Molecular and Systems Biology, Geisel School of Medicine at Dartmouth, Hanover, NH, USA
| | - Deborah A Hogan
- Geisel School of Medicine at Dartmouth, Hanover, NH, USA.,Department of Microbiology and Immunology, Geisel School of Medicine at Dartmouth, Hanover, NH, USA
| | - Maria J Martin
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Hinxton, United Kingdom
| | - Claire O'Donovan
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Hinxton, United Kingdom
| | - Sean D Mooney
- Department of Biomedical Informatics and Medical Education, University of Washington, Seattle, WA, USA
| | - Casey S Greene
- Department of Systems Pharmacology and Translational Therapeutics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, Pennsylvania, USA.,Childhood Cancer Data Lab, Alex's Lemonade Stand Foundation, Philadelphia, Pennsylvania, USA
| | - Predrag Radivojac
- Khoury College of Computer Sciences, Northeastern University, Boston, MA, USA.
| | - Iddo Friedberg
- Veterinary Microbiology and Preventive Medicine, Iowa State University, Ames, IA, USA.
| |
Collapse
|
7
|
Kreula SM, Kaewphan S, Ginter F, Jones PR. Finding novel relationships with integrated gene-gene association network analysis of Synechocystis sp. PCC 6803 using species-independent text-mining. PeerJ 2018; 6:e4806. [PMID: 29844966 PMCID: PMC5970561 DOI: 10.7717/peerj.4806] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/15/2017] [Accepted: 04/30/2018] [Indexed: 01/17/2023] Open
Abstract
The increasing move towards open access full-text scientific literature enhances our ability to utilize advanced text-mining methods to construct information-rich networks that no human will be able to grasp simply from ‘reading the literature’. The utility of text-mining for well-studied species is obvious though the utility for less studied species, or those with no prior track-record at all, is not clear. Here we present a concept for how advanced text-mining can be used to create information-rich networks even for less well studied species and apply it to generate an open-access gene-gene association network resource for Synechocystis sp. PCC 6803, a representative model organism for cyanobacteria and first case-study for the methodology. By merging the text-mining network with networks generated from species-specific experimental data, network integration was used to enhance the accuracy of predicting novel interactions that are biologically relevant. A rule-based algorithm (filter) was constructed in order to automate the search for novel candidate genes with a high degree of likely association to known target genes by (1) ignoring established relationships from the existing literature, as they are already ‘known’, and (2) demanding multiple independent evidences for every novel and potentially relevant relationship. Using selected case studies, we demonstrate the utility of the network resource and filter to (i) discover novel candidate associations between different genes or proteins in the network, and (ii) rapidly evaluate the potential role of any one particular gene or protein. The full network is provided as an open-source resource.
Collapse
Affiliation(s)
- Sanna M Kreula
- Department of Biochemistry, University of Turku, Turku, Finland.,University of Turku Graduate School, University of Turku, Turku, Finland
| | - Suwisa Kaewphan
- University of Turku Graduate School, University of Turku, Turku, Finland.,Turku Centre for Computer Science (TUCS), Turku, Finland.,Department of Future Technologies, University of Turku, Turku, Finland
| | - Filip Ginter
- Department of Future Technologies, University of Turku, Turku, Finland
| | - Patrik R Jones
- Department of Life Sciences, Imperial College London, London, United Kingdom
| |
Collapse
|
8
|
Mehryary F, Björne J, Salakoski T, Ginter F. Potent pairing: ensemble of long short-term memory networks and support vector machine for chemical-protein relation extraction. Database (Oxford) 2018; 2018:5255148. [PMID: 30576487 PMCID: PMC6310522 DOI: 10.1093/database/bay120] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/26/2018] [Revised: 10/05/2018] [Accepted: 10/07/2018] [Indexed: 12/02/2022]
Abstract
Biomedical researchers regularly discover new interactions between chemical compounds/drugs and genes/proteins, and report them in research literature. Having knowledge about these interactions is crucially important in many research areas such as precision medicine and drug discovery. The BioCreative VI Task 5 (CHEMPROT) challenge promotes the development and evaluation of computer systems that can automatically recognize and extract statements of such interactions from biomedical literature. We participated in this challenge with a Support Vector Machine (SVM) system and a deep learning-based system (ST-ANN), and achieved an F-score of 60.99 for the task. After the shared task, we have significantly improved the performance of the ST-ANN system. Additionally, we have developed a new deep learning-based system (I-ANN) that considerably outperforms the ST-ANN system. Both ST-ANN and I-ANN systems are centered around training an ensemble of artificial neural networks and utilizing different bidirectional Long Short-Term Memory (LSTM) chains for representing the shortest dependency path and/or the full sentence. By combining the predictions of the SVM and the I-ANN systems, we achieved an F-score of 63.10 for the task, improving our previous F-score by 2.11 percentage points. Our systems are fully open-source and publicly available. We highlight that the systems we present in this study are not applicable only to the BioCreative VI Task 5, but can be effortlessly re-trained to extract any types of relations of interest, with no modifications of the source code required, if a manually annotated corpus is provided as training data in a specific file format.
Collapse
Affiliation(s)
- Farrokh Mehryary
- TurkuNLP group, Department of Future Technologies, University of Turku, Turku, Finland
- University of Turku Graduate School, Turku, Finland
| | - Jari Björne
- TurkuNLP group, Department of Future Technologies, University of Turku, Turku, Finland
- Turku Centre for Computer Science, Turku, Finland
| | - Tapio Salakoski
- TurkuNLP group, Department of Future Technologies, University of Turku, Turku, Finland
- Turku Centre for Computer Science, Turku, Finland
| | - Filip Ginter
- TurkuNLP group, Department of Future Technologies, University of Turku, Turku, Finland
| |
Collapse
|
9
|
Moen H, Peltonen LM, Koivumäki M, Suhonen H, Salakoski T, Ginter F, Salanterä S. Improving Layman Readability of Clinical Narratives with Unsupervised Synonym Replacement. Stud Health Technol Inform 2018; 247:725-729. [PMID: 29678056] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
We report on the development and evaluation of a prototype tool aimed to assist laymen/patients in understanding the content of clinical narratives. The tool relies largely on unsupervised machine learning applied to two large corpora of unlabeled text - a clinical corpus and a general domain corpus. A joint semantic word-space model is created for the purpose of extracting easier to understand alternatives for words considered difficult to understand by laymen. Two domain experts evaluate the tool and inter-rater agreement is calculated. When having the tool suggest ten alternatives to each difficult word, it suggests acceptable lay words for 55.51% of them. This and future manual evaluation will serve to further improve performance, where also supervised machine learning will be used.
Collapse
Affiliation(s)
- Hans Moen
- Turku NLP Group, Department of Future Technologies, University of Turku, Finland
| | | | | | - Henry Suhonen
- Department of Nursing Science, University of Turku, Finland
| | - Tapio Salakoski
- Turku NLP Group, Department of Future Technologies, University of Turku, Finland
| | - Filip Ginter
- Turku NLP Group, Department of Future Technologies, University of Turku, Finland
| | | |
Collapse
|
10
|
Kaewphan S, Hakala K, Miekka N, Salakoski T, Ginter F. Wide-scope biomedical named entity recognition and normalization with CRFs, fuzzy matching and character level modeling. Database (Oxford) 2018; 2018:1-10. [PMID: 30239666 PMCID: PMC6146133 DOI: 10.1093/database/bay096] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/01/2018] [Revised: 08/16/2018] [Accepted: 08/17/2018] [Indexed: 11/13/2022]
Abstract
We present a system for automatically identifying a multitude of biomedical entities from the literature. This work is based on our previous efforts in the BioCreative VI: Interactive Bio-ID Assignment shared task in which our system demonstrated state-of-the-art performance with the highest achieved results in named entity recognition. In this paper we describe the original conditional random field-based system used in the shared task as well as experiments conducted since, including better hyperparameter tuning and character level modeling, which led to further performance improvements. For normalizing the mentions into unique identifiers we use fuzzy character n-gram matching. The normalization approach has also been improved with a better abbreviation resolution method and stricter guideline compliance resulting in vastly improved results for various entity types. All tools and models used for both named entity recognition and normalization are publicly available under open license.Database URL: https://github.com/TurkuNLP/BioCreativeVI_BioID_assignment.
Collapse
Affiliation(s)
- Suwisa Kaewphan
- Turku Centre for Computer Science, Turku, Finland
- Department of Future Technologies, University of Turku, Turku, Finland
- University of Turku Graduate School, Turku, Finland
| | - Kai Hakala
- Department of Future Technologies, University of Turku, Turku, Finland
- University of Turku Graduate School, Turku, Finland
| | - Niko Miekka
- Department of Future Technologies, University of Turku, Turku, Finland
| | - Tapio Salakoski
- Turku Centre for Computer Science, Turku, Finland
- University of Turku Graduate School, Turku, Finland
| | - Filip Ginter
- Department of Future Technologies, University of Turku, Turku, Finland
| |
Collapse
|
11
|
Jiang Y, Oron TR, Clark WT, Bankapur AR, D'Andrea D, Lepore R, Funk CS, Kahanda I, Verspoor KM, Ben-Hur A, Koo DCE, Penfold-Brown D, Shasha D, Youngs N, Bonneau R, Lin A, Sahraeian SME, Martelli PL, Profiti G, Casadio R, Cao R, Zhong Z, Cheng J, Altenhoff A, Skunca N, Dessimoz C, Dogan T, Hakala K, Kaewphan S, Mehryary F, Salakoski T, Ginter F, Fang H, Smithers B, Oates M, Gough J, Törönen P, Koskinen P, Holm L, Chen CT, Hsu WL, Bryson K, Cozzetto D, Minneci F, Jones DT, Chapman S, Bkc D, Khan IK, Kihara D, Ofer D, Rappoport N, Stern A, Cibrian-Uhalte E, Denny P, Foulger RE, Hieta R, Legge D, Lovering RC, Magrane M, Melidoni AN, Mutowo-Meullenet P, Pichler K, Shypitsyna A, Li B, Zakeri P, ElShal S, Tranchevent LC, Das S, Dawson NL, Lee D, Lees JG, Sillitoe I, Bhat P, Nepusz T, Romero AE, Sasidharan R, Yang H, Paccanaro A, Gillis J, Sedeño-Cortés AE, Pavlidis P, Feng S, Cejuela JM, Goldberg T, Hamp T, Richter L, Salamov A, Gabaldon T, Marcet-Houben M, Supek F, Gong Q, Ning W, Zhou Y, Tian W, Falda M, Fontana P, Lavezzo E, Toppo S, Ferrari C, Giollo M, Piovesan D, Tosatto SCE, Del Pozo A, Fernández JM, Maietta P, Valencia A, Tress ML, Benso A, Di Carlo S, Politano G, Savino A, Rehman HU, Re M, Mesiti M, Valentini G, Bargsten JW, van Dijk ADJ, Gemovic B, Glisic S, Perovic V, Veljkovic V, Veljkovic N, Almeida-E-Silva DC, Vencio RZN, Sharan M, Vogel J, Kansakar L, Zhang S, Vucetic S, Wang Z, Sternberg MJE, Wass MN, Huntley RP, Martin MJ, O'Donovan C, Robinson PN, Moreau Y, Tramontano A, Babbitt PC, Brenner SE, Linial M, Orengo CA, Rost B, Greene CS, Mooney SD, Friedberg I, Radivojac P. An expanded evaluation of protein function prediction methods shows an improvement in accuracy. Genome Biol 2016; 17:184. [PMID: 27604469 PMCID: PMC5015320 DOI: 10.1186/s13059-016-1037-6] [Citation(s) in RCA: 252] [Impact Index Per Article: 31.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/26/2015] [Accepted: 08/04/2016] [Indexed: 12/02/2022] Open
Abstract
BACKGROUND A major bottleneck in our understanding of the molecular underpinnings of life is the assignment of function to proteins. While molecular experiments provide the most reliable annotation of proteins, their relatively low throughput and restricted purview have led to an increasing role for computational function prediction. However, assessing methods for protein function prediction and tracking progress in the field remain challenging. RESULTS We conducted the second critical assessment of functional annotation (CAFA), a timed challenge to assess computational methods that automatically assign protein function. We evaluated 126 methods from 56 research groups for their ability to predict biological functions using Gene Ontology and gene-disease associations using Human Phenotype Ontology on a set of 3681 proteins from 18 species. CAFA2 featured expanded analysis compared with CAFA1, with regards to data set size, variety, and assessment metrics. To review progress in the field, the analysis compared the best methods from CAFA1 to those of CAFA2. CONCLUSIONS The top-performing methods in CAFA2 outperformed those from CAFA1. This increased accuracy can be attributed to a combination of the growing number of experimental annotations and improved methods for function prediction. The assessment also revealed that the definition of top-performing algorithms is ontology specific, that different performance metrics can be used to probe the nature of accurate predictions, and the relative diversity of predictions in the biological process and human phenotype ontologies. While there was methodological improvement between CAFA1 and CAFA2, the interpretation of results and usefulness of individual methods remain context-dependent.
Collapse
Affiliation(s)
- Yuxiang Jiang
- Department of Computer Science and Informatics, Indiana University, Bloomington, IN, USA
| | | | - Wyatt T Clark
- Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, CT, USA
| | - Asma R Bankapur
- Department of Microbiology, Miami University, Oxford, OH, USA
| | | | | | - Christopher S Funk
- Computational Bioscience Program, University of Colorado School of Medicine, Aurora, CO, USA
| | - Indika Kahanda
- Department of Computer Science, Colorado State University, Fort Collins, CO, USA
| | - Karin M Verspoor
- Department of Computing and Information Systems, University of Melbourne, Parkville, Victoria, Australia
- Health and Biomedical Informatics Centre, University of Melbourne, Parkville, Victoria, Australia
| | - Asa Ben-Hur
- Department of Computer Science, Colorado State University, Fort Collins, CO, USA
| | | | - Duncan Penfold-Brown
- Social Media and Political Participation Lab, New York University, New York, NY, USA
- CY Data Science, New York, NY, USA
| | - Dennis Shasha
- Department of Computer Science, New York University, New York, NY, USA
| | - Noah Youngs
- CY Data Science, New York, NY, USA
- Department of Computer Science, New York University, New York, NY, USA
- Simons Center for Data Analysis, New York, NY, USA
| | - Richard Bonneau
- Department of Computer Science, New York University, New York, NY, USA
- Simons Center for Data Analysis, New York, NY, USA
- Center for Genomics and Systems Biology, Department of Biology, New York University, New York, NY, USA
| | - Alexandra Lin
- Department of Electrical Engineering and Computer Sciences, University of California Berkeley, Berkeley, CA, USA
| | - Sayed M E Sahraeian
- Department of Plant and Microbial Biology, University of California Berkeley, Berkeley, CA, USA
| | | | - Giuseppe Profiti
- Biocomputing Group, BiGeA, University of Bologna, Bologna, Italy
| | - Rita Casadio
- Biocomputing Group, BiGeA, University of Bologna, Bologna, Italy
| | - Renzhi Cao
- Computer Science Department, University of Missouri, Columbia, MO, USA
| | - Zhaolong Zhong
- Computer Science Department, University of Missouri, Columbia, MO, USA
| | - Jianlin Cheng
- Computer Science Department, University of Missouri, Columbia, MO, USA
| | - Adrian Altenhoff
- ETH Zurich, Zurich, Switzerland
- Swiss Institute of Bioinformatics, Zurich, Switzerland
| | - Nives Skunca
- ETH Zurich, Zurich, Switzerland
- Swiss Institute of Bioinformatics, Zurich, Switzerland
| | - Christophe Dessimoz
- Bioinformatics Group, Department of Computer Science, University College London, London, UK
- University of Lausanne, Lausanne, Switzerland
- Swiss Institute of Bioinformatics, Lausanne, Switzerland
| | - Tunca Dogan
- European Molecular Biology Laboratory, European Bioinformatics Institute, Cambridge, UK
| | - Kai Hakala
- Department of Information Technology, University of Turku, Turku, Finland
- University of Turku Graduate School, University of Turku, Turku, Finland
| | - Suwisa Kaewphan
- Department of Information Technology, University of Turku, Turku, Finland
- University of Turku Graduate School, University of Turku, Turku, Finland
- Turku Centre for Computer Science, Turku, Finland
| | - Farrokh Mehryary
- Department of Information Technology, University of Turku, Turku, Finland
- University of Turku Graduate School, University of Turku, Turku, Finland
| | - Tapio Salakoski
- Department of Information Technology, University of Turku, Turku, Finland
- Turku Centre for Computer Science, Turku, Finland
| | - Filip Ginter
- Department of Information Technology, University of Turku, Turku, Finland
| | - Hai Fang
- University of Bristol, Bristol, UK
| | | | | | | | - Petri Törönen
- Institute of Biotechnology, University of Helsinki, Helsinki, Finland
| | - Patrik Koskinen
- Institute of Biotechnology, University of Helsinki, Helsinki, Finland
| | - Liisa Holm
- Institute of Biotechnology, University of Helsinki, Helsinki, Finland
- Department of Biological and Environmental Sciences, Universitity of Helsinki, Helsinki, Finland
| | - Ching-Tai Chen
- Institute of Information Science, Academia Sinica, Taipei, Taiwan
| | - Wen-Lian Hsu
- Institute of Information Science, Academia Sinica, Taipei, Taiwan
| | - Kevin Bryson
- Bioinformatics Group, Department of Computer Science, University College London, London, UK
| | - Domenico Cozzetto
- Bioinformatics Group, Department of Computer Science, University College London, London, UK
| | - Federico Minneci
- Bioinformatics Group, Department of Computer Science, University College London, London, UK
| | - David T Jones
- Bioinformatics Group, Department of Computer Science, University College London, London, UK
| | - Samuel Chapman
- Department of Computational Science and Engineering, North Carolina A&T State University, Greensboro, NC, USA
| | - Dukka Bkc
- Department of Computational Science and Engineering, North Carolina A&T State University, Greensboro, NC, USA
| | - Ishita K Khan
- Department of Computer Science, Purdue University, West Lafayette, IN, USA
| | - Daisuke Kihara
- Department of Computer Science, Purdue University, West Lafayette, IN, USA
- Department of Biological Sciences, Purdue University, West Lafayette, IN, USA
| | - Dan Ofer
- Department of Biological Chemistry, Institute of Life Sciences, The Hebrew University of Jerusalem, Jerusalem, Israel
| | - Nadav Rappoport
- Department of Biological Chemistry, Institute of Life Sciences, The Hebrew University of Jerusalem, Jerusalem, Israel
- School of Computer Science and Engineering, The Hebrew University of Jerusalem, Jerusalem, Israel
| | - Amos Stern
- Department of Biological Chemistry, Institute of Life Sciences, The Hebrew University of Jerusalem, Jerusalem, Israel
- School of Computer Science and Engineering, The Hebrew University of Jerusalem, Jerusalem, Israel
| | - Elena Cibrian-Uhalte
- European Molecular Biology Laboratory, European Bioinformatics Institute, Cambridge, UK
| | - Paul Denny
- Centre for Cardiovascular Genetics, Institute of Cardiovascular Science, University College London, London, UK
| | - Rebecca E Foulger
- Centre for Cardiovascular Genetics, Institute of Cardiovascular Science, University College London, London, UK
| | - Reija Hieta
- European Molecular Biology Laboratory, European Bioinformatics Institute, Cambridge, UK
| | - Duncan Legge
- European Molecular Biology Laboratory, European Bioinformatics Institute, Cambridge, UK
| | - Ruth C Lovering
- Centre for Cardiovascular Genetics, Institute of Cardiovascular Science, University College London, London, UK
| | - Michele Magrane
- European Molecular Biology Laboratory, European Bioinformatics Institute, Cambridge, UK
| | - Anna N Melidoni
- Centre for Cardiovascular Genetics, Institute of Cardiovascular Science, University College London, London, UK
| | | | - Klemens Pichler
- European Molecular Biology Laboratory, European Bioinformatics Institute, Cambridge, UK
| | - Aleksandra Shypitsyna
- European Molecular Biology Laboratory, European Bioinformatics Institute, Cambridge, UK
| | - Biao Li
- Buck Institute for Research on Aging, Novato, CA, USA
| | - Pooya Zakeri
- Department of Electrical Engineering, STADIUS Center for Dynamical Systems, Signal Processing and Data Analytics, KU Leuven, Leuven, Belgium
- iMinds Department Medical Information Technologies, Leuven, Belgium
| | - Sarah ElShal
- Department of Electrical Engineering, STADIUS Center for Dynamical Systems, Signal Processing and Data Analytics, KU Leuven, Leuven, Belgium
- iMinds Department Medical Information Technologies, Leuven, Belgium
| | - Léon-Charles Tranchevent
- Inserm UMR-S1052, CNRS UMR5286, Cancer Research Centre of Lyon, Lyon, France
- Université de Lyon 1, Villeurbanne, France
- Centre Léon Bérard, Lyon, France
| | - Sayoni Das
- Institute of Structural and Molecular Biology, University College London, London, UK
| | - Natalie L Dawson
- Institute of Structural and Molecular Biology, University College London, London, UK
| | - David Lee
- Institute of Structural and Molecular Biology, University College London, London, UK
| | - Jonathan G Lees
- Institute of Structural and Molecular Biology, University College London, London, UK
| | - Ian Sillitoe
- Institute of Structural and Molecular Biology, University College London, London, UK
| | | | | | - Alfonso E Romero
- Department of Computer Science, Centre for Systems and Synthetic Biology, Royal Holloway University of London, Egham, UK
| | - Rajkumar Sasidharan
- Department of Molecular, Cell and Developmental Biology, University of California at Los Angeles, Los Angeles, CA, USA
| | - Haixuan Yang
- School of Mathematics, Statistics and Applied Mathematics, National University of Ireland, Galway, Ireland
| | - Alberto Paccanaro
- Department of Computer Science, Centre for Systems and Synthetic Biology, Royal Holloway University of London, Egham, UK
| | - Jesse Gillis
- Stanley Institute for Cognitive Genomics Cold Spring Harbor Laboratory, New York, NY, USA
| | | | - Paul Pavlidis
- Department of Psychiatry and Michael Smith Laboratories, University of British Columbia, Vancouver, Canada
| | - Shou Feng
- Department of Computer Science and Informatics, Indiana University, Bloomington, IN, USA
| | - Juan M Cejuela
- Department for Bioinformatics and Computational Biology-I12, Technische Universität München, Garching, Germany
| | - Tatyana Goldberg
- Department for Bioinformatics and Computational Biology-I12, Technische Universität München, Garching, Germany
| | - Tobias Hamp
- Department for Bioinformatics and Computational Biology-I12, Technische Universität München, Garching, Germany
| | - Lothar Richter
- Department for Bioinformatics and Computational Biology-I12, Technische Universität München, Garching, Germany
| | - Asaf Salamov
- DOE Joint Genome Institute, Walnut Creek, CA, USA
| | - Toni Gabaldon
- Bioinformatics and Genomics, Centre for Genomic Regulation, Barcelona, Spain
- Universitat Pompeu Fabra, Barcelona, Spain
- Institució Catalana de Recerca i Estudis Avançats, Barcelona, Spain
| | - Marina Marcet-Houben
- Bioinformatics and Genomics, Centre for Genomic Regulation, Barcelona, Spain
- Universitat Pompeu Fabra, Barcelona, Spain
| | - Fran Supek
- Universitat Pompeu Fabra, Barcelona, Spain
- Division of Electronics, Rudjer Boskovic Institute, Zagreb, Croatia
- EMBL/CRG Systems Biology Research Unit, Centre for Genomic Regulation, Barcelona, Spain
| | - Qingtian Gong
- State Key Laboratory of Genetic Engineering, Collaborative Innovation Center of Genetics and Development, Department of Biostatistics and Computational Biology, School of Life Science, Fudan University, Shanghai, China
- Children's Hospital of Fudan University, Shanghai, China
| | - Wei Ning
- State Key Laboratory of Genetic Engineering, Collaborative Innovation Center of Genetics and Development, Department of Biostatistics and Computational Biology, School of Life Science, Fudan University, Shanghai, China
- Children's Hospital of Fudan University, Shanghai, China
| | - Yuanpeng Zhou
- State Key Laboratory of Genetic Engineering, Collaborative Innovation Center of Genetics and Development, Department of Biostatistics and Computational Biology, School of Life Science, Fudan University, Shanghai, China
- Children's Hospital of Fudan University, Shanghai, China
| | - Weidong Tian
- State Key Laboratory of Genetic Engineering, Collaborative Innovation Center of Genetics and Development, Department of Biostatistics and Computational Biology, School of Life Science, Fudan University, Shanghai, China
- Children's Hospital of Fudan University, Shanghai, China
| | - Marco Falda
- Department of Molecular Medicine, University of Padua, Padua, Italy
| | - Paolo Fontana
- Research and Innovation Center, Edmund Mach Foundation, San Michele all'Adige, Italy
| | - Enrico Lavezzo
- Department of Molecular Medicine, University of Padua, Padua, Italy
| | - Stefano Toppo
- Department of Molecular Medicine, University of Padua, Padua, Italy
| | - Carlo Ferrari
- Department of Information Engineering, University of Padua, Padova, Italy
| | - Manuel Giollo
- Department of Information Engineering, University of Padua, Padova, Italy
- Department of Biomedical Sciences, University of Padua, Padova, Italy
| | - Damiano Piovesan
- Department of Information Engineering, University of Padua, Padova, Italy
| | - Silvio C E Tosatto
- Department of Information Engineering, University of Padua, Padova, Italy
| | - Angela Del Pozo
- Instituto De Genetica Medica y Molecular, Hospital Universitario de La Paz, Madrid, Spain
| | - José M Fernández
- Spanish National Bioinformatics Institute, Spanish National Cancer Research Institute, Madrid, Spain
| | - Paolo Maietta
- Structural and Computational Biology Programme, Spanish National Cancer Research Institute, Madrid, Spain
| | - Alfonso Valencia
- Structural and Computational Biology Programme, Spanish National Cancer Research Institute, Madrid, Spain
| | - Michael L Tress
- Structural and Computational Biology Programme, Spanish National Cancer Research Institute, Madrid, Spain
| | - Alfredo Benso
- Control and Computer Engineering Department, Politecnico di Torino, Torino, Italy
| | - Stefano Di Carlo
- Control and Computer Engineering Department, Politecnico di Torino, Torino, Italy
| | - Gianfranco Politano
- Control and Computer Engineering Department, Politecnico di Torino, Torino, Italy
| | - Alessandro Savino
- Control and Computer Engineering Department, Politecnico di Torino, Torino, Italy
| | - Hafeez Ur Rehman
- National University of Computer & Emerging Sciences, Islamabad, Pakistan
| | - Matteo Re
- Anacleto Lab, Dipartimento di informatica, Università degli Studi di Milano, Milan, Italy
| | - Marco Mesiti
- Anacleto Lab, Dipartimento di informatica, Università degli Studi di Milano, Milan, Italy
| | - Giorgio Valentini
- Anacleto Lab, Dipartimento di informatica, Università degli Studi di Milano, Milan, Italy
| | - Joachim W Bargsten
- Applied Bioinformatics, Bioscience, Wageningen University and Research Centre, Wageningen, Netherlands
| | - Aalt D J van Dijk
- Applied Bioinformatics, Bioscience, Wageningen University and Research Centre, Wageningen, Netherlands
- Biometris, Wageningen University, Wageningen, Netherlands
| | - Branislava Gemovic
- Center for Multidisciplinary Research, Institute of Nuclear Sciences Vinca, University of Belgrade, Belgrade, Serbia
| | - Sanja Glisic
- Center for Multidisciplinary Research, Institute of Nuclear Sciences Vinca, University of Belgrade, Belgrade, Serbia
| | - Vladmir Perovic
- Center for Multidisciplinary Research, Institute of Nuclear Sciences Vinca, University of Belgrade, Belgrade, Serbia
| | - Veljko Veljkovic
- Center for Multidisciplinary Research, Institute of Nuclear Sciences Vinca, University of Belgrade, Belgrade, Serbia
| | - Nevena Veljkovic
- Center for Multidisciplinary Research, Institute of Nuclear Sciences Vinca, University of Belgrade, Belgrade, Serbia
| | | | - Ricardo Z N Vencio
- Department of Computing and Mathematics FFCLRP-USP, University of Sao Paulo, Ribeirao Preto, Brazil
| | - Malvika Sharan
- Institute for Molecular Infection Biology, University of Würzburg, Würzburg, Germany
| | - Jörg Vogel
- Institute for Molecular Infection Biology, University of Würzburg, Würzburg, Germany
| | - Lakesh Kansakar
- Department of Computer and Information Sciences, Temple University, Philadelphia, PA, USA
| | - Shanshan Zhang
- Department of Computer and Information Sciences, Temple University, Philadelphia, PA, USA
| | - Slobodan Vucetic
- Department of Computer and Information Sciences, Temple University, Philadelphia, PA, USA
| | - Zheng Wang
- University of Southern Mississippi, Hattiesburg, MS, USA
| | - Michael J E Sternberg
- Centre for Integrative Systems Biology and Bioinformatics, Department of Life Sciences, Imperial College London, London, UK
| | - Mark N Wass
- School of Biosciences, University of Kent, Canterbury, Kent, UK
| | - Rachael P Huntley
- European Molecular Biology Laboratory, European Bioinformatics Institute, Cambridge, UK
| | - Maria J Martin
- European Molecular Biology Laboratory, European Bioinformatics Institute, Cambridge, UK
| | - Claire O'Donovan
- European Molecular Biology Laboratory, European Bioinformatics Institute, Cambridge, UK
| | - Peter N Robinson
- Institut für Medizinische Genetik und Humangenetik, Charité - Universitätsmedizin Berlin, Berlin, Germany
| | - Yves Moreau
- Department of Electrical Engineering ESAT-SCD and IBBT-KU Leuven Future Health Department, Katholieke Universiteit Leuven, Leuven, Belgium
| | | | - Patricia C Babbitt
- California Institute for Quantitative Biosciences, University of California San Francisco, San Francisco, CA, USA
| | - Steven E Brenner
- Department of Plant and Microbial Biology, University of California Berkeley, Berkeley, CA, USA
| | - Michal Linial
- Department of Chemical Biology, The Hebrew University of Jerusalem, Jerusalem, Israel
| | - Christine A Orengo
- Institute of Structural and Molecular Biology, University College London, London, UK
| | - Burkhard Rost
- Department for Bioinformatics and Computational Biology-I12, Technische Universität München, Garching, Germany
| | - Casey S Greene
- Department of Systems Pharmacology and Translational Therapeutics, University of Pennsylvania, Philadelphia, PA, USA
| | - Sean D Mooney
- Department of Biomedical Informatics and Medical Education, University of Washington, Seattle, WA, USA
| | - Iddo Friedberg
- Department of Microbiology, Miami University, Oxford, OH, USA.
- Department of Computer Science, Miami University, Oxford, OH, USA.
| | - Predrag Radivojac
- Department of Computer Science and Informatics, Indiana University, Bloomington, IN, USA.
| |
Collapse
|
12
|
Mehryary F, Kaewphan S, Hakala K, Ginter F. Filtering large-scale event collections using a combination of supervised and unsupervised learning for event trigger classification. J Biomed Semantics 2016; 7:27. [PMID: 27175227 PMCID: PMC4864999 DOI: 10.1186/s13326-016-0070-4] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/28/2015] [Accepted: 05/01/2016] [Indexed: 11/19/2022] Open
Abstract
Background Biomedical event extraction is one of the key tasks in biomedical text mining, supporting various applications such as database curation and hypothesis generation. Several systems, some of which have been applied at a large scale, have been introduced to solve this task. Past studies have shown that the identification of the phrases describing biological processes, also known as trigger detection, is a crucial part of event extraction, and notable overall performance gains can be obtained by solely focusing on this sub-task. In this paper we propose a novel approach for filtering falsely identified triggers from large-scale event databases, thus improving the quality of knowledge extraction. Methods Our method relies on state-of-the-art word embeddings, event statistics gathered from the whole biomedical literature, and both supervised and unsupervised machine learning techniques. We focus on EVEX, an event database covering the whole PubMed and PubMed Central Open Access literature containing more than 40 million extracted events. The top most frequent EVEX trigger words are hierarchically clustered, and the resulting cluster tree is pruned to identify words that can never act as triggers regardless of their context. For rarely occurring trigger words we introduce a supervised approach trained on the combination of trigger word classification produced by the unsupervised clustering method and manual annotation. Results The method is evaluated on the official test set of BioNLP Shared Task on Event Extraction. The evaluation shows that the method can be used to improve the performance of the state-of-the-art event extraction systems. This successful effort also translates into removing 1,338,075 of potentially incorrect events from EVEX, thus greatly improving the quality of the data. The method is not solely bound to the EVEX resource and can be thus used to improve the quality of any event extraction system or database. Availability The data and source code for this work are available at: http://bionlp-www.utu.fi/trigger-clustering/. Electronic supplementary material The online version of this article (doi:10.1186/s13326-016-0070-4) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Farrokh Mehryary
- Department of Information Technology, University of Turku, Turku, Finland ; The University of Turku Graduate School (UTUGS), University of Turku, Turku, Finland
| | - Suwisa Kaewphan
- Department of Information Technology, University of Turku, Turku, Finland ; The University of Turku Graduate School (UTUGS), University of Turku, Turku, Finland ; Turku Centre for Computer Science (TUCS), Turku, Finland
| | - Kai Hakala
- Department of Information Technology, University of Turku, Turku, Finland ; The University of Turku Graduate School (UTUGS), University of Turku, Turku, Finland
| | - Filip Ginter
- Department of Information Technology, University of Turku, Turku, Finland
| |
Collapse
|
13
|
Kaewphan S, Van Landeghem S, Ohta T, Van de Peer Y, Ginter F, Pyysalo S. Cell line name recognition in support of the identification of synthetic lethality in cancer from text. Bioinformatics 2016; 32:276-82. [PMID: 26428294 PMCID: PMC4708107 DOI: 10.1093/bioinformatics/btv570] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/23/2015] [Revised: 09/08/2015] [Accepted: 09/27/2015] [Indexed: 01/28/2023] Open
Abstract
MOTIVATION The recognition and normalization of cell line names in text is an important task in biomedical text mining research, facilitating for instance the identification of synthetically lethal genes from the literature. While several tools have previously been developed to address cell line recognition, it is unclear whether available systems can perform sufficiently well in realistic and broad-coverage applications such as extracting synthetically lethal genes from the cancer literature. In this study, we revisit the cell line name recognition task, evaluating both available systems and newly introduced methods on various resources to obtain a reliable tagger not tied to any specific subdomain. In support of this task, we introduce two text collections manually annotated for cell line names: the broad-coverage corpus Gellus and CLL, a focused target domain corpus. RESULTS We find that the best performance is achieved using NERsuite, a machine learning system based on Conditional Random Fields, trained on the Gellus corpus and supported with a dictionary of cell line names. The system achieves an F-score of 88.46% on the test set of Gellus and 85.98% on the independently annotated CLL corpus. It was further applied at large scale to 24 302 102 unannotated articles, resulting in the identification of 5 181 342 cell line mentions, normalized to 11 755 unique cell line database identifiers. AVAILABILITY AND IMPLEMENTATION The manually annotated datasets, the cell line dictionary, derived corpora, NERsuite models and the results of the large-scale run on unannotated texts are available under open licenses at http://turkunlp.github.io/Cell-line-recognition/. CONTACT sukaew@utu.fi.
Collapse
Affiliation(s)
- Suwisa Kaewphan
- Turku Centre for Computer Science (TUCS), 20520 Turku, Finland, Department of Information Technology, University of Turku, 20014, Finland, University of Turku Graduate School (UTUGS), University of Turku, 20014, Finland
| | - Sofie Van Landeghem
- Department of Plant Systems Biology, VIB, Ghent 9000, Belgium, Department of Plant Biotechnology and Bioinformatics, Ghent University, Ghent 9052, Belgium
| | | | - Yves Van de Peer
- Department of Plant Systems Biology, VIB, Ghent 9000, Belgium, Department of Plant Biotechnology and Bioinformatics, Ghent University, Ghent 9052, Belgium, Bioinformatics Institute Ghent, Ghent University, Ghent, Belgium, Genomics Research Institute, University of Pretoria, Pretoria, South Africa and
| | - Filip Ginter
- Department of Information Technology, University of Turku, 20014, Finland
| | - Sampo Pyysalo
- Department of Information Technology, University of Turku, 20014, Finland, Language Technology Lab (LTL), University of Cambridge, Cambridge CB3 9DA, United Kingdom
| |
Collapse
|
14
|
Hakala K, Van Landeghem S, Salakoski T, Van de Peer Y, Ginter F. Application of the EVEX resource to event extraction and network construction: Shared Task entry and result analysis. BMC Bioinformatics 2015; 16 Suppl 16:S3. [PMID: 26551766 PMCID: PMC4642107 DOI: 10.1186/1471-2105-16-s16-s3] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Modern methods for mining biomolecular interactions from literature typically make predictions based solely on the immediate textual context, in effect a single sentence. No prior work has been published on extending this context to the information automatically gathered from the whole biomedical literature. Thus, our motivation for this study is to explore whether mutually supporting evidence, aggregated across several documents can be utilized to improve the performance of the state-of-the-art event extraction systems. RESULTS In the GE task, our re-ranking approach led to a modest performance increase and resulted in the first rank of the official Shared Task results with 50.97% F-score. Additionally, in this paper we explore and evaluate the usage of distributed vector representations for this challenge. CONCLUSIONS For the GRN task, we were able to produce a gene regulatory network from the EVEX data, warranting the use of such generic large-scale text mining data in network biology settings. A detailed performance and error analysis provides more insight into the relatively low recall rates.
Collapse
|
15
|
|
16
|
Moen H, Ginter F, Marsi E, Peltonen LM, Salakoski T, Salanterä S. Care episode retrieval: distributional semantic models for information retrieval in the clinical domain. BMC Med Inform Decis Mak 2015; 15 Suppl 2:S2. [PMID: 26099735 PMCID: PMC4474584 DOI: 10.1186/1472-6947-15-s2-s2] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
Patients' health related information is stored in electronic health records (EHRs) by health service providers. These records include sequential documentation of care episodes in the form of clinical notes. EHRs are used throughout the health care sector by professionals, administrators and patients, primarily for clinical purposes, but also for secondary purposes such as decision support and research. The vast amounts of information in EHR systems complicate information management and increase the risk of information overload. Therefore, clinicians and researchers need new tools to manage the information stored in the EHRs. A common use case is, given a--possibly unfinished--care episode, to retrieve the most similar care episodes among the records. This paper presents several methods for information retrieval, focusing on care episode retrieval, based on textual similarity, where similarity is measured through domain-specific modelling of the distributional semantics of words. Models include variants of random indexing and the semantic neural network model word2vec. Two novel methods are introduced that utilize the ICD-10 codes attached to care episodes to better induce domain-specificity in the semantic model. We report on experimental evaluation of care episode retrieval that circumvents the lack of human judgements regarding episode relevance. Results suggest that several of the methods proposed outperform a state-of-the art search engine (Lucene) on the retrieval task.
Collapse
|
17
|
Laippala V, Viljanen T, Airola A, Kanerva J, Salanterä S, Salakoski T, Ginter F. Statistical parsing of varieties of clinical Finnish. Artif Intell Med 2014; 61:131-6. [PMID: 24680097 DOI: 10.1016/j.artmed.2014.02.002] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/30/2013] [Revised: 02/18/2014] [Accepted: 02/20/2014] [Indexed: 10/25/2022]
Abstract
OBJECTIVES In this paper, we study the development and domain-adaptation of statistical syntactic parsers for three different clinical domains in Finnish. METHODS AND MATERIALS The materials include text from daily nursing notes written by nurses in an intensive care unit, physicians' notes from cardiology patients' health records, and daily nursing notes from cardiology patients' health records. The parsing is performed with the statistical parser of Bohnet (http://code.google.com/p/mate-tools/, accessed: 22 November 2013). RESULTS A parser trained only on general language performs poorly in all clinical subdomains, the labelled attachment score (LAS) ranging from 59.4% to 71.4%, whereas domain data combined with general language gives better results, the LAS varying between 67.2% and 81.7%. However, even a small amount of clinical domain data quickly outperforms this and also clinical data from other domains is more beneficial (LAS 71.3-80.0%) than general language only. The best results (LAS 77.4-84.6%) are achieved by using as training data the combination of all the clinical treebanks. CONCLUSIONS In order to develop a good syntactic parser for clinical language variants, a general language resource is not mandatory, while data from clinical fields is. However, in addition to the exact same clinical domain, also data from other clinical domains is useful.
Collapse
Affiliation(s)
- Veronika Laippala
- Department of French Studies, University of Turku, FI-20014, Finland.
| | - Timo Viljanen
- Department of Information Technology, University of Turku, FI-20014, Finland.
| | - Antti Airola
- Department of Information Technology, University of Turku, FI-20014, Finland.
| | - Jenna Kanerva
- Department of Information Technology, University of Turku, FI-20014, Finland.
| | - Sanna Salanterä
- Department of Nursing Science, University of Turku, FI-20014, Finland; The Hospital District of Southwest Finland, PL 52, FI-20521 Turku, Finland.
| | - Tapio Salakoski
- Department of Information Technology, University of Turku, FI-20014, Finland.
| | - Filip Ginter
- Department of Information Technology, University of Turku, FI-20014, Finland.
| |
Collapse
|
18
|
Haverinen K, Nyblom J, Viljanen T, Laippala V, Kohonen S, Missilä A, Ojala S, Salakoski T, Ginter F. Building the essential resources for Finnish: the Turku Dependency Treebank. LANG RESOUR EVAL 2013. [DOI: 10.1007/s10579-013-9244-1] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/26/2022]
|
19
|
Abstract
Background We present a system for extracting biomedical events (detailed descriptions of biomolecular interactions) from research articles, developed for the BioNLP'11 Shared Task. Our goal is to develop a system easily adaptable to different event schemes, following the theme of the BioNLP'11 Shared Task: generalization, the extension of event extraction to varied biomedical domains. Our system extends our BioNLP'09 Shared Task winning Turku Event Extraction System, which uses support vector machines to first detect event-defining words, followed by detection of their relationships. Results Our current system successfully predicts events for every domain case introduced in the BioNLP'11 Shared Task, being the only system to participate in all eight tasks and all of their subtasks, with best performance in four tasks. Following the Shared Task, we improve the system on the Infectious Diseases task from 42.57% to 53.87% F-score, bringing performance into line with the similar GENIA Event Extraction and Epigenetics and Post-translational Modifications tasks. We evaluate the machine learning performance of the system by calculating learning curves for all tasks, detecting areas where additional annotated data could be used to improve performance. Finally, we evaluate the use of system output on external articles as additional training data in a form of self-training. Conclusions We show that the updated Turku Event Extraction System can easily be adapted to all presently available event extraction targets, with competitive performance in most tasks. The scope of the performance gains between the 2009 and 2011 BioNLP Shared Tasks indicates event extraction is still a new field requiring more work. We provide several analyses of event extraction methods and performance, highlighting potential future directions for continued development.
Collapse
Affiliation(s)
- Jari Björne
- Department of Information Technology, University of Turku, Turku Centre for Computer Science (TUCS), Joukahaisenkatu 3-5, 20520 Turku, Finland.
| | | | | |
Collapse
|
20
|
Björne J, Heimonen J, Ginter F, Airola A, Pahikkala T, Salakoski T. EXTRACTING CONTEXTUALIZED COMPLEX BIOLOGICAL EVENTS WITH RICH GRAPH-BASED FEATURE SETS. Comput Intell 2011. [DOI: 10.1111/j.1467-8640.2011.00399.x] [Citation(s) in RCA: 30] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/27/2022]
|
21
|
Abstract
Motivation: There has recently been a notable shift in biomedical information extraction (IE) from relation models toward the more expressive event model, facilitated by the maturation of basic tools for biomedical text analysis and the availability of manually annotated resources. The event model allows detailed representation of complex natural language statements and can support a number of advanced text mining applications ranging from semantic search to pathway extraction. A recent collaborative evaluation demonstrated the potential of event extraction systems, yet there have so far been no studies of the generalization ability of the systems nor the feasibility of large-scale extraction. Results: This study considers event-based IE at PubMed scale. We introduce a system combining publicly available, state-of-the-art methods for domain parsing, named entity recognition and event extraction, and test the system on a representative 1% sample of all PubMed citations. We present the first evaluation of the generalization performance of event extraction systems to this scale and show that despite its computational complexity, event extraction from the entire PubMed is feasible. We further illustrate the value of the extraction approach through a number of analyses of the extracted information. Availability: The event detection system and extracted data are open source licensed and available at http://bionlp.utu.fi/. Contact:jari.bjorne@utu.fi
Collapse
Affiliation(s)
- Jari Björne
- Department of Information Technology, University of Turku, Turku, Finland.
| | | | | | | | | |
Collapse
|
22
|
|
23
|
Laippala V, Ginter F, Pyysalo S, Salakoski T. Towards automated processing of clinical Finnish: Sublanguage analysis and a rule-based parser. Int J Med Inform 2009; 78:e7-12. [DOI: 10.1016/j.ijmedinf.2009.02.005] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2008] [Revised: 01/28/2009] [Accepted: 02/10/2009] [Indexed: 10/21/2022]
|
24
|
Airola A, Pyysalo S, Björne J, Pahikkala T, Ginter F, Salakoski T. All-paths graph kernel for protein-protein interaction extraction with evaluation of cross-corpus learning. BMC Bioinformatics 2008; 9 Suppl 11:S2. [PMID: 19025688 PMCID: PMC2586751 DOI: 10.1186/1471-2105-9-s11-s2] [Citation(s) in RCA: 112] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/19/2022] Open
Abstract
BACKGROUND Automated extraction of protein-protein interactions (PPI) is an important and widely studied task in biomedical text mining. We propose a graph kernel based approach for this task. In contrast to earlier approaches to PPI extraction, the introduced all-paths graph kernel has the capability to make use of full, general dependency graphs representing the sentence structure. RESULTS We evaluate the proposed method on five publicly available PPI corpora, providing the most comprehensive evaluation done for a machine learning based PPI-extraction system. We additionally perform a detailed evaluation of the effects of training and testing on different resources, providing insight into the challenges involved in applying a system beyond the data it was trained on. Our method is shown to achieve state-of-the-art performance with respect to comparable evaluations, with 56.4 F-score and 84.8 AUC on the AImed corpus. CONCLUSION We show that the graph kernel approach performs on state-of-the-art level in PPI extraction, and note the possible extension to the task of extracting complex interactions. Cross-corpus results provide further insight into how the learning generalizes beyond individual corpora. Further, we identify several pitfalls that can make evaluations of PPI-extraction systems incomparable, or even invalid. These include incorrect cross-validation strategies and problems related to comparing F-score results achieved on different evaluation resources. Recommendations for avoiding these pitfalls are provided.
Collapse
Affiliation(s)
- Antti Airola
- Turku Centre for Computer Science (TUCS) and the Department of IT, University of Turku, Joukahaisenkatu 3-5, 20520 Turku, Finland
| | - Sampo Pyysalo
- Turku Centre for Computer Science (TUCS) and the Department of IT, University of Turku, Joukahaisenkatu 3-5, 20520 Turku, Finland
| | - Jari Björne
- Turku Centre for Computer Science (TUCS) and the Department of IT, University of Turku, Joukahaisenkatu 3-5, 20520 Turku, Finland
| | - Tapio Pahikkala
- Turku Centre for Computer Science (TUCS) and the Department of IT, University of Turku, Joukahaisenkatu 3-5, 20520 Turku, Finland
| | - Filip Ginter
- Turku Centre for Computer Science (TUCS) and the Department of IT, University of Turku, Joukahaisenkatu 3-5, 20520 Turku, Finland
| | - Tapio Salakoski
- Turku Centre for Computer Science (TUCS) and the Department of IT, University of Turku, Joukahaisenkatu 3-5, 20520 Turku, Finland
| |
Collapse
|
25
|
Abstract
BACKGROUND Growing interest in the application of natural language processing methods to biomedical text has led to an increasing number of corpora and methods targeting protein-protein interaction (PPI) extraction. However, there is no general consensus regarding PPI annotation and consequently resources are largely incompatible and methods are difficult to evaluate. RESULTS We present the first comparative evaluation of the diverse PPI corpora, performing quantitative evaluation using two separate information extraction methods as well as detailed statistical and qualitative analyses of their properties. For the evaluation, we unify the corpus PPI annotations to a shared level of information, consisting of undirected, untyped binary interactions of non-static types with no identification of the words specifying the interaction, no negations, and no interaction certainty. We find that the F-score performance of a state-of-the-art PPI extraction method varies on average 19 percentage units and in some cases over 30 percentage units between the different evaluated corpora. The differences stemming from the choice of corpus can thus be substantially larger than differences between the performance of PPI extraction methods, which suggests definite limits on the ability to compare methods evaluated on different resources. We analyse a number of potential sources for these differences and identify factors explaining approximately half of the variance. We further suggest ways in which the difficulty of the PPI extraction tasks codified by different corpora can be determined to advance comparability. Our analysis also identifies points of agreement and disagreement in PPI corpus annotation that are rarely explicitly stated by the authors of the corpora. CONCLUSIONS Our comparative analysis uncovers key similarities and differences between the diverse PPI corpora, thus taking an important step towards standardization. In the course of this study we have created a major practical contribution in converting the corpora into a shared format. The conversion software is freely available at http://mars.cs.utu.fi/PPICorpora.
Collapse
Affiliation(s)
- Sampo Pyysalo
- Turku Centre for Computer Science (TUCS) and the Department of IT, University of Turku, Joukahaisenkatu 3-5, 20520 Turku, Finland
| | - Antti Airola
- Turku Centre for Computer Science (TUCS) and the Department of IT, University of Turku, Joukahaisenkatu 3-5, 20520 Turku, Finland
| | - Juho Heimonen
- Turku Centre for Computer Science (TUCS) and the Department of IT, University of Turku, Joukahaisenkatu 3-5, 20520 Turku, Finland
| | - Jari Björne
- Turku Centre for Computer Science (TUCS) and the Department of IT, University of Turku, Joukahaisenkatu 3-5, 20520 Turku, Finland
| | - Filip Ginter
- Turku Centre for Computer Science (TUCS) and the Department of IT, University of Turku, Joukahaisenkatu 3-5, 20520 Turku, Finland
| | - Tapio Salakoski
- Turku Centre for Computer Science (TUCS) and the Department of IT, University of Turku, Joukahaisenkatu 3-5, 20520 Turku, Finland
| |
Collapse
|
26
|
Pyysalo S, Ginter F, Heimonen J, Björne J, Boberg J, Järvinen J, Salakoski T. BioInfer: a corpus for information extraction in the biomedical domain. BMC Bioinformatics 2007; 8:50. [PMID: 17291334 PMCID: PMC1808065 DOI: 10.1186/1471-2105-8-50] [Citation(s) in RCA: 160] [Impact Index Per Article: 9.4] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/21/2006] [Accepted: 02/09/2007] [Indexed: 12/22/2022] Open
Abstract
Background Lately, there has been a great interest in the application of information extraction methods to the biomedical domain, in particular, to the extraction of relationships of genes, proteins, and RNA from scientific publications. The development and evaluation of such methods requires annotated domain corpora. Results We present BioInfer (Bio Information Extraction Resource), a new public resource providing an annotated corpus of biomedical English. We describe an annotation scheme capturing named entities and their relationships along with a dependency analysis of sentence syntax. We further present ontologies defining the types of entities and relationships annotated in the corpus. Currently, the corpus contains 1100 sentences from abstracts of biomedical research articles annotated for relationships, named entities, as well as syntactic dependencies. Supporting software is provided with the corpus. The corpus is unique in the domain in combining these annotation types for a single set of sentences, and in the level of detail of the relationship annotation. Conclusion We introduce a corpus targeted at protein, gene, and RNA relationships which serves as a resource for the development of information extraction systems and their components such as parsers and domain analyzers. The corpus will be maintained and further developed with a current version being available at .
Collapse
Affiliation(s)
- Sampo Pyysalo
- Turku Centre for Computer Science (TUCS), and the Department of IT, University of Turku, Lemminkäisenkatu 14a, 20520 Turku, Finland
| | - Filip Ginter
- Turku Centre for Computer Science (TUCS), and the Department of IT, University of Turku, Lemminkäisenkatu 14a, 20520 Turku, Finland
| | - Juho Heimonen
- Turku Centre for Computer Science (TUCS), and the Department of IT, University of Turku, Lemminkäisenkatu 14a, 20520 Turku, Finland
| | - Jari Björne
- Turku Centre for Computer Science (TUCS), and the Department of IT, University of Turku, Lemminkäisenkatu 14a, 20520 Turku, Finland
| | - Jorma Boberg
- Turku Centre for Computer Science (TUCS), and the Department of IT, University of Turku, Lemminkäisenkatu 14a, 20520 Turku, Finland
| | - Jouni Järvinen
- Turku Centre for Computer Science (TUCS), and the Department of IT, University of Turku, Lemminkäisenkatu 14a, 20520 Turku, Finland
| | - Tapio Salakoski
- Turku Centre for Computer Science (TUCS), and the Department of IT, University of Turku, Lemminkäisenkatu 14a, 20520 Turku, Finland
| |
Collapse
|
27
|
Pyysalo S, Ginter F, Pahikkala T, Boberg J, Järvinen J, Salakoski T. Evaluation of two dependency parsers on biomedical corpus targeted at protein-protein interactions. Int J Med Inform 2005; 75:430-42. [PMID: 16099201 DOI: 10.1016/j.ijmedinf.2005.06.009] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/15/2005] [Accepted: 06/30/2005] [Indexed: 11/18/2022]
Abstract
We present an evaluation of Link Grammar and Connexor Machinese Syntax, two major broad-coverage dependency parsers, on a custom hand-annotated corpus consisting of sentences regarding protein-protein interactions. In the evaluation, we apply the notion of an interaction subgraph, which is the subgraph of a dependency graph expressing a protein-protein interaction. We measure the performance of the parsers for recovery of individual dependencies, fully correct parses, and interaction subgraphs. For Link Grammar, an open system that can be inspected in detail, we further perform a comprehensive failure analysis, report specific causes of error, and suggest potential modifications to the grammar. We find that both parsers perform worse on biomedical English than previously reported on general English. While Connexor Machinese Syntax significantly outperforms Link Grammar, the failure analysis suggests specific ways in which the latter could be modified for better performance in the domain.
Collapse
Affiliation(s)
- Sampo Pyysalo
- Turku Centre for Computer Science (TUCS), Department of Computer Science, University of Turku, Lemminkäisenkatu 14A, 20520 Turku, Finland.
| | | | | | | | | | | |
Collapse
|
28
|
Pahikkala T, Ginter F, Boberg J, Järvinen J, Salakoski T. Contextual weighting for Support Vector Machines in literature mining: an application to gene versus protein name disambiguation. BMC Bioinformatics 2005; 6:157. [PMID: 15972097 PMCID: PMC1180820 DOI: 10.1186/1471-2105-6-157] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/12/2004] [Accepted: 06/22/2005] [Indexed: 11/30/2022] Open
Abstract
BACKGROUND The ability to distinguish between genes and proteins is essential for understanding biological text. Support Vector Machines (SVMs) have been proven to be very efficient in general data mining tasks. We explore their capability for the gene versus protein name disambiguation task. RESULTS We incorporated into the conventional SVM a weighting scheme based on distances of context words from the word to be disambiguated. This weighting scheme increased the performance of SVMs by five percentage points giving performance better than 85% as measured by the area under ROC curve and outperformed the Weighted Additive Classifier, which also incorporates the weighting, and the Naive Bayes classifier. CONCLUSION We show that the performance of SVMs can be improved by the proposed weighting scheme. Furthermore, our results suggest that in this study the increase of the classification performance due to the weighting is greater than that obtained by selecting the underlying classifier or the kernel part of the SVM.
Collapse
Affiliation(s)
- Tapio Pahikkala
- Department of Information Technology, University of Turku and Turku Centre for Computer Science (TUCS), Lemminkäisenkatu 14 A, 20520 Turku, Finland
| | - Filip Ginter
- Department of Information Technology, University of Turku and Turku Centre for Computer Science (TUCS), Lemminkäisenkatu 14 A, 20520 Turku, Finland
| | - Jorma Boberg
- Department of Information Technology, University of Turku and Turku Centre for Computer Science (TUCS), Lemminkäisenkatu 14 A, 20520 Turku, Finland
| | - Jouni Järvinen
- Department of Information Technology, University of Turku and Turku Centre for Computer Science (TUCS), Lemminkäisenkatu 14 A, 20520 Turku, Finland
| | - Tapio Salakoski
- Department of Information Technology, University of Turku and Turku Centre for Computer Science (TUCS), Lemminkäisenkatu 14 A, 20520 Turku, Finland
| |
Collapse
|
29
|
Abstract
Plasma zinc and copper levels and copper/zinc ratio of 22 intrinsic asthma patients were compared to that of 33 healthy control subjects. Five of the intrinsic asthma patients were aspirin (ASA) intolerant. The zinc content of plasma was found to be significantly lower in patients than in control individuals with the values being 0.80 +/- 0.01 mg/L versus 0.89 +/- 0.02 mg/L, while the plasma copper level and copper/zinc ratio were significantly higher in the asthma group than in the control group, with the values being 1.28 +/- 0.03 mg/L and 1.61 +/- 0.04 versus 1.06 +/- 0.02 mg/L and 1.21 +/- 0.02, respectively (mean +/- SE). The role of the essential trace elements zinc and copper and cytokines in the pathogenesis of asthma is discussed.
Collapse
Affiliation(s)
- J Kadrabová
- Institute of Preventive and Clinical Medicine, Bratislava, Slovak Republic
| | | | | | | | | |
Collapse
|