1
|
Li L, Li H, Ishdorj TO, Zheng C, Su Y. MDNNSyn: A Multi-Modal Deep Learning Framework for Drug Synergy Prediction. IEEE J Biomed Health Inform 2024; 28:6225-6236. [PMID: 38954565 DOI: 10.1109/jbhi.2024.3421916] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/04/2024]
Abstract
Synergistic drug combination prediction tasks based on the computational models have been widely studied and applied in the cancer field. However, most of models only consider the interactions between drug pairs and specific cell lines, without taking into account the multiple biological relationships of drug-drug and cell line-cell line that also largely affect synergistic mechanisms. To this end, here we propose a multi-modal deep learning framework, termed MDNNSyn, which adequately applies multi-source information and trains multi-modal features to infer potential synergistic drug combinations. MDNNSyn extracts topology modality features by implementing the multi-layer hypergraph neural network on drug synergy hypergraph and constructs semantic modality features through similarity strategy. A multi-modal fusion network layer with gated neural network is then employed for synergy score prediction. MDNNSyn is compared to five classic and state-of-the-art prediction methods on DrugCombDB and Oncology-Screen datasets. The model achieves area under the curve (AUC) scores of 0.8682 and 0.9013 on two datasets, an improvement of 3.70 % and 2.71 % over the second-best model. Case study indicates that MDNNSyn is capable of detecting potential synergistic drug combinations.
Collapse
|
2
|
Minimum Information and Quality Standards for Conducting, Reporting, and Organizing In Vitro Research. Handb Exp Pharmacol 2020; 257:177-196. [PMID: 31628600 DOI: 10.1007/164_2019_284] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/12/2022]
Abstract
Insufficient description of experimental practices can contribute to difficulties in reproducing research findings. In response to this, "minimum information" guidelines have been developed for different disciplines. These standards help ensure that the complete experiment is described, including both experimental protocols and data processing methods, allowing a critical evaluation of the whole process and the potential recreation of the work. Selected examples of minimum information checklists with relevance for in vitro research are presented here and are collected by and registered at the MIBBI/FAIRsharing Information Resource portal.In addition, to support integrative research and to allow for comparisons and data sharing across studies, ontologies and vocabularies need to be defined and integrated across areas of in vitro research. As examples, this chapter addresses ontologies for cells and bioassays and discusses their importance for in vitro studies.Finally, specific quality requirements for important in vitro research tools (like chemical probes, antibodies, and cell lines) are suggested, and remaining issues are discussed.
Collapse
|
3
|
Korch C, Varella-Garcia M. Tackling the Human Cell Line and Tissue Misidentification Problem Is Needed for Reproducible Biomedical Research. ACTA ACUST UNITED AC 2018. [DOI: 10.1016/j.yamp.2018.07.003] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/19/2022]
|
4
|
Jeong I, Yu N, Jang I, Jun Y, Kim MS, Choi J, Lee B, Lee S. GEMiCCL: mining genotype and expression data of cancer cell lines with elaborate visualization. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2018; 2018:4991663. [PMID: 29726944 PMCID: PMC5932466 DOI: 10.1093/database/bay041] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 12/04/2017] [Accepted: 04/05/2018] [Indexed: 12/21/2022]
Abstract
Database URL GEMiCCL is available at https://www.kobic.kr/GEMICCL/.
Collapse
Affiliation(s)
- Inhae Jeong
- Department of Bio-Information Science, Ewha Womans University, Seoul 03760, Republic of Korea
| | - Namhee Yu
- Department of Life Science, Ewha Womans University, Seoul 03760, Republic of Korea
| | - Insu Jang
- Korean Research Institute of Bioscience and Biotechnology, Korean Bioinformation Center, Daejeon 34141, Republic of Korea
| | - Yukyung Jun
- Ewha Research Center for Systems Biology, Ewha Womans University, Seoul 03760, Republic of Korea
| | - Min-Seo Kim
- Korean Research Institute of Bioscience and Biotechnology, Korean Bioinformation Center, Daejeon 34141, Republic of Korea
| | - Jinhyuk Choi
- Korean Research Institute of Bioscience and Biotechnology, Korean Bioinformation Center, Daejeon 34141, Republic of Korea
| | - Byungwook Lee
- Korean Research Institute of Bioscience and Biotechnology, Korean Bioinformation Center, Daejeon 34141, Republic of Korea
| | - Sanghyuk Lee
- Department of Bio-Information Science, Ewha Womans University, Seoul 03760, Republic of Korea.,Department of Life Science, Ewha Womans University, Seoul 03760, Republic of Korea.,Ewha Research Center for Systems Biology, Ewha Womans University, Seoul 03760, Republic of Korea
| |
Collapse
|
5
|
Ong E, Sarntivijai S, Jupp S, Parkinson H, He Y. Comparison, alignment, and synchronization of cell line information between CLO and EFO. BMC Bioinformatics 2017; 18:557. [PMID: 29322915 PMCID: PMC5763470 DOI: 10.1186/s12859-017-1979-z] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND The Experimental Factor Ontology (EFO) is an application ontology driven by experimental variables including cell lines to organize and describe the diverse experimental variables and data resided in the EMBL-EBI resources. The Cell Line Ontology (CLO) is an OBO community-based ontology that contains information of immortalized cell lines and relevant experimental components. EFO integrates and extends ontologies from the bio-ontology community to drive a number of practical applications. It is desirable that the community shares design patterns and therefore that EFO reuses the cell line representation from the Cell Line Ontology (CLO). There are, however, challenges to be addressed when developing a common ontology design pattern for representing cell lines in both EFO and CLO. RESULTS In this study, we developed a strategy to compare and map cell line terms between EFO and CLO. We examined Cellosaurus resources for EFO-CLO cross-references. Text labels of cell lines from both ontologies were verified by biological information axiomatized in each source. The study resulted in the identification 873 EFO-CLO aligned and 344 EFO unique immortalized permanent cell lines. All of these cell lines were updated to CLO and the cell line related information was merged. A design pattern that integrates EFO and CLO was also developed. CONCLUSION Our study compared, aligned, and synchronized the cell line information between CLO and EFO. The final updated CLO will be examined as the candidate ontology to import and replace eligible EFO cell line classes thereby supporting the interoperability in the bio-ontology domain. Our mapping pipeline illustrates the use of ontology in aiding biological data standardization and integration through the biological and semantics content of cell lines.
Collapse
Affiliation(s)
- Edison Ong
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI USA
- Samples, Phenotypes, and Ontologies Team, European Bioinformatics Institute (EMBL-EBI), European Molecular Biology Laboratory, Hinxton, Cambridge, UK
| | - Sirarat Sarntivijai
- Samples, Phenotypes, and Ontologies Team, European Bioinformatics Institute (EMBL-EBI), European Molecular Biology Laboratory, Hinxton, Cambridge, UK
| | - Simon Jupp
- Samples, Phenotypes, and Ontologies Team, European Bioinformatics Institute (EMBL-EBI), European Molecular Biology Laboratory, Hinxton, Cambridge, UK
| | - Helen Parkinson
- Samples, Phenotypes, and Ontologies Team, European Bioinformatics Institute (EMBL-EBI), European Molecular Biology Laboratory, Hinxton, Cambridge, UK
| | - Yongqun He
- Center of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI USA
- Unit of Laboratory Animal Medicine, University of Michigan, Ann Arbor, MI USA
| |
Collapse
|
6
|
Abstract
BACKGROUND Cell lines and cell types are extensively studied in biomedical research yielding to a significant amount of publications each year. Identifying cell lines and cell types precisely in publications is crucial for science reproducibility and knowledge integration. There are efforts for standardisation of the cell nomenclature based on ontology development to support FAIR principles of the cell knowledge. However, it is important to analyse the usage of cell nomenclature in publications at a large scale for understanding the level of uptake of cell nomenclature in literature by scientists. In this study, we analyse the usage of cell nomenclature, both in Vivo, and in Vitro in biomedical literature by using text mining methods and present our results. RESULTS We identified 59% of the cell type classes in the Cell Ontology and 13% of the cell line classes in the Cell Line Ontology in the literature. Our analysis showed that cell line nomenclature is much more ambiguous compared to the cell type nomenclature. However, trends indicate that standardised nomenclature for cell lines and cell types are being increasingly used in publications by the scientists. CONCLUSIONS Our findings provide an insight to understand how experimental cells are described in publications and may allow for an improved standardisation of cell type and cell line nomenclature as well as can be utilised to develop efficient text mining applications on cell types and cell lines. All data generated in this study is available at https://github.com/shenay/CellNomenclatureStudy.
Collapse
Affiliation(s)
- Şenay Kafkas
- Computational Bioscience Research Center, Computer, Electrical and Mathematical Sciences & Engineering Division, King Abdullah University Science and Technology, 4700 KAUST, Thuwal, 23955-6900 Saudi Arabia
| | - Sirarat Sarntivijai
- The European Bioinformatics Institute (EMBL-EBI), European Molecular Biology Laboratory, Wellcome Genome Campus, Hinxton, Cambridge, SD CB10 1 UK
| | - Robert Hoehndorf
- Computational Bioscience Research Center, Computer, Electrical and Mathematical Sciences & Engineering Division, King Abdullah University Science and Technology, 4700 KAUST, Thuwal, 23955-6900 Saudi Arabia
| |
Collapse
|
7
|
Reid YA. Best practices for naming, receiving, and managing cells in culture. In Vitro Cell Dev Biol Anim 2017; 53:761-774. [PMID: 28986713 DOI: 10.1007/s11626-017-0199-1] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/24/2017] [Accepted: 09/05/2017] [Indexed: 12/26/2022]
Abstract
One of the first considerations in using an existing cell line or establishing a new a cell line is the detailed proactive planning of all phases of the cell line management. It is necessary to have a well-trained practitioner in best practices in cell culture who has experience in receiving a new cell line into the laboratory, the correct and appropriate use of a cell line name, the preparation of cell banks, microscopic observation of cells in culture, growth optimization, cell count, cell subcultivation, as well as detailed protocols on how to expand and store cells. Indeed, the practitioner should best manage all activities of cell culture by ensuring that the appropriate certified facilities, equipment, and validated supplies and reagents are in place.
Collapse
Affiliation(s)
- Yvonne A Reid
- ATCC, 10801 University Blvd., Manassas, VA, 20110, USA.
| |
Collapse
|
8
|
Yu M, Selvaraj SK, Liang-Chu MMY, Aghajani S, Busse M, Yuan J, Lee G, Peale F, Klijn C, Bourgon R, Kaminker JS, Neve RM. A resource for cell line authentication, annotation and quality control. Nature 2015; 520:307-11. [PMID: 25877200 DOI: 10.1038/nature14397] [Citation(s) in RCA: 286] [Impact Index Per Article: 31.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/27/2014] [Accepted: 03/09/2015] [Indexed: 01/25/2023]
Abstract
Cell line misidentification, contamination and poor annotation affect scientific reproducibility. Here we outline simple measures to detect or avoid cross-contamination, present a framework for cell line annotation linked to short tandem repeat and single nucleotide polymorphism profiles, and provide a catalogue of synonymous cell lines. This resource will enable our community to eradicate the use of misidentified lines and generate credible cell-based data.
Collapse
Affiliation(s)
- Mamie Yu
- Department of Discovery Oncology, Genentech Inc., South San Francisco, California 94080, USA
| | - Suresh K Selvaraj
- Department of Discovery Oncology, Genentech Inc., South San Francisco, California 94080, USA
| | - May M Y Liang-Chu
- Department of Discovery Oncology, Genentech Inc., South San Francisco, California 94080, USA
| | - Sahar Aghajani
- Department of Bioinformatics and Computational Biology, Genentech Inc., South San Francisco, California 94080, USA
| | - Matthew Busse
- Department of Bioinformatics and Computational Biology, Genentech Inc., South San Francisco, California 94080, USA
| | - Jean Yuan
- Department of Bioinformatics and Computational Biology, Genentech Inc., South San Francisco, California 94080, USA
| | - Genee Lee
- Department of Discovery Oncology, Genentech Inc., South San Francisco, California 94080, USA
| | - Franklin Peale
- Department of Pathology, Genentech Inc., South San Francisco, California 94080, USA
| | - Christiaan Klijn
- Department of Bioinformatics and Computational Biology, Genentech Inc., South San Francisco, California 94080, USA
| | - Richard Bourgon
- Department of Bioinformatics and Computational Biology, Genentech Inc., South San Francisco, California 94080, USA
| | - Joshua S Kaminker
- Department of Bioinformatics and Computational Biology, Genentech Inc., South San Francisco, California 94080, USA
| | - Richard M Neve
- Department of Discovery Oncology, Genentech Inc., South San Francisco, California 94080, USA
| |
Collapse
|
9
|
Vempati UD, Przydzial MJ, Chung C, Abeyruwan S, Mir A, Sakurai K, Visser U, Lemmon VP, Schürer SC. Formalization, annotation and analysis of diverse drug and probe screening assay datasets using the BioAssay Ontology (BAO). PLoS One 2012; 7:e49198. [PMID: 23155465 PMCID: PMC3498356 DOI: 10.1371/journal.pone.0049198] [Citation(s) in RCA: 47] [Impact Index Per Article: 3.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/22/2012] [Accepted: 10/04/2012] [Indexed: 11/30/2022] Open
Abstract
Huge amounts of high-throughput screening (HTS) data for probe and drug development projects are being generated in the pharmaceutical industry and more recently in the public sector. The resulting experimental datasets are increasingly being disseminated via publically accessible repositories. However, existing repositories lack sufficient metadata to describe the experiments and are often difficult to navigate by non-experts. The lack of standardized descriptions and semantics of biological assays and screening results hinder targeted data retrieval, integration, aggregation, and analyses across different HTS datasets, for example to infer mechanisms of action of small molecule perturbagens. To address these limitations, we created the BioAssay Ontology (BAO). BAO has been developed with a focus on data integration and analysis enabling the classification of assays and screening results by concepts that relate to format, assay design, technology, target, and endpoint. Previously, we reported on the higher-level design of BAO and on the semantic querying capabilities offered by the ontology-indexed triple store of HTS data. Here, we report on our detailed design, annotation pipeline, substantially enlarged annotation knowledgebase, and analysis results. We used BAO to annotate assays from the largest public HTS data repository, PubChem, and demonstrate its utility to categorize and analyze diverse HTS results from numerous experiments. BAO is publically available from the NCBO BioPortal at http://bioportal.bioontology.org/ontologies/1533. BAO provides controlled terminology and uniform scope to report probe and drug discovery screening assays and results. BAO leverages description logic to formalize the domain knowledge and facilitate the semantic integration with diverse other resources. As a consequence, BAO offers the potential to infer new knowledge from a corpus of assay results, for example molecular mechanisms of action of perturbagens.
Collapse
Affiliation(s)
- Uma D. Vempati
- Center for Computational Science, University of Miami, Miami, Florida, United States of America
| | - Magdalena J. Przydzial
- Center for Computational Science, University of Miami, Miami, Florida, United States of America
| | - Caty Chung
- Center for Computational Science, University of Miami, Miami, Florida, United States of America
| | - Saminda Abeyruwan
- Department of Computer Science, University of Miami, Miami, Florida, United States of America
| | - Ahsan Mir
- Center for Computational Science, University of Miami, Miami, Florida, United States of America
| | - Kunie Sakurai
- Center for Computational Science, University of Miami, Miami, Florida, United States of America
| | - Ubbo Visser
- Department of Computer Science, University of Miami, Miami, Florida, United States of America
| | - Vance P. Lemmon
- The Miami Project to Cure Paralysis, Department of Neurological Surgery, University of Miami, Miami, Florida, United States of America
| | - Stephan C. Schürer
- Center for Computational Science, University of Miami, Miami, Florida, United States of America
- Department of Molecular and Cellular Pharmacology, University of Miami, Miami, Florida, United States of America
- * E-mail:
| |
Collapse
|
10
|
Ganzinger M, He S, Breuhahn K, Knaup P. On the ontology based representation of cell lines. PLoS One 2012; 7:e48584. [PMID: 23144907 PMCID: PMC3492450 DOI: 10.1371/journal.pone.0048584] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/23/2012] [Accepted: 09/26/2012] [Indexed: 11/23/2022] Open
Abstract
Cell lines are frequently used as highly standardized and reproducible in vitro models for biomedical analyses and assays. Cell lines are distributed by cell banks that operate databases describing their products. However, the description of the cell lines' properties are not standardized across different cell banks. Existing cell line-related ontologies mostly focus on the description of the cell lines' names, but do not cover aspects like the origin or optimal growth conditions. The objective of this work is to develop an ontology that allows for a more comprehensive description of cell lines and their metadata, which should cover the data elements provided by cell banks. This will provide the basis for the standardized annotation of cell lines and corresponding assays in biomedical research. In addition, the ontology will be the foundation for automated evaluation of such assays and their respective protocols in the future. To accomplish this, a broad range of cell bank databases as well as existing ontologies were analyzed in a comprehensive manner. We identified existing ontologies capable of covering different aspects of the cell line domain. However, not all data fields derived from the cell banks' databases could be mapped to existing ontologies. As a result, we created a new ontology called cell culture ontology (CCONT) integrating existing ontologies where possible. CCONT provides classes from the areas of cell line identification, origin, cell line properties, propagation and tests performed.
Collapse
Affiliation(s)
- Matthias Ganzinger
- Institute of Medical Biometry and Informatics, Heidelberg University, Heidelberg, Germany.
| | | | | | | |
Collapse
|
11
|
Abstract
Integrative Biology (IB) uses experimental or computational quantitative technologies to characterize biological systems at the molecular, cellular, tissue and population levels. IB typically involves the integration of the data, knowledge and capabilities across disciplinary boundaries in order to solve complex problems. We identify a series of bioinformatics problems posed by interdisciplinary integration: (i) data integration that interconnects structured data across related biomedical domains; (ii) ontology integration that brings jargons, terminologies and taxonomies from various disciplines into a unified network of ontologies; (iii) knowledge integration that integrates disparate knowledge elements from multiple sources; (iv) service integration that build applications out of services provided by different vendors. We argue that IB can benefit significantly from the integration solutions enabled by Semantic Web (SW) technologies. The SW enables scientists to share content beyond the boundaries of applications and websites, resulting into a web of data that is meaningful and understandable to any computers. In this review, we provide insight into how SW technologies can be used to build open, standardized and interoperable solutions for interdisciplinary integration on a global basis. We present a rich set of case studies in system biology, integrative neuroscience, bio-pharmaceutics and translational medicine, to highlight the technical features and benefits of SW applications in IB.
Collapse
Affiliation(s)
- Huajun Chen
- College of Computer Science, Zhejiang University, Hangzhou, 310027, P.R. China.
| | | | | |
Collapse
|
12
|
Harmston N, Filsell W, Stumpf MPH. Which species is it? Species-driven gene name disambiguation using random walks over a mixture of adjacency matrices. Bioinformatics 2011; 28:254-60. [PMID: 22135416 DOI: 10.1093/bioinformatics/btr640] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
MOTIVATION The scientific literature contains a wealth of information about biological systems. Manual curation lacks the scalability to extract this information due to the ever-increasing numbers of papers being published. The development and application of text mining technologies has been proposed as a way of dealing with this problem. However, the inter-species ambiguity of the genomic nomenclature makes mapping of gene mentions identified in text to their corresponding Entrez gene identifiers an extremely difficult task. We propose a novel method, which transforms a MEDLINE record into a mixture of adjacency matrices; by performing a random walkover the resulting graph, we can perform multi-class supervised classification allowing the assignment of taxonomy identifiers to individual gene mentions. The ability to achieve good performance at this task has a direct impact on the performance of normalizing gene mentions to Entrez gene identifiers. Such graph mixtures add flexibility and allow us to generate probabilistic classification schemes that naturally reflect the uncertainties inherent, even in literature-derived data. RESULTS Our method performs well in terms of both micro- and macro-averaged performance, achieving micro-F(1) of 0.76 and macro-F(1) of 0.36 on the publicly available DECA corpus. Re-curation of the DECA corpus was performed, with our method achieving 0.88 micro-F(1) and 0.51 macro-F(1). Our method improves over standard classification techniques [such as support vector machines (SVMs)] in a number of ways: flexibility, interpretability and its resistance to the effects of class bias in the training data. Good performance is achieved without the need for computationally expensive parse tree generation or 'bag of words classification'.
Collapse
Affiliation(s)
- Nathan Harmston
- Centre for Bioinformatics, Division of Molecular Biosciences, Imperial College London, London SW7 2AZ, UK
| | | | | |
Collapse
|
13
|
Athey BD, Cavalcoli JD, Jagadish HV, Omenn GS, Mirel B, Kretzler M, Burant C, Isokpehi RD, DeLisi C. The NIH National Center for Integrative Biomedical Informatics (NCIBI). J Am Med Inform Assoc 2011; 19:166-70. [PMID: 22101971 DOI: 10.1136/amiajnl-2011-000552] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/20/2022] Open
Abstract
The National Center for Integrative and Biomedical Informatics (NCIBI) is one of the eight NCBCs. NCIBI supports information access and data analysis for biomedical researchers, enabling them to build computational and knowledge models of biological systems to address the Driving Biological Problems (DBPs). The NCIBI DBPs have included prostate cancer progression, organ-specific complications of type 1 and 2 diabetes, bipolar disorder, and metabolic analysis of obesity syndrome. Collaborating with these and other partners, NCIBI has developed a series of software tools for exploratory analysis, concept visualization, and literature searches, as well as core database and web services resources. Many of our training and outreach initiatives have been in collaboration with the Research Centers at Minority Institutions (RCMI), integrating NCIBI and RCMI faculty and students, culminating each year in an annual workshop. Our future directions include focusing on the TranSMART data sharing and analysis initiative.
Collapse
Affiliation(s)
- Brian D Athey
- Center for Computational Medicine and Bioinformatics, University of Michigan Medical School, Ann Arbor, Michigan, USA.
| | | | | | | | | | | | | | | | | | | |
Collapse
|
14
|
Lu Z, Kao HY, Wei CH, Huang M, Liu J, Kuo CJ, Hsu CN, Tsai RTH, Dai HJ, Okazaki N, Cho HC, Gerner M, Solt I, Agarwal S, Liu F, Vishnyakova D, Ruch P, Romacker M, Rinaldi F, Bhattacharya S, Srinivasan P, Liu H, Torii M, Matos S, Campos D, Verspoor K, Livingston KM, Wilbur WJ. The gene normalization task in BioCreative III. BMC Bioinformatics 2011; 12 Suppl 8:S2. [PMID: 22151901 PMCID: PMC3269937 DOI: 10.1186/1471-2105-12-s8-s2] [Citation(s) in RCA: 79] [Impact Index Per Article: 6.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND We report the Gene Normalization (GN) challenge in BioCreative III where participating teams were asked to return a ranked list of identifiers of the genes detected in full-text articles. For training, 32 fully and 500 partially annotated articles were prepared. A total of 507 articles were selected as the test set. Due to the high annotation cost, it was not feasible to obtain gold-standard human annotations for all test articles. Instead, we developed an Expectation Maximization (EM) algorithm approach for choosing a small number of test articles for manual annotation that were most capable of differentiating team performance. Moreover, the same algorithm was subsequently used for inferring ground truth based solely on team submissions. We report team performance on both gold standard and inferred ground truth using a newly proposed metric called Threshold Average Precision (TAP-k). RESULTS We received a total of 37 runs from 14 different teams for the task. When evaluated using the gold-standard annotations of the 50 articles, the highest TAP-k scores were 0.3297 (k=5), 0.3538 (k=10), and 0.3535 (k=20), respectively. Higher TAP-k scores of 0.4916 (k=5, 10, 20) were observed when evaluated using the inferred ground truth over the full test set. When combining team results using machine learning, the best composite system achieved TAP-k scores of 0.3707 (k=5), 0.4311 (k=10), and 0.4477 (k=20) on the gold standard, representing improvements of 12.4%, 21.8%, and 26.6% over the best team results, respectively. CONCLUSIONS By using full text and being species non-specific, the GN task in BioCreative III has moved closer to a real literature curation task than similar tasks in the past and presents additional challenges for the text mining community, as revealed in the overall team results. By evaluating teams using the gold standard, we show that the EM algorithm allows team submissions to be differentiated while keeping the manual annotation effort feasible. Using the inferred ground truth we show measures of comparative performance between teams. Finally, by comparing team rankings on gold standard vs. inferred ground truth, we further demonstrate that the inferred ground truth is as effective as the gold standard for detecting good team performance.
Collapse
Affiliation(s)
- Zhiyong Lu
- National Center for Biotechnology Information (NCBI), 8600 Rockville Pike, Bethesda, Maryland 20894, USA
| | - Hung-Yu Kao
- Department of Computer Science and Information Engineering, National Cheng Kung University, Tainan, Taiwan, R.O.C
| | - Chih-Hsuan Wei
- Department of Computer Science and Information Engineering, National Cheng Kung University, Tainan, Taiwan, R.O.C
| | - Minlie Huang
- Department of Computer Science and Technology, Tsinghua University, Beijing, 100084, China
| | - Jingchen Liu
- Department of Computer Science and Technology, Tsinghua University, Beijing, 100084, China
| | - Cheng-Ju Kuo
- Institute of Information Science, Academia Sinica, Taipei 115, Taiwan
| | - Chun-Nan Hsu
- Institute of Information Science, Academia Sinica, Taipei 115, Taiwan
- Information Science Institute, University of Southern California, Marina del Rey, California, USA
| | - Richard Tzong-Han Tsai
- Department of Computer Science and Engineering, Yuan Ze University, Chung-Li, Taiwan, R.O.C
| | - Hong-Jie Dai
- Department of Computer Science, National Tsing-Hua University, Hsinchu, Taiwan, R.O.C
- Institute of Information Science, Academic Sinica, Taipei, Taiwan, R.O.C
| | - Naoaki Okazaki
- Interfaculty Initiative in Information Studies, University of Tokyo, Japan
| | - Han-Cheol Cho
- Graduate School of Information Science and Technology, University of Tokyo, Japan
| | - Martin Gerner
- Faculty of Life Sciences, University of Manchester, Manchester, M13 9PT, UK
| | - Illes Solt
- Department of Telecommunications and Media Informatics, Budapest University of Technology and Economics, 1117 Budapest, Hungary
| | - Shashank Agarwal
- Medical Informatics, University of Wisconsin-Milwaukee, Milwaukee, Wisconsin, USA
| | - Feifan Liu
- Medical Informatics, University of Wisconsin-Milwaukee, Milwaukee, Wisconsin, USA
| | - Dina Vishnyakova
- BiTem Group, Division of Medical Information Sciences, University of Geneva, Switzerland
| | - Patrick Ruch
- BiTeM Group, Information Science Department, University of Applied Science, Geneva, Switzerland
| | | | - Fabio Rinaldi
- Institute of Computational Linguistics, University of Zurich, Zurich, Switzerland
| | | | - Padmini Srinivasan
- Department of Computer Science, The University of Iowa, Iowa City, Iowa 52242, USA
| | - Hongfang Liu
- Department of Health Sciences Research, Mayo Clinic College of Medicine, Rochester, MN 55905 USA
| | - Manabu Torii
- Lab of Text Intelligence in Biomedicine, Georgetown University Medical Center, 4000 Reservoir Rd., NW, Washington, DC 20057 USA
| | - Sergio Matos
- DETI/IEETA, University of Aveiro, Campus Universitário de Santiago, 3810-193 Aveiro, Portugal
| | - David Campos
- DETI/IEETA, University of Aveiro, Campus Universitário de Santiago, 3810-193 Aveiro, Portugal
| | - Karin Verspoor
- Center for Computational Pharmacology, University of Colorado School of Medicine, Aurora, Colorado, USA
| | - Kevin M Livingston
- Center for Computational Pharmacology, University of Colorado School of Medicine, Aurora, Colorado, USA
| | - W John Wilbur
- National Center for Biotechnology Information (NCBI), 8600 Rockville Pike, Bethesda, Maryland 20894, USA
| |
Collapse
|
15
|
BioAssay Ontology (BAO): a semantic description of bioassays and high-throughput screening results. BMC Bioinformatics 2011; 12:257. [PMID: 21702939 PMCID: PMC3149580 DOI: 10.1186/1471-2105-12-257] [Citation(s) in RCA: 81] [Impact Index Per Article: 6.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/21/2011] [Accepted: 06/24/2011] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND High-throughput screening (HTS) is one of the main strategies to identify novel entry points for the development of small molecule chemical probes and drugs and is now commonly accessible to public sector research. Large amounts of data generated in HTS campaigns are submitted to public repositories such as PubChem, which is growing at an exponential rate. The diversity and quantity of available HTS assays and screening results pose enormous challenges to organizing, standardizing, integrating, and analyzing the datasets and thus to maximize the scientific and ultimately the public health impact of the huge investments made to implement public sector HTS capabilities. Novel approaches to organize, standardize and access HTS data are required to address these challenges. RESULTS We developed the first ontology to describe HTS experiments and screening results using expressive description logic. The BioAssay Ontology (BAO) serves as a foundation for the standardization of HTS assays and data and as a semantic knowledge model. In this paper we show important examples of formalizing HTS domain knowledge and we point out the advantages of this approach. The ontology is available online at the NCBO bioportal http://bioportal.bioontology.org/ontologies/44531. CONCLUSIONS After a large manual curation effort, we loaded BAO-mapped data triples into a RDF database store and used a reasoner in several case studies to demonstrate the benefits of formalized domain knowledge representation in BAO. The examples illustrate semantic querying capabilities where BAO enables the retrieval of inferred search results that are relevant to a given query, but are not explicitly defined. BAO thus opens new functionality for annotating, querying, and analyzing HTS datasets and the potential for discovering new knowledge by means of inference.
Collapse
|
16
|
Rinaldi F, Kaljurand K, Sætre R. Terminological resources for text mining over biomedical scientific literature. Artif Intell Med 2011; 52:107-14. [PMID: 21652190 DOI: 10.1016/j.artmed.2011.04.011] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/09/2010] [Revised: 04/18/2011] [Accepted: 04/18/2011] [Indexed: 11/30/2022]
Abstract
OBJECTIVE We present a combined terminological resource for text mining over biomedical literature. The purpose of the resource is to allow the detection of mentions of specific biological entities in scientific publications, and their grounding to widely accepted identifiers. This is an essential process, useful in itself, and necessary as an intermediate step for almost every type of complex text mining application. METHODS We discuss some of the properties of the terminology for this domain, in particular the degree of ambiguity, which constitutes a peculiar problem for text mining applications. Without a correct recognition and disambiguation of the domain entities no reliable results can be produced. RESULTS We also discuss an application that makes use of the resulting terminological knowledge base. We annotate an existing corpus of sentences about protein interactions. The annotation consists of a normalization step that matches the terms in our resource with their actual representation in the corpus, and a disambiguation step that resolves the ambiguity of matched terms. CONCLUSION In this paper we present a large terminological resource, compiled through the aggregation of a number of different manually curated sources. We discuss the lexical properties of such resources, specifically the degree of ambiguity of the terms, and we inspect the causes of such ambiguity, in particular for protein names. This information is of vital importance for the implementation of an efficient term normalization and grounding algorithm.
Collapse
Affiliation(s)
- Fabio Rinaldi
- Institute of Computational Linguistics, University of Zurich, Binzmühlestrasse 14, CH-8050 Zurich, Switzerland.
| | | | | |
Collapse
|
17
|
Ozgür A, Xiang Z, Radev DR, He Y. Mining of vaccine-associated IFN-γ gene interaction networks using the Vaccine Ontology. J Biomed Semantics 2011; 2 Suppl 2:S8. [PMID: 21624163 PMCID: PMC3102897 DOI: 10.1186/2041-1480-2-s2-s8] [Citation(s) in RCA: 50] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
Background Interferon-gamma (IFN-γ) is vital in vaccine-induced immune defense against bacterial and viral infections and tumor. Our recent study demonstrated the power of a literature-based discovery method in extraction and comparison of the IFN-γ and vaccine-mediated gene interaction networks. The Vaccine Ontology (VO) contains a hierarchy of vaccine names. It is hypothesized that the application of VO will enhance the prediction of IFN-γ and vaccine-mediated gene interaction network. Results In this study, 186 specific vaccine names listed in the Vaccine Ontology (VO) and their semantic relations were used for possible improved retrieval of the IFN-γ and vaccine associated gene interactions. The application of VO allows discovery of 38 more genes and 60 more interactions. Comparison of different layers of IFN-γ networks and the example BCG vaccine-induced subnetwork led to generation of new hypotheses. By analyzing all discovered genes using centrality metrics, 32 genes were ranked high in the VO-based IFN-γ vaccine network using four centrality scores. Furthermore, 28 specific vaccines were found to be associated with these top 32 genes. These specific vaccine-gene associations were further used to generate a network of vaccine-vaccine associations. The BCG and LVS vaccines are found to be the most central vaccines in the vaccine-vaccine association network. Conclusion Our results demonstrate that the combined usages of biomedical ontologies and centrality-based literature mining are able to significantly facilitate discovery of gene interaction networks and gene-concept associations. Availability VO is available at: http://www.violinet.org/vaccineontology; and the SVM edit kernel for gene interaction extraction is available at: http://www.violinet.org/ifngvonet/int_ext_svm.zip
Collapse
Affiliation(s)
- Arzucan Ozgür
- Unit for Laboratory Animal Medicine, University of Michigan, Ann Arbor, MI 48109, USA.
| | | | | | | |
Collapse
|
18
|
Harmston N, Filsell W, Stumpf MPH. What the papers say: text mining for genomics and systems biology. Hum Genomics 2010; 5:17-29. [PMID: 21106487 PMCID: PMC3500154 DOI: 10.1186/1479-7364-5-1-17] [Citation(s) in RCA: 39] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/06/2010] [Accepted: 08/06/2010] [Indexed: 12/11/2022] Open
Abstract
Keeping up with the rapidly growing literature has become virtually impossible for most scientists. This can have dire consequences. First, we may waste research time and resources on reinventing the wheel simply because we can no longer maintain a reliable grasp on the published literature. Second, and perhaps more detrimental, judicious (or serendipitous) combination of knowledge from different scientific disciplines, which would require following disparate and distinct research literatures, is rapidly becoming impossible for even the most ardent readers of research publications. Text mining - the automated extraction of information from (electronically) published sources - could potentially fulfil an important role - but only if we know how to harness its strengths and overcome its weaknesses. As we do not expect that the rate at which scientific results are published will decrease, text mining tools are now becoming essential in order to cope with, and derive maximum benefit from, this information explosion. In genomics, this is particularly pressing as more and more rare disease-causing variants are found and need to be understood. Not being conversant with this technology may put scientists and biomedical regulators at a severe disadvantage. In this review, we introduce the basic concepts underlying modern text mining and its applications in genomics and systems biology. We hope that this review will serve three purposes: (i) to provide a timely and useful overview of the current status of this field, including a survey of present challenges; (ii) to enable researchers to decide how and when to apply text mining tools in their own research; and (iii) to highlight how the research communities in genomics and systems biology can help to make text mining from biomedical abstracts and texts more straightforward.
Collapse
Affiliation(s)
- Nathan Harmston
- Division of Molecular Biosciences, Centre for Bioinformatics, Imperial College London, 303, Wolfson Building, South Kensington Campus, London, SW7 2AZ, UK
| | - Wendy Filsell
- Unilever R&D, Colworth Science Park, Sharnbrook, Bedford MK44 1 LQ, UK
| | - Michael PH Stumpf
- Division of Molecular Biosciences, Centre for Bioinformatics, Imperial College London, 303, Wolfson Building, South Kensington Campus, London, SW7 2AZ, UK
| |
Collapse
|
19
|
Shin YC, Shin SY, So I, Kwon D, Jeon JH. TRIP Database: a manually curated database of protein-protein interactions for mammalian TRP channels. Nucleic Acids Res 2010; 39:D356-61. [PMID: 20851834 PMCID: PMC3013757 DOI: 10.1093/nar/gkq814] [Citation(s) in RCA: 24] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/03/2023] Open
Abstract
Transient receptor potential (TRP) channels are a superfamily of Ca2+-permeable cation channels that translate cellular stimuli into electrochemical signals. Aberrant activity of TRP channels has been implicated in a variety of human diseases, such as neurological disorders, cardiovascular disease and cancer. To facilitate the understanding of the molecular network by which TRP channels are associated with biological and disease processes, we have developed the TRIP (TRansient receptor potential channel-Interacting Protein) Database (http://www.trpchannel.org), a manually curated database that aims to offer comprehensive information on protein–protein interactions (PPIs) of mammalian TRP channels. The TRIP Database was created by systematically curating 277 peer-reviewed literature; the current version documents 490 PPI pairs, 28 TRP channels and 297 cellular proteins. The TRIP Database provides a detailed summary of PPI data that fit into four categories: screening, validation, characterization and functional consequence. Users can find in-depth information specified in the literature on relevant analytical methods and experimental resources, such as gene constructs and cell/tissue types. The TRIP Database has user-friendly web interfaces with helpful features, including a search engine, an interaction map and a function for cross-referencing useful external databases. Our TRIP Database will provide a valuable tool to assist in understanding the molecular regulatory network of TRP channels.
Collapse
Affiliation(s)
- Young-Cheul Shin
- Department of Physiology, Seoul National University College of Medicine, Seoul 110-799, Korea
| | | | | | | | | |
Collapse
|
20
|
Gerner M, Nenadic G, Bergman CM. LINNAEUS: a species name identification system for biomedical literature. BMC Bioinformatics 2010; 11:85. [PMID: 20149233 PMCID: PMC2836304 DOI: 10.1186/1471-2105-11-85] [Citation(s) in RCA: 159] [Impact Index Per Article: 11.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/28/2009] [Accepted: 02/11/2010] [Indexed: 11/25/2022] Open
Abstract
BACKGROUND The task of recognizing and identifying species names in biomedical literature has recently been regarded as critical for a number of applications in text and data mining, including gene name recognition, species-specific document retrieval, and semantic enrichment of biomedical articles. RESULTS In this paper we describe an open-source species name recognition and normalization software system, LINNAEUS, and evaluate its performance relative to several automatically generated biomedical corpora, as well as a novel corpus of full-text documents manually annotated for species mentions. LINNAEUS uses a dictionary-based approach (implemented as an efficient deterministic finite-state automaton) to identify species names and a set of heuristics to resolve ambiguous mentions. When compared against our manually annotated corpus, LINNAEUS performs with 94% recall and 97% precision at the mention level, and 98% recall and 90% precision at the document level. Our system successfully solves the problem of disambiguating uncertain species mentions, with 97% of all mentions in PubMed Central full-text documents resolved to unambiguous NCBI taxonomy identifiers. CONCLUSIONS LINNAEUS is an open source, stand-alone software system capable of recognizing and normalizing species name mentions with speed and accuracy, and can therefore be integrated into a range of bioinformatics and text-mining applications. The software and manually annotated corpus can be downloaded freely at http://linnaeus.sourceforge.net/.
Collapse
Affiliation(s)
- Martin Gerner
- Faculty of Life Sciences, University of Manchester, Manchester, M13 9PT, UK
| | - Goran Nenadic
- School of Computer Science, University of Manchester, Manchester, M13 9PL, UK
| | - Casey M Bergman
- Faculty of Life Sciences, University of Manchester, Manchester, M13 9PT, UK
| |
Collapse
|
21
|
|