1
|
Vagin AA, Steiner RA, Lebedev AA, Potterton L, McNicholas S, Long F, Murshudov GN. REFMAC5 dictionary: organization of prior chemical knowledge and guidelines for its use. ACTA CRYSTALLOGRAPHICA SECTION D: BIOLOGICAL CRYSTALLOGRAPHY 2004; 60:2184-95. [PMID: 15572771 DOI: 10.1107/s0907444904023510] [Citation(s) in RCA: 1149] [Impact Index Per Article: 54.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Subscribe] [Scholar Register] [Received: 04/19/2004] [Accepted: 09/22/2004] [Indexed: 11/11/2022]
Abstract
One of the most important aspects of macromolecular structure refinement is the use of prior chemical knowledge. Bond lengths, bond angles and other chemical properties are used in restrained refinement as subsidiary conditions. This contribution describes the organization and some aspects of the use of the flexible and human/machine-readable dictionary of prior chemical knowledge used by the maximum-likelihood macromolecular-refinement program REFMAC5. The dictionary stores information about monomers which represent the constitutive building blocks of biological macromolecules (amino acids, nucleic acids and saccharides) and about numerous organic/inorganic compounds commonly found in macromolecular crystallography. It also describes the modifications the building blocks undergo as a result of chemical reactions and the links required for polymer formation. More than 2000 monomer entries, 100 modification entries and 200 link entries are currently available. Algorithms and tools for updating and adding new entries to the dictionary have also been developed and are presented here. In many cases, the REFMAC5 dictionary allows entirely automatic generation of restraints within REFMAC5 refinement runs.
Collapse
|
Research Support, Non-U.S. Gov't |
21 |
1149 |
2
|
Zegers-Hochschild F, Adamson GD, Dyer S, Racowsky C, de Mouzon J, Sokol R, Rienzi L, Sunde A, Schmidt L, Cooke ID, Simpson JL, van der Poel S. The International Glossary on Infertility and Fertility Care, 2017. Fertil Steril 2017; 108:393-406. [PMID: 28760517 DOI: 10.1016/j.fertnstert.2017.06.005] [Citation(s) in RCA: 739] [Impact Index Per Article: 92.4] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/18/2017] [Accepted: 06/05/2017] [Indexed: 10/19/2022]
|
|
8 |
739 |
3
|
Michel JB, Shen YK, Aiden AP, Veres A, Gray MK, The Google Books Team, Pickett JP, Hoiberg D, Clancy D, Norvig P, Orwant J, Pinker S, Nowak MA, Aiden EL. Quantitative analysis of culture using millions of digitized books. Science 2011; 331:176-82. [PMID: 21163965 PMCID: PMC3279742 DOI: 10.1126/science.1199644] [Citation(s) in RCA: 562] [Impact Index Per Article: 40.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/02/2022]
Abstract
We constructed a corpus of digitized texts containing about 4% of all books ever printed. Analysis of this corpus enables us to investigate cultural trends quantitatively. We survey the vast terrain of 'culturomics,' focusing on linguistic and cultural phenomena that were reflected in the English language between 1800 and 2000. We show how this approach can provide insights about fields as diverse as lexicography, the evolution of grammar, collective memory, the adoption of technology, the pursuit of fame, censorship, and historical epidemiology. Culturomics extends the boundaries of rigorous quantitative inquiry to a wide array of new phenomena spanning the social sciences and the humanities.
Collapse
|
Research Support, N.I.H., Extramural |
14 |
562 |
4
|
Westbrook J, Feng Z, Chen L, Yang H, Berman HM. The Protein Data Bank and structural genomics. Nucleic Acids Res 2003; 31:489-91. [PMID: 12520059 PMCID: PMC165515 DOI: 10.1093/nar/gkg068] [Citation(s) in RCA: 261] [Impact Index Per Article: 11.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
The Protein Data Bank (PDB; http://www.pdb.org/) continues to be actively involved in various aspects of the informatics of structural genomics projects--developing and maintaining the Target Registration Database (TargetDB), organizing data dictionaries that will define the specification for the exchange and deposition of data with the structural genomics centers and creating software tools to capture data from standard structure determination applications.
Collapse
|
research-article |
22 |
261 |
5
|
Tuddenham WJ. Glossary of terms for thoracic radiology: recommendations of the Nomenclature Committee of the Fleischner Society. AJR Am J Roentgenol 1984; 143:509-17. [PMID: 6380245 DOI: 10.2214/ajr.143.3.509] [Citation(s) in RCA: 192] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/19/2023]
|
Dictionary |
41 |
192 |
6
|
Abstract
The WHO Health Promotion Glossary was written to facilitate understanding, communication and cooperation among those engaged in health promotion at the local, regional, national and global levels. Two editions of the Glossary have been released, the first in 1986 and the second in 1998, and continued revision of the document is necessary to promote consensus regarding meanings and to take account of developments in thinking and practice. In this update 10 new terms that are to be included in the Glossary are presented. Criteria for the inclusion of terms in the Glossary are that they differentiate health promotion from other health concepts, or have a specific application or meaning when used in relation to health promotion. The terms defined here are: burden of disease; capacity building; evidence-based health promotion; global health; health impact assessment; needs assessment; self-efficacy; social marketing; sustainable health promotion strategies, and; wellness. WHO will continue to periodically update the Health Promotion Glossary to ensure its relevance to the international health promotion community.
Collapse
|
Journal Article |
19 |
185 |
7
|
Neamatullah I, Douglass MM, Lehman LWH, Reisner A, Villarroel M, Long WJ, Szolovits P, Moody GB, Mark RG, Clifford GD. Automated de-identification of free-text medical records. BMC Med Inform Decis Mak 2008; 8:32. [PMID: 18652655 PMCID: PMC2526997 DOI: 10.1186/1472-6947-8-32] [Citation(s) in RCA: 178] [Impact Index Per Article: 10.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/19/2007] [Accepted: 07/24/2008] [Indexed: 11/29/2022] Open
Abstract
BACKGROUND Text-based patient medical records are a vital resource in medical research. In order to preserve patient confidentiality, however, the U.S. Health Insurance Portability and Accountability Act (HIPAA) requires that protected health information (PHI) be removed from medical records before they can be disseminated. Manual de-identification of large medical record databases is prohibitively expensive, time-consuming and prone to error, necessitating automatic methods for large-scale, automated de-identification. METHODS We describe an automated Perl-based de-identification software package that is generally usable on most free-text medical records, e.g., nursing notes, discharge summaries, X-ray reports, etc. The software uses lexical look-up tables, regular expressions, and simple heuristics to locate both HIPAA PHI, and an extended PHI set that includes doctors' names and years of dates. To develop the de-identification approach, we assembled a gold standard corpus of re-identified nursing notes with real PHI replaced by realistic surrogate information. This corpus consists of 2,434 nursing notes containing 334,000 words and a total of 1,779 instances of PHI taken from 163 randomly selected patient records. This gold standard corpus was used to refine the algorithm and measure its sensitivity. To test the algorithm on data not used in its development, we constructed a second test corpus of 1,836 nursing notes containing 296,400 words. The algorithm's false negative rate was evaluated using this test corpus. RESULTS Performance evaluation of the de-identification software on the development corpus yielded an overall recall of 0.967, precision value of 0.749, and fallout value of approximately 0.002. On the test corpus, a total of 90 instances of false negatives were found, or 27 per 100,000 word count, with an estimated recall of 0.943. Only one full date and one age over 89 were missed. No patient names were missed in either corpus. CONCLUSION We have developed a pattern-matching de-identification system based on dictionary look-ups, regular expressions, and heuristics. Evaluation based on two different sets of nursing notes collected from a U.S. hospital suggests that, in terms of recall, the software out-performs a single human de-identifier (0.81) and performs at least as well as a consensus of two human de-identifiers (0.94). The system is currently tuned to de-identify PHI in nursing notes and discharge summaries but is sufficiently generalized and can be customized to handle text files of any format. Although the accuracy of the algorithm is high, it is probably insufficient to be used to publicly disseminate medical data. The open-source de-identification software and the gold standard re-identified corpus of medical records have therefore been made available to researchers via the PhysioNet website to encourage improvements in the algorithm.
Collapse
|
Evaluation Study |
17 |
178 |
8
|
Hayamizu TF, Mangan M, Corradi JP, Kadin JA, Ringwald M. The Adult Mouse Anatomical Dictionary: a tool for annotating and integrating data. Genome Biol 2005; 6:R29. [PMID: 15774030 PMCID: PMC1088948 DOI: 10.1186/gb-2005-6-3-r29] [Citation(s) in RCA: 120] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/31/2004] [Revised: 11/08/2004] [Accepted: 01/11/2005] [Indexed: 11/10/2022] Open
Abstract
The Adult Mouse Anatomical Dictionary was developed to provide an ontology for standardized nomenclature for anatomical terms in the postnatal mouse. The ontology will be used to annotate and integrate different types of data pertinent to anatomy. We have developed an ontology to provide standardized nomenclature for anatomical terms in the postnatal mouse. The Adult Mouse Anatomical Dictionary is structured as a directed acyclic graph, and is organized hierarchically both spatially and functionally. The ontology will be used to annotate and integrate different types of data pertinent to anatomy, such as gene expression patterns and phenotype information, which will contribute to an integrated description of biological phenomena in the mouse.
Collapse
|
Research Support, U.S. Gov't, P.H.S. |
20 |
120 |
9
|
|
Dictionary |
39 |
119 |
10
|
Abstract
This glossary aims to provide readers with some of the key terms that are relevant to a consideration of the relevance of social capital for health, and to introduce some of the debates on the concepts.
Collapse
|
research-article |
22 |
113 |
11
|
Temkin JM, Gilder MR. Extraction of protein interaction information from unstructured text using a context-free grammar. Bioinformatics 2004; 19:2046-53. [PMID: 14594709 DOI: 10.1093/bioinformatics/btg279] [Citation(s) in RCA: 111] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION As research into disease pathology and cellular function continues to generate vast amounts of data pertaining to protein, gene and small molecule (PGSM) interactions, there exists a critical need to capture these results in structured formats allowing for computational analysis. Although many efforts have been made to create databases that store this information in computer readable form, populating these sources largely requires a manual process of interpreting and extracting interaction relationships from the biological research literature. Being able to efficiently and accurately automate the extraction of interactions from unstructured text, would greatly improve the content of these databases and provide a method for managing the continued growth of new literature being published. RESULTS In this paper, we describe a system for extracting PGSM interactions from unstructured text. By utilizing a lexical analyzer and context free grammar (CFG), we demonstrate that efficient parsers can be constructed for extracting these relationships from natural language with high rates of recall and precision. Our results show that this technique achieved a recall rate of 83.5% and a precision rate of 93.1% for recognizing PGSM names and a recall rate of 63.9% and a precision rate of 70.2% for extracting interactions between these entities. In contrast to other published techniques, the use of a CFG significantly reduces the complexities of natural language processing by focusing on domain specific structure as opposed to analyzing the semantics of a given language. Additionally, our approach provides a level of abstraction for adding new rules for extracting other types of biological relationships beyond PGSM relationships. AVAILABILITY The program and corpus are available by request from the authors.
Collapse
|
Validation Study |
21 |
111 |
12
|
Pyysalo S, Airola A, Heimonen J, Björne J, Ginter F, Salakoski T. Comparative analysis of five protein-protein interaction corpora. BMC Bioinformatics 2008; 9 Suppl 3:S6. [PMID: 18426551 PMCID: PMC2349296 DOI: 10.1186/1471-2105-9-s3-s6] [Citation(s) in RCA: 91] [Impact Index Per Article: 5.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/11/2022] Open
Abstract
BACKGROUND Growing interest in the application of natural language processing methods to biomedical text has led to an increasing number of corpora and methods targeting protein-protein interaction (PPI) extraction. However, there is no general consensus regarding PPI annotation and consequently resources are largely incompatible and methods are difficult to evaluate. RESULTS We present the first comparative evaluation of the diverse PPI corpora, performing quantitative evaluation using two separate information extraction methods as well as detailed statistical and qualitative analyses of their properties. For the evaluation, we unify the corpus PPI annotations to a shared level of information, consisting of undirected, untyped binary interactions of non-static types with no identification of the words specifying the interaction, no negations, and no interaction certainty. We find that the F-score performance of a state-of-the-art PPI extraction method varies on average 19 percentage units and in some cases over 30 percentage units between the different evaluated corpora. The differences stemming from the choice of corpus can thus be substantially larger than differences between the performance of PPI extraction methods, which suggests definite limits on the ability to compare methods evaluated on different resources. We analyse a number of potential sources for these differences and identify factors explaining approximately half of the variance. We further suggest ways in which the difficulty of the PPI extraction tasks codified by different corpora can be determined to advance comparability. Our analysis also identifies points of agreement and disagreement in PPI corpus annotation that are rarely explicitly stated by the authors of the corpora. CONCLUSIONS Our comparative analysis uncovers key similarities and differences between the diverse PPI corpora, thus taking an important step towards standardization. In the course of this study we have created a major practical contribution in converting the corpora into a shared format. The conversion software is freely available at http://mars.cs.utu.fi/PPICorpora.
Collapse
|
Comparative Study |
17 |
91 |
13
|
Erhardt RAA, Schneider R, Blaschke C. Status of text-mining techniques applied to biomedical text. Drug Discov Today 2007; 11:315-25. [PMID: 16580973 DOI: 10.1016/j.drudis.2006.02.011] [Citation(s) in RCA: 82] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/10/2005] [Revised: 02/08/2006] [Accepted: 02/27/2006] [Indexed: 11/16/2022]
Abstract
Scientific progress is increasingly based on knowledge and information. Knowledge is now recognized as the driver of productivity and economic growth, leading to a new focus on the role of information in the decision-making process. Most scientific knowledge is registered in publications and other unstructured representations that make it difficult to use and to integrate the information with other sources (e.g. biological databases). Making a computer understand human language has proven to be a complex achievement, but there are techniques capable of detecting, distinguishing and extracting a limited number of different classes of facts. In the biomedical field, extracting information has specific problems: complex and ever-changing nomenclature (especially genes and proteins) and the limited representation of domain knowledge.
Collapse
|
Review |
18 |
82 |
14
|
Wei WQ, Cronin RM, Xu H, Lasko TA, Bastarache L, Denny JC. Development and evaluation of an ensemble resource linking medications to their indications. J Am Med Inform Assoc 2013; 20:954-61. [PMID: 23576672 PMCID: PMC3756263 DOI: 10.1136/amiajnl-2012-001431] [Citation(s) in RCA: 71] [Impact Index Per Article: 5.9] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/21/2012] [Revised: 02/25/2013] [Accepted: 03/18/2013] [Indexed: 11/09/2022] Open
Abstract
OBJECTIVE To create a computable MEDication Indication resource (MEDI) to support primary and secondary use of electronic medical records (EMRs). MATERIALS AND METHODS We processed four public medication resources, RxNorm, Side Effect Resource (SIDER) 2, MedlinePlus, and Wikipedia, to create MEDI. We applied natural language processing and ontology relationships to extract indications for prescribable, single-ingredient medication concepts and all ingredient concepts as defined by RxNorm. Indications were coded as Unified Medical Language System (UMLS) concepts and International Classification of Diseases, 9th edition (ICD9) codes. A total of 689 extracted indications were randomly selected for manual review for accuracy using dual-physician review. We identified a subset of medication-indication pairs that optimizes recall while maintaining high precision. RESULTS MEDI contains 3112 medications and 63 343 medication-indication pairs. Wikipedia was the largest resource, with 2608 medications and 34 911 pairs. For each resource, estimated precision and recall, respectively, were 94% and 20% for RxNorm, 75% and 33% for MedlinePlus, 67% and 31% for SIDER 2, and 56% and 51% for Wikipedia. The MEDI high-precision subset (MEDI-HPS) includes indications found within either RxNorm or at least two of the three other resources. MEDI-HPS contains 13 304 unique indication pairs regarding 2136 medications. The mean±SD number of indications for each medication in MEDI-HPS is 6.22 ± 6.09. The estimated precision of MEDI-HPS is 92%. CONCLUSIONS MEDI is a publicly available, computable resource that links medications with their indications as represented by concepts and billing codes. MEDI may benefit clinical EMR applications and reuse of EMR data for research.
Collapse
|
Evaluation Study |
12 |
71 |
15
|
Chang JT, Schütze H, Altman RB. Creating an online dictionary of abbreviations from MEDLINE. J Am Med Inform Assoc 2002; 9:612-20. [PMID: 12386112 PMCID: PMC349378 DOI: 10.1197/jamia.m1139] [Citation(s) in RCA: 61] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
OBJECTIVE The growth of the biomedical literature presents special challenges for both human readers and automatic algorithms. One such challenge derives from the common and uncontrolled use of abbreviations in the literature. Each additional abbreviation increases the effective size of the vocabulary for a field. Therefore, to create an automatically generated and maintained lexicon of abbreviations, we have developed an algorithm to match abbreviations in text with their expansions. DESIGN Our method uses a statistical learning algorithm, logistic regression, to score abbreviation expansions based on their resemblance to a training set of human-annotated abbreviations. We applied it to Medstract, a corpus of MEDLINE abstracts in which abbreviations and their expansions have been manually annotated. We then ran the algorithm on all abstracts in MEDLINE, creating a dictionary of biomedical abbreviations. To test the coverage of the database, we used an independently created list of abbreviations from the China Medical Tribune. MEASUREMENTS We measured the recall and precision of the algorithm in identifying abbreviations from the Medstract corpus. We also measured the recall when searching for abbreviations from the China Medical Tribune against the database. RESULTS On the Medstract corpus, our algorithm achieves up to 83% recall at 80% precision. Applying the algorithm to all of MEDLINE yielded a database of 781,632 high-scoring abbreviations. Of all the abbreviations in the list from the China Medical Tribune, 88% were in the database. CONCLUSION We have developed an algorithm to identify abbreviations from text. We are making this available as a public abbreviation server at \url[http://abbreviation.stanford.edu/].
Collapse
|
research-article |
23 |
61 |
16
|
Talafous J, Sayre LM, Mieyal JJ, Klopman G. META. 2. A dictionary model of mammalian xenobiotic metabolism. JOURNAL OF CHEMICAL INFORMATION AND COMPUTER SCIENCES 1994; 34:1326-33. [PMID: 7989398 DOI: 10.1021/ci00022a015] [Citation(s) in RCA: 60] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/28/2023]
Abstract
META is a new knowledge-based expert system that provides computer simulation of the biotransformation of chemicals. The program is based on the recognition of key functional groups within the complete chemical structure and therefore can predict the metabolites of new xenobiotics. Here, we describe a comprehensive knowledge base built for the purposes of modeling mammalian metabolism with META methodology.
Collapse
|
|
31 |
60 |
17
|
Abstract
As the pace of biological research accelerates, biologists are becoming increasingly reliant on computers to manage the information explosion. Biologists communicate their research findings by relying on precise biological terms; these terms then provide indices into the literature and across the growing number of biological databases. This article examines emerging techniques to access biological resources through extraction of entity names and relations among them. Information extraction has been an active area of research in natural language processing and there are promising results for information extraction applied to news stories, e.g., balanced precision and recall in the 93-95% range for identifying person, organization and location names. But these results do not seem to transfer directly to biological names, where results remain in the 75-80% range. Multiple factors may be involved, including absence of shared training and test sets for rigorous measures of progress, lack of annotated training data specific to biological tasks, pervasive ambiguity of terms, frequent introduction of new terms, and a mismatch between evaluation tasks as defined for news and real biological problems. We present evidence from a simple lexical matching exercise that illustrates some specific problems encountered when identifying biological names. We conclude by outlining a research agenda to raise performance of named entity tagging to a level where it can be used to perform tasks of biological importance.
Collapse
|
|
23 |
60 |
18
|
Abstract
MOTIVATION Abbreviations are an important type of terminology in the biomedical domain. Although several groups have already created databases of biomedical abbreviations, these are either not public, or are not comprehensive, or focus exclusively on acronym-type abbreviations. We have created another abbreviation database, ADAM, which covers commonly used abbreviations and their definitions (or long-forms) within MEDLINE titles and abstracts, including both acronym and non-acronym abbreviations. RESULTS A model of recognizing abbreviations and their long-forms from titles and abstracts of MEDLINE (2006 baseline) was employed. After grouping morphological variants, 59 405 abbreviation/long-form pairs were identified. ADAM shows high precision (97.4%) and includes most of the frequently used abbreviations contained in the Unified Medical Language System (UMLS) Lexicon and the Stanford Abbreviation Database. Conversely, one-third of abbreviations in ADAM are novel insofar as they are not included in either database. About 19% of the novel abbreviations are non-acronym-type and these cover at least seven different types of short-form/long-form pairs. AVAILABILITY A free, public query interface to ADAM is available at http://arrowsmith.psych.uic.edu, and the entire database can be downloaded as a text file.
Collapse
|
|
19 |
53 |
19
|
|
|
61 |
50 |
20
|
Abstract
MOTIVATION Biological literature contains many abbreviations with one particular sense in each document. However, most abbreviations do not have a unique sense across the literature. Furthermore, many documents do not contain the long forms of the abbreviations. Resolving an abbreviation in a document consists of retrieving its sense in use. Abbreviation resolution improves accuracy of document retrieval engines and of information extraction systems. RESULTS We combine an automatic analysis of Medline abstracts and linguistic methods to build a dictionary of abbreviation/sense pairs. The dictionary is used for the resolution of abbreviations occurring with their long forms. Ambiguous global abbreviations are resolved using support vector machines that have been trained on the context of each instance of the abbreviation/sense pairs, previously extracted for the dictionary set-up. The system disambiguates abbreviations with a precision of 98.9% for a recall of 98.2% (98.5% accuracy). This performance is superior in comparison with previously reported research work. AVAILABILITY The abbreviation resolution module is available at http://www.ebi.ac.uk/Rebholz/software.html.
Collapse
|
|
20 |
48 |
21
|
Maki WS, McKinley LN, Thompson AG. Semantic distance norms computed from an electronic dictionary (WordNet). ACTA ACUST UNITED AC 2004; 36:421-31. [PMID: 15641432 DOI: 10.3758/bf03195590] [Citation(s) in RCA: 47] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
Abstract
WordNet, an electronic dictionary (or lexical database), is a valuable resource for computational and cognitive scientists. Recent work on the computing of semantic distances among nodes (synsets) in WordNet has made it possible to build a large database of semantic distances for use in selecting word pairs for psychological research. The database now contains nearly 50,000 pairs of words that have values for semantic distance, associative strength, and similarity based on co-occurrence. Semantic distance was found to correlate weakly with these other measures but to correlate more strongly with another measure of semantic relatedness, featural similarity. Hierarchical clustering analysis suggested that the knowledge structure underlying semantic distance is similar in gross form to that underlying featural similarity. In experiments in which semantic similarity ratings were used, human participants were able to discriminate semantic distance. Thus, semantic distance as derived from WordNet appears distinct from other measures of word pair relatedness and is psychologically functional. This database may be downloaded from www.psychonomic.org/archive/.
Collapse
|
|
21 |
47 |
22
|
Hoffmann R, Krallinger M, Andres E, Tamames J, Blaschke C, Valencia A. Text mining for metabolic pathways, signaling cascades, and protein networks. Sci Signal 2005; 2005:pe21. [PMID: 15886388 DOI: 10.1126/stke.2832005pe21] [Citation(s) in RCA: 44] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/02/2022]
Abstract
The complexity of the information stored in databases and publications on metabolic and signaling pathways, the high throughput of experimental data, and the growing number of publications make it imperative to provide systems to help the researcher navigate through these interrelated information resources. Text-mining methods have started to play a key role in the creation and maintenance of links between the information stored in biological databases and its original sources in the literature. These links will be extremely useful for database updating and curation, especially if a number of technical problems can be solved satisfactorily, including the identification of protein and gene names (entities in general) and the characterization of their types of interactions. The first generation of openly accessible text-mining systems, such as iHOP (Information Hyperlinked over Proteins), provides additional functions to facilitate the reconstruction of protein interaction networks, combine database and text information, and support the scientist in the formulation of novel hypotheses. The next challenge is the generation of comprehensive information regarding the general function of signaling pathways and protein interaction networks.
Collapse
|
Review |
20 |
44 |
23
|
Hartel FW, de Coronado S, Dionne R, Fragoso G, Golbeck J. Modeling a description logic vocabulary for cancer research. J Biomed Inform 2005; 38:114-29. [PMID: 15797001 DOI: 10.1016/j.jbi.2004.09.001] [Citation(s) in RCA: 43] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/19/2004] [Indexed: 11/23/2022]
Abstract
The National Cancer Institute has developed the NCI Thesaurus, a biomedical vocabulary for cancer research, covering terminology across a wide range of cancer research domains. A major design goal of the NCI Thesaurus is to facilitate translational research. We describe: the features of Ontylog, a description logic used to build NCI Thesaurus; our methodology for enhancing the terminology through collaboration between ontologists and domain experts, and for addressing certain real world challenges arising in modeling the Thesaurus; and finally, we describe the conversion of NCI Thesaurus from Ontylog into Web Ontology Language Lite. Ontylog has proven well suited for constructing big biomedical vocabularies. We have capitalized on the Ontylog constructs Kind and Role in the collaboration process described in this paper to facilitate communication between ontologists and domain experts. The artifacts and processes developed by NCI for collaboration may be useful in other biomedical terminology development efforts.
Collapse
|
|
20 |
43 |
24
|
Bray JE, Todd AE, Pearl FM, Thornton JM, Orengo CA. The CATH Dictionary of Homologous Superfamilies (DHS): a consensus approach for identifying distant structural homologues. PROTEIN ENGINEERING 2000; 13:153-65. [PMID: 10775657 DOI: 10.1093/protein/13.3.153] [Citation(s) in RCA: 41] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/12/2022]
Abstract
A consensus approach has been developed for identifying distant structural homologues. This is based on the CATH Dictionary of Homologous Superfamilies (DHS), a database of validated multiple structural alignments annotated with consensus functional information for evolutionary protein superfamilies (URL: http://www. biochem.ucl.ac.uk/bsm/dhs). Multiple structural alignments have been generated for 362 well-populated superfamilies in the CATH structural domain database and annotated with secondary structure, physicochemical properties, functional sequence patterns and protein-ligand interaction data. Consensus functional information for each superfamily includes descriptions and keywords extracted from SWISS-PROT and the ENZYME database. The Dictionary provides a powerful resource to validate, examine and visualize key structural and functional features of each homologous superfamily. The value of the DHS, for assessing functional variability and identifying distant evolutionary relationships, is illustrated using the pyridoxal-5'-phosphate (PLP) binding aspartate aminotransferase superfamily. The DHS also provides a tool for examining sequence-structure relationships for proteins within each fold group.
Collapse
|
|
25 |
41 |
25
|
Abstract
A prerequisite for all higher level information extraction tasks is the identification of unknown names in text. Today, when large corpora can consist of billions of words, it is of utmost importance to develop accurate techniques for the automatic detection, extraction and categorization of named entities in these corpora. Although named entity recognition might be regarded a solved problem in some domains, it still poses a significant challenge in others. In this work we focus on one of the more difficult tasks, the identification of protein names in text. This task presents several interesting difficulties because of the named entities variant structural characteristics, their sometimes unclear status as names, the lack of common standards and fixed nomenclatures, and the specifics of the texts in the molecular biology domain in which they appear. We describe how we approached these and other difficulties in the implementation of Yapex, a system for the automatic identification of protein names in text. We also evaluate Yapex under four different notions of correctness and compare its performance to that of another publicly available system for protein name recognition.
Collapse
|
Comparative Study |
23 |
41 |