1
|
Queirós P, Novikova P, Wilmes P, May P. Unification of functional annotation descriptions using text mining. Biol Chem 2021; 402:983-990. [PMID: 33984880 DOI: 10.1515/hsz-2021-0125] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/27/2021] [Accepted: 05/03/2021] [Indexed: 02/06/2023]
Abstract
A common approach to genome annotation involves the use of homology-based tools for the prediction of the functional role of proteins. The quality of functional annotations is dependent on the reference data used, as such, choosing the appropriate sources is crucial. Unfortunately, no single reference data source can be universally considered the gold standard, thus using multiple references could potentially increase annotation quality and coverage. However, this comes with challenges, particularly due to the introduction of redundant and exclusive annotations. Through text mining it is possible to identify highly similar functional descriptions, thus strengthening the confidence of the final protein functional annotation and providing a redundancy-free output. Here we present UniFunc, a text mining approach that is able to detect similar functional descriptions with high precision. UniFunc was built as a small module and can be independently used or integrated into protein function annotation pipelines. By removing the need to individually analyse and compare annotation results, UniFunc streamlines the complementary use of multiple reference datasets.
Collapse
Affiliation(s)
| | | | - Paul Wilmes
- Systems Ecology, Esch-sur-Alzette, Luxembourg
| | - Patrick May
- Bioinformatics Core, Luxembourg Centre for Systems Biomedicine, University of Luxembourg, 4362, Esch-sur-Alzette, Luxembourg
| |
Collapse
|
2
|
Becker TE, Jakobsson E. ResidueFinder: extracting individual residue mentions from protein literature. J Biomed Semantics 2021; 12:14. [PMID: 34289903 PMCID: PMC8293528 DOI: 10.1186/s13326-021-00243-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/02/2019] [Accepted: 05/07/2021] [Indexed: 11/10/2022] Open
Abstract
Background The revolution in molecular biology has shown how protein function and structure are based on specific sequences of amino acids. Thus, an important feature in many papers is the mention of the significance of individual amino acids in the context of the entire sequence of the protein. MutationFinder is a widely used program for finding mentions of specific mutations in texts. We report on augmenting the positive attributes of MutationFinder with a more inclusive regular expression list to create ResidueFinder, which finds mentions of native amino acids as well as mutations. We also consider parameter options for both ResidueFinder and MutationFinder to explore trade-offs between precision, recall, and computational efficiency. We test our methods and software in full text as well as abstracts. Results We find there is much more variety of formats for mentioning residues in the entire text of papers than in abstracts alone. Failure to take these multiple formats into account results in many false negatives in the program. Since MutationFinder, like several other programs, was primarily tested on abstracts, we found it necessary to build an expanded regular expression list to achieve acceptable recall in full text searches. We also discovered a number of artifacts arising from PDF to text conversion, which we wrote elements in the regular expression library to address. Taking into account those factors resulted in high recall on randomly selected primary research articles. We also developed a streamlined regular expression (called “cut”) which enables a several hundredfold speedup in both MutationFinder and ResidueFinder with only a modest compromise of recall. All regular expressions were tested using expanded F-measure statistics, i.e., we compute Fβ for various values of where the larger the value of β the more recall is weighted, the smaller the value of β the more precision is weighted. Conclusions ResidueFinder is a simple, effective, and efficient program for finding individual residue mentions in primary literature starting with text files, implemented in Python, and available in SourceForge.net. The most computationally efficient versions of ResidueFinder could enable creation and maintenance of a database of residue mentions encompassing all articles in PubMed. Supplementary Information The online version contains supplementary material available at 10.1186/s13326-021-00243-3.
Collapse
Affiliation(s)
- Ton E Becker
- Department of Molecular and Integrative Physiology, Beckman Institute for Advanced Science and Technology, University of Illinois at Urbana-Champaign, Illinois, 61801, Urbana, USA
| | - Eric Jakobsson
- Department of Molecular and Integrative Physiology, Beckman Institute for Advanced Science and Technology, University of Illinois at Urbana-Champaign, Illinois, 61801, Urbana, USA. .,Department of Biochemistry, Program in Biophysics and Computational Biology, National Center for Supercomputing Applications, University of Illinois at Urbana-Champaign, Illinois, 61801, Urbana, USA.
| |
Collapse
|
3
|
Hamid MN, Friedberg I. Identifying antimicrobial peptides using word embedding with deep recurrent neural networks. Bioinformatics 2019; 35:2009-2016. [PMID: 30418485 PMCID: PMC6581433 DOI: 10.1093/bioinformatics/bty937] [Citation(s) in RCA: 62] [Impact Index Per Article: 12.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/18/2018] [Revised: 08/27/2018] [Accepted: 11/08/2018] [Indexed: 12/11/2022] Open
Abstract
MOTIVATION Antibiotic resistance constitutes a major public health crisis, and finding new sources of antimicrobial drugs is crucial to solving it. Bacteriocins, which are bacterially produced antimicrobial peptide products, are candidates for broadening the available choices of antimicrobials. However, the discovery of new bacteriocins by genomic mining is hampered by their sequences' low complexity and high variance, which frustrates sequence similarity-based searches. RESULTS Here we use word embeddings of protein sequences to represent bacteriocins, and apply a word embedding method that accounts for amino acid order in protein sequences, to predict novel bacteriocins from protein sequences without using sequence similarity. Our method predicts, with a high probability, six yet unknown putative bacteriocins in Lactobacillus. Generalized, the representation of sequences with word embeddings preserving sequence order information can be applied to peptide and protein classification problems for which sequence similarity cannot be used. AVAILABILITY AND IMPLEMENTATION Data and source code for this project are freely available at: https://github.com/nafizh/NeuBI. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Md-Nafiz Hamid
- Interdepartmental program in Bioinformatics and Computational Biology
- Department of Veterinary Microbiology and Preventive Medicine, Iowa State University, Ames, IA, USA
| | - Iddo Friedberg
- Interdepartmental program in Bioinformatics and Computational Biology
- Department of Veterinary Microbiology and Preventive Medicine, Iowa State University, Ames, IA, USA
| |
Collapse
|
4
|
Badal VD, Kundrotas PJ, Vakser IA. Natural language processing in text mining for structural modeling of protein complexes. BMC Bioinformatics 2018; 19:84. [PMID: 29506465 PMCID: PMC5838950 DOI: 10.1186/s12859-018-2079-4] [Citation(s) in RCA: 23] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/24/2017] [Accepted: 02/20/2018] [Indexed: 12/04/2022] Open
Abstract
Background Structural modeling of protein-protein interactions produces a large number of putative configurations of the protein complexes. Identification of the near-native models among them is a serious challenge. Publicly available results of biomedical research may provide constraints on the binding mode, which can be essential for the docking. Our text-mining (TM) tool, which extracts binding site residues from the PubMed abstracts, was successfully applied to protein docking (Badal et al., PLoS Comput Biol, 2015; 11: e1004630). Still, many extracted residues were not relevant to the docking. Results We present an extension of the TM tool, which utilizes natural language processing (NLP) for analyzing the context of the residue occurrence. The procedure was tested using generic and specialized dictionaries. The results showed that the keyword dictionaries designed for identification of protein interactions are not adequate for the TM prediction of the binding mode. However, our dictionary designed to distinguish keywords relevant to the protein binding sites led to considerable improvement in the TM performance. We investigated the utility of several methods of context analysis, based on dissection of the sentence parse trees. The machine learning-based NLP filtered the pool of the mined residues significantly more efficiently than the rule-based NLP. Constraints generated by NLP were tested in docking of unbound proteins from the DOCKGROUND X-ray benchmark set 4. The output of the global low-resolution docking scan was post-processed, separately, by constraints from the basic TM, constraints re-ranked by NLP, and the reference constraints. The quality of a match was assessed by the interface root-mean-square deviation. The results showed significant improvement of the docking output when using the constraints generated by the advanced TM with NLP. Conclusions The basic TM procedure for extracting protein-protein binding site residues from the PubMed abstracts was significantly advanced by the deep parsing (NLP techniques for contextual analysis) in purging of the initial pool of the extracted residues. Benchmarking showed a substantial increase of the docking success rate based on the constraints generated by the advanced TM with NLP. Electronic supplementary material The online version of this article (10.1186/s12859-018-2079-4) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Varsha D Badal
- Center for Computational Biology and Department of Molecular Biosciences, The University of Kansas, Lawrence, Kansas, 66047, USA
| | - Petras J Kundrotas
- Center for Computational Biology and Department of Molecular Biosciences, The University of Kansas, Lawrence, Kansas, 66047, USA.
| | - Ilya A Vakser
- Center for Computational Biology and Department of Molecular Biosciences, The University of Kansas, Lawrence, Kansas, 66047, USA.
| |
Collapse
|
5
|
Taha K. Inferring the Functions of Proteins from the Interrelationships between Functional Categories. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2018; 15:157-167. [PMID: 27723600 DOI: 10.1109/tcbb.2016.2615608] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/06/2023]
Abstract
This study proposes a new method to determine the functions of an unannotated protein. The proteins and amino acid residues mentioned in biomedical texts associated with an unannotated protein can be considered as characteristics terms for , which are highly predictive of the potential functions of . Similarly, proteins and amino acid residues mentioned in biomedical texts associated with proteins annotated with a functional category can be considered as characteristics terms of . We introduce in this paper an information extraction system called IFP_IFC that predicts the functions of an unannotated protein by representing and each functional category by a vector of weights. Each weight reflects the degree of association between a characteristic term and (or a characteristic term and ). First, IFP_IFC constructs a network, whose nodes represent the different functional categories, and its edges the interrelationships between the nodes. Then, it determines the functions of by employing random walks with restarts on the mentioned network. The walker is the vector of . Finally, is assigned to the functional categories of the nodes in the network that are visited most by the walker. We evaluated the quality of IFP_IFC by comparing it experimentally with two other systems. Results showed marked improvement.
Collapse
|
6
|
Al-Aamri A, Taha K, Al-Hammadi Y, Maalouf M, Homouz D. Constructing Genetic Networks using Biomedical Literature and Rare Event Classification. Sci Rep 2017; 7:15784. [PMID: 29150626 PMCID: PMC5694017 DOI: 10.1038/s41598-017-16081-2] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/17/2017] [Accepted: 10/24/2017] [Indexed: 12/16/2022] Open
Abstract
Text mining has become an important tool in bioinformatics research with the massive growth in the biomedical literature over the past decade. Mining the biomedical literature has resulted in an incredible number of computational algorithms that assist many bioinformatics researchers. In this paper, we present a text mining system called Gene Interaction Rare Event Miner (GIREM) that constructs gene-gene-interaction networks for human genome using information extracted from biomedical literature. GIREM identifies functionally related genes based on their co-occurrences in the abstracts of biomedical literature. For a given gene g, GIREM first extracts the set of genes found within the abstracts of biomedical literature associated with g. GIREM aims at enhancing biological text mining approaches by identifying the semantic relationship between each co-occurrence of a pair of genes in abstracts using the syntactic structures of sentences and linguistics theories. It uses a supervised learning algorithm, weighted logistic regression to label pairs of genes to related or un-related classes, and to reflect the population proportion using smaller samples. We evaluated GIREM by comparing it experimentally with other well-known approaches and a protein-protein interactions database. Results showed marked improvement.
Collapse
Affiliation(s)
- Amira Al-Aamri
- Department of Electrical and Computer Engineering, Khalifa University of Science and Technology, P.O. Box 127788, Abu Dhabi, UAE
| | - Kamal Taha
- Department of Electrical and Computer Engineering, Khalifa University of Science and Technology, P.O. Box 127788, Abu Dhabi, UAE
| | - Yousof Al-Hammadi
- Department of Electrical and Computer Engineering, Khalifa University of Science and Technology, P.O. Box 127788, Abu Dhabi, UAE
| | - Maher Maalouf
- Department of Industrial and Systems Engineering, Khalifa University of Science and Technology, P.O. Box 127788, Abu Dhabi, UAE
| | - Dirar Homouz
- Department of Physics, Khalifa University of Science and Technology, P.O. Box 127788, Abu Dhabi, UAE.
| |
Collapse
|
7
|
Taha K, Yoo PD. Predicting the functions of a protein from its ability to associate with other molecules. BMC Bioinformatics 2016; 17:34. [PMID: 26767846 PMCID: PMC4714473 DOI: 10.1186/s12859-016-0882-3] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/12/2015] [Accepted: 01/05/2016] [Indexed: 11/25/2022] Open
Abstract
BACKGROUND All proteins associate with other molecules. These associated molecules are highly predictive of the potential functions of proteins. The association of a protein and a molecule can be determined from their co-occurrences in biomedical abstracts. Extensive semantically related co-occurrences of a protein's name and a molecule's name in the sentences of biomedical abstracts can be considered as indicative of the association between the protein and the molecule. Dependency parsers extract textual relations from a text by determining the grammatical relations between words in a sentence. They can be used for determining the textual relations between proteins and molecules. Despite their success, they may extract textual relations with low precision. This is because they do not consider the semantic relationships between terms in a sentence (i.e., they consider only the structural relationships between the terms). Moreover, they may not be well suited for complex sentences and for long-distance textual relations. RESULTS We introduce an information extraction system called PPFBM that predicts the functions of unannotated proteins from the molecules that associate with these proteins. PPFBM represents each protein by the other molecules that associate with it in the abstracts referenced in the protein's entries in reliable biological databases. It automatically extracts each co-occurrence of a protein-molecule pair that represents semantic relationship between the pair. Towards this, we present novel semantic rules that identify the semantic relationship between each co-occurrence of a protein-molecule pair using the syntactic structures of sentences and linguistics theories. PPFBM determines the functions of an un-annotated protein p as follows. First, it determines the set S r of annotated proteins that is semantically similar to p by matching the molecules representing p and the annotated proteins. Then, it assigns p the functional category FC if the significance of the frequency of occurrences of S r in abstracts associated with proteins annotated with FC is statistically significantly different than the significance of the frequency of occurrences of S r in abstracts associated with proteins annotated with all other functional categories. We evaluated the quality of PPFBM by comparing it experimentally with two other systems. Results showed marked improvement. CONCLUSIONS The experimental results demonstrated that PPFBM outperforms other systems that predict protein function from the textual information found within biomedical abstracts. This is because these system do not consider the semantic relationships between terms in a sentence (i.e., they consider only the structural relationships between the terms). PPFBM's performance over these system increases steadily as the number of training protein increases. That is, PPFBM's prediction performance becomes more accurate constantly, as the size of training proteins gets larger. This is because every time a new set of test proteins is added to the current set of training proteins. A demo of PPFBM that annotates each input Yeast protein (SGD (Saccharomyces Genome Database). Available at: http://www.yeastgenome.org/download-data/curation) with the functions of Gene Ontology terms is available at: (see Appendix for more details about the demo) http://ecesrvr.kustar.ac.ae:8080/PPFBM/.
Collapse
Affiliation(s)
- Kamal Taha
- Department of Electrical and Computer Engineering, Khalifa University, Abu Dhabi, United Arab Emirates.
| | - Paul D Yoo
- Faculty of Science and Technology, Bournemouth University, Bournemouth, UK.
| |
Collapse
|
8
|
Rsite2: an efficient computational method to predict the functional sites of noncoding RNAs. Sci Rep 2016; 6:19016. [PMID: 26751501 PMCID: PMC4707467 DOI: 10.1038/srep19016] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/27/2015] [Accepted: 12/02/2015] [Indexed: 01/11/2023] Open
Abstract
Noncoding RNAs (ncRNAs) represent a big class of important RNA molecules. Given the large number of ncRNAs, identifying their functional sites is becoming one of the most important topics in the post-genomic era, but available computational methods are limited. For the above purpose, we previously presented a tertiary structure based method, Rsite, which first calculates the distance metrics defined in Methods with the tertiary structure of an ncRNA and then identifies the nucleotides located within the extreme points in the distance curve as the functional sites of the given ncRNA. However, the application of Rsite is largely limited because of limited RNA tertiary structures. Here we present a secondary structure based computational method, Rsite2, based on the observation that the secondary structure based nucleotide distance is strongly positively correlated with that derived from tertiary structure. This makes it reasonable to replace tertiary structure with secondary structure, which is much easier to obtain and process. Moreover, we applied Rsite2 to three ncRNAs (tRNA (Lys), Diels-Alder ribozyme, and RNase P) and a list of human mitochondria transcripts. The results show that Rsite2 works well with nearly equivalent accuracy as Rsite but is much more feasible and efficient. Finally, a web-server, the source codes, and the dataset of Rsite2 are available at http://www.cuialb.cn/rsite2.
Collapse
|
9
|
Weissenbacher D, Tahsin T, Beard R, Figaro M, Rivera R, Scotch M, Gonzalez G. Knowledge-driven geospatial location resolution for phylogeographic models of virus migration. Bioinformatics 2015; 31:i348-56. [PMID: 26072502 PMCID: PMC4542781 DOI: 10.1093/bioinformatics/btv259] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/21/2022] Open
Abstract
Summary: Diseases caused by zoonotic viruses (viruses transmittable between humans and animals) are a major threat to public health throughout the world. By studying virus migration and mutation patterns, the field of phylogeography provides a valuable tool for improving their surveillance. A key component in phylogeographic analysis of zoonotic viruses involves identifying the specific locations of relevant viral sequences. This is usually accomplished by querying public databases such as GenBank and examining the geospatial metadata in the record. When sufficient detail is not available, a logical next step is for the researcher to conduct a manual survey of the corresponding published articles. Motivation: In this article, we present a system for detection and disambiguation of locations (toponym resolution) in full-text articles to automate the retrieval of sufficient metadata. Our system has been tested on a manually annotated corpus of journal articles related to phylogeography using integrated heuristics for location disambiguation including a distance heuristic, a population heuristic and a novel heuristic utilizing knowledge obtained from GenBank metadata (i.e. a ‘metadata heuristic’). Results: For detecting and disambiguating locations, our system performed best using the metadata heuristic (0.54 Precision, 0.89 Recall and 0.68 F-score). Precision reaches 0.88 when examining only the disambiguation of location names. Our error analysis showed that a noticeable increase in the accuracy of toponym resolution is possible by improving the geospatial location detection. By improving these fundamental automated tasks, our system can be a useful resource to phylogeographers that rely on geospatial metadata of GenBank sequences. Contact:davy.weissenbacher@asu.edu
Collapse
Affiliation(s)
- Davy Weissenbacher
- Department of Biomedical Informatics, Arizona State University, Scottsdale, AZ 85259, USA and Center for Environmental Security, Biodesign Institute, Arizona State University, Tempe, AZ 85287-5904, USA
| | - Tasnia Tahsin
- Department of Biomedical Informatics, Arizona State University, Scottsdale, AZ 85259, USA and Center for Environmental Security, Biodesign Institute, Arizona State University, Tempe, AZ 85287-5904, USA
| | - Rachel Beard
- Department of Biomedical Informatics, Arizona State University, Scottsdale, AZ 85259, USA and Center for Environmental Security, Biodesign Institute, Arizona State University, Tempe, AZ 85287-5904, USA Department of Biomedical Informatics, Arizona State University, Scottsdale, AZ 85259, USA and Center for Environmental Security, Biodesign Institute, Arizona State University, Tempe, AZ 85287-5904, USA
| | - Mari Figaro
- Department of Biomedical Informatics, Arizona State University, Scottsdale, AZ 85259, USA and Center for Environmental Security, Biodesign Institute, Arizona State University, Tempe, AZ 85287-5904, USA Department of Biomedical Informatics, Arizona State University, Scottsdale, AZ 85259, USA and Center for Environmental Security, Biodesign Institute, Arizona State University, Tempe, AZ 85287-5904, USA
| | - Robert Rivera
- Department of Biomedical Informatics, Arizona State University, Scottsdale, AZ 85259, USA and Center for Environmental Security, Biodesign Institute, Arizona State University, Tempe, AZ 85287-5904, USA
| | - Matthew Scotch
- Department of Biomedical Informatics, Arizona State University, Scottsdale, AZ 85259, USA and Center for Environmental Security, Biodesign Institute, Arizona State University, Tempe, AZ 85287-5904, USA Department of Biomedical Informatics, Arizona State University, Scottsdale, AZ 85259, USA and Center for Environmental Security, Biodesign Institute, Arizona State University, Tempe, AZ 85287-5904, USA
| | - Graciela Gonzalez
- Department of Biomedical Informatics, Arizona State University, Scottsdale, AZ 85259, USA and Center for Environmental Security, Biodesign Institute, Arizona State University, Tempe, AZ 85287-5904, USA
| |
Collapse
|
10
|
Badal VD, Kundrotas PJ, Vakser IA. Text Mining for Protein Docking. PLoS Comput Biol 2015; 11:e1004630. [PMID: 26650466 PMCID: PMC4674139 DOI: 10.1371/journal.pcbi.1004630] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/08/2015] [Accepted: 10/29/2015] [Indexed: 11/18/2022] Open
Abstract
The rapidly growing amount of publicly available information from biomedical research is readily accessible on the Internet, providing a powerful resource for predictive biomolecular modeling. The accumulated data on experimentally determined structures transformed structure prediction of proteins and protein complexes. Instead of exploring the enormous search space, predictive tools can simply proceed to the solution based on similarity to the existing, previously determined structures. A similar major paradigm shift is emerging due to the rapidly expanding amount of information, other than experimentally determined structures, which still can be used as constraints in biomolecular structure prediction. Automated text mining has been widely used in recreating protein interaction networks, as well as in detecting small ligand binding sites on protein structures. Combining and expanding these two well-developed areas of research, we applied the text mining to structural modeling of protein-protein complexes (protein docking). Protein docking can be significantly improved when constraints on the docking mode are available. We developed a procedure that retrieves published abstracts on a specific protein-protein interaction and extracts information relevant to docking. The procedure was assessed on protein complexes from Dockground (http://dockground.compbio.ku.edu). The results show that correct information on binding residues can be extracted for about half of the complexes. The amount of irrelevant information was reduced by conceptual analysis of a subset of the retrieved abstracts, based on the bag-of-words (features) approach. Support Vector Machine models were trained and validated on the subset. The remaining abstracts were filtered by the best-performing models, which decreased the irrelevant information for ~ 25% complexes in the dataset. The extracted constraints were incorporated in the docking protocol and tested on the Dockground unbound benchmark set, significantly increasing the docking success rate. Protein interactions are central for many cellular processes. Physical characterization of these interactions is essential for understanding of life processes and applications in biology and medicine. Because of the inherent limitations of experimental techniques and rapid development of computational power and methodology, computer modeling is a tool of choice in many studies. Publicly available information from biomedical research is readily accessible on the Internet, providing a powerful resource for modeling of proteins and protein complexes. A major paradigm shift in modeling of protein complexes is emerging due to the rapidly expanding amount of such information, which can be used as modeling constraints. Text mining has been widely used in recreating networks of protein interactions, as well as in detecting small molecule binding sites on proteins. Combining and expanding these two well-developed areas of research, we applied the text mining to physical modeling of protein complexes (protein docking). Our procedure retrieves published abstracts on a protein-protein interaction and extracts the relevant information. The results show that correct information on binding can be obtained for about half of protein complexes. The extracted constraints were incorporated in a modeling procedure, significantly improving its performance.
Collapse
Affiliation(s)
- Varsha D. Badal
- Center for Computational Biology, The University of Kansas, Lawrence, Kansas, United States of America
| | - Petras J. Kundrotas
- Center for Computational Biology, The University of Kansas, Lawrence, Kansas, United States of America
- * E-mail: (IAV); (PJK)
| | - Ilya A. Vakser
- Center for Computational Biology, The University of Kansas, Lawrence, Kansas, United States of America
- Department of Molecular Biosciences, The University of Kansas, Lawrence, Kansas, United States of America
- * E-mail: (IAV); (PJK)
| |
Collapse
|
11
|
Abstract
The Human Genome Project has provided science with a hugely valuable resource: the blueprints for life; the specification of all of the genes that make up a human. While the genes have all been identified and deciphered, it is proteins that are the workhorses of the human body: they are essential to virtually all cell functions and are the primary mechanism through which biological function is carried out. Hence in order to fully understand what happens at a molecular level in biological organisms, and eventually to enable development of treatments for diseases where some aspect of a biological system goes awry, we must understand the functions of proteins. However, experimental characterization of protein function cannot scale to the vast amount of DNA sequence data now available. Computational protein function prediction has therefore emerged as a problem at the forefront of modern biology (Radivojac et al., Nat Methods 10(13):221-227, 2013).Within the varied approaches to computational protein function prediction that have been explored, there are several that make use of biomedical literature mining. These methods take advantage of information in the published literature to associate specific proteins with specific protein functions. In this chapter, we introduce two main strategies for doing this: association of function terms, represented as Gene Ontology terms (Ashburner et al., Nat Genet 25(1):25-29, 2000), to proteins based on information in published articles, and a paradigm called LEAP-FS (Literature-Enhanced Automated Prediction of Functional Sites) in which literature mining is used to validate the predictions of an orthogonal computational protein function prediction method.
Collapse
|
12
|
Taha K. Extracting Various Classes of Data From Biological Text Using the Concept of Existence Dependency. IEEE J Biomed Health Inform 2015; 19:1918-28. [PMID: 25616086 DOI: 10.1109/jbhi.2015.2392786] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/07/2022]
Abstract
One of the key goals of biological natural language processing (NLP) is the automatic information extraction from biomedical publications. Most current constituency and dependency parsers overlook the semantic relationships between the constituents comprising a sentence and may not be well suited for capturing complex long-distance dependences. We propose in this paper a hybrid constituency-dependency parser for biological NLP information extraction called EDCC. EDCC aims at enhancing the state of the art of biological text mining by applying novel linguistic computational techniques that overcome the limitations of current constituency and dependency parsers outlined earlier, as follows: 1) it determines the semantic relationship between each pair of constituents in a sentence using novel semantic rules; and 2) it applies a semantic relationship extraction model that extracts information from different structural forms of constituents in sentences. EDCC can be used to extract different types of data from biological texts for purposes such as protein function prediction, genetic network construction, and protein-protein interaction detection. We evaluated the quality of EDCC by comparing it experimentally with six systems. Results showed marked improvement.
Collapse
|
13
|
Ahmed A, Smith RD, Clark JJ, Dunbar JB, Carlson HA. Recent improvements to Binding MOAD: a resource for protein-ligand binding affinities and structures. Nucleic Acids Res 2014; 43:D465-9. [PMID: 25378330 PMCID: PMC4383918 DOI: 10.1093/nar/gku1088] [Citation(s) in RCA: 65] [Impact Index Per Article: 6.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/24/2022] Open
Abstract
For over 10 years, Binding MOAD (Mother of All Databases; http://www.BindingMOAD.org) has been one of the largest resources for high-quality protein-ligand complexes and associated binding affinity data. Binding MOAD has grown at the rate of 1994 complexes per year, on average. Currently, it contains 23,269 complexes and 8156 binding affinities. Our annual updates curate the data using a semi-automated literature search of the references cited within the PDB file, and we have recently upgraded our website and added new features and functionalities to better serve Binding MOAD users. In order to eliminate the legacy application server of the old platform and to accommodate new changes, the website has been completely rewritten in the LAMP (Linux, Apache, MySQL and PHP) environment. The improved user interface incorporates current third-party plugins for better visualization of protein and ligand molecules, and it provides features like sorting, filtering and filtered downloads. In addition to the field-based searching, Binding MOAD now can be searched by structural queries based on the ligand. In order to remove redundancy, Binding MOAD records are clustered in different families based on 90% sequence identity. The new Binding MOAD, with the upgraded platform, features and functionalities, is now equipped to better serve its users.
Collapse
Affiliation(s)
- Aqeel Ahmed
- Department of Medicinal Chemistry, University of Michigan, 428 Church St, Ann Arbor, MI 48109-1065, USA
| | - Richard D Smith
- Department of Medicinal Chemistry, University of Michigan, 428 Church St, Ann Arbor, MI 48109-1065, USA
| | - Jordan J Clark
- Department of Medicinal Chemistry, University of Michigan, 428 Church St, Ann Arbor, MI 48109-1065, USA
| | - James B Dunbar
- Department of Medicinal Chemistry, University of Michigan, 428 Church St, Ann Arbor, MI 48109-1065, USA
| | - Heather A Carlson
- Department of Medicinal Chemistry, University of Michigan, 428 Church St, Ann Arbor, MI 48109-1065, USA
| |
Collapse
|
14
|
Eksi R, Li HD, Menon R, Wen Y, Omenn GS, Kretzler M, Guan Y. Systematically differentiating functions for alternatively spliced isoforms through integrating RNA-seq data. PLoS Comput Biol 2013; 9:e1003314. [PMID: 24244129 PMCID: PMC3820534 DOI: 10.1371/journal.pcbi.1003314] [Citation(s) in RCA: 68] [Impact Index Per Article: 6.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/18/2013] [Accepted: 09/19/2013] [Indexed: 12/13/2022] Open
Abstract
Integrating large-scale functional genomic data has significantly accelerated our understanding of gene functions. However, no algorithm has been developed to differentiate functions for isoforms of the same gene using high-throughput genomic data. This is because standard supervised learning requires ‘ground-truth’ functional annotations, which are lacking at the isoform level. To address this challenge, we developed a generic framework that interrogates public RNA-seq data at the transcript level to differentiate functions for alternatively spliced isoforms. For a specific function, our algorithm identifies the ‘responsible’ isoform(s) of a gene and generates classifying models at the isoform level instead of at the gene level. Through cross-validation, we demonstrated that our algorithm is effective in assigning functions to genes, especially the ones with multiple isoforms, and robust to gene expression levels and removal of homologous gene pairs. We identified genes in the mouse whose isoforms are predicted to have disparate functionalities and experimentally validated the ‘responsible’ isoforms using data from mammary tissue. With protein structure modeling and experimental evidence, we further validated the predicted isoform functional differences for the genes Cdkn2a and Anxa6. Our generic framework is the first to predict and differentiate functions for alternatively spliced isoforms, instead of genes, using genomic data. It is extendable to any base machine learner and other species with alternatively spliced isoforms, and shifts the current gene-centered function prediction to isoform-level predictions. In mammalian genomes, a single gene can be alternatively spliced into multiple isoforms which greatly increase the functional diversity of the genome. In the human, more than 95% of multi-exon genes undergo alternative splicing. It is hard to computationally differentiate the functions for the splice isoforms of the same gene, because they are almost always annotated with the same functions and share similar sequences. In this paper, we developed a generic framework to identify the ‘responsible’ isoform(s) for each function that the gene carries out, and therefore predict functional assignment on the isoform level instead of on the gene level. Within this generic framework, we implemented and evaluated several related algorithms for isoform function prediction. We tested these algorithms through both computational evaluation and experimental validation of the predicted ‘responsible’ isoform(s) and the predicted disparate functions of the isoforms of Cdkn2a and of Anxa6. Our algorithm represents the first effort to predict and differentiate isoforms through large-scale genomic data integration.
Collapse
Affiliation(s)
- Ridvan Eksi
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, Michigan, United States of America
| | - Hong-Dong Li
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, Michigan, United States of America
| | - Rajasree Menon
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, Michigan, United States of America
| | - Yuchen Wen
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, Michigan, United States of America
| | - Gilbert S. Omenn
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, Michigan, United States of America
- Department of Internal Medicine, University of Michigan, Ann Arbor, Michigan, United States of America
- * E-mail: (GSO); (MK); (YG)
| | - Matthias Kretzler
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, Michigan, United States of America
- Department of Internal Medicine, University of Michigan, Ann Arbor, Michigan, United States of America
- * E-mail: (GSO); (MK); (YG)
| | - Yuanfang Guan
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, Michigan, United States of America
- Department of Internal Medicine, University of Michigan, Ann Arbor, Michigan, United States of America
- Department of Electrical Engineering and Computer Science, University of Michigan, Ann Arbor, Michigan, United States of America
- * E-mail: (GSO); (MK); (YG)
| |
Collapse
|
15
|
Liu H, Hunter L, Kešelj V, Verspoor K. Approximate subgraph matching-based literature mining for biomedical events and relations. PLoS One 2013; 8:e60954. [PMID: 23613763 PMCID: PMC3629260 DOI: 10.1371/journal.pone.0060954] [Citation(s) in RCA: 27] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/07/2012] [Accepted: 03/04/2013] [Indexed: 11/23/2022] Open
Abstract
The biomedical text mining community has focused on developing techniques to automatically extract important relations between biological components and semantic events involving genes or proteins from literature. In this paper, we propose a novel approach for mining relations and events in the biomedical literature using approximate subgraph matching. Extraction of such knowledge is performed by searching for an approximate subgraph isomorphism between key contextual dependencies and input sentence graphs. Our approach significantly increases the chance of retrieving relations or events encoded within complex dependency contexts by introducing error tolerance into the graph matching process, while maintaining the extraction precision at a high level. When evaluated on practical tasks, it achieves a 51.12% F-score in extracting nine types of biological events on the GE task of the BioNLP-ST 2011 and an 84.22% F-score in detecting protein-residue associations. The performance is comparable to the reported systems across these tasks, and thus demonstrates the generalizability of our proposed approach.
Collapse
Affiliation(s)
- Haibin Liu
- National Center for Biotechnology Information, Bethesda, Maryland, United States of America.
| | | | | | | |
Collapse
|
16
|
Verspoor K, Jimeno Yepes A, Cavedon L, McIntosh T, Herten-Crabb A, Thomas Z, Plazzer JP. Annotating the biomedical literature for the human variome. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2013; 2013:bat019. [PMID: 23584833 PMCID: PMC3676157 DOI: 10.1093/database/bat019] [Citation(s) in RCA: 42] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
This article introduces the Variome Annotation Schema, a schema that
aims to capture the core concepts and relations relevant to cataloguing and interpreting
human genetic variation and its relationship to disease, as described in the published
literature. The schema was inspired by the needs of the database curators of the
International Society for Gastrointestinal Hereditary Tumours (InSiGHT) database, but is
intended to have application to genetic variation information in a range of diseases. The
schema has been applied to a small corpus of full text journal publications on the subject
of inherited colorectal cancer. We show that the inter-annotator agreement on annotation
of this corpus ranges from 0.78 to 0.95 F-score across different entity
types when exact matching is measured, and improves to a minimum F-score
of 0.87 when boundary matching is relaxed. Relations show more variability in agreement,
but several are reliable, with the highest, cohort-has-size, reaching
0.90 F-score. We also explore the relevance of the schema to the InSiGHT
database curation process. The schema and the corpus represent an important new resource
for the development of text mining solutions that address relationships among patient
cohorts, disease and genetic variation, and therefore, we also discuss the role text
mining might play in the curation of information related to the human variome. The corpus
is available at http://opennicta.com/home/health/variome.
Collapse
Affiliation(s)
- Karin Verspoor
- National ICT Australia (NICTA), Victoria Research Laboratory, Level 2, Building 193, The University of Melbourne, Parkville VIC 3010, Australia.
| | | | | | | | | | | | | |
Collapse
|
17
|
Verspoor K, Mackinlay A, Cohn JD, Wall ME. Detection of protein catalytic sites in the biomedical literature. PACIFIC SYMPOSIUM ON BIOCOMPUTING. PACIFIC SYMPOSIUM ON BIOCOMPUTING 2013:433-444. [PMID: 23424147 PMCID: PMC3664919] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Subscribe] [Scholar Register] [Indexed: 06/01/2023]
Abstract
This paper explores the application of text mining to the problem of detecting protein functional sites in the biomedical literature, and specifically considers the task of identifying catalytic sites in that literature. We provide strong evidence for the need for text mining techniques that address residue-level protein function annotation through an analysis of two corpora in terms of their coverage of curated data sources. We also explore the viability of building a text-based classifier for identifying protein functional sites, identifying the low coverage of curated data sources and the potential ambiguity of information about protein functional sites as challenges that must be addressed. Nevertheless we produce a simple classifier that achieves a reasonable ∼69% F-score on our full text silver corpus on the first attempt to address this classification task. The work has application in computational prediction of the functional significance of protein sites as well as in curation workflows for databases that capture this information.
Collapse
Affiliation(s)
- Karin Verspoor
- National ICT Australia, Victoria Research Lab, Parkville, VIC 3010 Australia
| | - Andrew Mackinlay
- National ICT Australia, Victoria Research Lab, Parkville, VIC 3010 Australia
| | - Judith D. Cohn
- Computer and Computational Sciences Division, Los Alamos National Laboratory, Los Alamos, NM 87545 USA
| | - Michael E. Wall
- Computer and Computational Sciences Division, Los Alamos National Laboratory, Los Alamos, NM 87545 USA
| |
Collapse
|
18
|
Blaschke C, Valencia A. The Functional Genomics Network in the evolution of biological text mining over the past decade. N Biotechnol 2012. [PMID: 23202358 DOI: 10.1016/j.nbt.2012.11.020] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022]
Abstract
Different programs of The European Science Foundation (ESF) have contributed significantly to connect researchers in Europe and beyond through several initiatives. This support was particularly relevant for the development of the areas related with extracting information from papers (text-mining) because it supported the field in its early phases long before it was recognized by the community. We review the historical development of text mining research and how it was introduced in bioinformatics. Specific applications in (functional) genomics are described like it's integration in genome annotation pipelines and the support to the analysis of high-throughput genomics experimental data, and we highlight the activities of evaluation of methods and benchmarking for which the ESF programme support was instrumental.
Collapse
Affiliation(s)
- Christian Blaschke
- Spanish National Cancer Research Centre, C/Melchor Fernández Almagro, 3, E-28029 Madrid, Spain.
| | | |
Collapse
|
19
|
Ravikumar K, Liu H, Cohn JD, Wall ME, Verspoor K. Literature mining of protein-residue associations with graph rules learned through distant supervision. J Biomed Semantics 2012; 3 Suppl 3:S2. [PMID: 23046792 PMCID: PMC3465209 DOI: 10.1186/2041-1480-3-s3-s2] [Citation(s) in RCA: 27] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND We propose a method for automatic extraction of protein-specific residue mentions from the biomedical literature. The method searches text for mentions of amino acids at specific sequence positions and attempts to correctly associate each mention with a protein also named in the text. The methods presented in this work will enable improved protein functional site extraction from articles, ultimately supporting protein function prediction. Our method made use of linguistic patterns for identifying the amino acid residue mentions in text. Further, we applied an automated graph-based method to learn syntactic patterns corresponding to protein-residue pairs mentioned in the text. We finally present an approach to automated construction of relevant training and test data using the distant supervision model. RESULTS The performance of the method was assessed by extracting protein-residue relations from a new automatically generated test set of sentences containing high confidence examples found using distant supervision. It achieved a F-measure of 0.84 on automatically created silver corpus and 0.79 on a manually annotated gold data set for this task, outperforming previous methods. CONCLUSIONS The primary contributions of this work are to (1) demonstrate the effectiveness of distant supervision for automatic creation of training data for protein-residue relation extraction, substantially reducing the effort and time involved in manual annotation of a data set and (2) show that the graph-based relation extraction approach we used generalizes well to the problem of protein-residue association extraction. This work paves the way towards effective extraction of protein functional residues from the literature.
Collapse
Affiliation(s)
- Ke Ravikumar
- University of Colorado School of Medicine, Aurora, CO 80045, USA.
| | | | | | | | | |
Collapse
|