1
|
Peng Y, Li W, Liu Y. A Hybrid Approach for Biomarker Discovery from Microarray Gene Expression Data for Cancer Classification. Cancer Inform 2017. [DOI: 10.1177/117693510600200024] [Citation(s) in RCA: 24] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022] Open
Abstract
Microarrays allow researchers to monitor the gene expression patterns for tens of thousands of genes across a wide range of cellular responses, phenotype and conditions. Selecting a small subset of discriminate genes from thousands of genes is important for accurate classification of diseases and phenotypes. Many methods have been proposed to find subsets of genes with maximum relevance and minimum redundancy, which can distinguish accurately between samples with different labels. To find the minimum subset of relevant genes is often referred as biomarker discovery. Two main approaches, filter and wrapper techniques, have been applied to biomarker discovery. In this paper, we conducted a comparative study of different biomarker discovery methods, including six filter methods and three wrapper methods. We then proposed a hybrid approach, FR-Wrapper, for biomarker discovery. The aim of this approach is to find an optimum balance between the precision of the biomarker discovery and the computation cost, by taking advantages of both filter method's efficiency and wrapper method's high accuracy. Our hybrid approach applies Fisher's ratio, a simple method easy to understand and implement, to filter out most of the irrelevant genes, then a wrapper method is employed to reduce the redundancy. The performance of FR-Wrapper approach is evaluated over four widely used microarray datasets. Analysis of experimental results reveals that the hybrid approach can achieve the goal of maximum relevance with minimum redundancy.
Collapse
Affiliation(s)
- Yanxiong Peng
- Laboratory for Bioinformatics and Medical Informatics, University of Texas at Dallas, Richardson, TX 75083-0688, U.S.A
- Department of Computer Science, University of Texas at Dallas, Richardson, TX 75083-0688, U.S.A
| | - Wenyuan Li
- Laboratory for Bioinformatics and Medical Informatics, University of Texas at Dallas, Richardson, TX 75083-0688, U.S.A
- Department of Computer Science, University of Texas at Dallas, Richardson, TX 75083-0688, U.S.A
| | - Ying Liu
- Laboratory for Bioinformatics and Medical Informatics, University of Texas at Dallas, Richardson, TX 75083-0688, U.S.A
- Department of Computer Science, University of Texas at Dallas, Richardson, TX 75083-0688, U.S.A
- Department of Molecular and Cell Biology, University of Texas at Dallas, Richardson, TX 75083-0688, U.S.A
| |
Collapse
|
2
|
Lee DG, Shin H. Disease causality extraction based on lexical semantics and document-clause frequency from biomedical literature. BMC Med Inform Decis Mak 2017; 17:53. [PMID: 28539124 PMCID: PMC5444051 DOI: 10.1186/s12911-017-0448-y] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Recently, research on human disease network has succeeded and has become an aid in figuring out the relationship between various diseases. In most disease networks, however, the relationship between diseases has been simply represented as an association. This representation results in the difficulty of identifying prior diseases and their influence on posterior diseases. In this paper, we propose a causal disease network that implements disease causality through text mining on biomedical literature. METHODS To identify the causality between diseases, the proposed method includes two schemes: the first is the lexicon-based causality term strength, which provides the causal strength on a variety of causality terms based on lexicon analysis. The second is the frequency-based causality strength, which determines the direction and strength of causality based on document and clause frequencies in the literature. RESULTS We applied the proposed method to 6,617,833 PubMed literature, and chose 195 diseases to construct a causal disease network. From all possible pairs of disease nodes in the network, 1011 causal pairs of 149 diseases were extracted. The resulting network was compared with that of a previous study. In terms of both coverage and quality, the proposed method showed outperforming results; it determined 2.7 times more causalities and showed higher correlation with associated diseases than the existing method. CONCLUSIONS This research has novelty in which the proposed method circumvents the limitations of time and cost in applying all possible causalities in biological experiments and it is a more advanced text mining technique by defining the concepts of causality term strength.
Collapse
Affiliation(s)
- Dong-Gi Lee
- Department of Industrial Engineering, Ajou University, 206 Worldcup-ro, Yeongtong-gu, Suwon, 16499, South Korea
| | - Hyunjung Shin
- Department of Industrial Engineering, Ajou University, 206 Worldcup-ro, Yeongtong-gu, Suwon, 16499, South Korea.
| |
Collapse
|
3
|
Jurca G, Addam O, Aksac A, Gao S, Özyer T, Demetrick D, Alhajj R. Integrating text mining, data mining, and network analysis for identifying genetic breast cancer trends. BMC Res Notes 2016; 9:236. [PMID: 27112211 PMCID: PMC4845430 DOI: 10.1186/s13104-016-2023-5] [Citation(s) in RCA: 22] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/10/2015] [Accepted: 04/05/2016] [Indexed: 01/08/2023] Open
Abstract
Background Breast cancer is a serious disease which affects many women and may lead to death. It has received considerable attention from the research community. Thus, biomedical researchers aim to find genetic biomarkers indicative of the disease. Novel biomarkers can be elucidated from the existing literature. However, the vast amount of scientific publications on breast cancer make this a daunting task. This paper presents a framework which investigates existing literature data for informative discoveries. It integrates text mining and social network analysis in order to identify new potential biomarkers for breast cancer. Results We utilized PubMed for the testing. We investigated gene–gene interactions, as well as novel interactions such as gene-year, gene-country, and abstract-country to find out how the discoveries varied over time and how overlapping/diverse are the discoveries and the interest of various research groups in different countries. Conclusions Interesting trends have been identified and discussed, e.g., different genes are highlighted in relationship to different countries though the various genes were found to share functionality. Some text analysis based results have been validated against results from other tools that predict gene–gene relations and gene functions. Electronic supplementary material The online version of this article (doi:10.1186/s13104-016-2023-5) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Gabriela Jurca
- Department of Computer Science, University of Calgary, Calgary, AB, Canada
| | - Omar Addam
- Department of Computer Science, University of Calgary, Calgary, AB, Canada
| | - Alper Aksac
- Department of Computer Science, University of Calgary, Calgary, AB, Canada
| | - Shang Gao
- College of Computer Science and Technology, Jilin University, Changchun, China
| | - Tansel Özyer
- Department of Computer Engineering, TOBB University, Ankara, Turkey
| | - Douglas Demetrick
- Departments of Pathology, Oncology and Biochemistry & Molecular Biology, University of Calgary, Calgary, AB, Canada
| | - Reda Alhajj
- Department of Computer Science, University of Calgary, Calgary, AB, Canada. .,Department of Computer Science, Global University, Beirut, Lebanon.
| |
Collapse
|
4
|
Xie B, Ding Q, Wu D. Text Mining on Big and Complex Biomedical Literature. BIG DATA ANALYTICS IN BIOINFORMATICS AND HEALTHCARE 2015. [DOI: 10.4018/978-1-4666-6611-5.ch002] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/15/2022]
Abstract
Driven by the rapidly advancing techniques and increasing interests in biology and medicine, about 2,000 to 4,000 references are added daily to MEDLINE, the US national biomedical bibliographic database. Even for a specific research topic, extracting useful and comprehensive information out of the huge literature data pool is challenging. Text mining techniques become extremely useful when dealing with the abundant biomedical information and they have been applied to various areas in the realm of biomedical research. Instead of providing a brief overview of all text mining techniques and every major biomedical text mining application, this chapter explores in-depth the microRNA profiling area and related text mining tools. As an illustrative example, one rule-based text mining system developed by the authors is discussed in detail. This chapter also includes the discussion of the challenges and potential research areas in biomedical text mining.
Collapse
|
5
|
Garcia EV, Klein JL, Taylor AT. Clinical decision support systems in myocardial perfusion imaging. J Nucl Cardiol 2014; 21:427-39; quiz 440. [PMID: 24482142 DOI: 10.1007/s12350-014-9857-9] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/19/2013] [Accepted: 12/17/2013] [Indexed: 10/25/2022]
Abstract
Diagnostic imaging is becoming more complicated, physicians are also required to master an ever-expanding knowledge base and take into account an ever increasing amount of patient-specific clinical information while the time available to master this knowledge base, assemble the relevant clinical data, and apply it to specific tasks is steadily shrinking. Compounding these problems, there is an ever increasing number of aging "Baby Boomers" who are becoming patients coupled with a declining number of cardiac diagnosticians experienced in interpreting these studies. Hence, it is crucial that decision support tools be developed and implemented to assist physicians in interpreting studies at a faster rate and at the highest level of up-to-date expertise. Such tools will minimize subjectivity and intra- and inter-observer variation in image interpretation, help achieve a standardized high level of performance, and reduce healthcare costs. Presently, there are many decision support systems and approaches being developed and implemented to provide greater automation and to further objectify and standardize analysis, display, integration, interpretation, and reporting of myocardial perfusion SPECT and PET studies. This review focuses on these systems and approaches.
Collapse
Affiliation(s)
- Ernest V Garcia
- Department of Radiology and Imaging Sciences, Emory University, 101 Woodruff Circle, Room 1203, Atlanta, GA, 30322, USA,
| | | | | |
Collapse
|
6
|
|
7
|
Hsiao MY, Chen CC, Chen JH. Using UMLS to construct a generalized hierarchical concept-based dictionary of brain functions for information extraction from the fMRI literature. J Biomed Inform 2009; 42:912-22. [DOI: 10.1016/j.jbi.2009.04.003] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/27/2008] [Revised: 04/14/2009] [Accepted: 04/15/2009] [Indexed: 01/12/2023]
|
8
|
Inference of gene pathways using mixture Bayesian networks. BMC SYSTEMS BIOLOGY 2009; 3:54. [PMID: 19454027 PMCID: PMC2701418 DOI: 10.1186/1752-0509-3-54] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 10/21/2008] [Accepted: 05/19/2009] [Indexed: 12/13/2022]
Abstract
BACKGROUND Inference of gene networks typically relies on measurements across a wide range of conditions or treatments. Although one network structure is predicted, the relationship between genes could vary across conditions. A comprehensive approach to infer general and condition-dependent gene networks was evaluated. This approach integrated Bayesian network and Gaussian mixture models to describe continuous microarray gene expression measurements, and three gene networks were predicted. RESULTS The first reconstructions of a circadian rhythm pathway in honey bees and an adherens junction pathway in mouse embryos were obtained. In addition, general and condition-specific gene relationships, some unexpected, were detected in these two pathways and in a yeast cell-cycle pathway. The mixture Bayesian network approach identified all (honey bee circadian rhythm and mouse adherens junction pathways) or the vast majority (yeast cell-cycle pathway) of the gene relationships reported in empirical studies. Findings across the three pathways and data sets indicate that the mixture Bayesian network approach is well-suited to infer gene pathways based on microarray data. Furthermore, the interpretation of model estimates provided a broader understanding of the relationships between genes. The mixture models offered a comprehensive description of the relationships among genes in complex biological processes or across a wide range of conditions. The mixture parameter estimates and corresponding odds that the gene network inferred for a sample pertained to each mixture component allowed the uncovering of both general and condition-dependent gene relationships and patterns of expression. CONCLUSION This study demonstrated the two main benefits of learning gene pathways using mixture Bayesian networks. First, the identification of the optimal number of mixture components supported by the data offered a robust approach to infer gene relationships and estimate gene expression profiles. Second, the classification of conditions and observations into groups that support particular mixture components helped to uncover both gene relationships that are unique or common across conditions. Results from the application of mixture Bayesian networks substantially augmented the understanding of gene networks and demonstrated the added-value of this methodology to infer gene networks.
Collapse
|
9
|
Lourenço A, Carreira R, Carneiro S, Maia P, Glez-Peña D, Fdez-Riverola F, Ferreira EC, Rocha I, Rocha M. @Note: a workbench for biomedical text mining. J Biomed Inform 2009; 42:710-20. [PMID: 19393341 DOI: 10.1016/j.jbi.2009.04.002] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/06/2008] [Revised: 02/16/2009] [Accepted: 04/07/2009] [Indexed: 10/20/2022]
Abstract
Biomedical Text Mining (BioTM) is providing valuable approaches to the automated curation of scientific literature. However, most efforts have addressed the benchmarking of new algorithms rather than user operational needs. Bridging the gap between BioTM researchers and biologists' needs is crucial to solve real-world problems and promote further research. We present @Note, a platform for BioTM that aims at the effective translation of the advances between three distinct classes of users: biologists, text miners and software developers. Its main functional contributions are the ability to process abstracts and full-texts; an information retrieval module enabling PubMed search and journal crawling; a pre-processing module with PDF-to-text conversion, tokenisation and stopword removal; a semantic annotation schema; a lexicon-based annotator; a user-friendly annotation view that allows to correct annotations and a Text Mining Module supporting dataset preparation and algorithm evaluation. @Note improves the interoperability, modularity and flexibility when integrating in-home and open-source third-party components. Its component-based architecture allows the rapid development of new applications, emphasizing the principles of transparency and simplicity of use. Although it is still on-going, it has already allowed the development of applications that are currently being used.
Collapse
Affiliation(s)
- Anália Lourenço
- IBB - Institute for Biotechnology and Bioengineering, Centre of Biological Engineering, University of Minho, Campus de Gualtar, 4710-057 Braga, Portugal.
| | | | | | | | | | | | | | | | | |
Collapse
|
10
|
Yang J, Cohen A, Hersh W. Evaluation of a gene information summarization system by users during the analysis process of microarray datasets. BMC Bioinformatics 2009; 10 Suppl 2:S5. [PMID: 19208193 PMCID: PMC2646238 DOI: 10.1186/1471-2105-10-s2-s5] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022] Open
Abstract
Background Summarization of gene information in the literature has the potential to help genomics researchers translate basic research into clinical benefits. Gene expression microarrays have been used to study biomarkers for disease and discover novel types of therapeutics and the task of finding information in journal articles on sets of genes is common for translational researchers working with microarray data. However, manually searching and scanning the literature references returned from PubMed is a time-consuming task for scientists. We built and evaluated an automatic summarizer of information on genes studied in microarray experiments. The Gene Information Clustering and Summarization System (GICSS) is a system that integrates two related steps of the microarray data analysis process: functional gene clustering and gene information gathering. The system evaluation was conducted during the process of genomic researchers analyzing their own experimental microarray datasets. Results The clusters generated by GICSS were validated by scientists during their microarray analysis process. In addition, presenting sentences in the abstract provided significantly more important information to the users than just showing the title in the default PubMed format. Conclusion The evaluation results suggest that GICSS can be useful for researchers in genomic area. In addition, the hybrid evaluation method, partway between intrinsic and extrinsic system evaluation, may enable researchers to gauge the true usefulness of the tool for the scientists in their natural analysis workflow and also elicit suggestions for future enhancements. Availability GICSS can be accessed online at:
Collapse
Affiliation(s)
- Jianji Yang
- Department of Medical Informatics and Clinical Epidemiology, Oregon Health & Science University, Portland, Oregon 97239, USA.
| | | | | |
Collapse
|
11
|
Hull D, Pettifer SR, Kell DB. Defrosting the digital library: bibliographic tools for the next generation web. PLoS Comput Biol 2008; 4:e1000204. [PMID: 18974831 PMCID: PMC2568856 DOI: 10.1371/journal.pcbi.1000204] [Citation(s) in RCA: 92] [Impact Index Per Article: 5.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/02/2022] Open
Abstract
Many scientists now manage the bulk of their bibliographic information electronically, thereby organizing their publications and citation material from digital libraries. However, a library has been described as "thought in cold storage," and unfortunately many digital libraries can be cold, impersonal, isolated, and inaccessible places. In this Review, we discuss the current chilly state of digital libraries for the computational biologist, including PubMed, IEEE Xplore, the ACM digital library, ISI Web of Knowledge, Scopus, Citeseer, arXiv, DBLP, and Google Scholar. We illustrate the current process of using these libraries with a typical workflow, and highlight problems with managing data and metadata using URIs. We then examine a range of new applications such as Zotero, Mendeley, Mekentosj Papers, MyNCBI, CiteULike, Connotea, and HubMed that exploit the Web to make these digital libraries more personal, sociable, integrated, and accessible places. We conclude with how these applications may begin to help achieve a digital defrost, and discuss some of the issues that will help or hinder this in terms of making libraries on the Web warmer places in the future, becoming resources that are considerably more useful to both humans and machines.
Collapse
Affiliation(s)
- Duncan Hull
- School of Chemistry, The University of Manchester, Manchester, UK.
| | | | | |
Collapse
|
12
|
Watanabe RLA, Morett E, Vallejo EE. Inferring modules of functionally interacting proteins using the Bond Energy Algorithm. BMC Bioinformatics 2008; 9:285. [PMID: 18559112 PMCID: PMC2474619 DOI: 10.1186/1471-2105-9-285] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/28/2008] [Accepted: 06/17/2008] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Non-homology based methods such as phylogenetic profiles are effective for predicting functional relationships between proteins with no considerable sequence or structure similarity. Those methods rely heavily on traditional similarity metrics defined on pairs of phylogenetic patterns. Proteins do not exclusively interact in pairs as the final biological function of a protein in the cellular context is often hold by a group of proteins. In order to accurately infer modules of functionally interacting proteins, the consideration of not only direct but also indirect relationships is required. In this paper, we used the Bond Energy Algorithm (BEA) to predict functionally related groups of proteins. With BEA we create clusters of phylogenetic profiles based on the associations of the surrounding elements of the analyzed data using a metric that considers linked relationships among elements in the data set. RESULTS Using phylogenetic profiles obtained from the Cluster of Orthologous Groups of Proteins (COG) database, we conducted a series of clustering experiments using BEA to predict (upper level) relationships between profiles. We evaluated our results by comparing with COG's functional categories, And even more, with the experimentally determined functional relationships between proteins provided by the DIP and ECOCYC databases. Our results demonstrate that BEA is capable of predicting meaningful modules of functionally related proteins. BEA outperforms traditionally used clustering methods, such as k-means and hierarchical clustering by predicting functional relationships between proteins with higher accuracy. CONCLUSION This study shows that the linked relationships of phylogenetic profiles obtained by BEA is useful for detecting functional associations between profiles and extending functional modules not found by traditional methods. BEA is capable of detecting relationship among phylogenetic patterns by linking them through a common element shared in a group. Additionally, we discuss how the proposed method may become more powerful if other criteria to classify different levels of protein functional interactions, as gene neighborhood or protein fusion information, is provided.
Collapse
Affiliation(s)
- Ryosuke L A Watanabe
- ITESM Campus Estado de México, Carretera Lago de Guadalupe km 3,5, Atizapán de Zaragoza, 52926, México.
| | | | | |
Collapse
|
13
|
Natarajan J, Ganapathy J. Functional gene clustering via gene annotation sentences, MeSH and GO keywords from biomedical literature. Bioinformation 2007; 2:185-93. [PMID: 18305827 PMCID: PMC2241933 DOI: 10.6026/97320630002185] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/18/2007] [Accepted: 12/30/2007] [Indexed: 11/23/2022] Open
Abstract
Gene function annotation remains a key challenge in modern biology. This is especially true for high-throughput techniques such as gene expression experiments. Vital information about genes is available electronically from biomedical literature in the form of full texts and abstracts. In addition, various publicly available databases (such as GenBank, Gene Ontology and Entrez) provide access to gene-related information at different levels of biological organization, granularity and data format. This information is being used to assess and interpret the results from high-throughput experiments. To improve keyword extraction for annotational clustering and other types of analyses, we have developed a novel text mining approach, which is based on keywords identified at the level of gene annotation sentences (in particular sentences characterizing biological function) instead of entire abstracts. Further, to improve the expressiveness and usefulness of gene annotation terms, we investigated the combination of sentence-level keywords with terms from the Medical Subject Headings (MeSH) and Gene Ontology (GO) resources. We find that sentence-level keywords combined with MeSH terms outperforms the typical 'baseline' set-up (term frequencies at the level of abstracts) by a significant margin, whereas the addition of GO terms improves matters only marginally. We validated our approach on the basis of a manually annotated corpus of 200 abstracts generated on the basis of 2 cancer categories and 10 genes per category. We applied the method in the context of three sets of differentially expressed genes obtained from pediatric brain tumor samples. This analysis suggests novel interpretations of discovered gene expression patterns.
Collapse
Affiliation(s)
- Jeyakumar Natarajan
- Centre of Excellence in Bioinformatics, School of Biotechnology, Madurai Kamaraj University, Madurai 625021, India.
| | | |
Collapse
|
14
|
Zhou X, Liu B, Wu Z, Feng Y. Integrative mining of traditional Chinese medicine literature and MEDLINE for functional gene networks. Artif Intell Med 2007; 41:87-104. [PMID: 17804209 DOI: 10.1016/j.artmed.2007.07.007] [Citation(s) in RCA: 32] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/01/2006] [Revised: 07/24/2007] [Accepted: 07/24/2007] [Indexed: 01/17/2023]
Abstract
OBJECTIVE The amount of biomedical data in different disciplines is growing at an exponential rate. Integrating these significant knowledge sources to generate novel hypotheses for systems biology research is difficult. Traditional Chinese medicine (TCM) is a completely different discipline, and is a complementary knowledge system to modern biomedical science. This paper uses a significant TCM bibliographic literature database in China, together with MEDLINE, to help discover novel gene functional knowledge. MATERIALS AND METHODS We present an integrative mining approach to uncover the functional gene relationships from MEDLINE and TCM bibliographic literature. This paper introduces TCM literature (about 50,000 records) as one knowledge source for constructing literature-based gene networks. We use the TCM diagnosis, TCM syndrome, to automatically congregate the related genes. The syndrome-gene relationships are discovered based on the syndrome-disease relationships extracted from TCM literature and the disease-gene relationships in MEDLINE. Based on the bubble-bootstrapping and relation weight computing methods, we have developed a prototype system called MeDisco/3S, which has name entity and relation extraction, and online analytical processing (OLAP) capabilities, to perform the integrative mining process. RESULTS We have got about 200,000 syndrome-gene relations, which could help generate syndrome-based gene networks, and help analyze the functional knowledge of genes from syndrome perspective. We take the gene network of Kidney-Yang Deficiency syndrome (KYD syndrome) and the functional analysis of some genes, such as CRH (corticotropin releasing hormone), PTH (parathyroid hormone), PRL (prolactin), BRCA1 (breast cancer 1, early onset) and BRCA2 (breast cancer 2, early onset), to demonstrate the preliminary results. The underlying hypothesis is that the related genes of the same syndrome will have some biological functional relationships, and will constitute a functional network. CONCLUSION This paper presents an approach to integrate TCM literature and modern biomedical data to discover novel gene networks and functional knowledge of genes. The preliminary results show that the novel gene functional knowledge and gene networks, which are worthy of further investigation, could be generated by integrating the two complementary biomedical data sources. It will be a promising research field through integrative mining of TCM and modern life science literature.
Collapse
Affiliation(s)
- Xuezhong Zhou
- China Academy of Chinese Medical Sciences, Beijing 100700, China.
| | | | | | | |
Collapse
|
15
|
Lin Y, Li W, Chen K, Liu Y. A document clustering and ranking system for exploring MEDLINE citations. J Am Med Inform Assoc 2007; 14:651-61. [PMID: 17600104 PMCID: PMC1975797 DOI: 10.1197/jamia.m2215] [Citation(s) in RCA: 40] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/17/2006] [Accepted: 05/20/2007] [Indexed: 11/10/2022] Open
Abstract
OBJECTIVE A major problem faced in biomedical informatics involves how best to present information retrieval results. When a single query retrieves many results, simply showing them as a long list often provides poor overview. With a goal of presenting users with reduced sets of relevant citations, this study developed an approach that retrieved and organized MEDLINE citations into different topical groups and prioritized important citations in each group. DESIGN A text mining system framework for automatic document clustering and ranking organized MEDLINE citations following simple PubMed queries. The system grouped the retrieved citations, ranked the citations in each cluster, and generated a set of keywords and MeSH terms to describe the common theme of each cluster. MEASUREMENTS Several possible ranking functions were compared, including citation count per year (CCPY), citation count (CC), and journal impact factor (JIF). We evaluated this framework by identifying as "important" those articles selected by the Surgical Oncology Society. RESULTS Our results showed that CCPY outperforms CC and JIF, i.e., CCPY better ranked important articles than did the others. Furthermore, our text clustering and knowledge extraction strategy grouped the retrieval results into informative clusters as revealed by the keywords and MeSH terms extracted from the documents in each cluster. CONCLUSIONS The text mining system studied effectively integrated text clustering, text summarization, and text ranking and organized MEDLINE retrieval results into different topical groups.
Collapse
Affiliation(s)
- Yongjing Lin
- Laboratory for Bioinformatics and Medical Informatics, Department of Computer Science, University of Texas at Dallas, Richardson, TX
| | - Wenyuan Li
- Laboratory for Bioinformatics and Medical Informatics, Department of Computer Science, University of Texas at Dallas, Richardson, TX
| | | | - Ying Liu
- Laboratory for Bioinformatics and Medical Informatics, Department of Computer Science, University of Texas at Dallas, Richardson, TX
- Department of Molecular and Cell Biology, University of Texas at Dallas, Richardson, TX
| |
Collapse
|
16
|
Polavarapu N, Navathe SB, Ramnarayanan R, ul Haque A, Sahay S, Liu Y. Investigation into biomedical literature classification using support vector machines. PROCEEDINGS. IEEE COMPUTATIONAL SYSTEMS BIOINFORMATICS CONFERENCE 2007:366-74. [PMID: 16447994 DOI: 10.1109/csb.2005.36] [Citation(s) in RCA: 14] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
Specific topic search in the PubMed Database, one of the most important information resources for scientific community, presents a big challenge to the users. The researcher typically formulates boolean queries followed by scanning the retrieved records for relevance, which is very time consuming and error prone. We applied Support Vector Machines (SVM) for automatic retrieval of PubMed articles related to Human genome epidemiological research at CDC (Center for disease Control and Prevention). In this paper, we discuss various investigations into biomedical literature classification and analyze the effect of various issues related to the choice of keywords, training sets, kernel functions and parameters for the SVM technique. We report on the various factors above to show that SVM is a viable technique for automatic classification of biomedical literature into topics of interest such as epidemiology, cancer, birth defects etc. In all our experiments, we achieved high values of PPV, sensitivity and specificity.
Collapse
Affiliation(s)
- Nalini Polavarapu
- School of Biology, Georgia Institute of Technology, Atlanta, GA 30332, USA.
| | | | | | | | | | | |
Collapse
|
17
|
Thomas PD, Mi H, Lewis S. Ontology annotation: mapping genomic regions to biological function. Curr Opin Chem Biol 2007; 11:4-11. [PMID: 17208035 DOI: 10.1016/j.cbpa.2006.11.039] [Citation(s) in RCA: 51] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/06/2006] [Accepted: 11/29/2006] [Indexed: 10/23/2022]
Abstract
With numerous whole genomes now in hand, and experimental data about genes and biological pathways on the increase, a systems approach to biological research is becoming essential. Ontologies provide a formal representation of knowledge that is amenable to computational as well as human analysis, an obvious underpinning of systems biology. Mapping function to gene products in the genome consists of two, somewhat intertwined enterprises: ontology building and ontology annotation. Ontology building is the formal representation of a domain of knowledge; ontology annotation is association of specific genomic regions (which we refer to simply as 'genes', including genes and their regulatory elements and products such as proteins and functional RNAs) to parts of the ontology. We consider two complementary representations of gene function: the Gene Ontology (GO) and pathway ontologies. GO represents function from the gene's eye view, in relation to a large and growing context of biological knowledge at all levels. Pathway ontologies represent function from the point of view of biochemical reactions and interactions, which are ordered into networks and causal cascades. The more mature GO provides an example of ontology annotation: how conclusions from the scientific literature and from evolutionary relationships are converted into formal statements about gene function. Annotations are made using a variety of different types of evidence, which can be used to estimate the relative reliability of different annotations.
Collapse
Affiliation(s)
- Paul D Thomas
- Evolutionary Systems Biology Group, Artificial Intelligence Center, SRI International, Menlo Park, CA 94025, USA.
| | | | | |
Collapse
|
18
|
Pospisil P, Iyer LK, Adelstein SJ, Kassis AI. A combined approach to data mining of textual and structured data to identify cancer-related targets. BMC Bioinformatics 2006; 7:354. [PMID: 16857057 PMCID: PMC1555615 DOI: 10.1186/1471-2105-7-354] [Citation(s) in RCA: 57] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/10/2006] [Accepted: 07/20/2006] [Indexed: 11/24/2022] Open
Abstract
Background We present an effective, rapid, systematic data mining approach for identifying genes or proteins related to a particular interest. A selected combination of programs exploring PubMed abstracts, universal gene/protein databases (UniProt, InterPro, NCBI Entrez), and state-of-the-art pathway knowledge bases (LSGraph and Ingenuity Pathway Analysis) was assembled to distinguish enzymes with hydrolytic activities that are expressed in the extracellular space of cancer cells. Proteins were identified with respect to six types of cancer occurring in the prostate, breast, lung, colon, ovary, and pancreas. Results The data mining method identified previously undetected targets. Our combined strategy applied to each cancer type identified a minimum of 375 proteins expressed within the extracellular space and/or attached to the plasma membrane. The method led to the recognition of human cancer-related hydrolases (on average, ~35 per cancer type), among which were prostatic acid phosphatase, prostate-specific antigen, and sulfatase 1. Conclusion The combined data mining of several databases overcame many of the limitations of querying a single database and enabled the facile identification of gene products. In the case of cancer-related targets, it produced a list of putative extracellular, hydrolytic enzymes that merit additional study as candidates for cancer radioimaging and radiotherapy. The proposed data mining strategy is of a general nature and can be applied to other biological databases for understanding biological functions and diseases.
Collapse
Affiliation(s)
- Pavel Pospisil
- Harvard Medical School, Department of Radiology, 200 Longwood Avenue, Boston, Massachusetts, USA
| | - Lakshmanan K Iyer
- Bauer Center for Genomics Research, Harvard University, 7 Divinity Avenue, Cambridge, Massachusetts, USA
| | - S James Adelstein
- Harvard Medical School, Department of Radiology, 200 Longwood Avenue, Boston, Massachusetts, USA
| | - Amin I Kassis
- Harvard Medical School, Department of Radiology, 200 Longwood Avenue, Boston, Massachusetts, USA
| |
Collapse
|