1
|
Ye Z, Tafti AP, He KY, Wang K, He MM. SparkText: Biomedical Text Mining on Big Data Framework. PLoS One 2016; 11:e0162721. [PMID: 27685652 PMCID: PMC5042555 DOI: 10.1371/journal.pone.0162721] [Citation(s) in RCA: 29] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/30/2016] [Accepted: 08/26/2016] [Indexed: 11/18/2022] Open
Abstract
Background Many new biomedical research articles are published every day, accumulating rich information, such as genetic variants, genes, diseases, and treatments. Rapid yet accurate text mining on large-scale scientific literature can discover novel knowledge to better understand human diseases and to improve the quality of disease diagnosis, prevention, and treatment. Results In this study, we designed and developed an efficient text mining framework called SparkText on a Big Data infrastructure, which is composed of Apache Spark data streaming and machine learning methods, combined with a Cassandra NoSQL database. To demonstrate its performance for classifying cancer types, we extracted information (e.g., breast, prostate, and lung cancers) from tens of thousands of articles downloaded from PubMed, and then employed Naïve Bayes, Support Vector Machine (SVM), and Logistic Regression to build prediction models to mine the articles. The accuracy of predicting a cancer type by SVM using the 29,437 full-text articles was 93.81%. While competing text-mining tools took more than 11 hours, SparkText mined the dataset in approximately 6 minutes. Conclusions This study demonstrates the potential for mining large-scale scientific articles on a Big Data infrastructure, with real-time update from new articles published daily. SparkText can be extended to other areas of biomedical research.
Collapse
Affiliation(s)
- Zhan Ye
- Biomedical Informatics Research Center, Marshfield Clinic Research Foundation, Marshfield, WI, 54449, United States of America
| | - Ahmad P Tafti
- Center for Human Genetics, Marshfield Clinic Research Foundation, Marshfield, WI, 54449, United States of America.,Department of Computer Science, University of Wisconsin-Milwaukee, Milwaukee, WI, 53211, United States of America
| | - Karen Y He
- Department of Epidemiology and Biostatistics, Case Western Reserve University, Cleveland, OH, 44106, United States of America
| | - Kai Wang
- Zilkha Neurogenetic Institute, University of Southern California, Los Angeles, CA, 90089, United States of America.,Department of Psychiatry, University of Southern California, Los Angeles, CA, 90089, United States of America
| | - Max M He
- Biomedical Informatics Research Center, Marshfield Clinic Research Foundation, Marshfield, WI, 54449, United States of America.,Center for Human Genetics, Marshfield Clinic Research Foundation, Marshfield, WI, 54449, United States of America.,Computation and Informatics in Biology and Medicine, University of Wisconsin-Madison, Madison, WI, 53706, United States of America
| |
Collapse
|
2
|
Rossi LMG, Escobar-Gutierrez A, Rahal P. Advanced molecular surveillance of hepatitis C virus. Viruses 2015; 7:1153-88. [PMID: 25781918 PMCID: PMC4379565 DOI: 10.3390/v7031153] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/17/2014] [Revised: 02/05/2015] [Accepted: 02/20/2015] [Indexed: 12/12/2022] Open
Abstract
Hepatitis C virus (HCV) infection is an important public health problem worldwide. HCV exploits complex molecular mechanisms, which result in a high degree of intrahost genetic heterogeneity. This high degree of variability represents a challenge for the accurate establishment of genetic relatedness between cases and complicates the identification of sources of infection. Tracking HCV infections is crucial for the elucidation of routes of transmission in a variety of settings. Therefore, implementation of HCV advanced molecular surveillance (AMS) is essential for disease control. Accounting for virulence is also important for HCV AMS and both viral and host factors contribute to the disease outcome. Therefore, HCV AMS requires the incorporation of host factors as an integral component of the algorithms used to monitor disease occurrence. Importantly, implementation of comprehensive global databases and data mining are also needed for the proper study of the mechanisms responsible for HCV transmission. Here, we review molecular aspects associated with HCV transmission, as well as the most recent technological advances used for virus and host characterization. Additionally, the cornerstone discoveries that have defined the pathway for viral characterization are presented and the importance of implementing advanced HCV molecular surveillance is highlighted.
Collapse
Affiliation(s)
- Livia Maria Gonçalves Rossi
- Department of Biology, Institute of Bioscience, Language and Exact Science, Sao Paulo State University, Sao Jose do Rio Preto, SP 15054-000, Brazil.
| | | | - Paula Rahal
- Department of Biology, Institute of Bioscience, Language and Exact Science, Sao Paulo State University, Sao Jose do Rio Preto, SP 15054-000, Brazil.
| |
Collapse
|
3
|
Baye TM, Wilke RA. Mapping genes that predict treatment outcome in admixed populations. THE PHARMACOGENOMICS JOURNAL 2010; 10:465-77. [PMID: 20921971 PMCID: PMC2991422 DOI: 10.1038/tpj.2010.71] [Citation(s) in RCA: 27] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/09/2010] [Revised: 07/07/2010] [Accepted: 08/05/2010] [Indexed: 01/19/2023]
Abstract
There is great interest in characterizing the genetic architecture underlying drug response. For many drugs, gene-based dosing models explain a considerable amount of the overall variation in treatment outcome. As such, prescription drug labels are increasingly being modified to contain pharmacogenetic information. Genetic data must, however, be interpreted within the context of relevant clinical covariates. Even the most predictive models improve with the addition of data related to biogeographical ancestry. The current review explores analytical strategies that leverage population structure to more fully characterize genetic determinants of outcome in large clinical practice-based cohorts. The success of this approach will depend upon several key factors: (1) the availability of outcome data from groups of admixed individuals (that is, populations recombined over multiple generations), (2) a measurable difference in treatment outcome (that is, efficacy and toxicity end points), and (3) a measurable difference in allele frequency between the ancestral populations.
Collapse
Affiliation(s)
- T M Baye
- Division of Asthma Research, Department of Pediatrics, Cincinnati Children's Hospital Medical Center, University of Cincinnati, Cincinnati, OH 45229-3039, USA.
| | | |
Collapse
|