26
|
Botvinik-Nezer R, Holzmeister F, Camerer CF, Dreber A, Huber J, Johannesson M, Kirchler M, Iwanir R, Mumford JA, Adcock RA, Avesani P, Baczkowski BM, Bajracharya A, Bakst L, Ball S, Barilari M, Bault N, Beaton D, Beitner J, Benoit RG, Berkers RMWJ, Bhanji JP, Biswal BB, Bobadilla-Suarez S, Bortolini T, Bottenhorn KL, Bowring A, Braem S, Brooks HR, Brudner EG, Calderon CB, Camilleri JA, Castrellon JJ, Cecchetti L, Cieslik EC, Cole ZJ, Collignon O, Cox RW, Cunningham WA, Czoschke S, Dadi K, Davis CP, Luca AD, Delgado MR, Demetriou L, Dennison JB, Di X, Dickie EW, Dobryakova E, Donnat CL, Dukart J, Duncan NW, Durnez J, Eed A, Eickhoff SB, Erhart A, Fontanesi L, Fricke GM, Fu S, Galván A, Gau R, Genon S, Glatard T, Glerean E, Goeman JJ, Golowin SAE, González-García C, Gorgolewski KJ, Grady CL, Green MA, Guassi Moreira JF, Guest O, Hakimi S, Hamilton JP, Hancock R, Handjaras G, Harry BB, Hawco C, Herholz P, Herman G, Heunis S, Hoffstaedter F, Hogeveen J, Holmes S, Hu CP, Huettel SA, Hughes ME, Iacovella V, Iordan AD, Isager PM, Isik AI, Jahn A, Johnson MR, Johnstone T, Joseph MJE, Juliano AC, Kable JW, Kassinopoulos M, Koba C, Kong XZ, Koscik TR, Kucukboyaci NE, Kuhl BA, Kupek S, Laird AR, Lamm C, Langner R, Lauharatanahirun N, Lee H, Lee S, Leemans A, Leo A, Lesage E, Li F, Li MYC, Lim PC, Lintz EN, Liphardt SW, Losecaat Vermeer AB, Love BC, Mack ML, Malpica N, Marins T, Maumet C, McDonald K, McGuire JT, Melero H, Méndez Leal AS, Meyer B, Meyer KN, Mihai G, Mitsis GD, Moll J, Nielson DM, Nilsonne G, Notter MP, Olivetti E, Onicas AI, Papale P, Patil KR, Peelle JE, Pérez A, Pischedda D, Poline JB, Prystauka Y, Ray S, Reuter-Lorenz PA, Reynolds RC, Ricciardi E, Rieck JR, Rodriguez-Thompson AM, Romyn A, Salo T, Samanez-Larkin GR, Sanz-Morales E, Schlichting ML, Schultz DH, Shen Q, Sheridan MA, Silvers JA, Skagerlund K, Smith A, Smith DV, Sokol-Hessner P, Steinkamp SR, Tashjian SM, Thirion B, Thorp JN, Tinghög G, Tisdall L, Tompson SH, Toro-Serey C, Torre Tresols JJ, Tozzi L, Truong V, Turella L, van 't Veer AE, Verguts T, Vettel JM, Vijayarajah S, Vo K, Wall MB, Weeda WD, Weis S, White DJ, Wisniewski D, Xifra-Porxas A, Yearling EA, Yoon S, Yuan R, Yuen KSL, Zhang L, Zhang X, Zosky JE, Nichols TE, Poldrack RA, Schonberg T. Variability in the analysis of a single neuroimaging dataset by many teams. Nature 2020; 582:84-88. [PMID: 32483374 PMCID: PMC7771346 DOI: 10.1038/s41586-020-2314-9] [Citation(s) in RCA: 439] [Impact Index Per Article: 109.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/14/2019] [Accepted: 04/07/2020] [Indexed: 01/13/2023]
Abstract
Data analysis workflows in many scientific domains have become increasingly complex and flexible. Here we assess the effect of this flexibility on the results of functional magnetic resonance imaging by asking 70 independent teams to analyse the same dataset, testing the same 9 ex-ante hypotheses1. The flexibility of analytical approaches is exemplified by the fact that no two teams chose identical workflows to analyse the data. This flexibility resulted in sizeable variation in the results of hypothesis tests, even for teams whose statistical maps were highly correlated at intermediate stages of the analysis pipeline. Variation in reported results was related to several aspects of analysis methodology. Notably, a meta-analytical approach that aggregated information across teams yielded a significant consensus in activated regions. Furthermore, prediction markets of researchers in the field revealed an overestimation of the likelihood of significant findings, even by researchers with direct knowledge of the dataset2-5. Our findings show that analytical flexibility can have substantial effects on scientific conclusions, and identify factors that may be related to variability in the analysis of functional magnetic resonance imaging. The results emphasize the importance of validating and sharing complex analysis workflows, and demonstrate the need for performing and reporting multiple analyses of the same data. Potential approaches that could be used to mitigate issues related to analytical variability are discussed.
Collapse
|
27
|
Elias D, Campaña H, Poletta F, Heisecke S, Gili J, Ratowiecki J, Gimenez L, Pawluk M, Santos MR, Cosentino V, Uranga R, Rittler M, Lopez Camelo J. A graph theory approach to analyze birth defect associations. PLoS One 2020; 15:e0233529. [PMID: 32442191 PMCID: PMC7244144 DOI: 10.1371/journal.pone.0233529] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/28/2019] [Accepted: 05/06/2020] [Indexed: 01/11/2023] Open
Abstract
Birth defects are prenatal morphological or functional anomalies. Associations among them are studied to identify their etiopathogenesis. The graph theory methods allow analyzing relationships among a complete set of anomalies. A graph consists of nodes which represent the entities (birth defects in the present work), and edges that join nodes indicating the relationships among them. The aim of the present study was to validate the graph theory methods to study birth defect associations. All birth defects monitoring records from the Estudio Colaborativo Latino Americano de Malformaciones Congénitas gathered between 1967 and 2017 were used. From around 5 million live and stillborn infants, 170,430 had one or more birth defects. Volume-adjusted Chi-Square was used to determine the association strength between two birth defects and to weight the graph edges. The complete birth defect graph showed a Log-Normal degree distribution and its characteristics differed from random, scale-free and small-world graphs. The graph comprised 118 nodes and 550 edges. Birth defects with the highest centrality values were nonspecific codes such as Other upper limb anomalies. After partition, the graph yielded 12 groups; most of them were recognizable and included conditions such as VATER and OEIS associations, and Patau syndrome. Our findings validate the graph theory methods to study birth defect associations. This method may contribute to identify underlying etiopathogeneses as well as to improve coding systems.
Collapse
|
28
|
Abstract
With increasing demand for training in data science, extracurricular or "ad hoc" education efforts have emerged to help individuals acquire relevant skills and expertise. Although extracurricular efforts already exist for many computationally intensive disciplines, their support of data science education has significantly helped in coping with the speed of innovation in data science practice and formal curricula. While the proliferation of ad hoc efforts is an indication of their popularity, less has been documented about the needs that they are designed to meet, the limitations that they face, and practical suggestions for holding successful efforts. To holistically understand the role of different ad hoc formats for data science, we surveyed organizers of ad hoc data science education efforts to understand how organizers perceived the events to have gone-including areas of strength and areas requiring growth. We also gathered recommendations from these past events for future organizers. Our results suggest that the perceived benefits of ad hoc efforts go beyond developing technical skills and may provide continued benefit in conjunction with formal curricula, which warrants further investigation. As increasing numbers of researchers from computational fields with a history of complex data become involved with ad hoc efforts to share their skills, the lessons learned that we extract from the surveys will provide concrete suggestions for the practitioner-leaders interested in creating, improving, and sustaining future efforts.
Collapse
|
29
|
McDonough CW, Breitenstein MK, Shahin M, Empey PE, Freimuth RR, Li L, Liebman M, Tuteja S. Translational Informatics Connects Real-World Information to Knowledge in an Increasingly Data-Driven World. Clin Pharmacol Ther 2020; 107:738-741. [PMID: 31837229 PMCID: PMC7678684 DOI: 10.1002/cpt.1719] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/20/2019] [Accepted: 11/01/2019] [Indexed: 11/07/2022]
|
30
|
Abstract
Background. Accurate diagnosis of patients' preferences is central to shared decision making. Missing from clinical practice is an approach that links pretreatment preferences and patient-reported outcomes. Objective. We propose a Bayesian collaborative filtering (CF) algorithm that combines pretreatment preferences and patient-reported outcomes to provide treatment recommendations. Design. We present the methodological details of a Bayesian CF algorithm designed to accomplish 3 tasks: 1) eliciting patient preferences using conjoint analysis surveys, 2) clustering patients into preference phenotypes, and 3) making treatment recommendations based on the posttreatment satisfaction of like-minded patients. We conduct a series of simulation studies to test the algorithm and to compare it to a 2-stage approach. Results. The Bayesian CF algorithm and 2-stage approaches performed similarly when there was extensive overlap between preference phenotypes. When the treatment was moderately associated with satisfaction, both methods made accurate recommendations. The kappa estimates measuring agreement between the true and predicted recommendations were 0.70 (95% confidence interval = 0.052-0.88) and 0.73 (0.56-0.90) under the Bayesian CF and 2-stage approaches, respectively. The 2-stage approach failed to converge in settings in which clusters were well separated, whereas the Bayesian CF algorithm produced acceptable results, with kappas of 0.73 (0.56-0.90) and 0.83 (0.69-0.97) for scenarios with moderate and large treatment effects, respectively. Limitations. Our approach assumes that the patient population is composed of distinct preference phenotypes, there is association between treatment and outcomes, and treatment effects vary across phenotypes. Findings are also limited to simulated data. Conclusion. The Bayesian CF algorithm is feasible, provides accurate cluster treatment recommendations, and outperforms 2-stage estimation when clusters are well separated. As such, the approach serves as a roadmap for incorporating predictive analytics into shared decision making.
Collapse
|
31
|
Xu H, Li J, Jiang X, Chen Q. Electronic Health Records for Drug Repurposing: Current Status, Challenges, and Future Directions. Clin Pharmacol Ther 2020; 107:712-714. [PMID: 32012237 PMCID: PMC10815929 DOI: 10.1002/cpt.1769] [Citation(s) in RCA: 12] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/23/2019] [Accepted: 01/06/2020] [Indexed: 12/20/2022]
Abstract
It is well recognized that the global pharmaceutical industry now faces challenges such as high costs and low productivity when developing new drugs (e.g., it is estimated that the average cost for developing a new drug ranges from US $2 billion to $3 billion with the total time to bring it to the market being about 13–15 years).1 Therefore, drug repurposing (also called drug repositioning/reprofiling), which finds new indications for existing drugs, has received great attention in the past decade. Drug repurposing can reduce drug development time, while improving success rates because the toxicity profiles of existing drugs are already known. Studies have shown that new applications for repurposed drugs have nearly a 30% success rate for US Food and Drug Administration (FDA) approval, whereas traditional new drug applications have < 10% approval rate.
Collapse
|
32
|
Polasek TM, Rostami-Hodjegan A. Virtual Twins: Understanding the Data Required for Model-Informed Precision Dosing. Clin Pharmacol Ther 2020; 107:742-745. [PMID: 32056199 DOI: 10.1002/cpt.1778] [Citation(s) in RCA: 28] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/20/2019] [Accepted: 01/13/2020] [Indexed: 12/16/2022]
|
33
|
Amarasinghe SL, Su S, Dong X, Zappia L, Ritchie ME, Gouil Q. Opportunities and challenges in long-read sequencing data analysis. Genome Biol 2020; 21:30. [PMID: 32033565 PMCID: PMC7006217 DOI: 10.1186/s13059-020-1935-5] [Citation(s) in RCA: 685] [Impact Index Per Article: 171.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/28/2019] [Accepted: 01/15/2020] [Indexed: 12/11/2022] Open
Abstract
Long-read technologies are overcoming early limitations in accuracy and throughput, broadening their application domains in genomics. Dedicated analysis tools that take into account the characteristics of long-read data are thus required, but the fast pace of development of such tools can be overwhelming. To assist in the design and analysis of long-read sequencing projects, we review the current landscape of available tools and present an online interactive database, long-read-tools.org, to facilitate their browsing. We further focus on the principles of error correction, base modification detection, and long-read transcriptomics analysis and highlight the challenges that remain.
Collapse
|
34
|
Lähnemann D, Köster J, Szczurek E, McCarthy DJ, Hicks SC, Robinson MD, Vallejos CA, Campbell KR, Beerenwinkel N, Mahfouz A, Pinello L, Skums P, Stamatakis A, Attolini CSO, Aparicio S, Baaijens J, Balvert M, Barbanson BD, Cappuccio A, Corleone G, Dutilh BE, Florescu M, Guryev V, Holmer R, Jahn K, Lobo TJ, Keizer EM, Khatri I, Kielbasa SM, Korbel JO, Kozlov AM, Kuo TH, Lelieveldt BP, Mandoiu II, Marioni JC, Marschall T, Mölder F, Niknejad A, Rączkowska A, Reinders M, Ridder JD, Saliba AE, Somarakis A, Stegle O, Theis FJ, Yang H, Zelikovsky A, McHardy AC, Raphael BJ, Shah SP, Schönhuth A. Eleven grand challenges in single-cell data science. Genome Biol 2020; 21:31. [PMID: 32033589 PMCID: PMC7007675 DOI: 10.1186/s13059-020-1926-6] [Citation(s) in RCA: 545] [Impact Index Per Article: 136.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/02/2019] [Accepted: 01/02/2020] [Indexed: 02/08/2023] Open
Abstract
The recent boom in microfluidics and combinatorial indexing strategies, combined with low sequencing costs, has empowered single-cell sequencing technology. Thousands-or even millions-of cells analyzed in a single experiment amount to a data revolution in single-cell biology and pose unique data science problems. Here, we outline eleven challenges that will be central to bringing this emerging field of single-cell data science forward. For each challenge, we highlight motivating research questions, review prior work, and formulate open problems. This compendium is for established researchers, newcomers, and students alike, highlighting interesting and rewarding problems for the coming years.
Collapse
|
35
|
Tran DT, Bhaskara A, Kuberan B, Might M. A graph-based algorithm for RNA-seq data normalization. PLoS One 2020; 15:e0227760. [PMID: 31978105 PMCID: PMC6980396 DOI: 10.1371/journal.pone.0227760] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/15/2019] [Accepted: 12/28/2019] [Indexed: 12/16/2022] Open
Abstract
The use of RNA-sequencing has garnered much attention in recent years for characterizing and understanding various biological systems. However, it remains a major challenge to gain insights from a large number of RNA-seq experiments collectively, due to the normalization problem. Normalization has been challenging due to an inherent circularity, requiring that RNA-seq data be normalized before any pattern of differential (or non-differential) expression can be ascertained; meanwhile, the prior knowledge of non-differential transcripts is crucial to the normalization process. Some methods have successfully overcome this problem by the assumption that most transcripts are not differentially expressed. However, when RNA-seq profiles become more abundant and heterogeneous, this assumption fails to hold, leading to erroneous normalization. We present a normalization procedure that does not rely on this assumption, nor prior knowledge about the reference transcripts. This algorithm is based on a graph constructed from intrinsic correlations among RNA-seq transcripts and seeks to identify a set of densely connected vertices as references. Application of this algorithm on our synthesized validation data showed that it could recover the reference transcripts with high precision, thus resulting in high-quality normalization. On a realistic data set from the ENCODE project, this algorithm gave good results and could finish in a reasonable time. These preliminary results imply that we may be able to break the long persisting circularity problem in RNA-seq normalization.
Collapse
|
36
|
Pittard WS, Villaveces CK, Li S. A Bioinformatics Primer to Data Science, with Examples for Metabolomics. Methods Mol Biol 2020; 2104:245-263. [PMID: 31953822 DOI: 10.1007/978-1-0716-0239-3_14] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/07/2023]
Abstract
With the increasing importance of big data in biomedicine, skills in data science are a foundation for the individual career development and for the progress of science. This chapter is a practical guide to working with high-throughput biomedical data. It covers how to understand and set up the computing environment, to start a research project with proper and effective data management, and to perform common bioinformatics tasks such as data wrangling, quality control, statistical analysis, and visualization, with examples on metabolomics data. Concepts and tools related to coding and scripting are discussed. Version control, knitr and Jupyter notebooks are important to project management, collaboration, and research reproducibility. Overall, this chapter describes a core set of skills to work in bioinformatics, and can serve as a reference text at the level of a graduate course and interfacing with data science.
Collapse
|
37
|
|
38
|
Olatosi B, Zhang J, Weissman S, Hu J, Haider MR, Li X. Using big data analytics to improve HIV medical care utilisation in South Carolina: A study protocol. BMJ Open 2019; 9:e027688. [PMID: 31326931 PMCID: PMC6661700 DOI: 10.1136/bmjopen-2018-027688] [Citation(s) in RCA: 26] [Impact Index Per Article: 5.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 11/07/2018] [Revised: 03/28/2019] [Accepted: 06/04/2019] [Indexed: 12/23/2022] Open
Abstract
INTRODUCTION Linkage and retention in HIV medical care remains problematic in the USA. Extensive health utilisation data collection through electronic health records (EHR) and claims data represent new opportunities for scientific discovery. Big data science (BDS) is a powerful tool for investigating HIV care utilisation patterns. The South Carolina (SC) office of Revenue and Fiscal Affairs (RFA) data warehouse captures individual-level longitudinal health utilisation data for persons living with HIV (PLWH). The data warehouse includes EHR, claims and data from private institutions, housing, prisons, mental health, Medicare, Medicaid, State Health Plan and the department of health and human services. The purpose of this study is to describe the process for creating a comprehensive database of all SC PLWH, and plans for using BDS to explore, identify, characterise and explain new predictors of missed opportunities for HIV medical care utilisation. METHODS AND ANALYSIS This project will create person-level profiles guided by the Gelberg-Andersen Behavioral Model and describe new patterns of HIV care utilisation. The population for the comprehensive database comes from statewide HIV surveillance data (2005-2016) for all SC PLWH (N≈18000). Surveillance data are available from the state health department's enhanced HIV/AIDS Reporting System (e-HARS). Additional data pulls for the e-HARS population will include Ryan White HIV/AIDS Program Service Reports, Health Sciences SC data and Area Health Resource Files. These data will be linked to the RFA data and serve as sources for traditional and vulnerable domain Gelberg-Anderson Behavioral Model variables. The project will use BDS techniques such as machine learning to identify new predictors of HIV care utilisation behaviour among PLWH, and 'missed opportunities' for re-engaging them back into care. ETHICS AND DISSEMINATION The study team applied for data from different sources and submitted individual Institutional Review Board (IRB) applications to the University of South Carolina (USC) IRB and other local authorities/agencies/state departments. This study was approved by the USC IRB (#Pro00068124) in 2017. To protect the identity of the persons living with HIV (PLWH), researchers will only receive linked deidentified data from the RFA. Study findings will be disseminated at local community forums, community advisory group meetings, meetings with our state agencies, local partners and other key stakeholders (including PLWH, policy-makers and healthcare providers), presentations at academic conferences and through publication in peer-reviewed articles. Data security and patient confidentiality are the bedrock of this study. Extensive data agreements ensuring data security and patient confidentiality for the deidentified linked data have been established and are stringently adhered to. The RFA is authorised to collect and merge data from these different sources and to ensure the privacy of all PLWH. The legislatively mandated SC data oversight council reviewed the proposed process stringently before approving it. Researchers will get only the encrypted deidentified dataset to prevent any breach of privacy in the data transfer, management and analysis processes. In addition, established secure data governance rules, data encryption and encrypted predictive techniques will be deployed. In addition to the data anonymisation as a part of privacy-preserving analytics, encryption schemes that protect running prediction algorithms on encrypted data will also be deployed. Best practices and lessons learnt about the complex processes involved in negotiating and navigating multiple data sharing agreements between different entities are being documented for dissemination.
Collapse
|
39
|
Grzegorczyk M, Aderhold A, Husmeier D. Overview and Evaluation of Recent Methods for Statistical Inference of Gene Regulatory Networks from Time Series Data. Methods Mol Biol 2019; 1883:49-94. [PMID: 30547396 DOI: 10.1007/978-1-4939-8882-2_3] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/14/2023]
Abstract
A challenging problem in systems biology is the reconstruction of gene regulatory networks from postgenomic data. A variety of reverse engineering methods from machine learning and computational statistics have been proposed in the literature. However, deciding on the best method to adopt for a particular application or data set might be a confusing task. The present chapter provides a broad overview of state-of-the-art methods with an emphasis on conceptual understanding rather than a deluge of mathematical details, and the pros and cons of the various approaches are discussed. Guidance on practical applications with pointers to publicly available software implementations are included. The chapter concludes with a comprehensive comparative benchmark study on simulated data and a real-work application taken from the current plant systems biology.
Collapse
|
40
|
Kampe C, Reid G, Jones P, S C, S S, Vogel KM. Bringing the National Security Agency into the Classroom: Ethical Reflections on Academia-Intelligence Agency Partnerships. SCIENCE AND ENGINEERING ETHICS 2019; 25:869-898. [PMID: 29318451 DOI: 10.1007/s11948-017-9938-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/20/2017] [Accepted: 06/21/2017] [Indexed: 06/07/2023]
Abstract
Academia-intelligence agency collaborations are on the rise for a variety of reasons. These can take many forms, one of which is in the classroom, using students to stand in for intelligence analysts. Classrooms, however, are ethically complex spaces, with students considered vulnerable populations, and become even more complex when layering multiple goals, activities, tools, and stakeholders over those traditionally present. This does not necessarily mean one must shy away from academia-intelligence agency partnerships in classrooms, but that these must be conducted carefully and reflexively. This paper hopes to contribute to this conversation by describing one purposeful classroom encounter that occurred between a professor, students, and intelligence practitioners in the fall of 2015 at North Carolina State University: an experiment conducted as part of a graduate-level political science class that involved students working with a prototype analytic technology, a type of participatory sensing/self-tracking device, developed by the National Security Agency. This experiment opened up the following questions that this paper will explore: What social, ethical, and pedagogical considerations arise with the deployment of a prototype intelligence technology in the college classroom, and how can they be addressed? How can academia-intelligence agency collaboration in the classroom be conducted in ways that provide benefits to all parties, while minimizing disruptions and negative consequences? This paper will discuss the experimental findings in the context of ethical perspectives involved in values in design and participatory/self-tracking data practices, and discuss lessons learned for the ethics of future academia-intelligence agency partnerships in the classroom.
Collapse
|
41
|
Musa A, Tripathi S, Dehmer M, Yli-Harja O, Kauffman SA, Emmert-Streib F. Systems Pharmacogenomic Landscape of Drug Similarities from LINCS data: Drug Association Networks. Sci Rep 2019; 9:7849. [PMID: 31127155 PMCID: PMC6534546 DOI: 10.1038/s41598-019-44291-3] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/12/2018] [Accepted: 05/08/2019] [Indexed: 02/01/2023] Open
Abstract
Modern research in the biomedical sciences is data-driven utilizing high-throughput technologies to generate big genomic data. The Library of Integrated Network-based Cellular Signatures (LINCS) is an example for a large-scale genomic data repository providing hundred thousands of high-dimensional gene expression measurements for thousands of drugs and dozens of cell lines. However, the remaining challenge is how to use these data effectively for pharmacogenomics. In this paper, we use LINCS data to construct drug association networks (DANs) representing the relationships between drugs. By using the Anatomical Therapeutic Chemical (ATC) classification of drugs we demonstrate that the DANs represent a systems pharmacogenomic landscape of drugs summarizing the entire LINCS repository on a genomic scale meaningfully. Here we identify the modules of the DANs as therapeutic attractors of the ATC drug classes.
Collapse
|
42
|
Rivière E, Quinton A, Dehail P. [Analysis of the discrimination of the final marks after the first computerized national ranking exam in Medicine in June 2016 in France]. Rev Med Interne 2019; 40:286-290. [PMID: 30902508 DOI: 10.1016/j.revmed.2018.10.386] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/10/2018] [Revised: 10/07/2018] [Accepted: 10/18/2018] [Indexed: 11/18/2022]
Abstract
INTRODUCTION The first computerised national ranking exam (cNRE) in Medicine was introduced in June 2016 for 8214 students. It was made of 18 progressive clinical cases (PCCs) with multiple choice questions (MCQs), 120 independent MCQs and 2 scientific articles to criticize. A lack of mark discrimination grounded the cNRE reform. We aimed to assess the discrimination of the final marks after this first cNRE. METHODS A national Excel® file gathering overall statistics and marks were transmitted to the medical faculties after the cNRE. The mean points deviation between two papers and the percentage of points ranking 75% of students allowed us to analyse marks' discrimination. RESULTS The national distribution sigmoid curve of the marks is superimposable with previous NRE in 2015. In PCCs, 72% of students were ranked in 1090 points out of 7560 (14%). In independents MCQs, 73% of students were ranked in 434 points out of 2160 (20%). In critical analysis of articles, 75% of students were ranked in 225 points out of 1080 (21%). The above percentages of students are on the plateau of each discrimination curve for PCCs, independent MCQs and critical analysis of scientific articles. CONCLUSION The cNRE reduced equally-ranked students compared to 2015, with a mean deviation between two papers of 0.28 in 2016 vs 0.04 in 2015. Despite the new format introduced by the cNRE, 75% of students are still ranked in a low proportion of points that is equivalent to previous NRE in 2015 (between 15 et 20% of points).
Collapse
|
43
|
Alghamdi SM, Sundberg BA, Sundberg JP, Schofield PN, Hoehndorf R. Quantitative evaluation of ontology design patterns for combining pathology and anatomy ontologies. Sci Rep 2019; 9:4025. [PMID: 30858527 PMCID: PMC6411989 DOI: 10.1038/s41598-019-40368-1] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/31/2018] [Accepted: 02/14/2019] [Indexed: 12/28/2022] Open
Abstract
Data are increasingly annotated with multiple ontologies to capture rich information about the features of the subject under investigation. Analysis may be performed over each ontology separately, but recently there has been a move to combine multiple ontologies to provide more powerful analytical possibilities. However, it is often not clear how to combine ontologies or how to assess or evaluate the potential design patterns available. Here we use a large and well-characterized dataset of anatomic pathology descriptions from a major study of aging mice. We show how different design patterns based on the MPATH and MA ontologies provide orthogonal axes of analysis, and perform differently in over-representation and semantic similarity applications. We discuss how such a data-driven approach might be used generally to generate and evaluate ontology design patterns.
Collapse
|
44
|
Abstract
Data science can be incorporated into every stage of a scientific study. Here we describe how data science can be used to generate hypotheses, to design experiments, to perform experiments, and to analyse data. We also present our vision for how data science techniques will be an integral part of the laboratory of the future.
Collapse
|
45
|
Huang H, Tang H, Huang J, Chen B, Liu R, Tang RS, Lu Y, Yang P. Special Issue: Selected Papers of the Inaugural DahShu Data Science Symposium: Computational Precision Health (CPH 2017). J Comput Biol 2019; 24:635-636. [PMID: 28657834 DOI: 10.1089/cmb.2017.29007.hh] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
|
46
|
Lapidus M. Not All Library Analytics are Created Equal: LibAnswers to the Rescue! Med Ref Serv Q 2019; 38:41-55. [PMID: 30942679 DOI: 10.1080/02763869.2019.1548892] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/18/2018] [Revised: 11/06/2018] [Accepted: 11/07/2018] [Indexed: 06/09/2023]
Abstract
The reasons for implementing and the advantages of switching to the Reference Analytics system, a part of the Springshare LibAnswers platform, for collecting reference statistics at a three-campus university library are described. The benefits of using this web-based product are highlighted based on the comparison with the previously used analytical tools and the annual statistical data. Transitioning to Reference Analytics allowed librarians to take advantage of such features, as seamless access to reference transactions, easy customization, cross-tabulation, and data visualization, proving beneficial for overall library reference services.
Collapse
|
47
|
Abstract
Gene regulatory networks are powerful abstractions of biological systems. Since the advent of high-throughput measurement technologies in biology in the late 1990s, reconstructing the structure of such networks has been a central computational problem in systems biology. While the problem is certainly not solved in its entirety, considerable progress has been made in the last two decades, with mature tools now available. This chapter aims to provide an introduction to the basic concepts underpinning network inference tools, attempting a categorization which highlights commonalities and relative strengths. While the chapter is meant to be self-contained, the material presented should provide a useful background to the later, more specialized chapters of this book.
Collapse
|
48
|
Levin-Schwartz Y, Calhoun VD, Adalı T. A method to compare the discriminatory power of data-driven methods: Application to ICA and IVA. J Neurosci Methods 2019; 311:267-276. [PMID: 30389489 PMCID: PMC6258321 DOI: 10.1016/j.jneumeth.2018.10.008] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/19/2018] [Revised: 08/24/2018] [Accepted: 10/08/2018] [Indexed: 11/20/2022]
Abstract
BACKGROUND The widespread application of data-driven factorization-based methods, such as independent component analysis (ICA), to functional magnetic resonance imaging data facilitates the study of neural function and how it is disrupted by psychiatric disorders such as schizophrenia. While the increasing number of these methods motivates a comparison of their relative performance, such a comparison is difficult to perform on real fMRI data, since the ground truth is, relatively, unknown and the alignment of factors across different methods is impractical and imprecise. NEW METHOD We present a novel method, global difference maps (GDMs), to compare the results of different fMRI analysis techniques on real fMRI data, quantify their relative performances, and highlight the differences between the decompositions visually. COMPARISON WITH EXISTING METHODS We apply this method to compare the performances of two different factorization-based methods, ICA and its multiset extension independent vector analysis (IVA), for the analysis of fMRI data from 109 patients with schizophrenia and 138 healthy controls during the performance of three tasks. RESULTS Through this application of GDMs, we find that IVA can determine regions that are more discriminatory between patients and controls than ICA, though IVA is less effective at emphasizing regions found in only a subset of the tasks. CONCLUSIONS These results demonstrate that GDMs are an effective way to compare the performances of different factorization-based methods as well as regression-based analyses.
Collapse
|
49
|
Stevens SLR, Kuzak M, Martinez C, Moser A, Bleeker P, Galland M. Building a local community of practice in scientific programming for life scientists. PLoS Biol 2018; 16:e2005561. [PMID: 30485260 PMCID: PMC6287879 DOI: 10.1371/journal.pbio.2005561] [Citation(s) in RCA: 13] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Revised: 12/10/2018] [Indexed: 11/18/2022] Open
Abstract
In this paper, we describe why and how to build a local community of practice in scientific programming for life scientists who use computers and programming in their research. A community of practice is a small group of scientists who meet regularly to help each other and promote good practices in scientific programming. While most life scientists are well trained in the laboratory to conduct experiments, good practices with (big) data sets and their analysis are often missing. We propose a model on how to build such a community of practice at a local academic institution, present two real-life examples, and introduce challenges and implemented solutions. We believe that the current data deluge that life scientists face can benefit from the implementation of these small communities. Good practices spread among experimental scientists will foster open, transparent, and sound scientific results beneficial to society.
Collapse
|
50
|
Cohen MC, Guetta CD, Jiao K, Provost F. Data-Driven Investment Strategies for Peer-to-Peer Lending: A Case Study for Teaching Data Science. BIG DATA 2018; 6:191-213. [PMID: 30283728 PMCID: PMC6154448 DOI: 10.1089/big.2018.0092] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
We develop a number of data-driven investment strategies that demonstrate how machine learning and data analytics can be used to guide investments in peer-to-peer loans. We detail the process starting with the acquisition of (real) data from a peer-to-peer lending platform all the way to the development and evaluation of investment strategies based on a variety of approaches. We focus heavily on how to apply and evaluate the data science methods, and resulting strategies, in a real-world business setting. The material presented in this article can be used by instructors who teach data science courses, at the undergraduate or graduate levels. Importantly, we go beyond just evaluating predictive performance of models, to assess how well the strategies would actually perform, using real, publicly available data. Our treatment is comprehensive and ranges from qualitative to technical, but is also modular-which gives instructors the flexibility to focus on specific parts of the case, depending on the topics they want to cover. The learning concepts include the following: data cleaning and ingestion, classification/probability estimation modeling, regression modeling, analytical engineering, calibration curves, data leakage, evaluation of model performance, basic portfolio optimization, evaluation of investment strategies, and using Python for data science.
Collapse
|