1
|
Amarasinghe SL, Su S, Dong X, Zappia L, Ritchie ME, Gouil Q. Opportunities and challenges in long-read sequencing data analysis. Genome Biol 2020; 21:30. [PMID: 32033565 PMCID: PMC7006217 DOI: 10.1186/s13059-020-1935-5] [Citation(s) in RCA: 919] [Impact Index Per Article: 183.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/28/2019] [Accepted: 01/15/2020] [Indexed: 12/11/2022] Open
Abstract
Long-read technologies are overcoming early limitations in accuracy and throughput, broadening their application domains in genomics. Dedicated analysis tools that take into account the characteristics of long-read data are thus required, but the fast pace of development of such tools can be overwhelming. To assist in the design and analysis of long-read sequencing projects, we review the current landscape of available tools and present an online interactive database, long-read-tools.org, to facilitate their browsing. We further focus on the principles of error correction, base modification detection, and long-read transcriptomics analysis and highlight the challenges that remain.
Collapse
|
Review |
5 |
919 |
2
|
Lähnemann D, Köster J, Szczurek E, McCarthy DJ, Hicks SC, Robinson MD, Vallejos CA, Campbell KR, Beerenwinkel N, Mahfouz A, Pinello L, Skums P, Stamatakis A, Attolini CSO, Aparicio S, Baaijens J, Balvert M, Barbanson BD, Cappuccio A, Corleone G, Dutilh BE, Florescu M, Guryev V, Holmer R, Jahn K, Lobo TJ, Keizer EM, Khatri I, Kielbasa SM, Korbel JO, Kozlov AM, Kuo TH, Lelieveldt BP, Mandoiu II, Marioni JC, Marschall T, Mölder F, Niknejad A, Rączkowska A, Reinders M, Ridder JD, Saliba AE, Somarakis A, Stegle O, Theis FJ, Yang H, Zelikovsky A, McHardy AC, Raphael BJ, Shah SP, Schönhuth A. Eleven grand challenges in single-cell data science. Genome Biol 2020; 21:31. [PMID: 32033589 PMCID: PMC7007675 DOI: 10.1186/s13059-020-1926-6] [Citation(s) in RCA: 690] [Impact Index Per Article: 138.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/02/2019] [Accepted: 01/02/2020] [Indexed: 02/08/2023] Open
Abstract
The recent boom in microfluidics and combinatorial indexing strategies, combined with low sequencing costs, has empowered single-cell sequencing technology. Thousands-or even millions-of cells analyzed in a single experiment amount to a data revolution in single-cell biology and pose unique data science problems. Here, we outline eleven challenges that will be central to bringing this emerging field of single-cell data science forward. For each challenge, we highlight motivating research questions, review prior work, and formulate open problems. This compendium is for established researchers, newcomers, and students alike, highlighting interesting and rewarding problems for the coming years.
Collapse
|
Research Support, N.I.H., Extramural |
5 |
690 |
3
|
Botvinik-Nezer R, Holzmeister F, Camerer CF, Dreber A, Huber J, Johannesson M, Kirchler M, Iwanir R, Mumford JA, Adcock RA, Avesani P, Baczkowski BM, Bajracharya A, Bakst L, Ball S, Barilari M, Bault N, Beaton D, Beitner J, Benoit RG, Berkers RMWJ, Bhanji JP, Biswal BB, Bobadilla-Suarez S, Bortolini T, Bottenhorn KL, Bowring A, Braem S, Brooks HR, Brudner EG, Calderon CB, Camilleri JA, Castrellon JJ, Cecchetti L, Cieslik EC, Cole ZJ, Collignon O, Cox RW, Cunningham WA, Czoschke S, Dadi K, Davis CP, Luca AD, Delgado MR, Demetriou L, Dennison JB, Di X, Dickie EW, Dobryakova E, Donnat CL, Dukart J, Duncan NW, Durnez J, Eed A, Eickhoff SB, Erhart A, Fontanesi L, Fricke GM, Fu S, Galván A, Gau R, Genon S, Glatard T, Glerean E, Goeman JJ, Golowin SAE, González-García C, Gorgolewski KJ, Grady CL, Green MA, Guassi Moreira JF, Guest O, Hakimi S, Hamilton JP, Hancock R, Handjaras G, Harry BB, Hawco C, Herholz P, Herman G, Heunis S, Hoffstaedter F, Hogeveen J, Holmes S, Hu CP, Huettel SA, Hughes ME, Iacovella V, Iordan AD, Isager PM, Isik AI, Jahn A, Johnson MR, Johnstone T, Joseph MJE, Juliano AC, Kable JW, Kassinopoulos M, Koba C, Kong XZ, et alBotvinik-Nezer R, Holzmeister F, Camerer CF, Dreber A, Huber J, Johannesson M, Kirchler M, Iwanir R, Mumford JA, Adcock RA, Avesani P, Baczkowski BM, Bajracharya A, Bakst L, Ball S, Barilari M, Bault N, Beaton D, Beitner J, Benoit RG, Berkers RMWJ, Bhanji JP, Biswal BB, Bobadilla-Suarez S, Bortolini T, Bottenhorn KL, Bowring A, Braem S, Brooks HR, Brudner EG, Calderon CB, Camilleri JA, Castrellon JJ, Cecchetti L, Cieslik EC, Cole ZJ, Collignon O, Cox RW, Cunningham WA, Czoschke S, Dadi K, Davis CP, Luca AD, Delgado MR, Demetriou L, Dennison JB, Di X, Dickie EW, Dobryakova E, Donnat CL, Dukart J, Duncan NW, Durnez J, Eed A, Eickhoff SB, Erhart A, Fontanesi L, Fricke GM, Fu S, Galván A, Gau R, Genon S, Glatard T, Glerean E, Goeman JJ, Golowin SAE, González-García C, Gorgolewski KJ, Grady CL, Green MA, Guassi Moreira JF, Guest O, Hakimi S, Hamilton JP, Hancock R, Handjaras G, Harry BB, Hawco C, Herholz P, Herman G, Heunis S, Hoffstaedter F, Hogeveen J, Holmes S, Hu CP, Huettel SA, Hughes ME, Iacovella V, Iordan AD, Isager PM, Isik AI, Jahn A, Johnson MR, Johnstone T, Joseph MJE, Juliano AC, Kable JW, Kassinopoulos M, Koba C, Kong XZ, Koscik TR, Kucukboyaci NE, Kuhl BA, Kupek S, Laird AR, Lamm C, Langner R, Lauharatanahirun N, Lee H, Lee S, Leemans A, Leo A, Lesage E, Li F, Li MYC, Lim PC, Lintz EN, Liphardt SW, Losecaat Vermeer AB, Love BC, Mack ML, Malpica N, Marins T, Maumet C, McDonald K, McGuire JT, Melero H, Méndez Leal AS, Meyer B, Meyer KN, Mihai G, Mitsis GD, Moll J, Nielson DM, Nilsonne G, Notter MP, Olivetti E, Onicas AI, Papale P, Patil KR, Peelle JE, Pérez A, Pischedda D, Poline JB, Prystauka Y, Ray S, Reuter-Lorenz PA, Reynolds RC, Ricciardi E, Rieck JR, Rodriguez-Thompson AM, Romyn A, Salo T, Samanez-Larkin GR, Sanz-Morales E, Schlichting ML, Schultz DH, Shen Q, Sheridan MA, Silvers JA, Skagerlund K, Smith A, Smith DV, Sokol-Hessner P, Steinkamp SR, Tashjian SM, Thirion B, Thorp JN, Tinghög G, Tisdall L, Tompson SH, Toro-Serey C, Torre Tresols JJ, Tozzi L, Truong V, Turella L, van 't Veer AE, Verguts T, Vettel JM, Vijayarajah S, Vo K, Wall MB, Weeda WD, Weis S, White DJ, Wisniewski D, Xifra-Porxas A, Yearling EA, Yoon S, Yuan R, Yuen KSL, Zhang L, Zhang X, Zosky JE, Nichols TE, Poldrack RA, Schonberg T. Variability in the analysis of a single neuroimaging dataset by many teams. Nature 2020; 582:84-88. [PMID: 32483374 PMCID: PMC7771346 DOI: 10.1038/s41586-020-2314-9] [Show More Authors] [Citation(s) in RCA: 522] [Impact Index Per Article: 104.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/14/2019] [Accepted: 04/07/2020] [Indexed: 01/13/2023]
Abstract
Data analysis workflows in many scientific domains have become increasingly complex and flexible. Here we assess the effect of this flexibility on the results of functional magnetic resonance imaging by asking 70 independent teams to analyse the same dataset, testing the same 9 ex-ante hypotheses1. The flexibility of analytical approaches is exemplified by the fact that no two teams chose identical workflows to analyse the data. This flexibility resulted in sizeable variation in the results of hypothesis tests, even for teams whose statistical maps were highly correlated at intermediate stages of the analysis pipeline. Variation in reported results was related to several aspects of analysis methodology. Notably, a meta-analytical approach that aggregated information across teams yielded a significant consensus in activated regions. Furthermore, prediction markets of researchers in the field revealed an overestimation of the likelihood of significant findings, even by researchers with direct knowledge of the dataset2-5. Our findings show that analytical flexibility can have substantial effects on scientific conclusions, and identify factors that may be related to variability in the analysis of functional magnetic resonance imaging. The results emphasize the importance of validating and sharing complex analysis workflows, and demonstrate the need for performing and reporting multiple analyses of the same data. Potential approaches that could be used to mitigate issues related to analytical variability are discussed.
Collapse
|
Research Support, N.I.H., Extramural |
5 |
522 |
4
|
Yang Y, Liu X, Shen C, Lin Y, Yang P, Qiao L. In silico spectral libraries by deep learning facilitate data-independent acquisition proteomics. Nat Commun 2020; 11:146. [PMID: 31919359 PMCID: PMC6952453 DOI: 10.1038/s41467-019-13866-z] [Citation(s) in RCA: 122] [Impact Index Per Article: 24.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/16/2019] [Accepted: 12/04/2019] [Indexed: 11/12/2022] Open
Abstract
Data-independent acquisition (DIA) is an emerging technology for quantitative proteomic analysis of large cohorts of samples. However, sample-specific spectral libraries built by data-dependent acquisition (DDA) experiments are required prior to DIA analysis, which is time-consuming and limits the identification/quantification by DIA to the peptides identified by DDA. Herein, we propose DeepDIA, a deep learning-based approach to generate in silico spectral libraries for DIA analysis. We demonstrate that the quality of in silico libraries predicted by instrument-specific models using DeepDIA is comparable to that of experimental libraries, and outperforms libraries generated by global models. With peptide detectability prediction, in silico libraries can be built directly from protein sequence databases. We further illustrate that DeepDIA can break through the limitation of DDA on peptide/protein detection, and enhance DIA analysis on human serum samples compared to the state-of-the-art protocol using a DDA library. We expect this work expanding the toolbox for DIA proteomics.
Collapse
|
research-article |
5 |
122 |
5
|
Abstract
Data, including information generated from them by processing and analysis, are an asset with measurable value. The assets that biological research funding produces are the data generated, the information derived from these data, and, ultimately, the discoveries and knowledge these lead to. From the time when Henry Oldenburg published the first scientific journal in 1665 (Proceedings of the Royal Society) to the founding of the United States National Library of Medicine in 1879 to the present, there has been a sustained drive to improve how researchers can record and discover what is known. Researchers’ experimental work builds upon years and (collectively) billions of dollars’ worth of earlier work. Today, researchers are generating data at ever-faster rates because of advances in instrumentation and technology, coupled with decreases in production costs. Unfortunately, the ability of researchers to manage and disseminate their results has not kept pace, so their work cannot achieve its maximal impact. Strides have recently been made, but more awareness is needed of the essential role that biological data resources, including biocuration, play in maintaining and linking this ever-growing flood of data and information. The aim of this paper is to describe the nature of data as an asset, the role biocurators play in increasing its value, and consistent, practical means to measure effectiveness that can guide planning and justify costs in biological research information resources’ development and management.
Collapse
|
Journal Article |
7 |
58 |
6
|
Straw I, Callison-Burch C. Artificial Intelligence in mental health and the biases of language based models. PLoS One 2020; 15:e0240376. [PMID: 33332380 PMCID: PMC7745984 DOI: 10.1371/journal.pone.0240376] [Citation(s) in RCA: 51] [Impact Index Per Article: 10.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/01/2020] [Accepted: 09/07/2020] [Indexed: 12/20/2022] Open
Abstract
BACKGROUND The rapid integration of Artificial Intelligence (AI) into the healthcare field has occurred with little communication between computer scientists and doctors. The impact of AI on health outcomes and inequalities calls for health professionals and data scientists to make a collaborative effort to ensure historic health disparities are not encoded into the future. We present a study that evaluates bias in existing Natural Language Processing (NLP) models used in psychiatry and discuss how these biases may widen health inequalities. Our approach systematically evaluates each stage of model development to explore how biases arise from a clinical, data science and linguistic perspective. DESIGN/METHODS A literature review of the uses of NLP in mental health was carried out across multiple disciplinary databases with defined Mesh terms and keywords. Our primary analysis evaluated biases within 'GloVe' and 'Word2Vec' word embeddings. Euclidean distances were measured to assess relationships between psychiatric terms and demographic labels, and vector similarity functions were used to solve analogy questions relating to mental health. RESULTS Our primary analysis of mental health terminology in GloVe and Word2Vec embeddings demonstrated significant biases with respect to religion, race, gender, nationality, sexuality and age. Our literature review returned 52 papers, of which none addressed all the areas of possible bias that we identify in model development. In addition, only one article existed on more than one research database, demonstrating the isolation of research within disciplinary silos and inhibiting cross-disciplinary collaboration or communication. CONCLUSION Our findings are relevant to professionals who wish to minimize the health inequalities that may arise as a result of AI and data-driven algorithms. We offer primary research identifying biases within these technologies and provide recommendations for avoiding these harms in the future.
Collapse
|
Review |
5 |
51 |
7
|
Aerts HJWL. Data Science in Radiology: A Path Forward. Clin Cancer Res 2018; 24:532-534. [PMID: 29097379 PMCID: PMC5810958 DOI: 10.1158/1078-0432.ccr-17-2804] [Citation(s) in RCA: 42] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/25/2017] [Revised: 10/20/2017] [Accepted: 10/31/2017] [Indexed: 11/16/2022]
Abstract
Artificial intelligence (AI), especially deep learning, has the potential to fundamentally alter clinical radiology. AI algorithms, which excel in quantifying complex patterns in data, have shown remarkable progress in applications ranging from self-driving cars to speech recognition. The AI application within radiology, known as radiomics, can provide detailed quantifications of the radiographic characteristics of underlying tissues. This information can be used throughout the clinical care path to improve diagnosis and treatment planning, as well as assess treatment response. This tremendous potential for clinical translation has led to a vast increase in the number of research studies being conducted in the field, a number that is expected to rise sharply in the future. Many studies have reported robust and meaningful findings; however, a growing number also suffer from flawed experimental or analytic designs. Such errors could not only result in invalid discoveries, but also may lead others to perpetuate similar flaws in their own work. This perspective article aims to increase awareness of the issue, identify potential reasons why this is happening, and provide a path forward. Clin Cancer Res; 24(3); 532-4. ©2017 AACR.
Collapse
|
Letter |
7 |
42 |
8
|
Abstract
Gene regulatory networks are powerful abstractions of biological systems. Since the advent of high-throughput measurement technologies in biology in the late 1990s, reconstructing the structure of such networks has been a central computational problem in systems biology. While the problem is certainly not solved in its entirety, considerable progress has been made in the last two decades, with mature tools now available. This chapter aims to provide an introduction to the basic concepts underpinning network inference tools, attempting a categorization which highlights commonalities and relative strengths. While the chapter is meant to be self-contained, the material presented should provide a useful background to the later, more specialized chapters of this book.
Collapse
|
Introductory Journal Article |
6 |
40 |
9
|
Abstract
Overfitting is one of the critical problems in developing models by machine learning. With machine learning becoming an essential technology in computational biology, we must include training about overfitting in all courses that introduce this technology to students and practitioners. We here propose a hands-on training for overfitting that is suitable for introductory level courses and can be carried out on its own or embedded within any data science course. We use workflow-based design of machine learning pipelines, experimentation-based teaching, and hands-on approach that focuses on concepts rather than underlying mathematics. We here detail the data analysis workflows we use in training and motivate them from the viewpoint of teaching goals. Our proposed approach relies on Orange, an open-source data science toolbox that combines data visualization and machine learning, and that is tailored for education in machine learning and explorative data analysis. Every teacher strives for an a-ha moment, a sudden revelation by the student who gained a fundamental insight she will always remember. In the past years, authors of this paper have been tailoring their courses in machine learning to include material that could lead students to such discoveries. We aim to expose machine learning to practitioners–not only computer scientists but also molecular biologists and students of biomedicine, that is, the end-users of bioinformatics’ computational approaches. In this article, we lay out a course that aims to teach about overfitting, one of the key concepts in machine learning that needs to be understood, mastered, and avoided in data science applications. We propose a hands-on approach that uses an open-source workflow-based data science toolbox that combines data visualization and machine learning. In the proposed training about overfitting, we first deceive the students, then expose the problem, and finally challenge them to find the solution. In the paper, we present three lessons in overfitting and associated data analysis workflows and motivate the use of introduced computation methods by relating them to concepts conveyed by instructors.
Collapse
|
Research Support, Non-U.S. Gov't |
4 |
38 |
10
|
Olatosi B, Zhang J, Weissman S, Hu J, Haider MR, Li X. Using big data analytics to improve HIV medical care utilisation in South Carolina: A study protocol. BMJ Open 2019; 9:e027688. [PMID: 31326931 PMCID: PMC6661700 DOI: 10.1136/bmjopen-2018-027688] [Citation(s) in RCA: 37] [Impact Index Per Article: 6.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 11/07/2018] [Revised: 03/28/2019] [Accepted: 06/04/2019] [Indexed: 12/23/2022] Open
Abstract
INTRODUCTION Linkage and retention in HIV medical care remains problematic in the USA. Extensive health utilisation data collection through electronic health records (EHR) and claims data represent new opportunities for scientific discovery. Big data science (BDS) is a powerful tool for investigating HIV care utilisation patterns. The South Carolina (SC) office of Revenue and Fiscal Affairs (RFA) data warehouse captures individual-level longitudinal health utilisation data for persons living with HIV (PLWH). The data warehouse includes EHR, claims and data from private institutions, housing, prisons, mental health, Medicare, Medicaid, State Health Plan and the department of health and human services. The purpose of this study is to describe the process for creating a comprehensive database of all SC PLWH, and plans for using BDS to explore, identify, characterise and explain new predictors of missed opportunities for HIV medical care utilisation. METHODS AND ANALYSIS This project will create person-level profiles guided by the Gelberg-Andersen Behavioral Model and describe new patterns of HIV care utilisation. The population for the comprehensive database comes from statewide HIV surveillance data (2005-2016) for all SC PLWH (N≈18000). Surveillance data are available from the state health department's enhanced HIV/AIDS Reporting System (e-HARS). Additional data pulls for the e-HARS population will include Ryan White HIV/AIDS Program Service Reports, Health Sciences SC data and Area Health Resource Files. These data will be linked to the RFA data and serve as sources for traditional and vulnerable domain Gelberg-Anderson Behavioral Model variables. The project will use BDS techniques such as machine learning to identify new predictors of HIV care utilisation behaviour among PLWH, and 'missed opportunities' for re-engaging them back into care. ETHICS AND DISSEMINATION The study team applied for data from different sources and submitted individual Institutional Review Board (IRB) applications to the University of South Carolina (USC) IRB and other local authorities/agencies/state departments. This study was approved by the USC IRB (#Pro00068124) in 2017. To protect the identity of the persons living with HIV (PLWH), researchers will only receive linked deidentified data from the RFA. Study findings will be disseminated at local community forums, community advisory group meetings, meetings with our state agencies, local partners and other key stakeholders (including PLWH, policy-makers and healthcare providers), presentations at academic conferences and through publication in peer-reviewed articles. Data security and patient confidentiality are the bedrock of this study. Extensive data agreements ensuring data security and patient confidentiality for the deidentified linked data have been established and are stringently adhered to. The RFA is authorised to collect and merge data from these different sources and to ensure the privacy of all PLWH. The legislatively mandated SC data oversight council reviewed the proposed process stringently before approving it. Researchers will get only the encrypted deidentified dataset to prevent any breach of privacy in the data transfer, management and analysis processes. In addition, established secure data governance rules, data encryption and encrypted predictive techniques will be deployed. In addition to the data anonymisation as a part of privacy-preserving analytics, encryption schemes that protect running prediction algorithms on encrypted data will also be deployed. Best practices and lessons learnt about the complex processes involved in negotiating and navigating multiple data sharing agreements between different entities are being documented for dissemination.
Collapse
|
Research Support, N.I.H., Extramural |
6 |
37 |
11
|
Pluchino A, Biondo AE, Giuffrida N, Inturri G, Latora V, Le Moli R, Rapisarda A, Russo G, Zappalà C. A novel methodology for epidemic risk assessment of COVID-19 outbreak. Sci Rep 2021; 11:5304. [PMID: 33674627 PMCID: PMC7935987 DOI: 10.1038/s41598-021-82310-4] [Citation(s) in RCA: 36] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/06/2020] [Accepted: 01/19/2021] [Indexed: 12/24/2022] Open
Abstract
We propose a novel data-driven framework for assessing the a-priori epidemic risk of a geographical area and for identifying high-risk areas within a country. Our risk index is evaluated as a function of three different components: the hazard of the disease, the exposure of the area and the vulnerability of its inhabitants. As an application, we discuss the case of COVID-19 outbreak in Italy. We characterize each of the twenty Italian regions by using available historical data on air pollution, human mobility, winter temperature, housing concentration, health care density, population size and age. We find that the epidemic risk is higher in some of the Northern regions with respect to Central and Southern Italy. The corresponding risk index shows correlations with the available official data on the number of infected individuals, patients in intensive care and deceased patients, and can help explaining why regions such as Lombardia, Emilia-Romagna, Piemonte and Veneto have suffered much more than the rest of the country. Although the COVID-19 outbreak started in both North (Lombardia) and Central Italy (Lazio) almost at the same time, when the first cases were officially certified at the beginning of 2020, the disease has spread faster and with heavier consequences in regions with higher epidemic risk. Our framework can be extended and tested on other epidemic data, such as those on seasonal flu, and applied to other countries. We also present a policy model connected with our methodology, which might help policy-makers to take informed decisions.
Collapse
|
research-article |
4 |
36 |
12
|
Polasek TM, Rostami-Hodjegan A. Virtual Twins: Understanding the Data Required for Model-Informed Precision Dosing. Clin Pharmacol Ther 2020; 107:742-745. [PMID: 32056199 DOI: 10.1002/cpt.1778] [Citation(s) in RCA: 30] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/20/2019] [Accepted: 01/13/2020] [Indexed: 12/16/2022]
|
Journal Article |
5 |
30 |
13
|
Zhan C, Tse CK, Lai Z, Hao T, Su J. Prediction of COVID-19 spreading profiles in South Korea, Italy and Iran by data-driven coding. PLoS One 2020; 15:e0234763. [PMID: 32628673 PMCID: PMC7337285 DOI: 10.1371/journal.pone.0234763] [Citation(s) in RCA: 27] [Impact Index Per Article: 5.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/09/2020] [Accepted: 06/02/2020] [Indexed: 11/18/2022] Open
Abstract
This work applies a data-driven coding method for prediction of the COVID-19 spreading profile in any given population that shows an initial phase of epidemic progression. Based on the historical data collected for COVID-19 spreading in 367 cities in China and the set of parameters of the augmented Susceptible-Exposed-Infected-Removed (SEIR) model obtained for each city, a set of profile codes representing a variety of transmission mechanisms and contact topologies is formed. By comparing the data of an early outbreak of a given population with the complete set of historical profiles, the best fit profiles are selected and the corresponding sets of profile codes are used for prediction of the future progression of the epidemic in that population. Application of the method to the data collected for South Korea, Italy and Iran shows that peaks of infection cases are expected to occur before mid April, the end of March and the end of May 2020, and that the percentage of population infected in each city or region will be less than 0.01%, 0.5% and 0.5%, for South Korea, Italy and Iran, respectively.
Collapse
|
research-article |
5 |
27 |
14
|
Bahmani A, Alavi A, Buergel T, Upadhyayula S, Wang Q, Ananthakrishnan SK, Alavi A, Celis D, Gillespie D, Young G, Xing Z, Nguyen MHH, Haque A, Mathur A, Payne J, Mazaheri G, Li JK, Kotipalli P, Liao L, Bhasin R, Cha K, Rolnik B, Celli A, Dagan-Rosenfeld O, Higgs E, Zhou W, Berry CL, Van Winkle KG, Contrepois K, Ray U, Bettinger K, Datta S, Li X, Snyder MP. A scalable, secure, and interoperable platform for deep data-driven health management. Nat Commun 2021; 12:5757. [PMID: 34599181 PMCID: PMC8486823 DOI: 10.1038/s41467-021-26040-1] [Citation(s) in RCA: 20] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/24/2020] [Accepted: 08/23/2021] [Indexed: 11/08/2022] Open
Abstract
The large amount of biomedical data derived from wearable sensors, electronic health records, and molecular profiling (e.g., genomics data) is rapidly transforming our healthcare systems. The increasing scale and scope of biomedical data not only is generating enormous opportunities for improving health outcomes but also raises new challenges ranging from data acquisition and storage to data analysis and utilization. To meet these challenges, we developed the Personal Health Dashboard (PHD), which utilizes state-of-the-art security and scalability technologies to provide an end-to-end solution for big biomedical data analytics. The PHD platform is an open-source software framework that can be easily configured and deployed to any big data health project to store, organize, and process complex biomedical data sets, support real-time data analysis at both the individual level and the cohort level, and ensure participant privacy at every step. In addition to presenting the system, we illustrate the use of the PHD framework for large-scale applications in emerging multi-omics disease studies, such as collecting and visualization of diverse data types (wearable, clinical, omics) at a personal level, investigation of insulin resistance, and an infrastructure for the detection of presymptomatic COVID-19.
Collapse
|
research-article |
4 |
20 |
15
|
Vaca Jacome AS, Peckner R, Shulman N, Krug K, DeRuff KC, Officer A, Christianson KE, MacLean B, MacCoss MJ, Carr SA, Jaffe JD. Avant-garde: an automated data-driven DIA data curation tool. Nat Methods 2020; 17:1237-1244. [PMID: 33199889 PMCID: PMC7723322 DOI: 10.1038/s41592-020-00986-4] [Citation(s) in RCA: 18] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/18/2019] [Accepted: 09/25/2020] [Indexed: 12/03/2022]
Abstract
Several challenges remain in data-independent acquisition (DIA) data analysis, such as to confidently identify peptides, define integration boundaries, remove interferences, and control false discovery rates. In practice, a visual inspection of the signals is still required, which is impractical with large datasets. We present Avant-garde as a tool to refine DIA (and parallel reaction monitoring) data. Avant-garde uses a novel data-driven scoring strategy: signals are refined by learning from the dataset itself, using all measurements in all samples to achieve the best optimization. We evaluate the performance of Avant-garde using benchmark DIA datasets and show that it can determine the quantitative suitability of a peptide peak, and reach the same levels of selectivity, accuracy, and reproducibility as manual validation. Avant-garde is complementary to existing DIA analysis engines and aims to establish a strong foundation for subsequent analysis of quantitative mass spectrometry data.
Collapse
|
Research Support, N.I.H., Extramural |
5 |
18 |
16
|
Wallach JD, Zhang AD, Skydel JJ, Bartlett VL, Dhruva SS, Shah ND, Ross JS. Feasibility of Using Real-world Data to Emulate Postapproval Confirmatory Clinical Trials of Therapeutic Agents Granted US Food and Drug Administration Accelerated Approval. JAMA Netw Open 2021; 4:e2133667. [PMID: 34751763 PMCID: PMC8579227 DOI: 10.1001/jamanetworkopen.2021.33667] [Citation(s) in RCA: 16] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
This cross-sectional study examines the feasibility of using real-world data, such as billing, claims, and electronic health records, to emulate US Food and Drug Administration–required confirmatory clinical trials for the 50 new therapeutic agents that received accelerated approval between 2009 and 2018.
Collapse
|
Evaluation Study |
4 |
16 |
17
|
Clements HD, Flynn AR, Nicholls BT, Grosheva D, Lefave SJ, Merriman MT, Hyster TK, Sigman MS. Using Data Science for Mechanistic Insights and Selectivity Predictions in a Non-Natural Biocatalytic Reaction. J Am Chem Soc 2023; 145:17656-17664. [PMID: 37530568 PMCID: PMC10602048 DOI: 10.1021/jacs.3c03639] [Citation(s) in RCA: 14] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 08/03/2023]
Abstract
The study of non-natural biocatalytic transformations relies heavily on empirical methods, such as directed evolution, for identifying improved variants. Although exceptionally effective, this approach provides limited insight into the molecular mechanisms behind the transformations and necessitates multiple protein engineering campaigns for new reactants. To address this limitation, we disclose a strategy to explore the biocatalytic reaction space and garner insight into the molecular mechanisms driving enzymatic transformations. Specifically, we explored the selectivity of an "ene"-reductase, GluER-T36A, to create a data-driven toolset that explores reaction space and rationalizes the observed and predicted selectivities of substrate/mutant combinations. The resultant statistical models related structural features of the enzyme and substrate to selectivity and were used to effectively predict selectivity in reactions with out-of-sample substrates and mutants. Our approach provided a deeper understanding of enantioinduction by GluER-T36A and holds the potential to enhance the virtual screening of enzyme mutants.
Collapse
|
Research Support, N.I.H., Extramural |
2 |
14 |
18
|
Galea S, Abdalla SM, Sturchio JL. Social determinants of health, data science, and decision-making: Forging a transdisciplinary synthesis. PLoS Med 2020; 17:e1003174. [PMID: 32525875 PMCID: PMC7289342 DOI: 10.1371/journal.pmed.1003174] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Abstract
Sandro Galea and co-authors discuss a forthcoming Collection on data science and social determinants of health.
Collapse
|
other |
5 |
14 |
19
|
Stevens SLR, Kuzak M, Martinez C, Moser A, Bleeker P, Galland M. Building a local community of practice in scientific programming for life scientists. PLoS Biol 2018; 16:e2005561. [PMID: 30485260 PMCID: PMC6287879 DOI: 10.1371/journal.pbio.2005561] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Revised: 12/10/2018] [Indexed: 11/18/2022] Open
Abstract
In this paper, we describe why and how to build a local community of practice in scientific programming for life scientists who use computers and programming in their research. A community of practice is a small group of scientists who meet regularly to help each other and promote good practices in scientific programming. While most life scientists are well trained in the laboratory to conduct experiments, good practices with (big) data sets and their analysis are often missing. We propose a model on how to build such a community of practice at a local academic institution, present two real-life examples, and introduce challenges and implemented solutions. We believe that the current data deluge that life scientists face can benefit from the implementation of these small communities. Good practices spread among experimental scientists will foster open, transparent, and sound scientific results beneficial to society.
Collapse
|
other |
7 |
13 |
20
|
Xu H, Li J, Jiang X, Chen Q. Electronic Health Records for Drug Repurposing: Current Status, Challenges, and Future Directions. Clin Pharmacol Ther 2020; 107:712-714. [PMID: 32012237 PMCID: PMC10815929 DOI: 10.1002/cpt.1769] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/23/2019] [Accepted: 01/06/2020] [Indexed: 12/20/2022]
Abstract
It is well recognized that the global pharmaceutical industry now faces challenges such as high costs and low productivity when developing new drugs (e.g., it is estimated that the average cost for developing a new drug ranges from US $2 billion to $3 billion with the total time to bring it to the market being about 13–15 years).1 Therefore, drug repurposing (also called drug repositioning/reprofiling), which finds new indications for existing drugs, has received great attention in the past decade. Drug repurposing can reduce drug development time, while improving success rates because the toxicity profiles of existing drugs are already known. Studies have shown that new applications for repurposed drugs have nearly a 30% success rate for US Food and Drug Administration (FDA) approval, whereas traditional new drug applications have < 10% approval rate.
Collapse
|
Research Support, N.I.H., Extramural |
5 |
12 |
21
|
Martinez-Soto CE, Cucić S, Lin JT, Kirst S, Mahmoud ES, Khursigara CM, Anany H. PHIDA: A High Throughput Turbidimetric Data Analytic Tool to Compare Host Range Profiles of Bacteriophages Isolated Using Different Enrichment Methods. Viruses 2021; 13:2120. [PMID: 34834927 PMCID: PMC8623551 DOI: 10.3390/v13112120] [Citation(s) in RCA: 12] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/18/2021] [Revised: 10/08/2021] [Accepted: 10/12/2021] [Indexed: 02/07/2023] Open
Abstract
Bacteriophages are viruses that infect bacteria and are present in niches where bacteria thrive. In recent years, the suggested application areas of lytic bacteriophage have been expanded to include therapy, biocontrol, detection, sanitation, and remediation. However, phage application is constrained by the phage's host range-the range of bacterial hosts sensitive to the phage and the degree of infection. Even though phage isolation and enrichment techniques are straightforward protocols, the correlation between the enrichment technique and host range profile has not been evaluated. Agar-based methods such as spotting assay and efficiency of plaquing (EOP) are the most used methods to determine the phage host range. These methods, aside from being labor intensive, can lead to subjective and incomplete results as they rely on qualitative observations of the lysis/plaques, do not reflect the lytic activity in liquid culture, and can overestimate the host range. In this study, phages against three bacterial genera were isolated using three different enrichment methods. Host range profiles of the isolated phages were quantitatively determined using a high throughput turbidimetric protocol and the data were analyzed with an accessible analytic tool "PHIDA". Using this tool, the host ranges of 9 Listeria, 14 Salmonella, and 20 Pseudomonas phages isolated with different enrichment methods were quantitatively compared. A high variability in the host range index (HRi) ranging from 0.86-0.63, 0.07-0.24, and 0.00-0.67 for Listeria, Salmonella, and Pseudomonas phages, respectively, was observed. Overall, no direct correlation was found between the phage host range breadth and the enrichment method in any of the three target bacterial genera. The high throughput method and analytics tool developed in this study can be easily adapted to any phage study and can provide a consensus for phage host range determination.
Collapse
|
Comparative Study |
4 |
12 |
22
|
Musa A, Tripathi S, Dehmer M, Yli-Harja O, Kauffman SA, Emmert-Streib F. Systems Pharmacogenomic Landscape of Drug Similarities from LINCS data: Drug Association Networks. Sci Rep 2019; 9:7849. [PMID: 31127155 PMCID: PMC6534546 DOI: 10.1038/s41598-019-44291-3] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/12/2018] [Accepted: 05/08/2019] [Indexed: 02/01/2023] Open
Abstract
Modern research in the biomedical sciences is data-driven utilizing high-throughput technologies to generate big genomic data. The Library of Integrated Network-based Cellular Signatures (LINCS) is an example for a large-scale genomic data repository providing hundred thousands of high-dimensional gene expression measurements for thousands of drugs and dozens of cell lines. However, the remaining challenge is how to use these data effectively for pharmacogenomics. In this paper, we use LINCS data to construct drug association networks (DANs) representing the relationships between drugs. By using the Anatomical Therapeutic Chemical (ATC) classification of drugs we demonstrate that the DANs represent a systems pharmacogenomic landscape of drugs summarizing the entire LINCS repository on a genomic scale meaningfully. Here we identify the modules of the DANs as therapeutic attractors of the ATC drug classes.
Collapse
|
research-article |
6 |
12 |
23
|
Heneghan JA, Walker SB, Fawcett A, Bennett TD, Dziorny AC, Sanchez-Pinto LN, Farris RW, Winter MC, Badke C, Martin B, Brown SR, McCrory MC, Ness-Cochinwala M, Rogerson C, Baloglu O, Harwayne-Gidansky I, Hudkins MR, Kamaleswaran R, Gangadharan S, Tripathi S, Mendonca EA, Markovitz BP, Mayampurath A, Spaeder MC. The Pediatric Data Science and Analytics Subgroup of the Pediatric Acute Lung Injury and Sepsis Investigators Network: Use of Supervised Machine Learning Applications in Pediatric Critical Care Medicine Research. Pediatr Crit Care Med 2024; 25:364-374. [PMID: 38059732 PMCID: PMC10994770 DOI: 10.1097/pcc.0000000000003425] [Citation(s) in RCA: 11] [Impact Index Per Article: 11.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 12/08/2023]
Abstract
OBJECTIVE Perform a scoping review of supervised machine learning in pediatric critical care to identify published applications, methodologies, and implementation frequency to inform best practices for the development, validation, and reporting of predictive models in pediatric critical care. DESIGN Scoping review and expert opinion. SETTING We queried CINAHL Plus with Full Text (EBSCO), Cochrane Library (Wiley), Embase (Elsevier), Ovid Medline, and PubMed for articles published between 2000 and 2022 related to machine learning concepts and pediatric critical illness. Articles were excluded if the majority of patients were adults or neonates, if unsupervised machine learning was the primary methodology, or if information related to the development, validation, and/or implementation of the model was not reported. Article selection and data extraction were performed using dual review in the Covidence tool, with discrepancies resolved by consensus. SUBJECTS Articles reporting on the development, validation, or implementation of supervised machine learning models in the field of pediatric critical care medicine. INTERVENTIONS None. MEASUREMENTS AND MAIN RESULTS Of 5075 identified studies, 141 articles were included. Studies were primarily (57%) performed at a single site. The majority took place in the United States (70%). Most were retrospective observational cohort studies. More than three-quarters of the articles were published between 2018 and 2022. The most common algorithms included logistic regression and random forest. Predicted events were most commonly death, transfer to ICU, and sepsis. Only 14% of articles reported external validation, and only a single model was implemented at publication. Reporting of validation methods, performance assessments, and implementation varied widely. Follow-up with authors suggests that implementation remains uncommon after model publication. CONCLUSIONS Publication of supervised machine learning models to address clinical challenges in pediatric critical care medicine has increased dramatically in the last 5 years. While these approaches have the potential to benefit children with critical illness, the literature demonstrates incomplete reporting, absence of external validation, and infrequent clinical implementation.
Collapse
|
Review |
1 |
11 |
24
|
Choi B, Shim G, Jeong B, Jo S. Data-driven analysis using multiple self-report questionnaires to identify college students at high risk of depressive disorder. Sci Rep 2020; 10:7867. [PMID: 32398788 PMCID: PMC7217968 DOI: 10.1038/s41598-020-64709-7] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/29/2019] [Accepted: 04/21/2020] [Indexed: 02/01/2023] Open
Abstract
Depression diagnosis is one of the most important issues in psychiatry. Depression is a complicated mental illness that varies in symptoms and requires patient cooperation. In the present study, we demonstrated a novel data-driven attempt to diagnose depressive disorder based on clinical questionnaires. It includes deep learning, multi-modal representation, and interpretability to overcome the limitations of the data-driven approach in clinical application. We implemented a shared representation model between three different questionnaire forms to represent questionnaire responses in the same latent space. Based on this, we proposed two data-driven diagnostic methods; unsupervised and semi-supervised. We compared them with a cut-off screening method, which is a traditional diagnostic method for depression. The unsupervised method considered more items, relative to the screening method, but showed lower performance because it maximized the difference between groups. In contrast, the semi-supervised method adjusted for bias using information from the screening method and showed higher performance. In addition, we provided the interpretation of diagnosis and statistical analysis of information using local interpretable model-agnostic explanations and ordinal logistic regression. The proposed data-driven framework demonstrated the feasibility of analyzing depressed patients with items directly or indirectly related to depression.
Collapse
|
research-article |
5 |
9 |
25
|
Abstract
Despite a newfound wealth of data and information, the healthcare sector is lacking in actionable knowledge. This is largely because healthcare data, though plentiful, tends to be inherently complex and fragmented. Health data analytics, with an emphasis on predictive analytics, is emerging as a transformative tool that can enable more proactive and preventative treatment options. This review considers the ways in which predictive analytics has been applied in the for-profit business sector to generate well-timed and accurate predictions of key outcomes, with a focus on key features that may be applicable to healthcare-specific applications. Published medical research presenting assessments of predictive analytics technology in medical applications are reviewed, with particular emphasis on how hospitals have integrated predictive analytics into their day-to-day healthcare services to improve quality of care. This review also highlights the numerous challenges of implementing predictive analytics in healthcare settings and concludes with a discussion of current efforts to implement healthcare data analytics in the developing country, Saudi Arabia.
Collapse
|
Review |
7 |
9 |