1
|
Mascher M, Jayakodi M, Shim H, Stein N. Promises and challenges of crop translational genomics. Nature 2024:10.1038/s41586-024-07713-5. [PMID: 39313530 DOI: 10.1038/s41586-024-07713-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/27/2023] [Accepted: 06/13/2024] [Indexed: 09/25/2024]
Abstract
Crop translational genomics applies breeding techniques based on genomic datasets to improve crops. Technological breakthroughs in the past ten years have made it possible to sequence the genomes of increasing numbers of crop varieties and have assisted in the genetic dissection of crop performance. However, translating research findings to breeding applications remains challenging. Here we review recent progress and future prospects for crop translational genomics in bringing results from the laboratory to the field. Genetic mapping, genomic selection and sequence-assisted characterization and deployment of plant genetic resources utilize rapid genotyping of large populations. These approaches have all had an impact on breeding for qualitative traits, where single genes with large phenotypic effects exert their influence. Characterization of the complex genetic architectures that underlie quantitative traits such as yield and flowering time, especially in newly domesticated crops, will require further basic research, including research into regulation and interactions of genes and the integration of genomic approaches and high-throughput phenotyping, before targeted interventions can be designed. Future priorities for translation include supporting genomics-assisted breeding in low-income countries and adaptation of crops to changing environments.
Collapse
Affiliation(s)
- Martin Mascher
- Leibniz Institute of Plant Genetics and Crop Plant Research (IPK), Gatersleben, Germany.
- German Centre for Integrative Biodiversity Research (iDiv) Halle-Jena-Leipzig, Leipzig, Germany.
| | - Murukarthick Jayakodi
- Leibniz Institute of Plant Genetics and Crop Plant Research (IPK), Gatersleben, Germany
| | - Hyeonah Shim
- Department of Agriculture, Forestry and Bioresources, Plant Genomics and Breeding Institute, Research Institute of Agriculture and Life Sciences, College of Agriculture and Life Sciences, Seoul National University, Seoul, Korea
| | - Nils Stein
- Leibniz Institute of Plant Genetics and Crop Plant Research (IPK), Gatersleben, Germany.
- Martin Luther University Halle-Wittenberg, Halle, Germany.
| |
Collapse
|
2
|
Poger D, Yen L, Braet F. Big data in contemporary electron microscopy: challenges and opportunities in data transfer, compute and management. Histochem Cell Biol 2023; 160:169-192. [PMID: 37052655 PMCID: PMC10492738 DOI: 10.1007/s00418-023-02191-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 03/21/2023] [Indexed: 04/14/2023]
Abstract
The second decade of the twenty-first century witnessed a new challenge in the handling of microscopy data. Big data, data deluge, large data, data compliance, data analytics, data integrity, data interoperability, data retention and data lifecycle are terms that have introduced themselves to the electron microscopy sciences. This is largely attributed to the booming development of new microscopy hardware tools. As a result, large digital image files with an average size of one terabyte within one single acquisition session is not uncommon nowadays, especially in the field of cryogenic electron microscopy. This brings along numerous challenges in data transfer, compute and management. In this review, we will discuss in detail the current state of international knowledge on big data in contemporary electron microscopy and how big data can be transferred, computed and managed efficiently and sustainably. Workflows, solutions, approaches and suggestions will be provided, with the example of the latest experiences in Australia. Finally, important principles such as data integrity, data lifetime and the FAIR and CARE principles will be considered.
Collapse
Affiliation(s)
- David Poger
- Microscopy Australia, The University of Sydney, Sydney, NSW, 2006, Australia.
| | - Lisa Yen
- Microscopy Australia, The University of Sydney, Sydney, NSW, 2006, Australia
| | - Filip Braet
- Australian Centre for Microscopy and Microanalysis, The University of Sydney, Sydney, NSW, 2006, Australia
- School of Medical Sciences (Molecular and Cellular Biomedicine), The University of Sydney, Sydney, NSW, 2006, Australia
| |
Collapse
|
3
|
Grossman RL. Ten lessons for data sharing with a data commons. Sci Data 2023; 10:120. [PMID: 36878917 PMCID: PMC9988927 DOI: 10.1038/s41597-023-02029-x] [Citation(s) in RCA: 7] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/10/2022] [Accepted: 02/17/2023] [Indexed: 03/08/2023] Open
Affiliation(s)
- Robert L Grossman
- University of Chicago, Center for Translational Data Science, Chicago, IL, 60615, USA.
| |
Collapse
|
4
|
Abstract
A large body of evidence has emerged in the past decade supporting a role for the gut microbiome in the regulation of blood pressure. The field has moved from association to causation in the last 5 years, with studies that have used germ-free animals, antibiotic treatments and direct supplementation with microbial metabolites. The gut microbiome can regulate blood pressure through several mechanisms, including through gut dysbiosis-induced changes in microbiome-associated gene pathways in the host. Microbiota-derived metabolites are either beneficial (for example, short-chain fatty acids and indole-3-lactic acid) or detrimental (for example, trimethylamine N-oxide), and can activate several downstream signalling pathways via G protein-coupled receptors or through direct immune cell activation. Moreover, dysbiosis-associated breakdown of the gut epithelial barrier can elicit systemic inflammation and disrupt intestinal mechanotransduction. These alterations activate mechanisms that are traditionally associated with blood pressure regulation, such as the renin-angiotensin-aldosterone system, the autonomic nervous system, and the immune system. Several methodological and technological challenges remain in gut microbiome research, and the solutions involve minimizing confounding factors, establishing causality and acting globally to improve sample diversity. New clinical trials, precision microbiome medicine and computational methods such as Mendelian randomization have the potential to enable leveraging of the microbiome for translational applications to lower blood pressure.
Collapse
|
5
|
Khan N, Thelwall M, Kousha K. Data sharing and reuse practices: disciplinary differences and improvements needed. ONLINE INFORMATION REVIEW 2023. [DOI: 10.1108/oir-08-2021-0423] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/08/2023]
Abstract
PurposeThis study investigates differences and commonalities in data production, sharing and reuse across the widest range of disciplines yet and identifies types of improvements needed to promote data sharing and reuse.Design/methodology/approachThe first authors of randomly selected publications from 2018 to 2019 in 20 Scopus disciplines were surveyed for their beliefs and experiences about data sharing and reuse.FindingsFrom the 3,257 survey responses, data sharing and reuse are still increasing but not ubiquitous in any subject area and are more common among experienced researchers. Researchers with previous data reuse experience were more likely to share data than others. Types of data produced and systematic online data sharing varied substantially between subject areas. Although the use of institutional and journal-supported repositories for sharing data is increasing, personal websites are still frequently used. Combining multiple existing datasets to answer new research questions was the most common use. Proper documentation, openness and information on the usability of data continue to be important when searching for existing datasets. However, researchers in most disciplines struggled to find datasets to reuse. Researchers' feedback suggested 23 recommendations to promote data sharing and reuse, including improved data access and usability, formal data citations, new search features and cultural and policy-related disciplinary changes to increase awareness and acceptance.Originality/valueThis study is the first to explore data sharing and reuse practices across the full range of academic discipline types. It expands and updates previous data sharing surveys and suggests new areas of improvement in terms of policy, guidance and training programs.Peer reviewThe peer review history for this article is available at: https://publons.com/publon/10.1108/OIR-08-2021-0423.
Collapse
|
6
|
Ruan E, Nemeth E, Moffitt R, Sandoval L, Machiela MJ, Freedman ND, Huang WY, Wong W, Chen KL, Park B, Jiang K, Hicks B, Liu J, Russ D, Minasian L, Pinsky P, Chanock SJ, Garcia-Closas M, Almeida JS. PLCOjs, a FAIR GWAS web SDK for the NCI Prostate, Lung, Colorectal and Ovarian Cancer Genetic Atlas project. Bioinformatics 2022; 38:4434-4436. [PMID: 35900159 PMCID: PMC9890300 DOI: 10.1093/bioinformatics/btac531] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/24/2022] [Revised: 07/11/2022] [Accepted: 07/25/2022] [Indexed: 02/04/2023] Open
Abstract
MOTIVATION The Division of Cancer Epidemiology and Genetics (DCEG) and the Division of Cancer Prevention (DCP) at the National Cancer Institute (NCI) have recently generated genome-wide association study (GWAS) data for multiple traits in the Prostate, Lung, Colorectal, and Ovarian (PLCO) Genomic Atlas project. The GWAS included 110 000 participants. The dissemination of the genetic association data through a data portal called GWAS Explorer, in a manner that addresses the modern expectations of FAIR reusability by data scientists and engineers, is the main motivation for the development of the open-source JavaScript software development kit (SDK) reported here. RESULTS The PLCO GWAS Explorer resource relies on a public stateless HTTP application programming interface (API) deployed as the sole backend service for both the landing page's web application and third-party analytical workflows. The core PLCOjs SDK is mapped to each of the API methods, and also to each of the reference graphic visualizations in the GWAS Explorer. A few additional visualization methods extend it. As is the norm with web SDKs, no download or installation is needed and modularization supports targeted code injection for web applications, reactive notebooks (Observable) and node-based web services. AVAILABILITY AND IMPLEMENTATION code at https://github.com/episphere/plco; project page at https://episphere.github.io/plco.
Collapse
Affiliation(s)
- Eric Ruan
- Department of Biomedical Informatics, Stony Brook University, Stony Brook, NY 11794, USA
| | - Erika Nemeth
- Department of Biomedical Informatics, Stony Brook University, Stony Brook, NY 11794, USA
| | - Richard Moffitt
- Department of Biomedical Informatics, Stony Brook University, Stony Brook, NY 11794, USA
| | - Lorena Sandoval
- Division of Cancer Epidemiology and Genetics (DCEG), National Cancer Institute, Rockville, MD 20850, USA
| | - Mitchell J Machiela
- Division of Cancer Epidemiology and Genetics (DCEG), National Cancer Institute, Rockville, MD 20850, USA
| | - Neal D Freedman
- Division of Cancer Epidemiology and Genetics (DCEG), National Cancer Institute, Rockville, MD 20850, USA
| | - Wen-Yi Huang
- Division of Cancer Epidemiology and Genetics (DCEG), National Cancer Institute, Rockville, MD 20850, USA
| | - Wendy Wong
- Division of Cancer Epidemiology and Genetics (DCEG), National Cancer Institute, Rockville, MD 20850, USA
| | - Kai-Ling Chen
- Center for Biomedical Informatics and Information Technology (CBIIT), National Cancer Institute, Rockville, MD 20850, USA
| | - Brian Park
- Center for Biomedical Informatics and Information Technology (CBIIT), National Cancer Institute, Rockville, MD 20850, USA
| | - Kevin Jiang
- Center for Biomedical Informatics and Information Technology (CBIIT), National Cancer Institute, Rockville, MD 20850, USA
| | - Belynda Hicks
- Division of Cancer Epidemiology and Genetics (DCEG), National Cancer Institute, Rockville, MD 20850, USA
| | - Jia Liu
- Division of Cancer Epidemiology and Genetics (DCEG), National Cancer Institute, Rockville, MD 20850, USA
| | - Daniel Russ
- Division of Cancer Epidemiology and Genetics (DCEG), National Cancer Institute, Rockville, MD 20850, USA
| | - Lori Minasian
- Division of Cancer Prevention, National Cancer Institute, Rockville, MD 20850, USA
| | - Paul Pinsky
- Division of Cancer Prevention, National Cancer Institute, Rockville, MD 20850, USA
| | - Stephen J Chanock
- Division of Cancer Epidemiology and Genetics (DCEG), National Cancer Institute, Rockville, MD 20850, USA
| | - Montserrat Garcia-Closas
- Division of Cancer Epidemiology and Genetics (DCEG), National Cancer Institute, Rockville, MD 20850, USA
| | | |
Collapse
|
7
|
|
8
|
Basu B, Gowtham N, Xiao Y, Kalidindi SR, Leong KW. Biomaterialomics: Data science-driven pathways to develop fourth-generation biomaterials. Acta Biomater 2022; 143:1-25. [PMID: 35202854 DOI: 10.1016/j.actbio.2022.02.027] [Citation(s) in RCA: 23] [Impact Index Per Article: 11.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/07/2021] [Revised: 02/16/2022] [Accepted: 02/17/2022] [Indexed: 12/12/2022]
Abstract
Conventional approaches to developing biomaterials and implants require intuitive tailoring of manufacturing protocols and biocompatibility assessment. This leads to longer development cycles, and high costs. To meet existing and unmet clinical needs, it is critical to accelerate the production of implantable biomaterials, implants and biomedical devices. Building on the Materials Genome Initiative, we define the concept 'biomaterialomics' as the integration of multi-omics data and high-dimensional analysis with artificial intelligence (AI) tools throughout the entire pipeline of biomaterials development. The Data Science-driven approach is envisioned to bring together on a single platform, the computational tools, databases, experimental methods, machine learning, and advanced manufacturing (e.g., 3D printing) to develop the fourth-generation biomaterials and implants, whose clinical performance will be predicted using 'digital twins'. While analysing the key elements of the concept of 'biomaterialomics', significant emphasis has been put forward to effectively utilize high-throughput biocompatibility data together with multiscale physics-based models, E-platform/online databases of clinical studies, data science approaches, including metadata management, AI/ Machine Learning (ML) algorithms and uncertainty predictions. Such integrated formulation will allow one to adopt cross-disciplinary approaches to establish processing-structure-property (PSP) linkages. A few published studies from the lead author's research group serve as representative examples to illustrate the formulation and relevance of the 'Biomaterialomics' approaches for three emerging research themes, i.e. patient-specific implants, additive manufacturing, and bioelectronic medicine. The increased adaptability of AI/ML tools in biomaterials science along with the training of the next generation researchers in data science are strongly recommended. STATEMENT OF SIGNIFICANCE: This leading opinion review paper emphasizes the need to integrate the concepts and algorithms of the data science with biomaterials science. Also, this paper emphasizes the need to establish a mathematically rigorous cross-disciplinary framework that will allow a systematic quantitative exploration and curation of critical biomaterials knowledge needed to drive objectively the innovation efforts within a suitable uncertainty quantification framework, as embodied in 'biomaterialomics' concept, which integrates multi-omics data and high-dimensional analysis with artificial intelligence (AI) tools, like machine learning. The formulation of this approach has been demonstrated for patient-specific implants, additive manufacturing, and bioelectronic medicine.
Collapse
|
9
|
Precision dentistry—what it is, where it fails (yet), and how to get there. Clin Oral Investig 2022; 26:3395-3403. [PMID: 35284954 PMCID: PMC8918420 DOI: 10.1007/s00784-022-04420-1] [Citation(s) in RCA: 13] [Impact Index Per Article: 6.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/10/2022] [Accepted: 02/17/2022] [Indexed: 12/23/2022]
Abstract
Objectives Dentistry is stuck between the one-size-fits-all approach towards diagnostics and therapy employed for a century and the era of stratified medicine. The present review presents the concept of precision dentistry, i.e., the next step beyond stratification into risk groups, and lays out where we stand, but also what challenges we have ahead for precision dentistry to come true. Material and methods Narrative literature review. Results Current approaches for enabling more precise diagnostics and therapies focus on stratification of individuals using clinical or social risk factors or indicators. Most research in dentistry does not focus on predictions — the key for precision dentistry — but on associations. We critically discuss why both approaches (focus on a limited number of risk factors or indicators and on associations) are insufficient and elaborate on what we think may allow to overcome the status quo. Conclusions Leveraging more diverse and broad data stemming from routine or unusual sources via advanced data analytics and testing the resulting prediction models rigorously may allow further steps towards more precise oral and dental care. Clinical significance Precision dentistry refers to tailoring diagnostics and therapy to an individual; it builds on modelling, prediction making and rigorous testing. Most studies in the dental domain focus on showing associations, and do not attempt to make any predictions. Moreover, the datasets used are narrow and usually collected purposively following a clinical reasoning. Opening routine data silos and involving uncommon data sources to harvest broad data and leverage them using advanced analytics could facilitate precision dentistry.
Collapse
|
10
|
A review on method entities in the academic literature: extraction, evaluation, and application. Scientometrics 2022. [DOI: 10.1007/s11192-022-04332-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/18/2022]
|
11
|
Bosco FA. Accumulating Knowledge in the Organizational Sciences. ANNUAL REVIEW OF ORGANIZATIONAL PSYCHOLOGY AND ORGANIZATIONAL BEHAVIOR 2022. [DOI: 10.1146/annurev-orgpsych-012420-090657] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
In some fields, research findings are rigorously curated in a common language and made available to enable future use and large-scale, robust insights. Organizational researchers have begun such efforts [e.g., metaBUS ( http://metabus.org/ )] but are far from the efficient, comprehensive curation seen in areas such as cognitive neuroscience or genetics. This review provides a sample of insights from research curation efforts in organizational research, psychology, and beyond—insights not possible by even large-scale, substantive meta-analyses. Efforts are classified as either science-of-science research or large-scale, substantive research. The various methods used for information extraction (e.g., from PDF files) and classification (e.g., using consensus ontologies) is reviewed. The review concludes with a series of recommendations for developing and leveraging the available corpus of organizational research to speed scientific progress.
Collapse
Affiliation(s)
- Frank A. Bosco
- Department of Management and Entrepreneurship, School of Business, Virginia Commonwealth University Richmond, Virginia, USA
| |
Collapse
|
12
|
Denecker T, Lelandais G. Omics Analyses: How to Navigate Through a Constant Data Deluge. Methods Mol Biol 2022; 2477:457-471. [PMID: 35524132 DOI: 10.1007/978-1-0716-2257-5_25] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/14/2023]
Abstract
Omics data are very valuable for researchers in biology, but the work required to develop a solid expertise in their analysis contrasts with the rapidity with which the omics technologies evolve. Data accumulate in public databases, and despite significant advances in bioinformatics softwares to integrate them, data analysis remains a burden for those who perform experiments. Beyond the issue of dealing with a very large number of results, we believe that working with omics data requires a change in the way scientific problems are solved. In this chapter, we explain pitfalls and tips we found during our functional genomics projects in yeasts. Our main lesson is that, if applying a protocol does not guarantee a successful project, following simple rules can help to become strategic and intentional, thus avoiding an endless drift into an ocean of possibilities.
Collapse
Affiliation(s)
- Thomas Denecker
- CNRS, Institut Français de Bioinformatique, IFB-core, UMS 3601, Évry, France
| | - Gaëlle Lelandais
- Institute for Integrative Biology of the Cell (I2BC), Université Paris-Saclay, CEA, CNRS, Gif-sur-Yvette, France.
| |
Collapse
|
13
|
Sharma C, Sinha R, Johnson K. Practical and comprehensive formalisms for modelling contemporary graph query languages. INFORM SYST 2021. [DOI: 10.1016/j.is.2021.101816] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
|
14
|
Bicycle Mobility Data: Current Use and Future Potential. An International Survey of Domain Professionals. DATA 2021. [DOI: 10.3390/data6110121] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022] Open
Abstract
Active mobility, especially cycling, is an essential building block for sustainable urban mobility. Public and private stakeholders are striving to improve conditions for cycling and subsequently increase its modal share. Data are regarded as key for different measures to become efficient and targeted. There is extensive evidence for an increasing amount of mobility data, availability of new data sources and potential usage scenarios for such data. However, little is known about the current use of these data in policy making, planning and related fields. To the best of our knowledge, it has not been investigated yet to which degree professionals in the broader field of cycling promotion benefit from an increasing amount of cycling-related data. Thus, we conducted a multi-lingual online survey among domain professionals and acquired data on their perspectives on current data availability, use and suitability as well as the potential they see for the use of cycling data in the future. In total, we received 325 complete responses from 32 countries, with the vast majority of 241 valid responses originating from Germany, Austria and Italy. Key findings are: 84% of domain professionals attribute high importance to data, and 89% state that they currently cannot or only partly solve their tasks with the data available to them. Results emphasize the need for making more and better suited data available to professionals in cycling-related positions, in both the private and public sector.
Collapse
|
15
|
Ahmed AE, Allen JM, Bhat T, Burra P, Fliege CE, Hart SN, Heldenbrand JR, Hudson ME, Istanto DD, Kalmbach MT, Kapraun GD, Kendig KI, Kendzior MC, Klee EW, Mattson N, Ross CA, Sharif SM, Venkatakrishnan R, Fadlelmola FM, Mainzer LS. Design considerations for workflow management systems use in production genomics research and the clinic. Sci Rep 2021; 11:21680. [PMID: 34737383 PMCID: PMC8569008 DOI: 10.1038/s41598-021-99288-8] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/13/2021] [Accepted: 09/15/2021] [Indexed: 01/22/2023] Open
Abstract
The changing landscape of genomics research and clinical practice has created a need for computational pipelines capable of efficiently orchestrating complex analysis stages while handling large volumes of data across heterogeneous computational environments. Workflow Management Systems (WfMSs) are the software components employed to fill this gap. This work provides an approach and systematic evaluation of key features of popular bioinformatics WfMSs in use today: Nextflow, CWL, and WDL and some of their executors, along with Swift/T, a workflow manager commonly used in high-scale physics applications. We employed two use cases: a variant-calling genomic pipeline and a scalability-testing framework, where both were run locally, on an HPC cluster, and in the cloud. This allowed for evaluation of those four WfMSs in terms of language expressiveness, modularity, scalability, robustness, reproducibility, interoperability, ease of development, along with adoption and usage in research labs and healthcare settings. This article is trying to answer, which WfMS should be chosen for a given bioinformatics application regardless of analysis type?. The choice of a given WfMS is a function of both its intrinsic language and engine features. Within bioinformatics, where analysts are a mix of dry and wet lab scientists, the choice is also governed by collaborations and adoption within large consortia and technical support provided by the WfMS team/community. As the community and its needs continue to evolve along with computational infrastructure, WfMSs will also evolve, especially those with permissive licenses that allow commercial use. In much the same way as the dataflow paradigm and containerization are now well understood to be very useful in bioinformatics applications, we will continue to see innovations of tools and utilities for other purposes, like big data technologies, interoperability, and provenance.
Collapse
Affiliation(s)
- Azza E Ahmed
- Faculty of Science, Center for Bioinformatics and Systems Biology, University of Khartoum, 11111, Khartoum, Sudan.
- Department of Electrical and Electronic Engineering, Faculty of Engineering, University of Khartoum, 11111, Khartoum, Sudan.
- Bernoulli Institute, University of Groningen, 9747 AG, Groningen, The Netherlands.
| | - Joshua M Allen
- National Center for Supercomputing Applications, University of Illinois at Urbana-Champaign, Urbana, IL, 61801, USA
| | - Tajesvi Bhat
- National Center for Supercomputing Applications, University of Illinois at Urbana-Champaign, Urbana, IL, 61801, USA
- Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, IL, 61801, USA
- Department of Molecular and Cellular Biology, University of Illinois at Urbana-Champaign, Urbana, IL, 61801, USA
| | - Prakruthi Burra
- National Center for Supercomputing Applications, University of Illinois at Urbana-Champaign, Urbana, IL, 61801, USA
- Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, IL, 61801, USA
- Center for Computational Biology, University of California, Berkeley, CA, 94720, USA
| | - Christina E Fliege
- National Center for Supercomputing Applications, University of Illinois at Urbana-Champaign, Urbana, IL, 61801, USA
| | - Steven N Hart
- Department of Quantitative Health Sciences, Center for Individualized Medicine, Mayo Clinic, Rochester, MN, 55905, USA
| | - Jacob R Heldenbrand
- National Center for Supercomputing Applications, University of Illinois at Urbana-Champaign, Urbana, IL, 61801, USA
| | - Matthew E Hudson
- National Center for Supercomputing Applications, University of Illinois at Urbana-Champaign, Urbana, IL, 61801, USA
- Department of Crop Sciences, University of Illinois at Urbana-Champaign, Urbana, IL, 61801, USA
| | - Dave Deandre Istanto
- Department of Crop Sciences, University of Illinois at Urbana-Champaign, Urbana, IL, 61801, USA
| | - Michael T Kalmbach
- Department of Information Technology, Department of Laboratory Medicine and Pathology, Mayo Clinic, Rochester, MN, 55905, USA
| | - Gregory D Kapraun
- Department of Information Technology, Department of Laboratory Medicine and Pathology, Mayo Clinic, Rochester, MN, 55905, USA
| | - Katherine I Kendig
- National Center for Supercomputing Applications, University of Illinois at Urbana-Champaign, Urbana, IL, 61801, USA
| | - Matthew Charles Kendzior
- National Center for Supercomputing Applications, University of Illinois at Urbana-Champaign, Urbana, IL, 61801, USA
- Department of Information Technology, Department of Laboratory Medicine and Pathology, Mayo Clinic, Rochester, MN, 55905, USA
| | - Eric W Klee
- Department of Quantitative Health Sciences, Center for Individualized Medicine, Mayo Clinic, Rochester, MN, 55905, USA
| | - Nate Mattson
- Department of Information Technology, Department of Laboratory Medicine and Pathology, Mayo Clinic, Rochester, MN, 55905, USA
| | - Christian A Ross
- Laboratory Pathology and Extramural Applications, Department of Laboratory Medicine and Pathology, Mayo Clinic, Rochester, MN, 55905, USA
| | - Sami M Sharif
- Department of Electrical and Electronic Engineering, Faculty of Engineering, University of Khartoum, 11111, Khartoum, Sudan
| | - Ramshankar Venkatakrishnan
- National Center for Supercomputing Applications, University of Illinois at Urbana-Champaign, Urbana, IL, 61801, USA
| | - Faisal M Fadlelmola
- Faculty of Science, Center for Bioinformatics and Systems Biology, University of Khartoum, 11111, Khartoum, Sudan
| | - Liudmila S Mainzer
- National Center for Supercomputing Applications, University of Illinois at Urbana-Champaign, Urbana, IL, 61801, USA
- Carl R. Woese Institute for Genomic Biology, University of Illinois at Urbana-Champaign, Urbana, IL, 61801, USA
| |
Collapse
|
16
|
Wang C, Yu F, Liu Y, Li X, Chen J, Thiyagalingam J, Sepe A. Deploying the Big Data Science Center at the Shanghai Synchrotron Radiation Facility: the first superfacility platform in China. MACHINE LEARNING: SCIENCE AND TECHNOLOGY 2021. [DOI: 10.1088/2632-2153/abe193] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
Abstract
With recent technological advances, large-scale experimental facilities generate huge datasets, into the petabyte range, every year, thereby creating the Big Data deluge effect. Data management, including the collection, management, and curation of these large datasets, is a significantly intensive precursor step in relation to the data analysis that underpins scientific investigations. The rise of artificial intelligence (AI), machine learning (ML), and robotic automation has changed the landscape for experimental facilities, producing a paradigm shift in how different datasets are leveraged for improved intelligence, operation, and data analysis. Therefore, such facilities, known as superfacilities, which fully enable user science while addressing the challenges of the Big Data deluge, are critical for the scientific community. In this work, we discuss the process of setting up the Big Data Science Center within the Shanghai Synchrotron Radiation Facility (SSRF), China’s first superfacility. We provide details of our initiatives for enabling user science at SSRF, with particular consideration given to recent developments in AI, ML, and robotic automation.
Collapse
|
17
|
Abstract
Data are a key resource for modern societies and expected to improve quality, accessibility, affordability, safety, and equity of health care. Dental care and research are currently transforming into what we term data dentistry, with 3 main applications: 1) medical data analysis uses deep learning, allowing one to master unprecedented amounts of data (language, speech, imagery) and put them to productive use. 2) Data-enriched clinical care integrates data from individual (e.g., demographic, social, clinical and omics data, consumer data), setting (e.g., geospatial, environmental, provider-related data), and systems level (payer or regulatory data to characterize input, throughput, output, and outcomes of health care) to provide a comprehensive and continuous real-time assessment of biologic perturbations, individual behaviors, and context. Such care may contribute to a deeper understanding of health and disease and a more precise, personalized, predictive, and preventive care. 3) Data for research include open research data and data sharing, allowing one to appraise, benchmark, pool, replicate, and reuse data. Concerns and confidence into data-driven applications, stakeholders’ and system’s capabilities, and lack of data standardization and harmonization currently limit the development and implementation of data dentistry. Aspects of bias and data-user interaction require attention. Action items for the dental community circle around increasing data availability, refinement, and usage; demonstrating safety, value, and usefulness of applications; educating the dental workforce and consumers; providing performant and standardized infrastructure and processes; and incentivizing and adopting open data and data sharing.
Collapse
Affiliation(s)
- F Schwendicke
- Department of Oral Diagnostics, Digital Health and Health Services Research, Charité-Universitätsmedizin Berlin, Berlin, Germany
| | - J Krois
- Department of Oral Diagnostics, Digital Health and Health Services Research, Charité-Universitätsmedizin Berlin, Berlin, Germany
| |
Collapse
|
18
|
An Ontology-Driven Personalized Faceted Search for Exploring Knowledge Bases of Capsicum. FUTURE INTERNET 2021. [DOI: 10.3390/fi13070172] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022] Open
Abstract
Capsicum is a genus of flowering plants in the Solanaceae family in which the members are well known to have a high economic value. The Capsicum fruits, which are popularly known as peppers or chili, have been widely used by people worldwide. It serves as a spice and raw material for many products such as sauce, food coloring, and medicine. For many years, scientists have studied this plant to optimize its production. A tremendous amount of knowledge has been obtained and shared, as reflected in multiple knowledge-based systems, databases, or information systems. An approach to knowledge-sharing is through the adoption of a common ontology to eliminate knowledge understanding discrepancy. Unfortunately, most of the knowledge-sharing solutions are intended for scientists who are familiar with the subject. On the other hand, there are groups of potential users that could benefit from such systems but have minimal knowledge of the subject. For these non-expert users, finding relevant information from a less familiar knowledge base would be daunting. More than that, users have various degrees of understanding of the available content in the knowledge base. This understanding discrepancy raises a personalization problem. In this paper, we introduce a solution to overcome this challenge. First, we developed an ontology to facilitate knowledge-sharing about Capsicum to non-expert users. Second, we developed a personalized faceted search algorithm that provides multiple structured ways to explore the knowledge base. The algorithm addresses the personalization problem by identifying the degree of understanding about the subject from each user. In this way, non-expert users could explore a knowledge base of Capsicum efficiently. Our solution characterized users into four groups. As a result, our faceted search algorithm defines four types of matching mechanisms, including three ranking mechanisms as the core of our solution. In order to evaluate the proposed method, we measured the predictability degree of produced list of facets. Our findings indicated that the proposed matching mechanisms could tolerate various query types, and a high degree of predictability can be achieved by combining multiple ranking mechanisms. Furthermore, it demonstrates that our approach has a high potential contribution to biodiversity science in general, where many knowledge-based systems have been developed with limited access to users outside of the domain.
Collapse
|
19
|
Geneviève LD, Martani A, Perneger T, Wangmo T, Elger BS. Systemic Fairness for Sharing Health Data: Perspectives From Swiss Stakeholders. Front Public Health 2021; 9:669463. [PMID: 34026719 PMCID: PMC8131670 DOI: 10.3389/fpubh.2021.669463] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/18/2021] [Accepted: 03/26/2021] [Indexed: 12/12/2022] Open
Abstract
Introduction: Health research is gradually embracing a more collectivist approach, fueled by a new movement of open science, data sharing and collaborative partnerships. However, the existence of systemic contradictions hinders the sharing of health data and such collectivist endeavor. Therefore, this qualitative study explores these systemic barriers to a fair sharing of health data from the perspectives of Swiss stakeholders. Methods: Purposive and snowball sampling were used to recruit 48 experts active in the Swiss healthcare domain, from the research/policy-making field and those having a high position in a health data enterprise (e.g., health register, hospital IT data infrastructure or a national health data initiative). Semi-structured interviews were then conducted, audio-recorded, verbatim transcribed with identifying information removed to guarantee the anonymity of participants. A theoretical thematic analysis was then carried out to identify themes and subthemes related to the topic of systemic fairness for sharing health data. Results: Two themes related to the topic of systemic fairness for sharing health data were identified, namely (i) the hypercompetitive environment and (ii) the legal uncertainty blocking data sharing. The theme, hypercompetitive environment was further divided into two subthemes, (i) systemic contradictions to fair data sharing and the (ii) need of fair systemic attribution mechanisms. Discussion: From the perspectives of Swiss stakeholders, hypercompetition in the Swiss academic system is hindering the sharing of health data for secondary research purposes, with the downside effect of influencing researchers to embrace individualism for career opportunities, thereby opposing the data sharing movement. In addition, there was a perceived sense of legal uncertainty from legislations governing the sharing of health data, which adds unreasonable burdens on individual researchers, who are often unequipped to deal with such facets of their data sharing activities.
Collapse
Affiliation(s)
| | - Andrea Martani
- Institute for Biomedical Ethics, University of Basel, Basel, Switzerland
| | - Thomas Perneger
- Division of Clinical Epidemiology, Geneva University Hospitals and University of Geneva, Geneva, Switzerland
| | - Tenzin Wangmo
- Institute for Biomedical Ethics, University of Basel, Basel, Switzerland
| | - Bernice Simone Elger
- Institute for Biomedical Ethics, University of Basel, Basel, Switzerland.,University Center of Legal Medicine, University of Geneva, Geneva, Switzerland
| |
Collapse
|
20
|
Cretu MT, Pérez-Ríos J. Predicting second virial coefficients of organic and inorganic compounds using Gaussian process regression. Phys Chem Chem Phys 2021; 23:2891-2898. [PMID: 33475124 DOI: 10.1039/d0cp05509c] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/03/2023]
Abstract
We show that by using intuitive and accessible molecular features it is possible to predict the temperature-dependent second virial coefficient of organic and inorganic compounds with Gaussian process regression. In particular, we built a low dimensional representation of features based on intrinsic molecular properties, topology and physical properties relevant for the characterization of molecule-molecule interactions. The featurization was used to predict second virial coefficients in the interpolative regime with a relative error ⪅1% and to extrapolate the prediction to temperatures outside of the training range for each compound in the dataset with a relative error of 2.1%. Additionally, the model's predictive abilities were extended to organic molecules unseen in the training process, yielding a prediction with a relative error of 2.7%. Test molecules must be well-represented in the training set by instances of their families, which are high in variety. The method shows a generally better performance when compared to several semi-empirical procedures employed in the prediction of the quantity. Therefore, apart from being robust, the present Gaussian process regression model is extensible to a variety of organic and inorganic compounds.
Collapse
Affiliation(s)
- Miruna T Cretu
- Department of Chemistry, Imperial College London, London SW7 2AZ, UK and Fritz-Haber-Institut der Max-Planck-Gesellschaft, Faradayweg 4-6, 14195 Berlin, Germany.
| | - Jesús Pérez-Ríos
- Fritz-Haber-Institut der Max-Planck-Gesellschaft, Faradayweg 4-6, 14195 Berlin, Germany.
| |
Collapse
|
21
|
LEE SANGHONG. CLASSIFICATION OF HEALTHY PEOPLE AND PD PATIENTS USING TAKAGI–SUGENO FUZZY MODEL-BASED INSTANCE SELECTION AND WAVELET TRANSFORMS. J MECH MED BIOL 2020. [DOI: 10.1142/s0219519420400394] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
In this study, a new instance selection method that combines the neural network with weighted fuzzy memberships (NEWFM) and Takagi–Sugeno (T–S) fuzzy model was proposed to improve the classification accuracy of healthy people and Parkinson’s disease (PD) patients. In order to evaluate the proposed instance selection for the classification accuracy of healthy people and PD patients, foot pressure data were collected from healthy people and PD patients as experimental data. This study uses wavelet transforms (WTs) to remove the noise from the foot pressure data in preprocessing step. The proposed instance selection method is an algorithm that selects instances using both weighted mean defuzzification (WMD) in the T–S fuzzy model and the confidence interval of a normal distribution used in statistics. The classification accuracy was compared before and after instance selection was applied to prove the superiority of instance selection. Classification accuracy before and after instance selection was 77.33% and 78.19%, respectively. The classification accuracy after instance selection exhibited a higher classification accuracy than that before instance selection by 0.86%. Further, McNemar’s test, which is used in statistics, was employed to show the difference in classification accuracy before and after instance selection was applied. The results of the McNemar’s test revealed that the probability of significance was smaller than 0.05, which reaffirmed that the classification accuracy was better when instance selection was applied than when instance selection was not applied. NEWFM includes the bounded sum of weighted fuzzy memberships (BSWFMs) that can easily show the differences in the graphically distinct characteristics between healthy people and PD patients. This study proposes new technique that NEWFM can detect PD patients from foot pressure data by the BSWFMs embedded in devices or systems.
Collapse
Affiliation(s)
- SANG-HONG LEE
- Department of Computer Science & Engineering, Anyang University, Anyang-Si, Republic of Korea
| |
Collapse
|
22
|
Hirschheim R. The attack on understanding: How big data and theory have led us astray: A comment on Gary Smith’s Data Mining Fool’s Gold. JOURNAL OF INFORMATION TECHNOLOGY 2020. [DOI: 10.1177/0268396220967677] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
|
23
|
Compson ZG, McClenaghan B, Singer GAC, Fahner NA, Hajibabaei M. Metabarcoding From Microbes to Mammals: Comprehensive Bioassessment on a Global Scale. Front Ecol Evol 2020. [DOI: 10.3389/fevo.2020.581835] [Citation(s) in RCA: 25] [Impact Index Per Article: 6.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/22/2022] Open
Abstract
Global biodiversity loss is unprecedented, and threats to existing biodiversity are growing. Given pervasive global change, a major challenge facing resource managers is a lack of scalable tools to rapidly and consistently measure Earth's biodiversity. Environmental genomic tools provide some hope in the face of this crisis, and DNA metabarcoding, in particular, is a powerful approach for biodiversity assessment at large spatial scales. However, metabarcoding studies are variable in their taxonomic, temporal, or spatial scope, investigating individual species, specific taxonomic groups, or targeted communities at local or regional scales. With the advent of modern, ultra-high throughput sequencing platforms, conducting deep sequencing metabarcoding surveys with multiple DNA markers will enhance the breadth of biodiversity coverage, enabling comprehensive, rapid bioassessment of all the organisms in a sample. Here, we report on a systematic literature review of 1,563 articles published about DNA metabarcoding and summarize how this approach is rapidly revolutionizing global bioassessment efforts. Specifically, we quantify the stakeholders using DNA metabarcoding, the dominant applications of this technology, and the taxonomic groups assessed in these studies. We show that while DNA metabarcoding has reached global coverage, few studies deliver on its promise of near-comprehensive biodiversity assessment. We then outline how DNA metabarcoding can help us move toward real-time, global bioassessment, illustrating how different stakeholders could benefit from DNA metabarcoding. Next, we address barriers to widespread adoption of DNA metabarcoding, highlighting the need for standardized sampling protocols, experts and computational resources to handle the deluge of genomic data, and standardized, open-source bioinformatic pipelines. Finally, we explore how technological and scientific advances will realize the promise of total biodiversity assessment in a sample—from microbes to mammals—and unlock the rich information genomics exposes, opening new possibilities for merging whole-system DNA metabarcoding with (1) abundance and biomass quantification, (2) advanced modeling, such as species occupancy models, to improve species detection, (3) population genetics, (4) phylogenetics, and (5) food web and functional gene analysis. While many challenges need to be addressed to facilitate widespread adoption of environmental genomic approaches, concurrent scientific and technological advances will usher in methods to supplement existing bioassessment tools reliant on morphological and abiotic data. This expanded toolbox will help ensure that the best tool is used for the job and enable exciting integrative techniques that capitalize on multiple tools. Collectively, these new approaches will aid in addressing the global biodiversity crisis we now face.
Collapse
|
24
|
Balazka D, Rodighiero D. Big Data and the Little Big Bang: An Epistemological (R)evolution. Front Big Data 2020; 3:31. [PMID: 33693404 PMCID: PMC7931920 DOI: 10.3389/fdata.2020.00031] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/25/2019] [Accepted: 08/07/2020] [Indexed: 01/30/2023] Open
Abstract
Starting from an analysis of frequently employed definitions of big data, it will be argued that, to overcome the intrinsic weaknesses of big data, it is more appropriate to define the object in relational terms. The excessive emphasis on volume and technological aspects of big data, derived from their current definitions, combined with neglected epistemological issues gave birth to an objectivistic rhetoric surrounding big data as implicitly neutral, omni-comprehensive, and theory-free. This rhetoric contradicts the empirical reality that embraces big data: (1) data collection is not neutral nor objective; (2) exhaustivity is a mathematical limit; and (3) interpretation and knowledge production remain both theoretically informed and subjective. Addressing these issues, big data will be interpreted as a methodological revolution carried over by evolutionary processes in technology and epistemology. By distinguishing between forms of nominal and actual access, we claim that big data promoted a new digital divide changing stakeholders, gatekeepers, and the basic rules of knowledge discovery by radically shaping the power dynamics involved in the processes of production and analysis of data.
Collapse
Affiliation(s)
- Dominik Balazka
- Center for Information and Communication Technology (FBK-ICT) and Center for Religious Studies (FBK-ISR), Fondazione Bruno Kessler, Trento, Italy
| | - Dario Rodighiero
- Comparative Media Studies/Writing, Massachusetts Institute of Technology, Cambridge, MA, United States.,Berkman Klein Center for Internet & Society, Harvard University, Cambridge, MA, United States
| |
Collapse
|
25
|
Shelley-Egan C, Gjefsen MD, Nydal R. Consolidating RRI and Open Science: understanding the potential for transformative change. LIFE SCIENCES, SOCIETY AND POLICY 2020; 16:7. [PMID: 32869131 PMCID: PMC7460767 DOI: 10.1186/s40504-020-00103-5] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/22/2020] [Accepted: 07/29/2020] [Indexed: 06/11/2023]
Abstract
In European research and innovation policy, Responsible Research and Innovation (RRI) and Open Science (OS) encompass two co-existing sets of ambitions concerning systemic change in the practice of research and innovation. This paper is an exploratory attempt to uncover synergies and differences between RRI and OS, by interrogating what motivates their respective transformative agendas. We offer two storylines that account for the specific contexts and dynamics from which RRI and OS have emerged, which in turn offer entrance points to further unpacking what 'opening up' to society means with respect to the transformative change agendas that are implicit in the two agendas. We compare differences regarding the 'how' of opening up in light of the 'why' to explore common areas of emphasis in both OS and RRI. We argue that while both agendas align with mission-oriented narratives around grand societal challenges, OS tends to emphasize efficiency and technical optimisation over RRI's emphasis on normative concerns and democracy deficits, and that the two agendas thus contrast in their relative legitimate emphasis on doable outcomes versus desirable outcomes. In our conclusion, we reflect on the future outlook for RRI and OS' co-existence and uptake, and on what their respective ambitions for transformation might mean for science-society scholars and scholarship.
Collapse
Affiliation(s)
- Clare Shelley-Egan
- Work Research Institute, OsloMet – Oslo Metropolitan University, Postboks 4 St. Olavs plass, 0130 Oslo, Norway
| | - Mads Dahl Gjefsen
- Work Research Institute, OsloMet – Oslo Metropolitan University, Postboks 4 St. Olavs plass, 0130 Oslo, Norway
| | - Rune Nydal
- Programme for applied ethics, Department of Philosophy and Religious Studies, NTNU – Norwegian University of Science and Technology, NO-7491 Trondheim, Norway
| |
Collapse
|
26
|
Abstract
Cloud computing is a mature technology that has already shown benefits for a wide range of academic research domains that, in turn, utilize a wide range of application design models. In this paper, we discuss the use of cloud computing as a tool to improve the range of resources available for climate science, presenting the evaluation of two different climate models. Each was customized in a different way to run in public cloud computing environments (hereafter cloud computing) provided by three different public vendors: Amazon, Google and Microsoft. The adaptations and procedures necessary to run the models in these environments are described. The computational performance and cost of each model within this new type of environment are discussed, and an assessment is given in qualitative terms. Finally, we discuss how cloud computing can be used for geoscientific modelling, including issues related to the allocation of resources by funding bodies. We also discuss problems related to computing security, reliability and scientific reproducibility.
Collapse
|
27
|
Abstract
In the context of atomic data computations for astrophysical applications, we review four different types of databases we have implemented for data dissemination: a database for nebular modeling; TIPTOPbase; OPserver; and AtomPy. The database for nebular plasmas is briefly discussed as a study case of a successful project. TOPbase and the OPserver were developed during the Opacity Project, an international consortium concerned with the revision of astrophysical opacities, while TIPbase was part of the Iron Project to calculate radiative transition probabilities and electron impact excitation collision strengths for iron-group ions. AtomPy is a prototype for an open, distributed data-assessment environment to engage both producers and users. We discuss design strategies and implementation issues that may help in the undertaking of present and future scientific database projects.
Collapse
|
28
|
Weber T, Kranzlmüller D, Fromm M, Tavares de Sousa N. Using supervised learning to classify metadata of research data by field of study. QUANTITATIVE SCIENCE STUDIES 2020. [DOI: 10.1162/qss_a_00049] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/04/2022] Open
Abstract
Many interesting use cases of research data classifiers presuppose that a research data item can be mapped to more than one field of study, but for such classification mechanisms, reproducible evaluations are lacking. This paper closes this gap: It describes the creation of a training and evaluation set comprised of labeled metadata, evaluates several supervised classification approaches, and comments on their application in scientometric research. The metadata were retrieved from the DataCite index of research data, pre processed, and compiled into a set of 613,585 records. According to our experiments with 20 general fields of study, multi layer perceptron models perform best, followed by long short-term memory models. The models can be used in scientometric research, for example to analyze interdisciplinary trends of digital scholarly output or to characterize growth patterns of research data, stratified by field of study. Our findings allow us to estimate errors in applying the models. The best performing models and the data used for their training are available for re use.
Collapse
Affiliation(s)
- Tobias Weber
- Munich Network Management Team, Leibniz Supercomputing Centre (Germany)
| | - Dieter Kranzlmüller
- Munich Network Management Team, Ludwig-Maximilians-Universität München (Germany)
| | - Michael Fromm
- Database Systems Group, Ludwig-Maximilians-Universität München (Germany)
| | | |
Collapse
|
29
|
Mahynski NA, Hatch HW, Witman M, Sheen DA, Errington JR, Shen VK. Flat-histogram extrapolation as a useful tool in the age of big data. MOLECULAR SIMULATION 2020. [DOI: 10.1080/08927022.2020.1747617] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/24/2022]
Affiliation(s)
- Nathan A. Mahynski
- Chemical Sciences Division, National Institute of Standards and Technology, Gaithersburg, MD, USA
| | - Harold W. Hatch
- Chemical Sciences Division, National Institute of Standards and Technology, Gaithersburg, MD, USA
| | | | - David A. Sheen
- Chemical Sciences Division, National Institute of Standards and Technology, Gaithersburg, MD, USA
| | - Jeffrey R. Errington
- Department of Chemical and Biological Engineering, University at Buffalo, The State University of New York, Buffalo, NY, USA
| | - Vincent K. Shen
- Chemical Sciences Division, National Institute of Standards and Technology, Gaithersburg, MD, USA
| |
Collapse
|
30
|
Compson ZG, Monk WA, Hayden B, Bush A, O'Malley Z, Hajibabaei M, Porter TM, Wright MTG, Baker CJO, Al Manir MS, Curry RA, Baird DJ. Network-Based Biomonitoring: Exploring Freshwater Food Webs With Stable Isotope Analysis and DNA Metabarcoding. Front Ecol Evol 2019. [DOI: 10.3389/fevo.2019.00395] [Citation(s) in RCA: 16] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/08/2023] Open
|
31
|
Li Q, Wang P, Sun Y, Zhang Y, Chen C. Data-driven decision making in graduate students’ research topic selection. ASLIB J INFORM MANAG 2019. [DOI: 10.1108/ajim-01-2019-0019] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
Abstract
Purpose
With the advent of the intelligent environment, as novice researchers, graduate students face digital challenges in their research topic selection (RTS). The purpose of this paper is to explore their cognitive processes during data-driven decision making (DDDM) in RTS, thus developing technical and instructional strategies to facilitate their research tasks.
Design/methodology/approach
This study developes a theoretical model that considers data-driven RTS as a second-order factor comprising both rational and experiential modes. Additionally, data literacy and visual data presentation were proposed as an antecedent and a consequence of data-driven RTS, respectively. The proposed model was examined by employing structural equation modeling based on a sample of 931 graduate students.
Findings
The results indicate that data-driven RTS is a second-order factor that positively affects the level of support of visual data presentation and that data literacy has a positive impact on DDDM in RTS. Furthermore, data literacy indirectly affects the level of support of visual data presentation.
Practical implications
These findings provide support for developers of knowledge discovery systems, data scientists, universities and libraries on the optimization of data visualization and data literacy instruction that conform to students’ cognitive styles to inform RTS.
Originality/value
This paper reveals the cognitive mechanisms underlying the effects of data literacy and data-driven RTS under rational and experiential modes on the level of support of the tabular or graphical presentations. It provides insights into the match between the visualization formats and cognitive modes.
Collapse
|
32
|
Data science from a library and information science perspective. DATA TECHNOLOGIES AND APPLICATIONS 2019. [DOI: 10.1108/dta-05-2019-0076] [Citation(s) in RCA: 17] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/06/2023]
Abstract
Purpose
Data science is a relatively new field which has gained considerable attention in recent years. This new field requires a wide range of knowledge and skills from different disciplines including mathematics and statistics, computer science and information science. The purpose of this paper is to present the results of the study that explored the field of data science from the library and information science (LIS) perspective.
Design/methodology/approach
Analysis of research publications on data science was made on the basis of papers published in the Web of Science database. The following research questions were proposed: What are the main tendencies in publication years, document types, countries of origin, source titles, authors of publications, affiliations of the article authors and the most cited articles related to data science in the field of LIS? What are the main themes discussed in the publications from the LIS perspective?
Findings
The highest contribution to data science comes from the computer science research community. The contribution of information science and library science community is quite small. However, there has been continuous increase in articles from the year 2015. The main document types are journal articles, followed by conference proceedings and editorial material. The top three journals that publish data science papers from the LIS perspective are the Journal of the American Medical Informatics Association, the International Journal of Information Management and the Journal of the Association for Information Science and Technology. The top five countries publishing are USA, China, England, Australia and India. The most cited article has got 112 citations. The analysis revealed that the data science field is quite interdisciplinary by nature. In addition to the field of LIS the papers belonged to several other research areas. The reviewed articles belonged to the six broad categories: data science education and training; knowledge and skills of the data professional; the role of libraries and librarians in the data science movement; tools, techniques and applications of data science; data science from the knowledge management perspective; and data science from the perspective of health sciences.
Research limitations/implications
The limitations of this research are that this study only analyzed research papers in the Web of Science database and therefore only covers a certain amount of scientific papers published in the field of LIS. In addition, only publications with the term “data science” in the topic area of the Web of Science database were analyzed. Therefore, several relevant studies are not discussed in this paper that are not reflected in the Web of Science database or were related to other keywords such as “e-science,” “e-research,” “data service,” “data curation” or “research data management.”
Originality/value
The field of data science has not been explored using bibliographic analysis of publications from the perspective of the LIS. This paper helps to better understand the field of data science and the perspectives for information professionals.
Collapse
|
33
|
Molecular Inverse Comorbidity between Alzheimer's Disease and Lung Cancer: New Insights from Matrix Factorization. Int J Mol Sci 2019; 20:ijms20133114. [PMID: 31247897 PMCID: PMC6650839 DOI: 10.3390/ijms20133114] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/20/2019] [Revised: 06/13/2019] [Accepted: 06/18/2019] [Indexed: 12/23/2022] Open
Abstract
Matrix factorization (MF) is an established paradigm for large-scale biological data analysis with tremendous potential in computational biology. Here, we challenge MF in depicting the molecular bases of epidemiologically described disease–disease (DD) relationships. As a use case, we focus on the inverse comorbidity association between Alzheimer’s disease (AD) and lung cancer (LC), described as a lower than expected probability of developing LC in AD patients. To this day, the molecular mechanisms underlying DD relationships remain poorly explained and their better characterization might offer unprecedented clinical opportunities. To this goal, we extend our previously designed MF-based framework for the molecular characterization of DD relationships. Considering AD–LC inverse comorbidity as a case study, we highlight multiple molecular mechanisms, among which we confirm the involvement of processes related to the immune system and mitochondrial metabolism. We then distinguish mechanisms specific to LC from those shared with other cancers through a pan-cancer analysis. Additionally, new candidate molecular players, such as estrogen receptor (ER), cadherin 1 (CDH1) and histone deacetylase (HDAC), are pinpointed as factors that might underlie the inverse relationship, opening the way to new investigations. Finally, some lung cancer subtype-specific factors are also detected, also suggesting the existence of heterogeneity across patients in the context of inverse comorbidity.
Collapse
|
34
|
Sachs J, Page R, Baskauf SJ, Pender J, Lujan-Toro B, Macklin J, Comspon Z. Training and hackathon on building biodiversity knowledge graphs. RESEARCH IDEAS AND OUTCOMES 2019. [DOI: 10.3897/rio.5.e36152] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
Knowledge graphs have the potential to unite disconnected digitized biodiversity data, and there are a number of efforts underway to build biodiversity knowledge graphs. More generally, the recent popularity of knowledge graphs, driven in part by the advent and success of the Google Knowledge Graph, has breathed life into the ongoing development of semantic web infrastructure and prototypes in the biodiversity informatics community. We describe a one week training event and hackathon that focused on applying three specific knowledge graph technologies – the Neptune graph database; Metaphactory; and Wikidata - to a diverse set of biodiversity use cases.
We give an overview of the training, the projects that were advanced throughout the week, and the critical discussions that emerged. We believe that the main barriers towards adoption of biodiversity knowledge graphs are the lack of understanding of knowledge graphs and the lack of adoption of shared unique identifiers. Furthermore, we believe an important advancement in the outlook of knowledge graph development is the emergence of Wikidata as an identifier broker and as a scoping tool. To remedy the current barriers towards biodiversity knowledge graph development, we recommend continued discussions at workshops and at conferences, which we expect to increase awareness and adoption of knowledge graph technologies.
Collapse
|
35
|
Hu Y, Niemeyer CM. From DNA Nanotechnology to Material Systems Engineering. ADVANCED MATERIALS (DEERFIELD BEACH, FLA.) 2019; 31:e1806294. [PMID: 30767279 DOI: 10.1002/adma.201806294] [Citation(s) in RCA: 99] [Impact Index Per Article: 19.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/28/2018] [Revised: 11/29/2018] [Indexed: 05/25/2023]
Abstract
In the past 35 years, DNA nanotechnology has grown to a highly innovative and vibrant field of research at the interface of chemistry, materials science, biotechnology, and nanotechnology. Herein, a short summary of the state of research in various subdisciplines of DNA nanotechnology, ranging from pure "structural DNA nanotechnology" over protein-DNA assemblies, nanoparticle-based DNA materials, and DNA polymers to DNA surface technology is given. The survey shows that these subdisciplines are growing ever closer together and suggests that this integration is essential in order to initiate the next phase of development. With the increasing implementation of machine-based approaches in microfluidics, robotics, and data-driven science, DNA-material systems will emerge that could be suitable for applications in sensor technology, photonics, as interfaces between technical systems and living organisms, or for biomimetic fabrication processes.
Collapse
Affiliation(s)
- Yong Hu
- Karlsruhe Institute of Technology (KIT), Institute for Biological Interfaces (IBG 1), Hermann-von-Helmholtz-Platz 1, D-76344, Eggenstein-Leopoldshafen, Germany
| | - Christof M Niemeyer
- Karlsruhe Institute of Technology (KIT), Institute for Biological Interfaces (IBG 1), Hermann-von-Helmholtz-Platz 1, D-76344, Eggenstein-Leopoldshafen, Germany
| |
Collapse
|
36
|
The Sciences Underlying Smart Sustainable Urbanism: Unprecedented Paradigmatic and Scholarly Shifts in Light of Big Data Science and Analytics. SMART CITIES 2019. [DOI: 10.3390/smartcities2020013] [Citation(s) in RCA: 28] [Impact Index Per Article: 5.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
As a new area of science and technology (S&T), big data science and analytics embodies an unprecedentedly transformative power—which is manifested not only in the form of revolutionizing science and transforming knowledge, but also in advancing social practices, catalyzing major shifts, and fostering societal transitions. Of particular relevance, it is instigating a massive change in the way both smart cities and sustainable cities are understood, studied, planned, operated, and managed to improve and maintain sustainability in the face of expanding urbanization. This relates to what has been dubbed data-driven smart sustainable urbanism, an emerging approach that is based on a computational understanding of city systems that reduces urban life to logical and algorithmic rules and procedures, as well as employs a new scientific method based on data-intensive science, while also harnessing urban big data to provide a more holistic and integrated view and synoptic intelligence of the city. This paper examines the unprecedented paradigmatic and scholarly shifts that the sciences underlying smart sustainable urbanism are undergoing in light of big data science and analytics and the underlying enabling technologies, as well as discusses how these shifts intertwine with and affect one another in the context of sustainability. I argue that data-intensive science, as a new epistemological shift, is fundamentally changing the scientific and practical foundations of urban sustainability. In specific terms, the new urban science—as underpinned by sustainability science and urban sustainability—is increasingly making cities more sustainable, resilient, efficient, and livable by rendering them more measurable, knowable, and tractable in terms of their operational functioning, management, planning, design, and development.
Collapse
|
37
|
Olmeda-Gómez C, Romá-Mateo C, Ovalle-Perandones MA. Overview of trends in global epigenetic research (2009–2017). Scientometrics 2019. [DOI: 10.1007/s11192-019-03095-y] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/25/2022]
|
38
|
Fadlelmola FM, Panji S, Ahmed AE, Ghouila A, Akurugu WA, Domelevo Entfellner JB, Souiai O, Mulder N. Ten simple rules for organizing a webinar series. PLoS Comput Biol 2019; 15:e1006671. [PMID: 30933972 PMCID: PMC6443143 DOI: 10.1371/journal.pcbi.1006671] [Citation(s) in RCA: 37] [Impact Index Per Article: 7.4] [Reference Citation Analysis] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/07/2023] Open
Affiliation(s)
- Faisal M. Fadlelmola
- Centre for Bioinformatics and Systems Biology, Faculty of Science, University of Khartoum, Khartoum, Sudan
- * E-mail:
| | - Sumir Panji
- Computational Biology Division, Department of Integrative Biomedical Sciences, University of Cape Town, Cape Town, South Africa
| | - Azza E. Ahmed
- Centre for Bioinformatics and Systems Biology, Faculty of Science, University of Khartoum, Khartoum, Sudan
- Department of Electrical and Electronic Engineering, Faculty of Engineering, University of Khartoum, Khartoum, Sudan
| | - Amel Ghouila
- Laboratory of Transmission, Control and Immunobiology of Infections (LTCII), Institut Pasteur de Tunis, Tunis-Belvédère, Tunisia
| | - Wisdom A. Akurugu
- Noguchi Memorial Institute for Medical Research, University of Ghana, Legon, Accra, Ghana
| | - Jean-Baka Domelevo Entfellner
- South African MRC Bioinformatics Unit, South African National Bioinformatics Institute, University of the Western Cape, Bellville 7535, Cape Town, South Africa
| | - Oussema Souiai
- Laboratory of BioInformatics Biomathematics and bioStatistics, Institut Pasteur de SalTunis, Tunis, Tunisia
- Institut supérieur des technologies médicales, Univesité Tunis al Manar, Tunis, Tunisia
| | - Nicola Mulder
- Computational Biology Division, Department of Integrative Biomedical Sciences, University of Cape Town, Cape Town, South Africa
| | | |
Collapse
|
39
|
Mongiardino Koch N. The phylogenomic revolution and its conceptual innovations: a text mining approach. ORG DIVERS EVOL 2019. [DOI: 10.1007/s13127-019-00397-0] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/24/2023]
|
40
|
Almeida JS, Hajagos J, Saltz J, Saltz M. Serverless OpenHealth at data commons scale-traversing the 20 million patient records of New York's SPARCS dataset in real-time. PeerJ 2019; 7:e6230. [PMID: 30671301 PMCID: PMC6338105 DOI: 10.7717/peerj.6230] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/05/2018] [Accepted: 12/07/2018] [Indexed: 12/05/2022] Open
Abstract
In a previous report, we explored the serverless OpenHealth approach to the Web as a Global Compute space. That approach relies on the modern browser full stack, and, in particular, its configuration for application assembly by code injection. The opportunity, and need, to expand this approach has since increased markedly, reflecting a wider adoption of Open Data policies by Public Health Agencies. Here, we describe how the serverless scaling challenge can be achieved by the isomorphic mapping between the remote data layer API and a local (client-side, in-browser) operator. This solution is validated with an accompanying interactive web application (bit.ly/loadsparcs) capable of real-time traversal of New York’s 20 million patient records of the Statewide Planning and Research Cooperative System (SPARCS), and is compared with alternative approaches. The results obtained strengthen the argument that the FAIR reproducibility needed for Population Science applications in the age of P4 Medicine is particularly well served by the Web platform.
Collapse
Affiliation(s)
- Jonas S Almeida
- Biomedical Informatics, State University of New York at Stony Brook, Stony Brook, NY, United States of America
| | - Janos Hajagos
- Biomedical Informatics, State University of New York at Stony Brook, Stony Brook, NY, United States of America
| | - Joel Saltz
- Biomedical Informatics, State University of New York at Stony Brook, Stony Brook, NY, United States of America
| | - Mary Saltz
- Radiology, State University of New York at Stony Brook, Stony Brook, NY, United States of America
| |
Collapse
|
41
|
Implementing in the VAMDC the New Paradigms for Data Citation from the Research Data Alliance. DATA SCIENCE JOURNAL 2019. [DOI: 10.5334/dsj-2019-004] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022] Open
|
42
|
Stein-O'Brien GL, Arora R, Culhane AC, Favorov AV, Garmire LX, Greene CS, Goff LA, Li Y, Ngom A, Ochs MF, Xu Y, Fertig EJ. Enter the Matrix: Factorization Uncovers Knowledge from Omics. Trends Genet 2018; 34:790-805. [PMID: 30143323 PMCID: PMC6309559 DOI: 10.1016/j.tig.2018.07.003] [Citation(s) in RCA: 111] [Impact Index Per Article: 18.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/23/2018] [Revised: 06/01/2018] [Accepted: 07/16/2018] [Indexed: 12/20/2022]
Abstract
Omics data contain signals from the molecular, physical, and kinetic inter- and intracellular interactions that control biological systems. Matrix factorization (MF) techniques can reveal low-dimensional structure from high-dimensional data that reflect these interactions. These techniques can uncover new biological knowledge from diverse high-throughput omics data in applications ranging from pathway discovery to timecourse analysis. We review exemplary applications of MF for systems-level analyses. We discuss appropriate applications of these methods, their limitations, and focus on the analysis of results to facilitate optimal biological interpretation. The inference of biologically relevant features with MF enables discovery from high-throughput data beyond the limits of current biological knowledge - answering questions from high-dimensional data that we have not yet thought to ask.
Collapse
Affiliation(s)
- Genevieve L Stein-O'Brien
- Department of Oncology, Division of Biostatistics and Bioinformatics, Sidney Kimmel Comprehensive Cancer Center, Johns Hopkins School of Medicine, Baltimore, MD, USA; Department of Neuroscience, Johns Hopkins School of Medicine, Baltimore, MD, USA; McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins School of Medicine, Baltimore, MD, USA
| | - Raman Arora
- Department of Computer Science, Institute for Data Intensive Engineering and Science, Johns Hopkins University, Baltimore, MD, USA
| | - Aedin C Culhane
- Department of Biostatistics and Computational Biology, Dana-Farber Cancer Institute, Boston, MA, USA; Department of Biostatistics, Harvard TH Chan School of Public Health, Boston, MA, USA
| | - Alexander V Favorov
- Department of Oncology, Division of Biostatistics and Bioinformatics, Sidney Kimmel Comprehensive Cancer Center, Johns Hopkins School of Medicine, Baltimore, MD, USA; Vavilov Institute of General Genetics, Moscow, Russia
| | | | - Casey S Greene
- Department of Systems Pharmacology and Translational Therapeutics, Perelman School of Medicine, University of Pennsylvania, PA, USA; Childhood Cancer Data Lab, Alex's Lemonade Stand Foundation, PA, USA
| | - Loyal A Goff
- Department of Neuroscience, Johns Hopkins School of Medicine, Baltimore, MD, USA; McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins School of Medicine, Baltimore, MD, USA
| | - Yifeng Li
- Digital Technologies Research Centre, National Research Council of Canada, Ottawa, ON, Canada
| | - Aloune Ngom
- School of Computer Science, University of Windsor, Windsor, ON, Canada
| | - Michael F Ochs
- Department of Mathematics and Statistics, The College of New Jersey, Ewing, NJ, USA
| | - Yanxun Xu
- Department of Applied Mathematics and Statistics, Whiting School of Engineering, Johns Hopkins University, Baltimore, MD, USA
| | - Elana J Fertig
- Department of Oncology, Division of Biostatistics and Bioinformatics, Sidney Kimmel Comprehensive Cancer Center, Johns Hopkins School of Medicine, Baltimore, MD, USA.
| |
Collapse
|
43
|
Meyer A, Zverinski D, Pfahringer B, Kempfert J, Kuehne T, Sündermann SH, Stamm C, Hofmann T, Falk V, Eickhoff C. Machine learning for real-time prediction of complications in critical care: a retrospective study. THE LANCET RESPIRATORY MEDICINE 2018; 6:905-914. [PMID: 30274956 DOI: 10.1016/s2213-2600(18)30300-x] [Citation(s) in RCA: 158] [Impact Index Per Article: 26.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Subscribe] [Scholar Register] [Received: 01/30/2018] [Revised: 06/29/2018] [Accepted: 07/09/2018] [Indexed: 12/31/2022]
Abstract
BACKGROUND The large amount of clinical signals in intensive care units can easily overwhelm health-care personnel and can lead to treatment delays, suboptimal care, or clinical errors. The aim of this study was to apply deep machine learning methods to predict severe complications during critical care in real time after cardiothoracic surgery. METHODS We used deep learning methods (recurrent neural networks) to predict several severe complications (mortality, renal failure with a need for renal replacement therapy, and postoperative bleeding leading to operative revision) in post cardiosurgical care in real time. Adult patients who underwent major open heart surgery from Jan 1, 2000, to Dec 31, 2016, in a German tertiary care centre for cardiovascular diseases formed the main derivation dataset. We measured the accuracy and timeliness of the deep learning model's forecasts and compared predictive quality to that of established standard-of-care clinical reference tools (clinical rule for postoperative bleeding, Simplified Acute Physiology Score II for mortality, and the Kidney Disease: Improving Global Outcomes staging criteria for acute renal failure) using positive predictive value (PPV), negative predictive value, sensitivity, specificity, area under the curve (AUC), and the F1 measure (which computes a harmonic mean of sensitivity and PPV). Results were externally retrospectively validated with 5898 cases from the published MIMIC-III dataset. FINDINGS Of 47 559 intensive care admissions (corresponding to 42 007 patients), we included 11 492 (corresponding to 9269 patients). The deep learning models yielded accurate predictions with the following PPV and sensitivity scores: PPV 0·90 and sensitivity 0·85 for mortality, 0·87 and 0·94 for renal failure, and 0·84 and 0·74 for bleeding. The predictions significantly outperformed the standard clinical reference tools, improving the absolute complication prediction AUC by 0·29 (95% CI 0·23-0·35) for bleeding, by 0·24 (0·19-0·29) for mortality, and by 0·24 (0·13-0·35) for renal failure (p<0·0001 for all three analyses). The deep learning methods showed accurate predictions immediately after patient admission to the intensive care unit. We also observed an increase in performance in our validation cohort when the machine learning approach was tested against clinical reference tools, with absolute improvements in AUC of 0·09 (95% CI 0·03-0·15; p=0·0026) for bleeding, of 0·18 (0·07-0·29; p=0·0013) for mortality, and of 0·25 (0·18-0·32; p<0·0001) for renal failure. INTERPRETATION The observed improvements in prediction for all three investigated clinical outcomes have the potential to improve critical care. These findings are noteworthy in that they use routinely collected clinical data exclusively, without the need for any manual processing. The deep machine learning method showed AUC scores that significantly surpass those of clinical reference tools, especially soon after admission. Taken together, these properties are encouraging for prospective deployment in critical care settings to direct the staff's attention towards patients who are most at risk. FUNDING No specific funding.
Collapse
Affiliation(s)
- Alexander Meyer
- Department of Cardiothoracic and Vascular Surgery, Deutsches Herzzentrum Berlin, Berlin, Germany; DZHK (German Centre for Cardiovascular Research), Partner Site Berlin, Berlin, Germany; Berlin Institute of Health, Berlin, Germany.
| | - Dina Zverinski
- Department of Cardiothoracic and Vascular Surgery, Deutsches Herzzentrum Berlin, Berlin, Germany; Department of Computer Science, ETH Zurich, Zurich, Switzerland
| | - Boris Pfahringer
- Department of Cardiothoracic and Vascular Surgery, Deutsches Herzzentrum Berlin, Berlin, Germany; Berlin Institute of Health, Berlin, Germany
| | - Jörg Kempfert
- Department of Cardiothoracic and Vascular Surgery, Deutsches Herzzentrum Berlin, Berlin, Germany
| | - Titus Kuehne
- Institute of Imaging Science and Computational Modelling, Charité - Universitätsmedizin Berlin, Berlin, Germany
| | - Simon H Sündermann
- Department of Cardiothoracic and Vascular Surgery, Deutsches Herzzentrum Berlin, Berlin, Germany; Department of Cardiovascular Surgery, Charité - Universitätsmedizin Berlin, Berlin, Germany
| | - Christof Stamm
- Department of Cardiothoracic and Vascular Surgery, Deutsches Herzzentrum Berlin, Berlin, Germany; Berlin Center for Regenerative Therapies, Charité - Universitätsmedizin Berlin, Berlin, Germany; DZHK (German Centre for Cardiovascular Research), Partner Site Berlin, Berlin, Germany
| | - Thomas Hofmann
- Department of Computer Science, ETH Zurich, Zurich, Switzerland
| | - Volkmar Falk
- Department of Cardiothoracic and Vascular Surgery, Deutsches Herzzentrum Berlin, Berlin, Germany; Department of Cardiovascular Surgery, Charité - Universitätsmedizin Berlin, Berlin, Germany; DZHK (German Centre for Cardiovascular Research), Partner Site Berlin, Berlin, Germany
| | - Carsten Eickhoff
- Department of Computer Science, ETH Zurich, Zurich, Switzerland; Center for Biomedical Informatics, Brown University, Providence, RI, USA
| |
Collapse
|
44
|
Mura C, Draizen EJ, Bourne PE. Structural biology meets data science: does anything change? Curr Opin Struct Biol 2018; 52:95-102. [PMID: 30267935 DOI: 10.1016/j.sbi.2018.09.003] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/24/2018] [Revised: 08/31/2018] [Accepted: 09/07/2018] [Indexed: 01/22/2023]
Abstract
Data science has emerged from the proliferation of digital data, coupled with advances in algorithms, software and hardware (e.g., GPU computing). Innovations in structural biology have been driven by similar factors, spurring us to ask: can these two fields impact one another in deep and hitherto unforeseen ways? We posit that the answer is yes. New biological knowledge lies in the relationships between sequence, structure, function and disease, all of which play out on the stage of evolution, and data science enables us to elucidate these relationships at scale. Here, we consider the above question from the five key pillars of data science: acquisition, engineering, analytics, visualization and policy, with an emphasis on machine learning as the premier analytics approach.
Collapse
Affiliation(s)
- Cameron Mura
- Department of Biomedical Engineering, University of Virginia, Charlottesville, VA 22908, USA
| | - Eli J Draizen
- Department of Biomedical Engineering, University of Virginia, Charlottesville, VA 22908, USA
| | - Philip E Bourne
- Department of Biomedical Engineering, University of Virginia, Charlottesville, VA 22908, USA; Data Science Institute, University of Virginia, Charlottesville, VA 22904, USA.
| |
Collapse
|
45
|
Two-step wavelet-based estimation for Gaussian mixed fractional processes. STATISTICAL INFERENCE FOR STOCHASTIC PROCESSES 2018. [DOI: 10.1007/s11203-018-9190-z] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/28/2022]
|
46
|
A visual analytics with evidential inference for big data: case study of chemical vapor deposition in solar company. GRANULAR COMPUTING 2018. [DOI: 10.1007/s41066-018-0116-3] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/26/2022]
|
47
|
A Proposed Solution and Future Direction for Blockchain-Based Heterogeneous Medicare Data in Cloud Environment. J Med Syst 2018; 42:156. [PMID: 29987560 DOI: 10.1007/s10916-018-1007-5] [Citation(s) in RCA: 74] [Impact Index Per Article: 12.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/02/2018] [Accepted: 06/26/2018] [Indexed: 10/28/2022]
Abstract
The healthcare data is an important asset and rich source of healthcare intellect. Medical databases, if created properly, will be large, complex, heterogeneous and time varying. The main challenge nowadays is to store and process this data efficiently so that it can benefit humans. Heterogeneity in the healthcare sector in the form of medical data is also considered to be one of the biggest challenges for researchers. Sometimes, this data is referred to as large-scale data or big data. Blockchain technology and the Cloud environment have proved their usability separately. Though these two technologies can be combined to enhance the exciting applications in healthcare industry. Blockchain is a highly secure and decentralized networking platform of multiple computers called nodes. It is changing the way medical information is being stored and shared. It makes the work easier, keeps an eye on the security and accuracy of the data and also reduces the cost of maintenance. A Blockchain-based platform is proposed that can be used for storing and managing electronic medical records in a Cloud environment.
Collapse
|
48
|
Founds S. Systems biology for nursing in the era of big data and precision health. Nurs Outlook 2018; 66:283-292. [DOI: 10.1016/j.outlook.2017.11.006] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/09/2017] [Revised: 11/25/2017] [Accepted: 11/26/2017] [Indexed: 12/19/2022]
|
49
|
Abstract
Purpose
In Digging into Data 3 (DID3) (2014-2016), ten funders from four countries (the USA, Canada, the UK, and the Netherlands) granted $5.1 million to 14 project teams to pursue data-intensive, interdisciplinary, and international digital humanities (DH) research. The purpose of this paper is to employ the DID3 projects as a case study to explore the following research question: what roles do librarians and archivists take on in data-intensive, interdisciplinary, and international DH projects?
Design/methodology/approach
Participation was secured from 53 persons representing eleven projects. The study was conducted in the naturalistic paradigm. It is a qualitative case study involving snowball sampling, semi-structured interviews, and grounded analysis.
Findings
Librarians or archivists were involved officially in 3 of the 11 projects (27.3 percent). Perhaps more importantly, information professionals played vital unofficial roles in these projects, namely as consultants and liaisons and also as technical support. Information and library science (ILS) expertise helped DID3 researchers with issues such as visualization, rights management, and user testing. DID3 participants also suggested ways in which librarians and archivists might further support DH projects, concentrating on three key areas: curation, outreach, and ILS education. Finally, six directions for future research are suggested.
Originality/value
Much untapped potential exists for librarians and archivists to collaborate with DH scholars; a gap exists between researcher awareness and information professionals’ capacity.
Collapse
|
50
|
Biscarini F, Cozzi P, Orozco-Ter Wengel P. Lessons learnt on the analysis of large sequence data in animal genomics. Anim Genet 2018; 49:147-158. [PMID: 29624711 DOI: 10.1111/age.12655] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 02/11/2018] [Indexed: 11/28/2022]
Abstract
The 'omics revolution has made a large amount of sequence data available to researchers and the industry. This has had a profound impact in the field of bioinformatics, stimulating unprecedented advancements in this discipline. Mostly, this is usually looked at from the perspective of human 'omics, in particular human genomics. Plant and animal genomics, however, have also been deeply influenced by next-generation sequencing technologies, with several genomics applications now popular among researchers and the breeding industry. Genomics tends to generate huge amounts of data, and genomic sequence data account for an increasing proportion of big data in biological sciences, due largely to decreasing sequencing and genotyping costs and to large-scale sequencing and resequencing projects. The analysis of big data poses a challenge to scientists, as data gathering currently takes place at a faster pace than does data processing and analysis, and the associated computational burden is increasingly taxing, making even simple manipulation, visualization and transferring of data a cumbersome operation. The time consumed by the processing and analysing of huge data sets may be at the expense of data quality assessment and critical interpretation. Additionally, when analysing lots of data, something is likely to go awry-the software may crash or stop-and it can be very frustrating to track the error. We herein review the most relevant issues related to tackling these challenges and problems, from the perspective of animal genomics, and provide researchers that lack extensive computing experience with guidelines that will help when processing large genomic data sets.
Collapse
Affiliation(s)
- F Biscarini
- CNR-IBBA, Via Bassini 15, 20133, Milan, Italy.,School of Medicine, Cardiff University, Heath Park, CF14 4XN, Cardiff, UK
| | - P Cozzi
- CNR-IBBA, Via Bassini 15, 20133, Milan, Italy.,Department of Bioinformatics and Biostatistics, PTP Science Park, Via Einstein, 26900, Lodi, Italy
| | - P Orozco-Ter Wengel
- School of Biosciences, Cardiff University, Museum Avenue, CF10 3AX, Cardiff, UK
| |
Collapse
|