1
|
Allers S, O’Connell KA, Carlson T, Belardo D, King BL. Reusable tutorials for using cloud-based computing environments for the analysis of bacterial gene expression data from bulk RNA sequencing. Brief Bioinform 2024; 25:bbae301. [PMID: 38997128 PMCID: PMC11245317 DOI: 10.1093/bib/bbae301] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/01/2023] [Revised: 05/29/2024] [Accepted: 06/07/2024] [Indexed: 07/14/2024] Open
Abstract
This manuscript describes the development of a resource module that is part of a learning platform named "NIGMS Sandbox for Cloud-based Learning" https://github.com/NIGMS/NIGMS-Sandbox. The overall genesis of the Sandbox is described in the editorial NIGMS Sandbox at the beginning of this Supplement. This module delivers learning materials on RNA sequencing (RNAseq) data analysis in an interactive format that uses appropriate cloud resources for data access and analyses. Biomedical research is increasingly data-driven, and dependent upon data management and analysis methods that facilitate rigorous, robust, and reproducible research. Cloud-based computing resources provide opportunities to broaden the application of bioinformatics and data science in research. Two obstacles for researchers, particularly those at small institutions, are: (i) access to bioinformatics analysis environments tailored to their research; and (ii) training in how to use Cloud-based computing resources. We developed five reusable tutorials for bulk RNAseq data analysis to address these obstacles. Using Jupyter notebooks run on the Google Cloud Platform, the tutorials guide the user through a workflow featuring an RNAseq dataset from a study of prophage altered drug resistance in Mycobacterium chelonae. The first tutorial uses a subset of the data so users can learn analysis steps rapidly, and the second uses the entire dataset. Next, a tutorial demonstrates how to analyze the read count data to generate lists of differentially expressed genes using R/DESeq2. Additional tutorials generate read counts using the Snakemake workflow manager and Nextflow with Google Batch. All tutorials are open-source and can be used as templates for other analysis.
Collapse
Affiliation(s)
- Steven Allers
- Department of Molecular and Biomedical Sciences, University of Maine, 5735 Hitchner Hall, Orono, ME 04469, United States
| | - Kyle A O’Connell
- Center for Information Technology, National Institutes of Health, 6555 Rock Spring Dr, Bethesda, MD 20817, United States
- Health Data and AI, Deloitte Consulting LLP, 1919 N. Lynn St, Arlington, VA 22203, United States
| | - Thad Carlson
- Center for Information Technology, National Institutes of Health, 6555 Rock Spring Dr, Bethesda, MD 20817, United States
- Health Data and AI, Deloitte Consulting LLP, 1919 N. Lynn St, Arlington, VA 22203, United States
| | - David Belardo
- Google Cloud, Google, 1900 Reston Metro Plaza, Reston, VA 20190, United States
| | - Benjamin L King
- Department of Molecular and Biomedical Sciences, University of Maine, 5735 Hitchner Hall, Orono, ME 04469, United States
- Maine Institutional Development Award Network of Biomedical Research Excellence (INBRE) Data Science Core, MDI Biological Laboratory, 159 Old Bar Harbor Rd, Bar Harbor, ME 04609, United States
- Graduate School of Biomedical Science and Engineering, University of Maine, 5775 Stodder Hall, Orono, ME 04469, United States
| |
Collapse
|
2
|
Sachdeva S, Bhatia S, Al Harrasi A, Shah YA, Anwer K, Philip AK, Shah SFA, Khan A, Ahsan Halim S. Unraveling the role of cloud computing in health care system and biomedical sciences. Heliyon 2024; 10:e29044. [PMID: 38601602 PMCID: PMC11004887 DOI: 10.1016/j.heliyon.2024.e29044] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/10/2023] [Revised: 03/24/2024] [Accepted: 03/28/2024] [Indexed: 04/12/2024] Open
Abstract
Cloud computing has emerged as a transformative force in healthcare and biomedical sciences, offering scalable, on-demand resources for managing vast amounts of data. This review explores the integration of cloud computing within these fields, highlighting its pivotal role in enhancing data management, security, and accessibility. We examine the application of cloud computing in various healthcare domains, including electronic medical records, telemedicine, and personalized patient care, as well as its impact on bioinformatics research, particularly in genomics, proteomics, and metabolomics. The review also addresses the challenges and ethical considerations associated with cloud-based healthcare solutions, such as data privacy and cybersecurity. By providing a comprehensive overview, we aim to assist readers in understanding the significance of cloud computing in modern medical applications and its potential to revolutionize both patient care and biomedical research.
Collapse
Affiliation(s)
| | - Saurabh Bhatia
- Natural & Medical Sciences Research Center, University of Nizwa, P.O. Box 33, 616 Birkat Al Mauz, Nizwa, Oman
- School of Health Science, University of Petroleum and Energy Studies, Prem Nagar, Dehradun, Uttarakhand, 248007, India
| | - Ahmed Al Harrasi
- Natural & Medical Sciences Research Center, University of Nizwa, P.O. Box 33, 616 Birkat Al Mauz, Nizwa, Oman
| | - Yasir Abbas Shah
- Natural & Medical Sciences Research Center, University of Nizwa, P.O. Box 33, 616 Birkat Al Mauz, Nizwa, Oman
| | - Khalid Anwer
- Department of Pharmaceutics, College of Pharmacy, Prince Sattam Bin Abdulaziz University, Al-Kharj, 11942, Saudi Arabia
| | - Anil K. Philip
- School of Pharmacy, University of Nizwa, Birkat Al Mouz, Nizwa, 616, Oman
| | - Syed Faisal Abbas Shah
- Faculty of Computer Science & Information Technology, Virtual University of Pakistan, Lahore, 54000, Pakistan
| | - Ajmal Khan
- Natural & Medical Sciences Research Center, University of Nizwa, P.O. Box 33, 616 Birkat Al Mauz, Nizwa, Oman
| | - Sobia Ahsan Halim
- Natural & Medical Sciences Research Center, University of Nizwa, P.O. Box 33, 616 Birkat Al Mauz, Nizwa, Oman
| |
Collapse
|
3
|
Huuki-Myers LA, Montgomery KD, Kwon SH, Cinquemani S, Eagles NJ, Gonzalez-Padilla D, Maden SK, Kleinman JE, Hyde TM, Hicks SC, Maynard KR, Collado-Torres L. Benchmark of cellular deconvolution methods using a multi-assay reference dataset from postmortem human prefrontal cortex. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.02.09.579665. [PMID: 38405805 PMCID: PMC10888823 DOI: 10.1101/2024.02.09.579665] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/27/2024]
Abstract
Background Cellular deconvolution of bulk RNA-sequencing (RNA-seq) data using single cell or nuclei RNA-seq (sc/snRNA-seq) reference data is an important strategy for estimating cell type composition in heterogeneous tissues, such as human brain. Computational methods for deconvolution have been developed and benchmarked against simulated data, pseudobulked sc/snRNA-seq data, or immunohistochemistry reference data. A major limitation in developing improved deconvolution algorithms has been the lack of integrated datasets with orthogonal measurements of gene expression and estimates of cell type proportions on the same tissue sample. Deconvolution algorithm performance has not yet been evaluated across different RNA extraction methods (cytosolic, nuclear, or whole cell RNA), different library preparation types (mRNA enrichment vs. ribosomal RNA depletion), or with matched single cell reference datasets. Results A rich multi-assay dataset was generated in postmortem human dorsolateral prefrontal cortex (DLPFC) from 22 tissue blocks. Assays included spatially-resolved transcriptomics, snRNA-seq, bulk RNA-seq (across six library/extraction RNA-seq combinations), and RNAScope/Immunofluorescence (RNAScope/IF) for six broad cell types. The Mean Ratio method, implemented in the DeconvoBuddies R package, was developed for selecting cell type marker genes. Six computational deconvolution algorithms were evaluated in DLPFC and predicted cell type proportions were compared to orthogonal RNAScope/IF measurements. Conclusions Bisque and hspe were the most accurate methods, were robust to differences in RNA library types and extractions. This multi-assay dataset showed that cell size differences, marker genes differentially quantified across RNA libraries, and cell composition variability in reference snRNA-seq impact the accuracy of current deconvolution methods.
Collapse
Affiliation(s)
- Louise A. Huuki-Myers
- Lieber Institute for Brain Development, Johns Hopkins Medical Campus, Baltimore, MD, 21205, USA
| | - Kelsey D. Montgomery
- Lieber Institute for Brain Development, Johns Hopkins Medical Campus, Baltimore, MD, 21205, USA
| | - Sang Ho Kwon
- Lieber Institute for Brain Development, Johns Hopkins Medical Campus, Baltimore, MD, 21205, USA
- The Solomon H. Snyder Department of Neuroscience, Johns Hopkins School of Medicine, Baltimore, MD, 21205, USA
| | - Sophia Cinquemani
- Lieber Institute for Brain Development, Johns Hopkins Medical Campus, Baltimore, MD, 21205, USA
| | - Nicholas J. Eagles
- Lieber Institute for Brain Development, Johns Hopkins Medical Campus, Baltimore, MD, 21205, USA
| | | | - Sean K. Maden
- Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, Baltimore, MD, 21205, USA
| | - Joel E. Kleinman
- Lieber Institute for Brain Development, Johns Hopkins Medical Campus, Baltimore, MD, 21205, USA
- Department of Psychiatry and Behavioral Sciences, Johns Hopkins School of Medicine, Baltimore, MD, 21205, USA
| | - Thomas M. Hyde
- Lieber Institute for Brain Development, Johns Hopkins Medical Campus, Baltimore, MD, 21205, USA
- Department of Psychiatry and Behavioral Sciences, Johns Hopkins School of Medicine, Baltimore, MD, 21205, USA
- Department of Neurology, Johns Hopkins School of Medicine, Baltimore, MD, 21205, USA
| | - Stephanie C. Hicks
- Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, Baltimore, MD, 21205, USA
- Department of Psychiatry and Behavioral Sciences, Johns Hopkins School of Medicine, Baltimore, MD, 21205, USA
- Center for Computational Biology, Johns Hopkins University, Baltimore, MD, 21205, USA
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, 21205, USA
- Malone Center for Engineering in Healthcare, Johns Hopkins University, Baltimore, MD, 21218, USA
| | - Kristen R. Maynard
- Lieber Institute for Brain Development, Johns Hopkins Medical Campus, Baltimore, MD, 21205, USA
- The Solomon H. Snyder Department of Neuroscience, Johns Hopkins School of Medicine, Baltimore, MD, 21205, USA
- Department of Psychiatry and Behavioral Sciences, Johns Hopkins School of Medicine, Baltimore, MD, 21205, USA
| | - Leonardo Collado-Torres
- Lieber Institute for Brain Development, Johns Hopkins Medical Campus, Baltimore, MD, 21205, USA
- Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, Baltimore, MD, 21205, USA
- Center for Computational Biology, Johns Hopkins University, Baltimore, MD, 21205, USA
| |
Collapse
|
4
|
Razi A, Lo CC, Wang S, Leek JT, Hansen KD. Genotype prediction of 336,463 samples from public expression data. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2023.10.21.562237. [PMID: 38559266 PMCID: PMC10979922 DOI: 10.1101/2023.10.21.562237] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 04/04/2024]
Abstract
Tens of thousands of RNA-sequencing experiments comprising hundreds of thousands of individual samples have now been performed. These data represent a broad range of experimental conditions, sequencing technologies, and hypotheses under study. The Recount project has aggregated and uniformly processed hundreds of thousands of publicly available RNA-seq samples. Most of these samples only include RNA expression measurements; genotype data for these same samples would enable a wide range of analyses including variant prioritization, eQTL analysis, and studies of allele specific expression. Here, we developed a statistical model based on the existing reference and alternative read counts from the RNA-seq experiments available through Recount3 to predict genotypes at autosomal biallelic loci in coding regions. We demonstrate the accuracy of our model using large-scale studies that measured both gene expression and genotype genome-wide. We show that our predictive model is highly accurate with 99.5% overall accuracy, 99.6% major allele accuracy, and 90.4% minor allele accuracy. Our model is robust to tissue and study effects, provided the coverage is high enough. We applied this model to genotype all the samples in Recount 3 and provide the largest ready-to-use expression repository containing genotype information. We illustrate that the predicted genotype from RNA-seq data is sufficient to unravel the underlying population structure of samples in Recount3 using Principal Component Analysis.
Collapse
Affiliation(s)
- Afrooz Razi
- Department of Genetic Medicine, Johns Hopkins University School of Medicine
| | - Christopher C. Lo
- Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health
| | - Siruo Wang
- Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health
| | - Jeffrey T. Leek
- Biostatistics Program, Division of Public Health Sciences, Fred Hutchinson Cancer Center
| | - Kasper D. Hansen
- Department of Genetic Medicine, Johns Hopkins University School of Medicine
- Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health
- Department of Biomedical Engineering, Johns Hopkins University School of Medicine
| |
Collapse
|
5
|
Schiebelhut LM, Guillaume AS, Kuhn A, Schweizer RM, Armstrong EE, Beaumont MA, Byrne M, Cosart T, Hand BK, Howard L, Mussmann SM, Narum SR, Rasteiro R, Rivera-Colón AG, Saarman N, Sethuraman A, Taylor HR, Thomas GWC, Wellenreuther M, Luikart G. Genomics and conservation: Guidance from training to analyses and applications. Mol Ecol Resour 2024; 24:e13893. [PMID: 37966259 DOI: 10.1111/1755-0998.13893] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/10/2022] [Revised: 10/25/2023] [Accepted: 10/30/2023] [Indexed: 11/16/2023]
Abstract
Environmental change is intensifying the biodiversity crisis and threatening species across the tree of life. Conservation genomics can help inform conservation actions and slow biodiversity loss. However, more training, appropriate use of novel genomic methods and communication with managers are needed. Here, we review practical guidance to improve applied conservation genomics. We share insights aimed at ensuring effectiveness of conservation actions around three themes: (1) improving pedagogy and training in conservation genomics including for online global audiences, (2) conducting rigorous population genomic analyses properly considering theory, marker types and data interpretation and (3) facilitating communication and collaboration between managers and researchers. We aim to update students and professionals and expand their conservation toolkit with genomic principles and recent approaches for conserving and managing biodiversity. The biodiversity crisis is a global problem and, as such, requires international involvement, training, collaboration and frequent reviews of the literature and workshops as we do here.
Collapse
Affiliation(s)
- Lauren M Schiebelhut
- Life and Environmental Sciences, University of California, Merced, California, USA
| | - Annie S Guillaume
- Geospatial Molecular Epidemiology group (GEOME), Laboratory for Biological Geochemistry (LGB), École Polytechnique Fédérale de Lausanne, Lausanne, Switzerland
| | - Arianna Kuhn
- Department of Biological Sciences, University of Lethbridge, Lethbridge, Alberta, Canada
- Virginia Museum of Natural History, Martinsville, Virginia, USA
| | - Rena M Schweizer
- Division of Biological Sciences, University of Montana, Missoula, Montana, USA
| | | | - Mark A Beaumont
- School of Biological Sciences, University of Bristol, Bristol, UK
| | - Margaret Byrne
- Department of Biodiversity, Conservation and Attractions, Biodiversity and Conservation Science, Perth, Western Australia, Australia
| | - Ted Cosart
- Flathead Lake Biology Station, University of Montana, Missoula, Montana, USA
| | - Brian K Hand
- Flathead Lake Biological Station, University of Montana, Polson, Montana, USA
| | - Leif Howard
- Flathead Lake Biology Station, University of Montana, Missoula, Montana, USA
| | - Steven M Mussmann
- Southwestern Native Aquatic Resources and Recovery Center, U.S. Fish & Wildlife Service, Dexter, New Mexico, USA
| | - Shawn R Narum
- Hagerman Genetics Lab, University of Idaho, Hagerman, Idaho, USA
| | - Rita Rasteiro
- MRC Integrative Epidemiology Unit, University of Bristol, Bristol, UK
| | - Angel G Rivera-Colón
- Department of Evolution, Ecology, and Behavior, University of Illinois at Urbana-Champaign, Champaign, Illinois, USA
| | - Norah Saarman
- Department of Biology and Ecology Center, Utah State University, Logan, Utah, USA
| | - Arun Sethuraman
- Department of Biology, San Diego State University, San Diego, California, USA
| | - Helen R Taylor
- Royal Zoological Society of Scotland, Edinburgh, Scotland
| | - Gregg W C Thomas
- Informatics Group, Harvard University, Cambridge, Massachusetts, USA
| | - Maren Wellenreuther
- Plant and Food Research, Nelson, New Zealand
- University of Auckland, Auckland, New Zealand
| | - Gordon Luikart
- Division of Biological Sciences, University of Montana, Missoula, Montana, USA
- Flathead Lake Biology Station, University of Montana, Missoula, Montana, USA
| |
Collapse
|
6
|
Choon YW, Choon YF, Nasarudin NA, Al Jasmi F, Remli MA, Alkayali MH, Mohamad MS. Artificial intelligence and database for NGS-based diagnosis in rare disease. Front Genet 2024; 14:1258083. [PMID: 38371307 PMCID: PMC10870236 DOI: 10.3389/fgene.2023.1258083] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/03/2023] [Accepted: 11/24/2023] [Indexed: 02/20/2024] Open
Abstract
Rare diseases (RDs) are rare complex genetic diseases affecting a conservative estimate of 300 million people worldwide. Recent Next-Generation Sequencing (NGS) studies are unraveling the underlying genetic heterogeneity of this group of diseases. NGS-based methods used in RDs studies have improved the diagnosis and management of RDs. Concomitantly, a suite of bioinformatics tools has been developed to sort through big data generated by NGS to understand RDs better. However, there are concerns regarding the lack of consistency among different methods, primarily linked to factors such as the lack of uniformity in input and output formats, the absence of a standardized measure for predictive accuracy, and the regularity of updates to the annotation database. Today, artificial intelligence (AI), particularly deep learning, is widely used in a variety of biological contexts, changing the healthcare system. AI has demonstrated promising capabilities in boosting variant calling precision, refining variant prediction, and enhancing the user-friendliness of electronic health record (EHR) systems in NGS-based diagnostics. This paper reviews the state of the art of AI in NGS-based genetics, and its future directions and challenges. It also compare several rare disease databases.
Collapse
Affiliation(s)
- Yee Wen Choon
- Institute for Artificial Intelligence and Big Data, Universiti Malaysia Kelantan, Kota Bharu, Kelantan, Malaysia
- Faculty of Data Science and Informatics, Universiti Malaysia Kelantan, Kota Bharu, Kelantan, Malaysia
| | - Yee Fan Choon
- Faculty of Dentistry, Lincoln University College, Petaling Jaya, Selangor, Malaysia
| | - Nurul Athirah Nasarudin
- Health Data Science Lab, Department of Genetics and Genomics, College of Medicine and Health Sciences, United Arab Emirates University, Al Ain, United Arab Emirates
| | - Fatma Al Jasmi
- Health Data Science Lab, Department of Genetics and Genomics, College of Medicine and Health Sciences, United Arab Emirates University, Al Ain, United Arab Emirates
| | - Muhamad Akmal Remli
- Institute for Artificial Intelligence and Big Data, Universiti Malaysia Kelantan, Kota Bharu, Kelantan, Malaysia
- Faculty of Data Science and Informatics, Universiti Malaysia Kelantan, Kota Bharu, Kelantan, Malaysia
| | | | - Mohd Saberi Mohamad
- Health Data Science Lab, Department of Genetics and Genomics, College of Medicine and Health Sciences, United Arab Emirates University, Al Ain, United Arab Emirates
| |
Collapse
|
7
|
Kumar P, Paul RK, Roy HS, Yeasin M, Ajit, Paul AK. Big Data Analysis in Computational Biology and Bioinformatics. Methods Mol Biol 2024; 2719:181-197. [PMID: 37803119 DOI: 10.1007/978-1-0716-3461-5_11] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/08/2023]
Abstract
Advancements in high-throughput technologies, genomics, transcriptomics, and metabolomics play an important role in obtaining biological information about living organisms. The field of computational biology and bioinformatics has experienced significant growth with the advent of high-throughput sequencing technologies and other high-throughput techniques. The resulting large amounts of data present both opportunities and challenges for data analysis. Big data analysis has become essential for extracting meaningful insights from the massive amount of data. In this chapter, we provide an overview of the current status of big data analysis in computational biology and bioinformatics. We discuss the various aspects of big data analysis, including data acquisition, storage, processing, and analysis. We also highlight some of the challenges and opportunities of big data analysis in this area of research. Despite the challenges, big data analysis presents significant opportunities like development of efficient and fast computing algorithms for advancing our understanding of biological processes, identifying novel biomarkers for breeding research and developments, predicting disease, and identifying potential drug targets for drug development programs.
Collapse
Affiliation(s)
- Prakash Kumar
- ICAR-Indian Agricultural Statistics Research Institute, Pusa, New Delhi, India
| | - Ranjit Kumar Paul
- ICAR-Indian Agricultural Statistics Research Institute, Pusa, New Delhi, India
| | - Himadri Shekhar Roy
- ICAR-Indian Agricultural Statistics Research Institute, Pusa, New Delhi, India
| | - Md Yeasin
- ICAR-Indian Agricultural Statistics Research Institute, Pusa, New Delhi, India
| | - Ajit
- ICAR-Indian Agricultural Statistics Research Institute, Pusa, New Delhi, India
| | - Amrit Kumar Paul
- ICAR-Indian Agricultural Statistics Research Institute, Pusa, New Delhi, India
| |
Collapse
|
8
|
Maden SK, Kwon SH, Huuki-Myers LA, Collado-Torres L, Hicks SC, Maynard KR. Challenges and opportunities to computationally deconvolve heterogeneous tissue with varying cell sizes using single-cell RNA-sequencing datasets. Genome Biol 2023; 24:288. [PMID: 38098055 PMCID: PMC10722720 DOI: 10.1186/s13059-023-03123-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/11/2023] [Accepted: 11/24/2023] [Indexed: 12/17/2023] Open
Abstract
Deconvolution of cell mixtures in "bulk" transcriptomic samples from homogenate human tissue is important for understanding disease pathologies. However, several experimental and computational challenges impede transcriptomics-based deconvolution approaches using single-cell/nucleus RNA-seq reference atlases. Cells from the brain and blood have substantially different sizes, total mRNA, and transcriptional activities, and existing approaches may quantify total mRNA instead of cell type proportions. Further, standards are lacking for the use of cell reference atlases and integrative analyses of single-cell and spatial transcriptomics data. We discuss how to approach these key challenges with orthogonal "gold standard" datasets for evaluating deconvolution methods.
Collapse
Affiliation(s)
- Sean K Maden
- Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, Baltimore, MD, USA
| | - Sang Ho Kwon
- Lieber Institute for Brain Development, Johns Hopkins Medical Campus, Baltimore, MD, USA
- The Solomon H. Snyder Department of Neuroscience, Johns Hopkins School of Medicine, Baltimore, MD, USA
| | - Louise A Huuki-Myers
- Lieber Institute for Brain Development, Johns Hopkins Medical Campus, Baltimore, MD, USA
| | - Leonardo Collado-Torres
- Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, Baltimore, MD, USA
- Lieber Institute for Brain Development, Johns Hopkins Medical Campus, Baltimore, MD, USA
| | - Stephanie C Hicks
- Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, Baltimore, MD, USA.
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, USA.
- Center for Computational Biology, Johns Hopkins University, Baltimore, MD, USA.
- Malone Center for Engineering in Healthcare, Johns Hopkins University, Baltimore, MD, USA.
| | - Kristen R Maynard
- Lieber Institute for Brain Development, Johns Hopkins Medical Campus, Baltimore, MD, USA.
- The Solomon H. Snyder Department of Neuroscience, Johns Hopkins School of Medicine, Baltimore, MD, USA.
- Department of Psychiatry and Behavioral Sciences, Johns Hopkins School of Medicine, Baltimore, MD, USA.
| |
Collapse
|
9
|
Post AR, Ho N, Rasmussen E, Post I, Cho A, Hofer J, Maness AT, Parnell T, Nix DA. Hypermedia-based software architecture enables Test-Driven Development. JAMIA Open 2023; 6:ooad089. [PMID: 37860604 PMCID: PMC10582517 DOI: 10.1093/jamiaopen/ooad089] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/31/2023] [Revised: 08/12/2023] [Accepted: 10/04/2023] [Indexed: 10/21/2023] Open
Abstract
Objectives Using agile software development practices, develop and evaluate an architecture and implementation for reliable and user-friendly self-service management of bioinformatic data stored in the cloud. Materials and methods Comprehensive Oncology Research Environment (CORE) Browser is a new open-source web application for cancer researchers to manage sequencing data organized in a flexible format in Amazon Simple Storage Service (S3) buckets. It has a microservices- and hypermedia-based architecture, which we integrated with Test-Driven Development (TDD), the iterative writing of computable specifications for how software should work prior to development. Relying on repeating patterns found in hypermedia-based architectures, we hypothesized that hypermedia would permit developing test "templates" that can be parameterized and executed for each microservice, maximizing code coverage while minimizing effort. Results After one-and-a-half years of development, the CORE Browser backend had 121 test templates and 875 custom tests that were parameterized and executed 3031 times, providing 78% code coverage. Discussion Architecting to permit test reuse through a hypermedia approach was a key success factor for our testing efforts. CORE Browser's application of hypermedia and TDD illustrates one way to integrate software engineering methods into data-intensive networked applications. Separating bioinformatic data management from analysis distinguishes this platform from others in bioinformatics and may provide stable data management while permitting analysis methods to advance more rapidly. Conclusion Software engineering practices are underutilized in informatics. Similar informatics projects will more likely succeed through application of good architecture and automated testing. Our approach is broadly applicable to data management tools involving cloud data storage.
Collapse
Affiliation(s)
- Andrew R Post
- Huntsman Cancer Institute, University of Utah, Salt Lake City, UT 84112, United States
- Department of Biomedical Informatics, University of Utah, Salt Lake City, UT 84112, United States
| | - Nancy Ho
- Huntsman Cancer Institute, University of Utah, Salt Lake City, UT 84112, United States
| | - Erik Rasmussen
- Huntsman Cancer Institute, University of Utah, Salt Lake City, UT 84112, United States
| | - Ivan Post
- Huntsman Cancer Institute, University of Utah, Salt Lake City, UT 84112, United States
| | - Aika Cho
- Huntsman Cancer Institute, University of Utah, Salt Lake City, UT 84112, United States
| | - John Hofer
- Huntsman Cancer Institute, University of Utah, Salt Lake City, UT 84112, United States
| | - Arthur T Maness
- Huntsman Cancer Institute, University of Utah, Salt Lake City, UT 84112, United States
| | - Timothy Parnell
- Huntsman Cancer Institute, University of Utah, Salt Lake City, UT 84112, United States
| | - David A Nix
- Huntsman Cancer Institute, University of Utah, Salt Lake City, UT 84112, United States
| |
Collapse
|
10
|
Plaza DF, Zerebinski J, Broumou I, Lautenbach MJ, Ngasala B, Sundling C, Färnert A. A genomic platform for surveillance and antigen discovery in Plasmodium spp. using long-read amplicon sequencing. CELL REPORTS METHODS 2023; 3:100574. [PMID: 37751696 PMCID: PMC10545912 DOI: 10.1016/j.crmeth.2023.100574] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/14/2022] [Revised: 06/18/2023] [Accepted: 08/07/2023] [Indexed: 09/28/2023]
Abstract
Many vaccine candidate proteins in the malaria parasite Plasmodium falciparum are under strong immunological pressure and confer antigenic diversity. We present a sequencing and data analysis platform for the genomic surveillance of the insertion or deletion (indel)-rich antigens merozoite surface protein 1 (MSP1), MSP2, glutamate-rich protein (GLURP), and CSP from P. falciparum using long-read circular consensus sequencing (CCS) in multiclonal malaria isolates. Our platform uses 40 PCR primers per gene to asymmetrically barcode and identify multiclonal infections in pools of up to 384 samples. With msp2, we validated the method using 235 mock infections combining 10 synthetic variants at different concentrations and infection complexities. We applied this strategy to P. falciparum isolates from a longitudinal cohort in Tanzania. Finally, we constructed an analysis pipeline that streamlines the processing and interpretation of epidemiological and antigenic diversity data from demultiplexed FASTQ files. This platform can be easily adapted to other polymorphic antigens of interest in Plasmodium or any other human pathogen.
Collapse
Affiliation(s)
- David Fernando Plaza
- Division of Infectious Diseases, Department of Medicine Solna and Center for Molecular Medicine, Karolinska Institutet, 17177 Stockholm, Sweden; Department of Infectious Diseases, Karolinska University Hospital, 17176 Stockholm, Sweden.
| | - Julia Zerebinski
- Division of Infectious Diseases, Department of Medicine Solna and Center for Molecular Medicine, Karolinska Institutet, 17177 Stockholm, Sweden; Department of Infectious Diseases, Karolinska University Hospital, 17176 Stockholm, Sweden
| | - Ioanna Broumou
- Division of Infectious Diseases, Department of Medicine Solna and Center for Molecular Medicine, Karolinska Institutet, 17177 Stockholm, Sweden; Department of Infectious Diseases, Karolinska University Hospital, 17176 Stockholm, Sweden
| | - Maximilian Julius Lautenbach
- Division of Infectious Diseases, Department of Medicine Solna and Center for Molecular Medicine, Karolinska Institutet, 17177 Stockholm, Sweden; Department of Infectious Diseases, Karolinska University Hospital, 17176 Stockholm, Sweden
| | - Billy Ngasala
- Muhimbili University of Health and Allied Sciences, Dar es Salaam 57RF+V8, Tanzania
| | - Christopher Sundling
- Division of Infectious Diseases, Department of Medicine Solna and Center for Molecular Medicine, Karolinska Institutet, 17177 Stockholm, Sweden; Department of Infectious Diseases, Karolinska University Hospital, 17176 Stockholm, Sweden
| | - Anna Färnert
- Division of Infectious Diseases, Department of Medicine Solna and Center for Molecular Medicine, Karolinska Institutet, 17177 Stockholm, Sweden; Department of Infectious Diseases, Karolinska University Hospital, 17176 Stockholm, Sweden
| |
Collapse
|
11
|
Lim HGM, Fann YC, Lee YCG. COWID: an efficient cloud-based genomics workflow for scalable identification of SARS-COV-2. Brief Bioinform 2023; 24:bbad280. [PMID: 37738400 PMCID: PMC10516370 DOI: 10.1093/bib/bbad280] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/30/2022] [Revised: 07/15/2023] [Accepted: 07/19/2023] [Indexed: 09/24/2023] Open
Abstract
Implementing a specific cloud resource to analyze extensive genomic data on severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) poses a challenge when resources are limited. To overcome this, we repurposed a cloud platform initially designed for use in research on cancer genomics (https://cgc.sbgenomics.com) to enable its use in research on SARS-CoV-2 to build Cloud Workflow for Viral and Variant Identification (COWID). COWID is a workflow based on the Common Workflow Language that realizes the full potential of sequencing technology for use in reliable SARS-CoV-2 identification and leverages cloud computing to achieve efficient parallelization. COWID outperformed other contemporary methods for identification by offering scalable identification and reliable variant findings with no false-positive results. COWID typically processed each sample of raw sequencing data within 5 min at a cost of only US$0.01. The COWID source code is publicly available (https://github.com/hendrick0403/COWID) and can be accessed on any computer with Internet access. COWID is designed to be user-friendly; it can be implemented without prior programming knowledge. Therefore, COWID is a time-efficient tool that can be used during a pandemic.
Collapse
Affiliation(s)
- Hendrick Gao-Min Lim
- Graduate Institute of Biomedical Informatics, College of Medical Science and Technology, Taipei Medical University, Taipei, Taiwan 11031
- Department of Medical Research, Tzu Chi Hospital Indonesia, Pantai Indah Kapuk, Greater Jakarta, Indonesia 14470
| | - Yang C Fann
- IT and Bioinformatics Program, Division of Intramural, National Institute of Neurological Disorders and Stroke, National Institutes of Health, Bethesda, Maryland, USA 20892
| | - Yuan-Chii Gladys Lee
- Graduate Institute of Biomedical Informatics, College of Medical Science and Technology, Taipei Medical University, Taipei, Taiwan 11031
| |
Collapse
|
12
|
Deflaux N, Selvaraj MS, Condon HR, Mayo K, Haidermota S, Basford MA, Lunt C, Philippakis AA, Roden DM, Denny JC, Musick A, Collins R, Allen N, Effingham M, Glazer D, Natarajan P, Bick AG. Demonstrating paths for unlocking the value of cloud genomics through cross cohort analysis. Nat Commun 2023; 14:5419. [PMID: 37669985 PMCID: PMC10480504 DOI: 10.1038/s41467-023-41185-x] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/05/2022] [Accepted: 08/24/2023] [Indexed: 09/07/2023] Open
Abstract
Recently, large scale genomic projects such as All of Us and the UK Biobank have introduced a new research paradigm where data are stored centrally in cloud-based Trusted Research Environments (TREs). To characterize the advantages and drawbacks of different TRE attributes in facilitating cross-cohort analysis, we conduct a Genome-Wide Association Study of standard lipid measures using two approaches: meta-analysis and pooled analysis. Comparison of full summary data from both approaches with an external study shows strong correlation of known loci with lipid levels (R2 ~ 83-97%). Importantly, 90 variants meet the significance threshold only in the meta-analysis and 64 variants are significant only in pooled analysis, with approximately 20% of variants in each of those groups being most prevalent in non-European, non-Asian ancestry individuals. These findings have important implications, as technical and policy choices lead to cross-cohort analyses generating similar, but not identical results, particularly for non-European ancestral populations.
Collapse
Affiliation(s)
| | - Margaret Sunitha Selvaraj
- Program in Medical and Population Genetics and the Cardiovascular Disease Initiative, Broad Institute of Harvard and MIT, Cambridge, MA, USA
- Department of Medicine, Harvard Medical School, Boston, MA, USA
- Cardiovascular Research Center, Massachusetts General Hospital, Boston, MA, USA
- Center for Genomic Medicine, Massachusetts General Hospital, Boston, MA, USA
| | - Henry Robert Condon
- Division of Genetic Medicine, Vanderbilt University Medical Center, Nashville, TN, USA
| | - Kelsey Mayo
- Vanderbilt Institute for Clinical and Translational Research, Vanderbilt University Medical Center, Nashville, TN, USA
| | - Sara Haidermota
- Program in Medical and Population Genetics and the Cardiovascular Disease Initiative, Broad Institute of Harvard and MIT, Cambridge, MA, USA
- Division of Cardiology, Massachusetts General Hospital, Boston, MA, USA
| | - Melissa A Basford
- Vanderbilt Institute for Clinical and Translational Research, Vanderbilt University Medical Center, Nashville, TN, USA
| | - Chris Lunt
- All of Us Research Program, National Institutes of Health, Bethesda, MD, USA
| | | | - Dan M Roden
- Department of Medicine, Vanderbilt University Medical Center, Nashville, TN, USA
- Department of Pharmacology, Vanderbilt University Medical Center, Nashville, TN, USA
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN, USA
| | - Joshua C Denny
- All of Us Research Program, National Institutes of Health, Bethesda, MD, USA
| | - Anjene Musick
- All of Us Research Program, National Institutes of Health, Bethesda, MD, USA
| | - Rory Collins
- Nuffield Department of Population Health, University of Oxford, Oxford, Oxfordshire, UK
- UK Biobank, Cheadle, Stockport, UK
| | - Naomi Allen
- Nuffield Department of Population Health, University of Oxford, Oxford, Oxfordshire, UK
- UK Biobank, Cheadle, Stockport, UK
| | | | | | - Pradeep Natarajan
- Program in Medical and Population Genetics and the Cardiovascular Disease Initiative, Broad Institute of Harvard and MIT, Cambridge, MA, USA
- Department of Medicine, Harvard Medical School, Boston, MA, USA
- Cardiovascular Research Center, Massachusetts General Hospital, Boston, MA, USA
- Center for Genomic Medicine, Massachusetts General Hospital, Boston, MA, USA
- Division of Cardiology, Massachusetts General Hospital, Boston, MA, USA
| | - Alexander G Bick
- Division of Genetic Medicine, Vanderbilt University Medical Center, Nashville, TN, USA.
| |
Collapse
|
13
|
Allen C, Meinl R, Paez JS, Searle BC, Just S, Pino LK, Fondrie WE. nf-encyclopedia: A Cloud-Ready Pipeline for Chromatogram Library Data-Independent Acquisition Proteomics Workflows. J Proteome Res 2023; 22:2743-2749. [PMID: 37417926 DOI: 10.1021/acs.jproteome.2c00613] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/08/2023]
Abstract
Data-independent acquisition (DIA) mass spectrometry methods provide systematic and comprehensive quantification of the proteome; yet, relatively few open-source tools are available to analyze DIA proteomics experiments. Fewer still are tools that can leverage gas phase fractionated (GPF) chromatogram libraries to enhance the detection and quantification of peptides in these experiments. Here, we present nf-encyclopedia, an open-source NextFlow pipeline that connects three open-source tools, MSConvert, EncyclopeDIA, and MSstats, to analyze DIA proteomics experiments with or without chromatogram libraries. We demonstrate that nf-encyclopedia is reproducible when run on either a cloud platform or a local workstation and provides robust peptide and protein quantification. Additionally, we found that MSstats enhances protein-level quantitative performance over EncyclopeDIA alone. Finally, we benchmarked the ability of nf-encyclopedia to scale to large experiments in the cloud by leveraging the parallelization of compute resources. The nf-encyclopedia pipeline is available under a permissive Apache 2.0 license; run it on your desktop, cluster, or in the cloud: https://github.com/TalusBio/nf-encyclopedia.
Collapse
Affiliation(s)
- Carolyn Allen
- Talus Bioscience, Seattle, Washington 98122, United States
| | - Rico Meinl
- Talus Bioscience, Seattle, Washington 98122, United States
| | | | - Brian C Searle
- Department of Biomedical Informatics, The Ohio State University, Columbus, Ohio 43210, United States
- Pelotonia Institute for Immuno-Oncology, The Ohio State University Comprehensive Cancer Center, Columbus, Ohio 43210, United States
- Proteome Software, Inc., Portland, Oregon 97219, United States
| | - Seth Just
- Proteome Software, Inc., Portland, Oregon 97219, United States
| | - Lindsay K Pino
- Talus Bioscience, Seattle, Washington 98122, United States
| | | |
Collapse
|
14
|
Nguyen T, Bian X, Roberson D, Khanna R, Chen Q, Yan C, Beck R, Worman Z, Meerzaman D. Multi-omics Pathways Workflow (MOPAW): An Automated Multi-omics Workflow on the Cancer Genomics Cloud. Cancer Inform 2023; 22:11769351231180992. [PMID: 37342652 PMCID: PMC10278438 DOI: 10.1177/11769351231180992] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/21/2023] [Accepted: 05/22/2023] [Indexed: 06/23/2023] Open
Abstract
Introduction In the era of big data, gene-set pathway analyses derived from multi-omics are exceptionally powerful. When preparing and analyzing high-dimensional multi-omics data, the installation process and programing skills required to use existing tools can be challenging. This is especially the case for those who are not familiar with coding. In addition, implementation with high performance computing solutions is required to run these tools efficiently. Methods We introduce an automatic multi-omics pathway workflow, a point and click graphical user interface to Multivariate Single Sample Gene Set Analysis (MOGSA), hosted on the Cancer Genomics Cloud by Seven Bridges Genomics. This workflow leverages the combination of different tools to perform data preparation for each given data types, dimensionality reduction, and MOGSA pathway analysis. The Omics data includes copy number alteration, transcriptomics data, proteomics and phosphoproteomics data. We have also provided an additional workflow to help with downloading data from The Cancer Genome Atlas and Clinical Proteomic Tumor Analysis Consortium and preprocessing these data to be used for this multi-omics pathway workflow. Results The main outputs of this workflow are the distinct pathways for subgroups of interest provided by users, which are displayed in heatmaps if identified. In addition to this, graphs and tables are provided to users for reviewing. Conclusion Multi-omics Pathway Workflow requires no coding experience. Users can bring their own data or download and preprocess public datasets from The Cancer Genome Atlas and Clinical Proteomic Tumor Analysis Consortium using our additional workflow based on the samples of interest. Distinct overactivated or deactivated pathways for groups of interest can be found. This useful information is important in effective therapeutic targeting.
Collapse
Affiliation(s)
- Trinh Nguyen
- The Computational Genomics and Bioinformatics Branch, Center for Biomedical Informatics and Information Technology, National Cancer Institute, Rockville, MD, USA
| | - Xiaopeng Bian
- The Computational Genomics and Bioinformatics Branch, Center for Biomedical Informatics and Information Technology, National Cancer Institute, Rockville, MD, USA
| | | | - Rakesh Khanna
- The Computational Genomics and Bioinformatics Branch, Center for Biomedical Informatics and Information Technology, National Cancer Institute, Rockville, MD, USA
| | - Qingrong Chen
- The Computational Genomics and Bioinformatics Branch, Center for Biomedical Informatics and Information Technology, National Cancer Institute, Rockville, MD, USA
| | - Chunhua Yan
- The Computational Genomics and Bioinformatics Branch, Center for Biomedical Informatics and Information Technology, National Cancer Institute, Rockville, MD, USA
| | | | | | - Daoud Meerzaman
- The Computational Genomics and Bioinformatics Branch, Center for Biomedical Informatics and Information Technology, National Cancer Institute, Rockville, MD, USA
| |
Collapse
|
15
|
O'Connell KA, Yosufzai ZB, Campbell RA, Lobb CJ, Engelken HT, Gorrell LM, Carlson TB, Catana JJ, Mikdadi D, Bonazzi VR, Klenk JA. Accelerating genomic workflows using NVIDIA Parabricks. BMC Bioinformatics 2023; 24:221. [PMID: 37259021 PMCID: PMC10230726 DOI: 10.1186/s12859-023-05292-2] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/20/2022] [Accepted: 04/15/2023] [Indexed: 06/02/2023] Open
Abstract
BACKGROUND As genome sequencing becomes better integrated into scientific research, government policy, and personalized medicine, the primary challenge for researchers is shifting from generating raw data to analyzing these vast datasets. Although much work has been done to reduce compute times using various configurations of traditional CPU computing infrastructures, Graphics Processing Units (GPUs) offer opportunities to accelerate genomic workflows by orders of magnitude. Here we benchmark one GPU-accelerated software suite called NVIDIA Parabricks on Amazon Web Services (AWS), Google Cloud Platform (GCP), and an NVIDIA DGX cluster. We benchmarked six variant calling pipelines, including two germline callers (HaplotypeCaller and DeepVariant) and four somatic callers (Mutect2, Muse, LoFreq, SomaticSniper). RESULTS We achieved up to 65 × acceleration with germline variant callers, bringing HaplotypeCaller runtimes down from 36 h to 33 min on AWS, 35 min on GCP, and 24 min on the NVIDIA DGX. Somatic callers exhibited more variation between the number of GPUs and computing platforms. On cloud platforms, GPU-accelerated germline callers resulted in cost savings compared with CPU runs, whereas some somatic callers were more expensive than CPU runs because their GPU acceleration was not sufficient to overcome the increased GPU cost. CONCLUSIONS Germline variant callers scaled well with the number of GPUs across platforms, whereas somatic variant callers exhibited more variation in the number of GPUs with the fastest runtimes, suggesting that, at least with the version of Parabricks used here, these workflows are less GPU optimized and require benchmarking on the platform of choice before being deployed at production scales. Our study demonstrates that GPUs can be used to greatly accelerate genomic workflows, thus bringing closer to grasp urgent societal advances in the areas of biosurveillance and personalized medicine.
Collapse
Affiliation(s)
- Kyle A O'Connell
- Health Data and AI, Deloitte Consulting LLP, VA, 22009, Arlington, USA
| | | | - Ross A Campbell
- Health Data and AI, Deloitte Consulting LLP, VA, 22009, Arlington, USA
| | - Collin J Lobb
- Health Data and AI, Deloitte Consulting LLP, VA, 22009, Arlington, USA
| | - Haley T Engelken
- Health Data and AI, Deloitte Consulting LLP, VA, 22009, Arlington, USA
| | - Laura M Gorrell
- Health Data and AI, Deloitte Consulting LLP, VA, 22009, Arlington, USA
| | - Thad B Carlson
- Cloud Managed Services, Deloitte Consulting LLP, Detroit, MI, 48226, USA
| | - Josh J Catana
- Health Data and AI, Deloitte Consulting LLP, VA, 22009, Arlington, USA
| | - Dina Mikdadi
- Health Data and AI, Deloitte Consulting LLP, VA, 22009, Arlington, USA
| | - Vivien R Bonazzi
- Health Data and AI, Deloitte Consulting LLP, VA, 22009, Arlington, USA.
| | - Juergen A Klenk
- Health Data and AI, Deloitte Consulting LLP, VA, 22009, Arlington, USA.
| |
Collapse
|
16
|
Berger B, Yu YW. Navigating bottlenecks and trade-offs in genomic data analysis. Nat Rev Genet 2023; 24:235-250. [PMID: 36476810 DOI: 10.1038/s41576-022-00551-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 10/27/2022] [Indexed: 12/12/2022]
Abstract
Genome sequencing and analysis allow researchers to decode the functional information hidden in DNA sequences as well as to study cell to cell variation within a cell population. Traditionally, the primary bottleneck in genomic analysis pipelines has been the sequencing itself, which has been much more expensive than the computational analyses that follow. However, an important consequence of the continued drive to expand the throughput of sequencing platforms at lower cost is that often the analytical pipelines are struggling to keep up with the sheer amount of raw data produced. Computational cost and efficiency have thus become of ever increasing importance. Recent methodological advances, such as data sketching, accelerators and domain-specific libraries/languages, promise to address these modern computational challenges. However, despite being more efficient, these innovations come with a new set of trade-offs, both expected, such as accuracy versus memory and expense versus time, and more subtle, including the human expertise needed to use non-standard programming interfaces and set up complex infrastructure. In this Review, we discuss how to navigate these new methodological advances and their trade-offs.
Collapse
Affiliation(s)
- Bonnie Berger
- Department of Mathematics, Massachusetts Institute of Technology, Cambridge, MA, USA.
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA, USA.
| | - Yun William Yu
- Department of Computer and Mathematical Sciences, University of Toronto Scarborough, Toronto, Ontario, Canada
- Tri-Campus Department of Mathematics, University of Toronto, Toronto, Ontario, Canada
| |
Collapse
|
17
|
Li T, Li Y, Shangguan H, Bian J, Luo R, Tian Y, Li Z, Nie X, Cui L. BarleyExpDB: an integrative gene expression database for barley. BMC PLANT BIOLOGY 2023; 23:170. [PMID: 37003963 PMCID: PMC10064564 DOI: 10.1186/s12870-023-04193-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 12/02/2022] [Accepted: 03/27/2023] [Indexed: 06/19/2023]
Abstract
BACKGROUND RNA-sequencing (RNA-seq) has been widely used to study the dynamic expression patterns of transcribed genes, which can lead to new biological insights. However, processing and analyzing these huge amounts of histological data remains a great challenge for wet labs and field researchers who lack bioinformatics experience and computational resources. RESULTS We present BarleyExpDB, an easy-to-operate, free, and web-accessible database that integrates transcriptional profiles of barley at different growth and developmental stages, tissues, and stress conditions, as well as differential expression of mutants and populations to build a platform for barley expression and visualization. The expression of a gene of interest can be easily queried by searching by known gene ID or sequence similarity. Expression data can be displayed as a heat map, along with functional descriptions as well as Gene Ontology, Kyoto Encyclopedia of Genes and Genomes, Proteins Families Database, and Simple Modular Architecture Research Tool annotations. CONCLUSIONS BarleyExpDB will serve as a valuable resource for the barley research community to leverage the vast publicly available RNA-seq datasets for functional genomics research and crop molecular breeding.
Collapse
Affiliation(s)
- Tingting Li
- College of Bioscience and Engineering, Jiangxi Agricultural University, Nanchang, 330045 Jiangxi China
- State Key Laboratory of Crop Stress Biology in Arid Areas and College of Agronomy, Northwest A&F University, Yangling, 712100 Shaanxi China
| | - Yihan Li
- College of Bioscience and Engineering, Jiangxi Agricultural University, Nanchang, 330045 Jiangxi China
| | - Hongbin Shangguan
- College of Bioscience and Engineering, Jiangxi Agricultural University, Nanchang, 330045 Jiangxi China
| | - Jianxin Bian
- Peking University Institute of Advanced Agricultural Sciences, Weifang, 261325 Shandong China
| | - Ruihan Luo
- College of Bioscience and Engineering, Jiangxi Agricultural University, Nanchang, 330045 Jiangxi China
| | - Yuan Tian
- Xintai Urban and Rural Development Group Co., Ltd, Taian, 271200 Shandong China
| | - Zhimin Li
- College of Bioscience and Engineering, Jiangxi Agricultural University, Nanchang, 330045 Jiangxi China
| | - Xiaojun Nie
- State Key Laboratory of Crop Stress Biology in Arid Areas and College of Agronomy, Northwest A&F University, Yangling, 712100 Shaanxi China
| | - Licao Cui
- College of Bioscience and Engineering, Jiangxi Agricultural University, Nanchang, 330045 Jiangxi China
| |
Collapse
|
18
|
Camacho C, Boratyn GM, Joukov V, Vera Alvarez R, Madden TL. ElasticBLAST: accelerating sequence search via cloud computing. BMC Bioinformatics 2023; 24:117. [PMID: 36967390 PMCID: PMC10040096 DOI: 10.1186/s12859-023-05245-9] [Citation(s) in RCA: 20] [Impact Index Per Article: 20.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/04/2023] [Accepted: 03/21/2023] [Indexed: 03/28/2023] Open
Abstract
BACKGROUND Biomedical researchers use alignments produced by BLAST (Basic Local Alignment Search Tool) to categorize their query sequences. Producing such alignments is an essential bioinformatics task that is well suited for the cloud. The cloud can perform many calculations quickly as well as store and access large volumes of data. Bioinformaticians can also use it to collaborate with other researchers, sharing their results, datasets and even their pipelines on a common platform. RESULTS We present ElasticBLAST, a cloud native application to perform BLAST alignments in the cloud. ElasticBLAST can handle anywhere from a few to many thousands of queries and run the searches on thousands of virtual CPUs (if desired), deleting resources when it is done. It uses cloud native tools for orchestration and can request discounted instances, lowering cloud costs for users. It is supported on Amazon Web Services and Google Cloud Platform. It can search BLAST databases that are user provided or from the National Center for Biotechnology Information. CONCLUSION We show that ElasticBLAST is a useful application that can efficiently perform BLAST searches for the user in the cloud, demonstrating that with two examples. At the same time, it hides much of the complexity of working in the cloud, lowering the threshold to move work to the cloud.
Collapse
Affiliation(s)
- Christiam Camacho
- grid.280285.50000 0004 0507 7840National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, 8600 Rockville Pike, Bethesda, MD 20894 USA
| | - Grzegorz M. Boratyn
- grid.280285.50000 0004 0507 7840National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, 8600 Rockville Pike, Bethesda, MD 20894 USA
| | - Victor Joukov
- grid.280285.50000 0004 0507 7840National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, 8600 Rockville Pike, Bethesda, MD 20894 USA
| | - Roberto Vera Alvarez
- grid.280285.50000 0004 0507 7840National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, 8600 Rockville Pike, Bethesda, MD 20894 USA
| | - Thomas L. Madden
- grid.280285.50000 0004 0507 7840National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, 8600 Rockville Pike, Bethesda, MD 20894 USA
| |
Collapse
|
19
|
Alvarellos M, Sheppard HE, Knarston I, Davison C, Raine N, Seeger T, Prieto Barja P, Chatzou Dunford M. Democratizing clinical-genomic data: How federated platforms can promote benefits sharing in genomics. Front Genet 2023; 13:1045450. [PMID: 36704354 PMCID: PMC9871385 DOI: 10.3389/fgene.2022.1045450] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2022] [Accepted: 12/19/2022] [Indexed: 01/12/2023] Open
Abstract
Since the first sequencing of the human genome, associated sequencing costs have dramatically lowered, leading to an explosion of genomic data. This valuable data should in theory be of huge benefit to the global community, although unfortunately the benefits of these advances have not been widely distributed. Much of today's clinical-genomic data is siloed and inaccessible in adherence with strict governance and privacy policies, with more than 97% of hospital data going unused, according to one reference. Despite these challenges, there are promising efforts to make clinical-genomic data accessible and useful without compromising security. Specifically, federated data platforms are emerging as key resources to facilitate secure data sharing without having to physically move the data from outside of its organizational or jurisdictional boundaries. In this perspective, we summarize the overarching progress in establishing federated data platforms, and highlight critical considerations on how they should be managed to ensure patient and public trust. These platforms are enabling global collaboration and improving representation of underrepresented groups, since sequencing efforts have not prioritized diverse population representation until recently. Federated data platforms, when combined with advances in no-code technology, can be accessible to the diverse end-users that make up the genomics workforce, and we discuss potential strategies to develop sustainable business models so that the platforms can continue to enable research long term. Although these platforms must be carefully managed to ensure appropriate and ethical use, they are democratizing access and insights to clinical-genomic data that will progress research and enable impactful therapeutic findings.
Collapse
|
20
|
Krumm N. Organizational and Technical Security Considerations for Laboratory Cloud Computing. J Appl Lab Med 2023; 8:180-193. [PMID: 36610429 DOI: 10.1093/jalm/jfac118] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/24/2022] [Accepted: 10/25/2022] [Indexed: 01/09/2023]
Abstract
BACKGROUND Clinical and anatomical pathology services are increasingly utilizing cloud information technology (IT) solutions to meet growing requirements for storage, computation, and other IT services. Cloud IT solutions are often considered on the promise of low cost of entry, durability and reliability, scalability, and features that are typically out of reach for small- or mid-sized IT organizations. However, use of cloud-based IT infrastructure also brings additional security and privacy risks to organizations, as unfamiliarity, public networks, and complex feature sets contribute to an increased surface area for attacks. CONTENT In this best-practices guide, we aim to help both managers and IT professionals in healthcare environments understand the requirements and risks when using cloud-based IT infrastructure within the laboratory environment. We will describe how technical, operational, and organizational best practices that can help mitigate security, privacy, and other risks associated with the use of could infrastructure; furthermore, we identify how these best practices fit into healthcare regulatory frameworks.Among organizational best practices, we identify the need for specific hiring requirements, relationships with parent IT groups, mechanisms for reviewing and auditing security practices, and sound practices for onboarding and offboarding employees. Then, we highlight selected specific operational security, account security, and auditing/logging best practices. Finally, we describe how individual cloud technologies have specific resource-level security features. SUMMARY We emphasize that laboratory directors, managers, and IT professionals must ensure that the fundamental organizational and process-based requirements are addressed first, to establish the groundwork for technical security solutions and successful implementation of cloud infrastructure.
Collapse
Affiliation(s)
- Niklas Krumm
- Division of Informatics, Department of Laboratory Medicine and Pathology, University of Washington, Seattle, WA
| |
Collapse
|
21
|
Camacho C, Boratyn GM, Joukov V, Alvarez RV, Madden TL. ElasticBLAST: Accelerating Sequence Search via Cloud Computing. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.01.04.522777. [PMID: 36789435 PMCID: PMC9928022 DOI: 10.1101/2023.01.04.522777] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/06/2023]
Abstract
Background Biomedical researchers use alignments produced by BLAST (Basic Local Alignment Search Tool) to categorize their query sequences. Producing such alignments is an essential bioinformatics task that is well suited for the cloud. The cloud can perform many calculations quickly as well as store and access large volumes of data. Bioinformaticians can also use it to collaborate with other researchers, sharing their results, datasets and even their pipelines on a common platform. Results We present ElasticBLAST, a cloud native application to perform BLAST alignments in the cloud. ElasticBLAST can handle anywhere from a few to many thousands of queries and run the searches on thousands of virtual CPUs (if desired), deleting resources when it is done. It uses cloud native tools for orchestration and can request discounted instances, lowering cloud costs for users. It is supported on Amazon Web Services and Google Cloud Platform. It can search BLAST databases that are user provided or from the National Center for Biotechnology Information. Conclusion We show that ElasticBLAST is a useful application that can efficiently perform BLAST searches for the user in the cloud, demonstrating that with two examples. At the same time, it hides much of the complexity of working in the cloud, lowering the threshold to move work to the cloud.
Collapse
Affiliation(s)
- Christiam Camacho
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, 8600 Rockville Pike, Bethesda, MD, 20894, USA
| | - Grzegorz M. Boratyn
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, 8600 Rockville Pike, Bethesda, MD, 20894, USA
| | - Victor Joukov
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, 8600 Rockville Pike, Bethesda, MD, 20894, USA
| | - Roberto Vera Alvarez
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, 8600 Rockville Pike, Bethesda, MD, 20894, USA
| | | |
Collapse
|
22
|
Building cloud computing environments for genome analysis in Japan. Hum Genome Var 2022; 9:46. [PMID: 36517473 PMCID: PMC9751107 DOI: 10.1038/s41439-022-00223-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/22/2022] [Revised: 10/29/2022] [Accepted: 11/07/2022] [Indexed: 12/23/2022] Open
Abstract
This review article describes the current status of data archiving and computational infrastructure in the field of genomic medicine, focusing primarily on the situation in Japan. I begin by introducing the status of supercomputer operations in Japan, where a high-performance computing infrastructure (HPCI) is operated to meet the diverse computational needs of science in general. Since this HPCI consists of supercomputers of various architectures located across the nation connected via a high-speed network, including supercomputers specialized in genome science, the status of its response to the explosive increase in genomic data, including the International Nucleotide Sequence Database Collaboration (INSDC) data archive, is explored. Separately, since it is clear that the use of commercial cloud computing environments needs to be promoted, both in light of the rapid increase in computing demands and to support international data sharing and international data analysis projects, I explain how the Japanese government has established a series of guidelines for the use of cloud computing based on its cybersecurity strategy and has begun to build a government cloud for government agencies. I will also carefully consider several other issues of user concern. Finally, I will show how Japan's major cloud computing infrastructure is currently evolving toward a multicloud and hybrid cloud configuration.
Collapse
|
23
|
Vuong P, Wise MJ, Whiteley AS, Kaur P. Ten simple rules for investigating (meta)genomic data from environmental ecosystems. PLoS Comput Biol 2022; 18:e1010675. [PMID: 36480496 PMCID: PMC9731419 DOI: 10.1371/journal.pcbi.1010675] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/13/2022] Open
Affiliation(s)
- Paton Vuong
- UWA School of Agriculture & Environment, University of Western Australia, Perth, Australia
| | - Michael J. Wise
- School of Physics, Mathematics and Computing, University of Western Australia, Perth, Australia
- The Marshall Centre of Infectious Diseases, School of Biological Sciences, The University of Western Australia, Perth, Australia
| | - Andrew S. Whiteley
- Centre for Environment & Life Sciences, Commonwealth Scientific and Industrial Research Organisation (CSIRO), Floreat, Australia
| | - Parwinder Kaur
- UWA School of Agriculture & Environment, University of Western Australia, Perth, Australia
- * E-mail:
| |
Collapse
|
24
|
Truong L, Ayora F, D’Orsogna L, Martinez P, De Santis D. Nanopore sequencing data analysis using Microsoft Azure cloud computing service. PLoS One 2022; 17:e0278609. [PMID: 36459531 PMCID: PMC9718390 DOI: 10.1371/journal.pone.0278609] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/04/2022] [Accepted: 11/20/2022] [Indexed: 12/04/2022] Open
Abstract
Genetic information provides insights into the exome, genome, epigenetics and structural organisation of the organism. Given the enormous amount of genetic information, scientists are able to perform mammoth tasks to improve the standard of health care such as determining genetic influences on outcome of allogeneic transplantation. Cloud based computing has increasingly become a key choice for many scientists, engineers and institutions as it offers on-demand network access and users can conveniently rent rather than buy all required computing resources. With the positive advancements of cloud computing and nanopore sequencing data output, we were motivated to develop an automated and scalable analysis pipeline utilizing cloud infrastructure in Microsoft Azure to accelerate HLA genotyping service and improve the efficiency of the workflow at lower cost. In this study, we describe (i) the selection process for suitable virtual machine sizes for computing resources to balance between the best performance versus cost effectiveness; (ii) the building of Docker containers to include all tools in the cloud computational environment; (iii) the comparison of HLA genotype concordance between the in-house manual method and the automated cloud-based pipeline to assess data accuracy. In conclusion, the Microsoft Azure cloud based data analysis pipeline was shown to meet all the key imperatives for performance, cost, usability, simplicity and accuracy. Importantly, the pipeline allows for the on-going maintenance and testing of version changes before implementation. This pipeline is suitable for the data analysis from MinION sequencing platform and could be adopted for other data analysis application processes.
Collapse
Affiliation(s)
- Linh Truong
- Department of Clinical Immunology, PathWest, Perth, Australia
- UWA Medical School, University of Western Australia, Perth, Australia
- * E-mail:
| | - Felipe Ayora
- Research and Advanced Computing, BizData, Wellington, New Zealand
| | - Lloyd D’Orsogna
- Department of Clinical Immunology, PathWest, Perth, Australia
- UWA Medical School, University of Western Australia, Perth, Australia
| | - Patricia Martinez
- Department of Clinical Immunology, PathWest, Perth, Australia
- UWA Medical School, University of Western Australia, Perth, Australia
| | - Dianne De Santis
- Department of Clinical Immunology, PathWest, Perth, Australia
- UWA Medical School, University of Western Australia, Perth, Australia
| |
Collapse
|
25
|
Rehn J, Mayoh C, Heatley SL, McClure BJ, Eadie LN, Schutz C, Yeung DT, Cowley MJ, Breen J, White DL. RaScALL: Rapid (Ra) screening (Sc) of RNA-seq data for prognostically significant genomic alterations in acute lymphoblastic leukaemia (ALL). PLoS Genet 2022; 18:e1010300. [PMID: 36251721 PMCID: PMC9612819 DOI: 10.1371/journal.pgen.1010300] [Citation(s) in RCA: 11] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/21/2022] [Revised: 10/27/2022] [Accepted: 09/22/2022] [Indexed: 12/05/2022] Open
Abstract
RNA-sequencing (RNA-seq) efforts in acute lymphoblastic leukaemia (ALL) have identified numerous prognostically significant genomic alterations which can guide diagnostic risk stratification and treatment choices when detected early. However, integrating RNA-seq in a clinical setting requires rapid detection and accurate reporting of clinically relevant alterations. Here we present RaScALL, an implementation of the k-mer based variant detection tool km, capable of identifying more than 100 prognostically significant lesions observed in ALL, including gene fusions, single nucleotide variants and focal gene deletions. We compared genomic alterations detected by RaScALL and those reported by alignment-based de novo variant detection tools in a study cohort of 180 Australian patient samples. Results were validated using 100 patient samples from a published North American cohort. RaScALL demonstrated a high degree of accuracy for reporting subtype defining genomic alterations. Gene fusions, including difficult to detect fusions involving EPOR and DUX4, were accurately identified in 98% of reported cases in the study cohort (n = 164) and 95% of samples (n = 63) in the validation cohort. Pathogenic sequence variants were correctly identified in 75% of tested samples, including all cases involving subtype defining variants PAX5 p.P80R (n = 12) and IKZF1 p.N159Y (n = 4). Intragenic IKZF1 deletions resulting in aberrant transcript isoforms were also detectable with 98% accuracy. Importantly, the median analysis time for detection of all targeted alterations averaged 22 minutes per sample, significantly shorter than standard alignment-based approaches. The application of RaScALL enables rapid identification and reporting of previously identified genomic alterations of known clinical relevance.
Collapse
Affiliation(s)
- Jacqueline Rehn
- Blood Cancer Program, Precision Cancer Medicine Theme, South Australian Health & Medical Research Institute (SAHMRI), Adelaide, South Australia, Australia
- Faculty of Health and Medical Science, University of Adelaide, Adelaide, South Australia, Australia
| | - Chelsea Mayoh
- Children’s Cancer Institute, Kensington, New South Wales, Australia
- School of Clinical Medicine, UNSW Sydney, Sydney, New South Wales, Australia
| | - Susan L Heatley
- Blood Cancer Program, Precision Cancer Medicine Theme, South Australian Health & Medical Research Institute (SAHMRI), Adelaide, South Australia, Australia
- Faculty of Health and Medical Science, University of Adelaide, Adelaide, South Australia, Australia
- Australian and New Zealand Children’s Oncology Group (ANZCHOG), Clayton, Victoria, Australia
| | - Barbara J McClure
- Blood Cancer Program, Precision Cancer Medicine Theme, South Australian Health & Medical Research Institute (SAHMRI), Adelaide, South Australia, Australia
- Faculty of Health and Medical Science, University of Adelaide, Adelaide, South Australia, Australia
| | - Laura N Eadie
- Blood Cancer Program, Precision Cancer Medicine Theme, South Australian Health & Medical Research Institute (SAHMRI), Adelaide, South Australia, Australia
- Faculty of Health and Medical Science, University of Adelaide, Adelaide, South Australia, Australia
| | - Caitlin Schutz
- Blood Cancer Program, Precision Cancer Medicine Theme, South Australian Health & Medical Research Institute (SAHMRI), Adelaide, South Australia, Australia
| | - David T Yeung
- Blood Cancer Program, Precision Cancer Medicine Theme, South Australian Health & Medical Research Institute (SAHMRI), Adelaide, South Australia, Australia
- Faculty of Health and Medical Science, University of Adelaide, Adelaide, South Australia, Australia
- Department of Haematology, Royal Adelaide Hospital and SA Pathology, Adelaide, South Australia, Australia
| | - Mark J Cowley
- Children’s Cancer Institute, Kensington, New South Wales, Australia
- School of Clinical Medicine, UNSW Sydney, Sydney, New South Wales, Australia
| | - James Breen
- Black Ochre Data Labs, Telethon Kids Institute, Adelaide, South Australia, Australia
- Australian National University, Canberra, Australian Capital Territory, Australia
- * E-mail:
| | - Deborah L White
- Blood Cancer Program, Precision Cancer Medicine Theme, South Australian Health & Medical Research Institute (SAHMRI), Adelaide, South Australia, Australia
- Faculty of Health and Medical Science, University of Adelaide, Adelaide, South Australia, Australia
- Australian and New Zealand Children’s Oncology Group (ANZCHOG), Clayton, Victoria, Australia
- Australian Genomics Health Alliance (AGHA), Parkville, Victoria, Australia
- Faculty of Sciences, University of Adelaide, Adelaide, South Australia, Australia
| |
Collapse
|
26
|
Rahimzadeh V, Peng G, Cho M. A mixed-methods protocol to develop and validate a stewardship maturity matrix for human genomic data in the cloud. Front Genet 2022; 13:876869. [DOI: 10.3389/fgene.2022.876869] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/15/2022] [Accepted: 09/28/2022] [Indexed: 11/13/2022] Open
Abstract
This article describes a mixed-methods protocol to develop and test the implementation of a stewardship maturity matrix (SMM) for repositories which govern access to human genomic data in the cloud. It is anticipated that the cloud will host most human genomic and related health datasets generated as part of publicly funded research in the coming years. However, repository managers lack practical tools for identifying what stewardship outcomes matter most to key stakeholders as well as how to track progress on their stewardship goals over time. In this article we describe a protocol that combines Delphi survey methods with SMM modeling first introduced in the earth and planetary sciences to develop a stewardship impact assessment tool for repositories that manage access to human genomic data. We discuss the strengths and limitations of this mixed-methods design and offer points to consider for wrangling both quantitative and qualitative data to enhance rigor and representativeness. We conclude with how the empirical methods bridged in this protocol have potential to improve evaluation of data stewardship systems and better align them with diverse stakeholder values in genomic data science.
Collapse
|
27
|
Data Provenance for Cloud Forensic Investigations, Security, Challenges, Solutions and Future Perspectives: a survey. JOURNAL OF KING SAUD UNIVERSITY - COMPUTER AND INFORMATION SCIENCES 2022. [DOI: 10.1016/j.jksuci.2022.10.018] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/07/2022]
|
28
|
Yi J, Zhang H, Mao J, Chen Y, Zhong H, Wang Y. Review on the COVID-19 pandemic prevention and control system based on AI. ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE 2022; 114:105184. [PMID: 35846728 PMCID: PMC9271459 DOI: 10.1016/j.engappai.2022.105184] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/24/2021] [Revised: 06/28/2022] [Accepted: 07/04/2022] [Indexed: 05/05/2023]
Abstract
As a new technology, artificial intelligence (AI) has recently received increasing attention from researchers and has been successfully applied to many domains. Currently, the outbreak of the COVID-19 pandemic has not only put people's lives in jeopardy but has also interrupted social activities and stifled economic growth. Artificial intelligence, as the most cutting-edge science field, is critical in the fight against the pandemic. To respond scientifically to major emergencies like COVID-19, this article reviews the use of artificial intelligence in the combat against the pandemic from COVID-19 large data, intelligent devices and systems, and intelligent robots. This article's primary contributions are in two aspects: (1) we summarized the applications of AI in the pandemic, including virus spreading prediction, patient diagnosis, vaccine development, excluding potential virus carriers, telemedicine service, economic recovery, material distribution, disinfection, and health care. (2) We concluded the faced challenges during the AI-based pandemic prevention process, including multidimensional data, sub-intelligent algorithms, and unsystematic, and discussed corresponding solutions, such as 5G, cloud computing, and unsupervised learning algorithms. This article systematically surveyed the applications and challenges of AI technology during the pandemic, which is of great significance to promote the development of AI technology and can serve as a new reference for future emergencies.
Collapse
Affiliation(s)
- Junfei Yi
- College of Electrical and Information Engineering, Hunan university, changsha, 410006, Hunan, China
| | - Hui Zhang
- College of Robotics, Hunan university, changsha, 410006, Hunan, China
| | - Jianxu Mao
- College of Electrical and Information Engineering, Hunan university, changsha, 410006, Hunan, China
| | - Yurong Chen
- College of Electrical and Information Engineering, Hunan university, changsha, 410006, Hunan, China
| | - Hang Zhong
- College of Electrical and Information Engineering, Hunan university, changsha, 410006, Hunan, China
| | - Yaonan Wang
- College of Electrical and Information Engineering, Hunan university, changsha, 410006, Hunan, China
| |
Collapse
|
29
|
Borowiec ML, Dikow RB, Frandsen PB, McKeeken A, Valentini G, White AE. Deep learning as a tool for ecology and evolution. Methods Ecol Evol 2022. [DOI: 10.1111/2041-210x.13901] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Affiliation(s)
- Marek L. Borowiec
- Entomology, Plant Pathology and Nematology University of Idaho Moscow ID USA
- Institute for Bioinformatics and Evolutionary Studies (IBEST) University of Idaho Moscow ID USA
| | - Rebecca B. Dikow
- Data Science Lab, Office of the Chief Information Officer Smithsonian Institution Washington DC USA
| | - Paul B. Frandsen
- Data Science Lab, Office of the Chief Information Officer Smithsonian Institution Washington DC USA
- Department of Plant and Wildlife Sciences Brigham Young University Provo UT USA
| | - Alexander McKeeken
- Entomology, Plant Pathology and Nematology University of Idaho Moscow ID USA
| | | | - Alexander E. White
- Data Science Lab, Office of the Chief Information Officer Smithsonian Institution Washington DC USA
- Department of Botany, National Museum of Natural History Smithsonian Institution Washington DC USA
| |
Collapse
|
30
|
Opportunities and challenges for the use of common controls in sequencing studies. Nat Rev Genet 2022; 23:665-679. [PMID: 35581355 DOI: 10.1038/s41576-022-00487-4] [Citation(s) in RCA: 9] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 03/22/2022] [Indexed: 01/02/2023]
Abstract
Genome-wide association studies using large-scale genome and exome sequencing data have become increasingly valuable in identifying associations between genetic variants and disease, transforming basic research and translational medicine. However, this progress has not been equally shared across all people and conditions, in part due to limited resources. Leveraging publicly available sequencing data as external common controls, rather than sequencing new controls for every study, can better allocate resources by augmenting control sample sizes or providing controls where none existed. However, common control studies must be carefully planned and executed as even small differences in sample ascertainment and processing can result in substantial bias. Here, we discuss challenges and opportunities for the robust use of common controls in high-throughput sequencing studies, including study design, quality control and statistical approaches. Thoughtful generation and use of large and valuable genetic sequencing data sets will enable investigation of a broader and more representative set of conditions, environments and genetic ancestries than otherwise possible.
Collapse
|
31
|
Muenzen KD, Amendola LM, Kauffman TL, Mittendorf KF, Bensen JT, Chen F, Green R, Powell BC, Kvale M, Angelo F, Farnan L, Fullerton SM, Robinson JO, Li T, Murali P, Lawlor JM, Ou J, Hindorff LA, Jarvik GP, Crosslin DR. Lessons learned and recommendations for data coordination in collaborative research: The CSER consortium experience. HGG ADVANCES 2022; 3:100120. [PMID: 35707062 PMCID: PMC9190054 DOI: 10.1016/j.xhgg.2022.100120] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/18/2022] [Accepted: 05/16/2022] [Indexed: 11/18/2022] Open
Abstract
Integrating data across heterogeneous research environments is a key challenge in multi-site, collaborative research projects. While it is important to allow for natural variation in data collection protocols across research sites, it is also important to achieve interoperability between datasets in order to reap the full benefits of collaborative work. However, there are few standards to guide the data coordination process from project conception to completion. In this paper, we describe the experiences of the Clinical Sequence Evidence-Generating Research (CSER) consortium Data Coordinating Center (DCC), which coordinated harmonized survey and genomic sequencing data from seven clinical research sites from 2020 to 2022. Using input from multiple consortium working groups and from CSER leadership, we first identify 14 lessons learned from CSER in the categories of communication, harmonization, informatics, compliance, and analytics. We then distill these lessons learned into 11 recommendations for future research consortia in the areas of planning, communication, informatics, and analytics. We recommend that planning and budgeting for data coordination activities occur as early as possible during consortium conceptualization and development to minimize downstream complications. We also find that clear, reciprocal, and continuous communication between consortium stakeholders and the DCC is equally important to maintaining a secure and centralized informatics ecosystem for pooling data. Finally, we discuss the importance of actively interrogating current approaches to data governance, particularly for research studies that straddle the research-clinical divide.
Collapse
|
32
|
Thelwall M, Maflahi N. Research Co-authorship 1900–2020: Continuous, universal, and ongoing expansion. QUANTITATIVE SCIENCE STUDIES 2022. [DOI: 10.1162/qss_a_00188] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/04/2022] Open
Abstract
Abstract
Research co-authorship is useful to combine different skillsets, especially for applied problems. Whilst it has increased over the last century, it is unclear whether this increase is universal across academic fields and which fields co-author the most and least. In response, this article assesses changes in the rate of journal article co-authorship 1900–2020 for all 27 Scopus broad fields and all 332 Scopus narrow fields. Whilst all broad fields have experienced reasonably continuous growth in co-authorship, in 2020 there were substantial disciplinary differences, from Arts and Humanities (1.3 authors) to Immunology and Microbiology (6 authors). All 332 Scopus narrow fields also experienced an increase in the average number of authors. Immunology and Classics are extreme Scopus narrow fields, as exemplified by 9.6 authors per Journal for ImmunoTherapy of Cancer article, whilst 93% of Trends in Classics articles were solo in 2020. The reason for this large difference seems to be the need for multiple complementary methods in Immunology, making it fundamentally a team science. Finally, the reasonably steady and universal increases in academic coauthorship over 121 years show no sign of slowing, suggesting that ever expanding teams are a central part of current professional science.
Collapse
|
33
|
Field MA. Bioinformatic Challenges Detecting Genetic Variation in Precision Medicine Programs. Front Med (Lausanne) 2022; 9:806696. [PMID: 35463004 PMCID: PMC9024231 DOI: 10.3389/fmed.2022.806696] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/01/2021] [Accepted: 03/21/2022] [Indexed: 11/13/2022] Open
Abstract
Precision medicine programs to identify clinically relevant genetic variation have been revolutionized by access to increasingly affordable high-throughput sequencing technologies. A decade of continual drops in per-base sequencing costs means it is now feasible to sequence an individual patient genome and interrogate all classes of genetic variation for < $1,000 USD. However, while advances in these technologies have greatly simplified the ability to obtain patient sequence information, the timely analysis and interpretation of variant information remains a challenge for the rollout of large-scale precision medicine programs. This review will examine the challenges and potential solutions that exist in identifying predictive genetic biomarkers and pharmacogenetic variants in a patient and discuss the larger bioinformatic challenges likely to emerge in the future. It will examine how both software and hardware development are aiming to overcome issues in short read mapping, variant detection and variant interpretation. It will discuss the current state of the art for genetic disease and the remaining challenges to overcome for complex disease. Success across all types of disease will require novel statistical models and software in order to ensure precision medicine programs realize their full potential now and into the future.
Collapse
Affiliation(s)
- Matt A. Field
- Centre for Tropical Bioinformatics and Molecular Biology, College of Public Health, Medical and Veterinary Science, James Cook University, Cairns, QLD, Australia
- Immunogenomics Lab, Garvan Institute of Medical Research, Darlinghurst, NSW, Australia
- Menzies School of Health Research, Charles Darwin University, Darwin, NT, Australia
- *Correspondence: Matt A. Field
| |
Collapse
|
34
|
Wu F, Liu YZ, Ling B. MTD: a unique pipeline for host and meta-transcriptome joint and integrative analyses of RNA-seq data. Brief Bioinform 2022; 23:6563416. [PMID: 35380623 PMCID: PMC9116375 DOI: 10.1093/bib/bbac111] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/04/2021] [Revised: 02/22/2022] [Accepted: 03/06/2022] [Indexed: 11/13/2022] Open
Abstract
Ribonucleic acid (RNA)-seq data contain not only host transcriptomes but also nonhost information that comprises transcripts from active microbiota in the host cells. Therefore, joint and integrative analyses of both host and meta-transcriptome can reveal gene expression of the microbial community in a given sample as well as the correlative and interactive dynamics of the host response to the microbiome. However, there are no convenient tools that can systemically analyze host-microbiota interactions through simultaneously quantifying the host and meta-transcriptome in the same sample at the tissue and the single-cell level. This poses a challenge for interested researchers with limited expertise in bioinformatics. Here, we developed a software pipeline that can comprehensively and synergistically analyze and correlate the host and meta-transcriptome in a single sample using bulk and single-cell RNA-seq data. This pipeline, named meta-transcriptome detector (MTD), can extensively identify and quantify microbiome, including viruses, bacteria, protozoa, fungi, plasmids and vectors, in the host cells and correlate the microbiome with the host transcriptome. MTD is easy to install and run, involving only a few lines of simple commands. It offers researchers with unique genomics insights into host responses to microorganisms.
Collapse
Affiliation(s)
- Fei Wu
- Host-Pathogen Interaction Program, Texas Biomedical Research Institute, 8715 W Military Dr, San Antonio, TX 78227, USA.,Tulane Center for Aging, Tulane University School of Medicine, New Orleans, LA 70112, USA
| | - Yao-Zhong Liu
- Tulane University School of Public Health and Tropical Medicine, New Orleans, LA 70112, USA
| | - Binhua Ling
- Host-Pathogen Interaction Program, Texas Biomedical Research Institute, 8715 W Military Dr, San Antonio, TX 78227, USA
| |
Collapse
|
35
|
Schuler BA, Nelson ET, Koziura M, Cogan JD, Hamid R, Phillips JA. Lessons learned: next-generation sequencing applied to undiagnosed genetic diseases. J Clin Invest 2022; 132:e154942. [PMID: 35362483 PMCID: PMC8970663 DOI: 10.1172/jci154942] [Citation(s) in RCA: 10] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/12/2022] Open
Abstract
Rare genetic disorders, when considered together, are relatively common. Despite advancements in genetics and genomics technologies as well as increased understanding of genomic function and dysfunction, many genetic diseases continue to be difficult to diagnose. The goal of this Review is to increase the familiarity of genetic testing strategies for non-genetics providers. As genetic testing is increasingly used in primary care, many subspecialty clinics, and various inpatient settings, it is important that non-genetics providers have a fundamental understanding of the strengths and weaknesses of various genetic testing strategies as well as develop an ability to interpret genetic testing results. We provide background on commonly used genetic testing approaches, give examples of phenotypes in which the various genetic testing approaches are used, describe types of genetic and genomic variations, cover challenges in variant identification, provide examples in which next-generation sequencing (NGS) failed to uncover the variant responsible for a disease, and discuss opportunities for continued improvement in the application of NGS clinically. As genetic testing becomes increasingly a part of all areas of medicine, familiarity with genetic testing approaches and result interpretation is vital to decrease the burden of undiagnosed disease.
Collapse
Affiliation(s)
- Bryce A. Schuler
- Division of Medical Genetics and Genomics and
- Department of Pediatrics, Vanderbilt University Medical Center, Nashville, Tennessee, USA
| | - Erica T. Nelson
- Division of Medical Genetics and Genomics and
- Department of Pediatrics, Vanderbilt University Medical Center, Nashville, Tennessee, USA
| | - Mary Koziura
- Division of Medical Genetics and Genomics and
- Department of Pediatrics, Vanderbilt University Medical Center, Nashville, Tennessee, USA
| | - Joy D. Cogan
- Division of Medical Genetics and Genomics and
- Department of Pediatrics, Vanderbilt University Medical Center, Nashville, Tennessee, USA
| | - Rizwan Hamid
- Division of Medical Genetics and Genomics and
- Department of Pediatrics, Vanderbilt University Medical Center, Nashville, Tennessee, USA
| | - John A. Phillips
- Division of Medical Genetics and Genomics and
- Department of Pediatrics, Vanderbilt University Medical Center, Nashville, Tennessee, USA
| |
Collapse
|
36
|
Stephan T, Burgess SM, Cheng H, Danko CG, Gill CA, Jarvis ED, Koepfli KP, Koltes JE, Lyons E, Ronald P, Ryder OA, Schriml LM, Soltis P, VandeWoude S, Zhou H, Ostrander EA, Karlsson EK. Darwinian genomics and diversity in the tree of life. Proc Natl Acad Sci U S A 2022; 119:e2115644119. [PMID: 35042807 PMCID: PMC8795533 DOI: 10.1073/pnas.2115644119] [Citation(s) in RCA: 9] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/21/2022] Open
Abstract
Genomics encompasses the entire tree of life, both extinct and extant, and the evolutionary processes that shape this diversity. To date, genomic research has focused on humans, a small number of agricultural species, and established laboratory models. Fewer than 18,000 of ∼2,000,000 eukaryotic species (<1%) have a representative genome sequence in GenBank, and only a fraction of these have ancillary information on genome structure, genetic variation, gene expression, epigenetic modifications, and population diversity. This imbalance reflects a perception that human studies are paramount in disease research. Yet understanding how genomes work, and how genetic variation shapes phenotypes, requires a broad view that embraces the vast diversity of life. We have the technology to collect massive and exquisitely detailed datasets about the world, but expertise is siloed into distinct fields. A new approach, integrating comparative genomics with cell and evolutionary biology, ecology, archaeology, anthropology, and conservation biology, is essential for understanding and protecting ourselves and our world. Here, we describe potential for scientific discovery when comparative genomics works in close collaboration with a broad range of fields as well as the technical, scientific, and social constraints that must be addressed.
Collapse
Affiliation(s)
- Taylorlyn Stephan
- National Human Genome Research Institute, National Institutes of Health, Bethesda, MD 20817
| | - Shawn M Burgess
- National Human Genome Research Institute, National Institutes of Health, Bethesda, MD 20817
| | - Hans Cheng
- Avian Disease and Oncology Laboratory, Agricultural Research Service, US Department of Agriculture, East Lansing, MI 48823
| | - Charles G Danko
- Department of Biomedical Sciences, Baker Institute for Animal Health, Cornell University, Ithaca, NY 14850
| | - Clare A Gill
- Department of Animal Science, Texas A&M University, College Station, TX 77843
| | - Erich D Jarvis
- Laboratory of Neurogenetics of Language, The Rockefeller University, New York, NY 10065
- HHMI, Chevy Chase, MD 20815
| | - Klaus-Peter Koepfli
- Smithsonian-Mason School of Conservation, George Mason University, Front Royal, VA 22630
- Smithsonian Conservation Biology Institute, National Zoological Park, Washington, DC 20008
| | - James E Koltes
- Department of Animal Science, Iowa State University, Ames, IA 50011
| | - Eric Lyons
- School of Plant Sciences, BIO5 Institute, University of Arizona, Tucson, AZ 85721
| | - Pamela Ronald
- Department of Plant Pathology, University of California, Davis, CA 95616
- The Genome Center, University of California, Davis, CA 95616
- The Innovative Genomics Institute, University of California, Berkeley, CA 94720
- Grass Genetics, Joint Bioenergy Institute, Emeryville, CA 94608
| | - Oliver A Ryder
- San Diego Zoo Wildlife Alliance, Escondido, CA 92027
- Department of Evolution, Behavior, and Ecology, University of California San Diego, La Jolla, CA 92093
| | - Lynn M Schriml
- Institute for Genome Sciences, University of Maryland School of Medicine, Baltimore, MD 21201
| | - Pamela Soltis
- Florida Museum of Natural History, University of Florida, Gainesville, FL 32611
| | - Sue VandeWoude
- Department of Micro-, Immuno-, and Pathology, Colorado State University, Fort Collins, CO 80532
| | - Huaijun Zhou
- Department of Animal Science, University of California, Davis, CA 95616
| | - Elaine A Ostrander
- National Human Genome Research Institute, National Institutes of Health, Bethesda, MD 20817
| | - Elinor K Karlsson
- Bioinformatics and Integrative Biology, University of Massachusetts Medical School, Worcester, MA 01655;
- Program in Molecular Medicine, University of Massachusetts Medical School, Worcester, MA 01655
- Broad Institute of MIT and Harvard, Cambridge, MA 02142
| |
Collapse
|
37
|
Schatz MC, Philippakis AA, Afgan E, Banks E, Carey VJ, Carroll RJ, Culotti A, Ellrott K, Goecks J, Grossman RL, Hall IM, Hansen KD, Lawson J, Leek JT, Luria AO, Mosher S, Morgan M, Nekrutenko A, O’Connor BD, Osborn K, Paten B, Patterson C, Tan FJ, Taylor CO, Vessio J, Waldron L, Wang T, Wuichet K. Inverting the model of genomics data sharing with the NHGRI Genomic Data Science Analysis, Visualization, and Informatics Lab-space. CELL GENOMICS 2022; 2:100085. [PMID: 35199087 PMCID: PMC8863334 DOI: 10.1016/j.xgen.2021.100085] [Citation(s) in RCA: 47] [Impact Index Per Article: 23.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 12/14/2022]
Abstract
The NHGRI Genomic Data Science Analysis, Visualization, and Informatics Lab-space (AnVIL; https://anvilproject.org) was developed to address a widespread community need for a unified computing environment for genomics data storage, management, and analysis. In this perspective, we present AnVIL, describe its ecosystem and interoperability with other platforms, and highlight how this platform and associated initiatives contribute to improved genomic data sharing efforts. The AnVIL is a federated cloud platform designed to manage and store genomics and related data, enable population-scale analysis, and facilitate collaboration through the sharing of data, code, and analysis results. By inverting the traditional model of data sharing, the AnVIL eliminates the need for data movement while also adding security measures for active threat detection and monitoring and provides scalable, shared computing resources for any researcher. We describe the core data management and analysis components of the AnVIL, which currently consists of Terra, Gen3, Galaxy, RStudio/Bioconductor, Dockstore, and Jupyter, and describe several flagship genomics datasets available within the AnVIL. We continue to extend and innovate the AnVIL ecosystem by implementing new capabilities, including mechanisms for interoperability and responsible data sharing, while streamlining access management. The AnVIL opens many new opportunities for analysis, collaboration, and data sharing that are needed to drive research and to make discoveries through the joint analysis of hundreds of thousands to millions of genomes along with associated clinical and molecular data types.
Collapse
Affiliation(s)
- Michael C. Schatz
- Department of Biology, Johns Hopkins University, Baltimore, MD, USA,Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA,Corresponding author
| | | | - Enis Afgan
- Department of Biology, Johns Hopkins University, Baltimore, MD, USA
| | - Eric Banks
- Broad Institute of MIT and Harvard, Cambridge, MA, USA
| | | | - Robert J. Carroll
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN, USA
| | - Alessandro Culotti
- Broad Institute of MIT and Harvard, Cambridge, MA, USA,Center for Translational Data Science, University of Chicago, Chicago, IL, USA
| | - Kyle Ellrott
- Biomedical Engineering, Oregon Health & Science University, Portland, OR, USA
| | - Jeremy Goecks
- Biomedical Engineering, Oregon Health & Science University, Portland, OR, USA
| | - Robert L. Grossman
- Center for Translational Data Science, University of Chicago, Chicago, IL, USA
| | - Ira M. Hall
- Yale School of Medicine, Yale University, New Haven, CT, USA
| | - Kasper D. Hansen
- Department of Biostatistics, Johns Hopkins University, Baltimore, MD, USA
| | | | - Jeffrey T. Leek
- Department of Biostatistics, Johns Hopkins University, Baltimore, MD, USA
| | | | - Stephen Mosher
- Department of Biology, Johns Hopkins University, Baltimore, MD, USA
| | - Martin Morgan
- Department of Biostatistics and Bioinformatics, Roswell Park Comprehensive Cancer Center, Buffalo, NY, USA
| | - Anton Nekrutenko
- Department of Biochemistry and Molecular Biology, The Pennsylvania State University, State College, PA, USA
| | | | - Kevin Osborn
- UC Santa Cruz Genomics Institute, UC Santa Cruz, Santa Cruz, CA, USA
| | - Benedict Paten
- UC Santa Cruz Genomics Institute, UC Santa Cruz, Santa Cruz, CA, USA
| | | | - Frederick J. Tan
- Department of Embryology, Carnegie Institution, Baltimore, MD, USA
| | - Casey Overby Taylor
- Departments of Medicine and Biomedical Engineering, Johns Hopkins University, Baltimore, MD, USA
| | - Jennifer Vessio
- Department of Biology, Johns Hopkins University, Baltimore, MD, USA
| | - Levi Waldron
- Department of Epidemiology and Biostatistics, City University of New York Graduate School of Public Health and Health Policy, New York, NY, USA
| | - Ting Wang
- Department of Genetics, Washington University of St. Louis, St. Louis, MO, USA
| | - Kristin Wuichet
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN, USA
| | | |
Collapse
|
38
|
Horgan D, Curigliano G, Rieß O, Hofman P, Büttner R, Conte P, Cufer T, Gallagher WM, Georges N, Kerr K, Penault-Llorca F, Mastris K, Pinto C, Van Meerbeeck J, Munzone E, Thomas M, Ujupan S, Vainer GW, Velthaus JL, André F. Identifying the Steps Required to Effectively Implement Next-Generation Sequencing in Oncology at a National Level in Europe. J Pers Med 2022; 12:72. [PMID: 35055387 PMCID: PMC8780351 DOI: 10.3390/jpm12010072] [Citation(s) in RCA: 20] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/08/2021] [Revised: 12/16/2021] [Accepted: 12/29/2021] [Indexed: 02/07/2023] Open
Abstract
Next-generation sequencing (NGS) may enable more focused and highly personalized cancer treatment, with the National Comprehensive Cancer Network and European Society for Medical Oncology guidelines now recommending NGS for daily clinical practice for several tumor types. However, NGS implementation, and therefore patient access, varies across Europe; a multi-stakeholder collaboration is needed to establish the conditions required to improve this discrepancy. In that regard, we set up European Alliance for Personalised Medicine (EAPM)-led expert panels during the first half of 2021, including key stakeholders from across 10 European countries covering medical, economic, patient, industry, and governmental expertise. We describe the outcomes of these panels in order to define and explore the necessary conditions for NGS implementation into routine clinical care to enable patient access, identify specific challenges in achieving them, and make short- and long-term recommendations. The main challenges identified relate to the demand for NGS tests (governance, clinical standardization, and awareness and education) and supply of tests (equitable reimbursement, infrastructure for conducting and validating tests, and testing access driven by evidence generation). Recommendations made to resolve each of these challenges should aid multi-stakeholder collaboration between national and European initiatives, to complement, support, and mutually reinforce efforts to improve patient care.
Collapse
Affiliation(s)
- Denis Horgan
- European Alliance for Personalised Medicine, Avenue de l’Armee/Legerlaan 10, 1040 Brussels, Belgium
| | - Giuseppe Curigliano
- European Institute of Oncology, IRCCS, Via Giuseppe Ripamonti, 435, 20141 Milan, Italy; (G.C.); (E.M.)
- Department of Oncology and Hemato-Oncology, University of Milan, Via Festa del Perdono, 7, 20122 Milan, Italy
| | - Olaf Rieß
- Institute of Medical Genetics and Applied Genomics, University of Tuebingen, Calwerstrasse 7, 72070 Tuebingen, Germany;
| | - Paul Hofman
- Laboratory of Clinical and Experimental Pathology, University of Côte d’Azur, FHU OncoAge, Biobank BB-0033-00025, Pasteur Hospital, 30 Avenue de la voie Romaine, CEDEX 01, 06001 Nice, France;
| | - Reinhard Büttner
- Institute for Pathology, University Hospital Cologne, Kerpener Str. 62, 50937 Cologne, Germany;
| | - Pierfranco Conte
- The Veneto Institute of Oncology, IRCCS, Via Gattamelata, 64, 35128 Padua, Italy;
- Department of Surgical, Oncological and Gastroenterological Sciences, University of Padua, Via Giustiniani, 2, 35124 Padua, Italy
| | - Tanja Cufer
- Medical Faculty, University of Ljubljana, Vrazov trg 2, 1000 Ljubljana, Slovenia;
| | - William M. Gallagher
- School of Biomolecular and Biomedical Science, University College Dublin, Belfield, D04 V1W8 Dublin, Ireland;
| | - Nadia Georges
- Exact Sciences, Quai du Seujet 10, 1201 Geneva, Switzerland;
| | - Keith Kerr
- School of Medicine and Dentistry, University of Aberdeen, Foresterhill, Aberdeen AB25 2ZD, UK;
| | - Frédérique Penault-Llorca
- Centre Jean Perrin, 58, Rue Montalembert, CEDEX 01, 63011 Clermont-Ferrand, France;
- Department of Pathology, University of Clermont Auvergne, INSERM U1240, 49 bd François Mitterrand, CS 60032, 63001 Clermont-Ferrand, France
| | - Ken Mastris
- Europa Uomo, Leopoldstraat 34, 2000 Antwerp, Belgium;
| | - Carla Pinto
- AstraZeneca, Rua Humberto Madeira 7, 1800 Oeiras, Portugal;
| | - Jan Van Meerbeeck
- Antwerp University Hospital, University of Antwerp, Wijlrijkstraat 10, 2650 Edegem, Belgium;
| | - Elisabetta Munzone
- European Institute of Oncology, IRCCS, Via Giuseppe Ripamonti, 435, 20141 Milan, Italy; (G.C.); (E.M.)
| | - Marlene Thomas
- F. Hoffmann-La Roche Ltd., Grenzacherstrasse 124, 4070 Basel, Switzerland;
| | - Sonia Ujupan
- Eli Lilly and Company, Rue du Marquis 1, Markiesstraat, 1000 Brussels, Belgium;
| | - Gilad W. Vainer
- Department of Pathology, Hadassah Hebrew-University Medical Center, Hebrew University of Jerusalem, Kalman Ya’akov Man St, Jerusalem 91905, Israel;
| | - Janna-Lisa Velthaus
- University Medical Center Hamburg-Eppendorf, Martinistraße 52, 20251 Hamburg, Germany;
| | - Fabrice André
- Institut Gustave Roussy, 114 Rue Edouard Vaillant, 94805 Villejuif, France;
| |
Collapse
|
39
|
Erfannia L, Alipour J. How does cloud computing improve cancer information management? A systematic review. INFORMATICS IN MEDICINE UNLOCKED 2022. [DOI: 10.1016/j.imu.2022.101095] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/24/2022] Open
|
40
|
A Survey of Swarm Intelligence Based Load Balancing Techniques in Cloud Computing Environment. ELECTRONICS 2021. [DOI: 10.3390/electronics10212718] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
Cloud computing offers flexible, interactive, and observable access to shared resources on the Internet. It frees users from the requirements of managing computing on their hardware. It enables users to not only store their data and computing over the internet but also can access it whenever and wherever it is required. The frequent use of smart devices has helped cloud computing to realize the need for its rapid growth. As more users are adapting to the cloud environment, the focus has been placed on load balancing. Load balancing allocates tasks or resources to different devices. In cloud computing, and load balancing has played a major role in the efficient usage of resources for the highest performance. This requirement results in the development of algorithms that can optimally assign resources while managing load and improving quality of service (QoS). This paper provides a survey of load balancing algorithms inspired by swarm intelligence (SI). The algorithms considered in the discussion are Genetic Algorithm, BAT Algorithm, Ant Colony, Grey Wolf, Artificial Bee Colony, Particle Swarm, Whale, Social Spider, Dragonfly, and Raven roosting Optimization. An analysis of the main objectives, area of applications, and targeted issues of each algorithm (with advancements) is presented. In addition, performance analysis has been performed based on average response time, data center processing time, and other quality parameters.
Collapse
|
41
|
Yu H, Ogbeyemi A, Lin W, He J, Sun W, Zhang W. A semantic model for enterprise application integration in the era of data explosion and globalisation. ENTERP INF SYST-UK 2021. [DOI: 10.1080/17517575.2021.1989495] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/20/2022]
Affiliation(s)
- H.Y. Yu
- School of Mechanical Engineering, Donghua University, Shanghai, China
| | - Akinola Ogbeyemi
- Department of Mechanical Engineering, University of Saskatchewan, Saskatoon, Canada
| | - W.J. Lin
- Department of Mechanical Engineering, University of Saskatchewan, Saskatoon, Canada
| | - Jingyi He
- Faculty of Nursing, University of Alberta, Edmonton, Canada
| | - Wei Sun
- Department of Mechanical Engineering, University of Saskatchewan, Saskatoon, Canada
| | - W.J. Zhang
- Department of Mechanical Engineering, University of Saskatchewan, Saskatoon, Canada
| |
Collapse
|
42
|
Lim HGM, Hsiao SH, Lee YCG. Orchestrating an Optimized Next-Generation Sequencing-Based Cloud Workflow for Robust Viral Identification during Pandemics. BIOLOGY 2021; 10:biology10101023. [PMID: 34681121 PMCID: PMC8533344 DOI: 10.3390/biology10101023] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 08/20/2021] [Revised: 09/24/2021] [Accepted: 10/06/2021] [Indexed: 10/24/2022]
Abstract
Coronavirus disease 2019 (COVID-19), caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), has recently become a novel pandemic event following the swine flu that occurred in 2009, which was caused by the influenza A virus (H1N1 subtype). The accurate identification of the huge number of samples during a pandemic still remains a challenge. In this study, we integrate two technologies, next-generation sequencing and cloud computing, into an optimized workflow version that uses a specific identification algorithm on the designated cloud platform. We use 182 samples (92 for COVID-19 and 90 for swine flu) with short-read sequencing data from two open-access datasets to represent each pandemic and evaluate our workflow performance based on an index specifically created for SARS-CoV-2 or H1N1. Results show that our workflow could differentiate cases between the two pandemics with a higher accuracy depending on the index used, especially when the index that exclusively represented each dataset was used. Our workflow substantially outperforms the original complete identification workflow available on the same platform in terms of time and cost by preserving essential tools internally. Our workflow can serve as a powerful tool for the robust identification of cases and, thus, aid in controlling the current and future pandemics.
Collapse
Affiliation(s)
- Hendrick Gao-Min Lim
- Graduate Institute of Biomedical Informatics, College of Medical Science and Technology, Taipei Medical University, Taipei 11031, Taiwan;
| | - Shih-Hsin Hsiao
- Division of Pulmonary Medicine, Department of Internal Medicine, School of Medicine, College of Medicine, Taipei Medical University, Taipei 11031, Taiwan;
- Division of Pulmonary Medicine, Department of Internal Medicine, Taipei Medical University Hospital, Taipei 11031, Taiwan
| | - Yuan-Chii Gladys Lee
- Graduate Institute of Biomedical Informatics, College of Medical Science and Technology, Taipei Medical University, Taipei 11031, Taiwan;
- Correspondence:
| |
Collapse
|
43
|
Grzesik P, Augustyn DR, Wyciślik Ł, Mrozek D. Serverless computing in omics data analysis and integration. Brief Bioinform 2021; 23:6367629. [PMID: 34505137 PMCID: PMC8499876 DOI: 10.1093/bib/bbab349] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/27/2021] [Revised: 06/28/2021] [Accepted: 08/06/2021] [Indexed: 11/30/2022] Open
Abstract
A comprehensive analysis of omics data can require vast computational resources and access to varied data sources that must be integrated into complex, multi-step analysis pipelines. Execution of many such analyses can be accelerated by applying the cloud computing paradigm, which provides scalable resources for storing data of different types and parallelizing data analysis computations. Moreover, these resources can be reused for different multi-omics analysis scenarios. Traditionally, developers are required to manage a cloud platform’s underlying infrastructure, configuration, maintenance and capacity planning. The serverless computing paradigm simplifies these operations by automatically allocating and maintaining both servers and virtual machines, as required for analysis tasks. This paradigm offers highly parallel execution and high scalability without manual management of the underlying infrastructure, freeing developers to focus on operational logic. This paper reviews serverless solutions in bioinformatics and evaluates their usage in omics data analysis and integration. We start by reviewing the application of the cloud computing model to a multi-omics data analysis and exposing some shortcomings of the early approaches. We then introduce the serverless computing paradigm and show its applicability for performing an integrative analysis of multiple omics data sources in the context of the COVID-19 pandemic.
Collapse
Affiliation(s)
- Piotr Grzesik
- Silesian University of Technology, Department of Applied Informatics, Gliwice 44-100, Poland
| | - Dariusz R Augustyn
- Silesian University of Technology, Department of Applied Informatics, Gliwice 44-100, Poland
| | - Łukasz Wyciślik
- Silesian University of Technology, Department of Applied Informatics, Gliwice 44-100, Poland
| | - Dariusz Mrozek
- Corresponding author: Dariusz Mrozek, Department of Applied Informatics, Silesian University of Technology, Gliwice 44-100, Poland. E-mail:
| |
Collapse
|
44
|
Zafeiropoulos H, Gioti A, Ninidakis S, Potirakis A, Paragkamian S, Angelova N, Antoniou A, Danis T, Kaitetzidou E, Kasapidis P, Kristoffersen JB, Papadogiannis V, Pavloudi C, Ha QV, Lagnel J, Pattakos N, Perantinos G, Sidirokastritis D, Vavilis P, Kotoulas G, Manousaki T, Sarropoulou E, Tsigenopoulos CS, Arvanitidis C, Magoulas A, Pafilis E. 0s and 1s in marine molecular research: a regional HPC perspective. Gigascience 2021; 10:6353916. [PMID: 34405237 PMCID: PMC8371273 DOI: 10.1093/gigascience/giab053] [Citation(s) in RCA: 14] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/07/2021] [Revised: 07/07/2021] [Accepted: 07/20/2021] [Indexed: 01/23/2023] Open
Abstract
High-performance computing (HPC) systems have become indispensable for modern marine research, providing support to an increasing number and diversity of users. Pairing with the impetus offered by high-throughput methods to key areas such as non-model organism studies, their operation continuously evolves to meet the corresponding computational challenges. Here, we present a Tier 2 (regional) HPC facility, operating for over a decade at the Institute of Marine Biology, Biotechnology, and Aquaculture of the Hellenic Centre for Marine Research in Greece. Strategic choices made in design and upgrades aimed to strike a balance between depth (the need for a few high-memory nodes) and breadth (a number of slimmer nodes), as dictated by the idiosyncrasy of the supported research. Qualitative computational requirement analysis of the latter revealed the diversity of marine fields, methods, and approaches adopted to translate data into knowledge. In addition, hardware and software architectures, usage statistics, policy, and user management aspects of the facility are presented. Drawing upon the last decade's experience from the different levels of operation of the Institute of Marine Biology, Biotechnology, and Aquaculture HPC facility, a number of lessons are presented; these have contributed to the facility's future directions in light of emerging distribution technologies (e.g., containers) and Research Infrastructure evolution. In combination with detailed knowledge of the facility usage and its upcoming upgrade, future collaborations in marine research and beyond are envisioned.
Collapse
Affiliation(s)
- Haris Zafeiropoulos
- Hellenic Centre for Marine Research, Institute of Marine Biology, Biotechnology and Aquaculture, Former U.S. Base of Gournes, P.O. Box 2214, 71003, Heraklion, Crete, Greece.,Department of Biology, University of Crete, Voutes University Campus, P.O. Box 2208, 70013, Heraklion, Crete, Greece
| | - Anastasia Gioti
- Hellenic Centre for Marine Research, Institute of Marine Biology, Biotechnology and Aquaculture, Former U.S. Base of Gournes, P.O. Box 2214, 71003, Heraklion, Crete, Greece
| | - Stelios Ninidakis
- Hellenic Centre for Marine Research, Institute of Marine Biology, Biotechnology and Aquaculture, Former U.S. Base of Gournes, P.O. Box 2214, 71003, Heraklion, Crete, Greece
| | - Antonis Potirakis
- Hellenic Centre for Marine Research, Institute of Marine Biology, Biotechnology and Aquaculture, Former U.S. Base of Gournes, P.O. Box 2214, 71003, Heraklion, Crete, Greece
| | - Savvas Paragkamian
- Hellenic Centre for Marine Research, Institute of Marine Biology, Biotechnology and Aquaculture, Former U.S. Base of Gournes, P.O. Box 2214, 71003, Heraklion, Crete, Greece.,Department of Biology, University of Crete, Voutes University Campus, P.O. Box 2208, 70013, Heraklion, Crete, Greece
| | - Nelina Angelova
- Hellenic Centre for Marine Research, Institute of Marine Biology, Biotechnology and Aquaculture, Former U.S. Base of Gournes, P.O. Box 2214, 71003, Heraklion, Crete, Greece
| | - Aglaia Antoniou
- Hellenic Centre for Marine Research, Institute of Marine Biology, Biotechnology and Aquaculture, Former U.S. Base of Gournes, P.O. Box 2214, 71003, Heraklion, Crete, Greece
| | - Theodoros Danis
- Hellenic Centre for Marine Research, Institute of Marine Biology, Biotechnology and Aquaculture, Former U.S. Base of Gournes, P.O. Box 2214, 71003, Heraklion, Crete, Greece.,School of Medicine, University of Crete, Voutes University Campus, 70013 Heraklion, Crete, Greece
| | - Eliza Kaitetzidou
- Hellenic Centre for Marine Research, Institute of Marine Biology, Biotechnology and Aquaculture, Former U.S. Base of Gournes, P.O. Box 2214, 71003, Heraklion, Crete, Greece
| | - Panagiotis Kasapidis
- Hellenic Centre for Marine Research, Institute of Marine Biology, Biotechnology and Aquaculture, Former U.S. Base of Gournes, P.O. Box 2214, 71003, Heraklion, Crete, Greece
| | - Jon Bent Kristoffersen
- Hellenic Centre for Marine Research, Institute of Marine Biology, Biotechnology and Aquaculture, Former U.S. Base of Gournes, P.O. Box 2214, 71003, Heraklion, Crete, Greece
| | - Vasileios Papadogiannis
- Hellenic Centre for Marine Research, Institute of Marine Biology, Biotechnology and Aquaculture, Former U.S. Base of Gournes, P.O. Box 2214, 71003, Heraklion, Crete, Greece
| | - Christina Pavloudi
- Hellenic Centre for Marine Research, Institute of Marine Biology, Biotechnology and Aquaculture, Former U.S. Base of Gournes, P.O. Box 2214, 71003, Heraklion, Crete, Greece
| | - Quoc Viet Ha
- Bull SAS, Rue du Gros Caillou, 78340 Les Clayes-sous-Bois, France
| | - Jacques Lagnel
- Institut National de Recherche pour l'Agriculture, l'Alimentation et l'Environnement, UR1052, Génétique et Amélioration des Fruits et Légumes, 67 Allée des Chênes, Centre de Recherche Provence-Alpes-Côte d'Azur, Domaine Saint Maurice, CS60094, 84143 Montfavet Cedex, France
| | - Nikos Pattakos
- Hellenic Centre for Marine Research, Institute of Marine Biology, Biotechnology and Aquaculture, Former U.S. Base of Gournes, P.O. Box 2214, 71003, Heraklion, Crete, Greece
| | - Giorgos Perantinos
- Hellenic Centre for Marine Research, Institute of Marine Biology, Biotechnology and Aquaculture, Former U.S. Base of Gournes, P.O. Box 2214, 71003, Heraklion, Crete, Greece
| | - Dimitris Sidirokastritis
- Hellenic Centre for Marine Research, Network Operation Center, Former U.S. Base of Gournes, P.O. Box 2214, 71003, Heraklion, Crete, Greece
| | - Panagiotis Vavilis
- Hellenic Centre for Marine Research, Network Operation Center, Former U.S. Base of Gournes, P.O. Box 2214, 71003, Heraklion, Crete, Greece
| | - Georgios Kotoulas
- Hellenic Centre for Marine Research, Institute of Marine Biology, Biotechnology and Aquaculture, Former U.S. Base of Gournes, P.O. Box 2214, 71003, Heraklion, Crete, Greece
| | - Tereza Manousaki
- Hellenic Centre for Marine Research, Institute of Marine Biology, Biotechnology and Aquaculture, Former U.S. Base of Gournes, P.O. Box 2214, 71003, Heraklion, Crete, Greece
| | - Elena Sarropoulou
- Hellenic Centre for Marine Research, Institute of Marine Biology, Biotechnology and Aquaculture, Former U.S. Base of Gournes, P.O. Box 2214, 71003, Heraklion, Crete, Greece
| | - Costas S Tsigenopoulos
- Hellenic Centre for Marine Research, Institute of Marine Biology, Biotechnology and Aquaculture, Former U.S. Base of Gournes, P.O. Box 2214, 71003, Heraklion, Crete, Greece
| | - Christos Arvanitidis
- Hellenic Centre for Marine Research, Institute of Marine Biology, Biotechnology and Aquaculture, Former U.S. Base of Gournes, P.O. Box 2214, 71003, Heraklion, Crete, Greece.,LifeWatch European Research Infrastructure Consortium, Sector II-III Plaza de España, 41071, Seville, Spain
| | - Antonios Magoulas
- Hellenic Centre for Marine Research, Institute of Marine Biology, Biotechnology and Aquaculture, Former U.S. Base of Gournes, P.O. Box 2214, 71003, Heraklion, Crete, Greece
| | - Evangelos Pafilis
- Hellenic Centre for Marine Research, Institute of Marine Biology, Biotechnology and Aquaculture, Former U.S. Base of Gournes, P.O. Box 2214, 71003, Heraklion, Crete, Greece
| |
Collapse
|
45
|
Yan G, Liu X, Xiao S, Xin W, Xu W, Li Y, Huang T, Qin J, Xie L, Ma J, Zhang Z, Huang L. An imputed whole-genome sequence-based GWAS approach pinpoints causal mutations for complex traits in a specific swine population. SCIENCE CHINA-LIFE SCIENCES 2021; 65:781-794. [PMID: 34387836 DOI: 10.1007/s11427-020-1960-9] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/17/2021] [Accepted: 05/19/2021] [Indexed: 01/08/2023]
Abstract
Sequencing-based genome-wide association studies (GWAS) have facilitated the identification of causal associations between genetic variants and traits in diverse species. However, it is cost-prohibitive for the majority of research groups to sequence a large number of samples. Here, we carried out genotype imputation to increase the density of single nucleotide polymorphisms in a large-scale Swine F2 population using a reference panel including 117 individuals, followed by a series of GWAS analyses. The imputation accuracies reached 0.89 and 0.86 for allelic concordance and correlation, respectively. A quantitative trait nucleotide (QTN) affecting the chest vertebrate was detected directly, while the investigation of another QTN affecting the residual glucose failed due to the presence of similar haplotypes carrying wild-type and mutant allelesin the reference panel used in this study. A high imputation accuracy was confirmed by Sanger sequencing technology for the most significant loci. Two candidate genes, CPNE5 and MYH3, affecting meat-related traits were proposed. Collectively, we illustrated four scenarios in imputation-based GWAS that may be encountered by researchers, and our results will provide an extensive reference for future genotype imputation-based GWAS analyses in the future.
Collapse
Affiliation(s)
- Guorong Yan
- State Key Laboratory for Pig Genetic Improvement and Production Technology, Jiangxi Agricultural University, Nanchang, 330045, China
- Institute of Photomedicine, Shanghai Skin Disease Hospital, School of Medicine, Tongji University, Shanghai, 200092, China
| | - Xianxian Liu
- State Key Laboratory for Pig Genetic Improvement and Production Technology, Jiangxi Agricultural University, Nanchang, 330045, China
| | - Shijun Xiao
- State Key Laboratory for Pig Genetic Improvement and Production Technology, Jiangxi Agricultural University, Nanchang, 330045, China
| | - Wenshui Xin
- State Key Laboratory for Pig Genetic Improvement and Production Technology, Jiangxi Agricultural University, Nanchang, 330045, China
| | - Wenwu Xu
- State Key Laboratory for Pig Genetic Improvement and Production Technology, Jiangxi Agricultural University, Nanchang, 330045, China
| | - Yiping Li
- State Key Laboratory for Pig Genetic Improvement and Production Technology, Jiangxi Agricultural University, Nanchang, 330045, China
| | - Tao Huang
- State Key Laboratory for Pig Genetic Improvement and Production Technology, Jiangxi Agricultural University, Nanchang, 330045, China
| | - Jiangtao Qin
- State Key Laboratory for Pig Genetic Improvement and Production Technology, Jiangxi Agricultural University, Nanchang, 330045, China
| | - Lei Xie
- State Key Laboratory for Pig Genetic Improvement and Production Technology, Jiangxi Agricultural University, Nanchang, 330045, China
| | - Junwu Ma
- State Key Laboratory for Pig Genetic Improvement and Production Technology, Jiangxi Agricultural University, Nanchang, 330045, China.
| | - Zhiyan Zhang
- State Key Laboratory for Pig Genetic Improvement and Production Technology, Jiangxi Agricultural University, Nanchang, 330045, China.
| | - Lusheng Huang
- State Key Laboratory for Pig Genetic Improvement and Production Technology, Jiangxi Agricultural University, Nanchang, 330045, China
| |
Collapse
|
46
|
Goonasekera N, Mahmoud A, Chilton J, Afgan E. GalaxyCloudRunner: enhancing scalable computing for Galaxy. Bioinformatics 2021; 37:1763-1765. [PMID: 33104194 DOI: 10.1093/bioinformatics/btaa860] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/20/2020] [Revised: 08/18/2020] [Accepted: 10/11/2020] [Indexed: 11/13/2022] Open
Abstract
SUMMARY The existence of more than 100 public Galaxy servers with service quotas is indicative of the need for an increased availability of compute resources for Galaxy to use. The GalaxyCloudRunner enables a Galaxy server to easily expand its available compute capacity by sending user jobs to cloud resources. User jobs are routed to the acquired resources based on a set of configurable rules and the resources can be dynamically acquired from any of four popular cloud providers (AWS, Azure, GCP or OpenStack) in an automated fashion. AVAILABILITY AND IMPLEMENTATION GalaxyCloudRunner is implemented in Python and leverages Docker containers. The source code is MIT licensed and available at https://github.com/cloudve/galaxycloudrunner. The documentation is available at http://gcr.cloudve.org/.
Collapse
Affiliation(s)
- Nuwan Goonasekera
- Melbourne Bioinformatics, Faculty of Medicine, Dentistry & Health Sciences, University of Melbourne, Melbourne, VIC 3010, Australia
| | - Alexandru Mahmoud
- Department of Biology, Johns Hopkins University, Baltimore, MD 21218, USA
| | - John Chilton
- Department of Biochemistry and Molecular Biology, Penn State University, State College, PA 16801, USA
| | - Enis Afgan
- Department of Biology, Johns Hopkins University, Baltimore, MD 21218, USA
| |
Collapse
|
47
|
Yuen D, Cabansay L, Duncan A, Luu G, Hogue G, Overbeck C, Perez N, Shands W, Steinberg D, Reid C, Olunwa N, Hansen R, Sheets E, O’Farrell A, Cullion K, O’Connor B, Paten B, Stein L. The Dockstore: enhancing a community platform for sharing reproducible and accessible computational protocols. Nucleic Acids Res 2021; 49:W624-W632. [PMID: 33978761 PMCID: PMC8218198 DOI: 10.1093/nar/gkab346] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/02/2021] [Revised: 04/01/2021] [Accepted: 04/26/2021] [Indexed: 11/24/2022] Open
Abstract
Dockstore (https://dockstore.org/) is an open source platform for publishing, sharing, and finding bioinformatics tools and workflows. The platform has facilitated large-scale biomedical research collaborations by using cloud technologies to increase the Findability, Accessibility, Interoperability and Reusability (FAIR) of computational resources, thereby promoting the reproducibility of complex bioinformatics analyses. Dockstore supports a variety of source repositories, analysis frameworks, and language technologies to provide a seamless publishing platform for authors to create a centralized catalogue of scientific software. The ready-to-use packaging of hundreds of tools and workflows, combined with the implementation of interoperability standards, enables users to launch analyses across multiple environments. Dockstore is widely used, more than twenty-five high-profile organizations share analysis collections through the platform in a variety of workflow languages, including the Broad Institute's GATK best practice and COVID-19 workflows (WDL), nf-core workflows (Nextflow), the Intergalactic Workflow Commission tools (Galaxy), and workflows from Seven Bridges (CWL) to highlight just a few. Here we describe the improvements made over the last four years, including the expansion of system integrations supporting authors, the addition of collaboration features and analysis platform integrations supporting users, and other enhancements that improve the overall scientific reproducibility of Dockstore content.
Collapse
Affiliation(s)
- Denis Yuen
- Adaptive Oncology, Ontario Institute for Cancer Research, Toronto, Ontario M5V 3S1, Canada
| | - Louise Cabansay
- UC Santa Cruz Genomics Institute, University of California Santa Cruz, Santa Cruz, CA 95060, USA
| | - Andrew Duncan
- Adaptive Oncology, Ontario Institute for Cancer Research, Toronto, Ontario M5V 3S1, Canada
| | - Gary Luu
- Adaptive Oncology, Ontario Institute for Cancer Research, Toronto, Ontario M5V 3S1, Canada
| | - Gregory Hogue
- Adaptive Oncology, Ontario Institute for Cancer Research, Toronto, Ontario M5V 3S1, Canada
| | - Charles Overbeck
- UC Santa Cruz Genomics Institute, University of California Santa Cruz, Santa Cruz, CA 95060, USA
| | - Natalie Perez
- UC Santa Cruz Genomics Institute, University of California Santa Cruz, Santa Cruz, CA 95060, USA
| | - Walt Shands
- UC Santa Cruz Genomics Institute, University of California Santa Cruz, Santa Cruz, CA 95060, USA
| | - David Steinberg
- UC Santa Cruz Genomics Institute, University of California Santa Cruz, Santa Cruz, CA 95060, USA
| | - Chaz Reid
- UC Santa Cruz Genomics Institute, University of California Santa Cruz, Santa Cruz, CA 95060, USA
| | - Nneka Olunwa
- UC Santa Cruz Genomics Institute, University of California Santa Cruz, Santa Cruz, CA 95060, USA
| | - Richard Hansen
- UC Santa Cruz Genomics Institute, University of California Santa Cruz, Santa Cruz, CA 95060, USA
| | - Elizabeth Sheets
- UC Santa Cruz Genomics Institute, University of California Santa Cruz, Santa Cruz, CA 95060, USA
| | - Ash O’Farrell
- UC Santa Cruz Genomics Institute, University of California Santa Cruz, Santa Cruz, CA 95060, USA
| | - Kim Cullion
- Adaptive Oncology, Ontario Institute for Cancer Research, Toronto, Ontario M5V 3S1, Canada
| | | | - Benedict Paten
- UC Santa Cruz Genomics Institute, University of California Santa Cruz, Santa Cruz, CA 95060, USA
| | - Lincoln Stein
- Adaptive Oncology, Ontario Institute for Cancer Research, Toronto, Ontario M5V 3S1, Canada
| |
Collapse
|
48
|
Sun B, Yeh J. Onco-fertility and personalized testing for potential for loss of ovarian reserve in patients undergoing chemotherapy: proposed next steps for development of genetic testing to predict changes in ovarian reserve. FERTILITY RESEARCH AND PRACTICE 2021; 7:13. [PMID: 34193292 PMCID: PMC8244159 DOI: 10.1186/s40738-021-00105-7] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 01/22/2021] [Accepted: 05/19/2021] [Indexed: 12/29/2022]
Abstract
Women of reproductive age undergoing chemotherapy face the risk of irreversible ovarian insufficiency. Current methods of ovarian reserve testing do not accurately predict future reproductive potential for patients undergoing chemotherapy. Genetic markers that more accurately predict the reproductive potential of each patient undergoing chemotherapy would be critical tools that would be useful for evidence-based fertility preservation counselling. To assess the possible approaches to take to develop personalized genetic testing for these patients, we review current literature regarding mechanisms of ovarian damage due to chemotherapy and genetic variants associated with both the damage mechanisms and primary ovarian insufficiency. The medical literature point to a number of genetic variants associated with mechanisms of ovarian damage and primary ovarian insufficiency. Those variants that appear at a higher frequency, with known pathways, may be considered as potential genetic markers for predictive ovarian reserve testing. We propose developing personalized testing of the potential for loss of ovarian function for patients with cancer, prior to chemotherapy treatment. There are advantages of using genetic markers complementary to the current ovarian reserve markers of AMH, antral follicle count and day 3 FSH as predictors of preservation of fertility after chemotherapy. Genetic markers will help identify upstream pathways leading to high risk of ovarian failure not detected by present clinical markers. Their predictive value is mechanism-based and will encourage research towards understanding the multiple pathways contributing to ovarian failure after chemotherapy.
Collapse
Affiliation(s)
- Bei Sun
- Sackler School of Medicine, New York State/American Program of Tel Aviv University, Tel Aviv University, Ramat Aviv 69978, Tel Aviv, Israel
| | - John Yeh
- Division of Reproductive Endocrinology and Infertility, Department of Obstetrics & Gynecology, University of Massachusetts Medical School, UMass Memorial Medical Center, 119 Belmont Street, Worcester, MA, 01605, USA.
| |
Collapse
|
49
|
Zayas-Cabán T, Chaney KJ, Rogers CC, Denny JC, White PJ. Meeting the challenge: Health information technology's essential role in achieving precision medicine. J Am Med Inform Assoc 2021; 28:1345-1352. [PMID: 33749793 PMCID: PMC8263078 DOI: 10.1093/jamia/ocab032] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/12/2020] [Accepted: 02/09/2021] [Indexed: 12/20/2022] Open
Abstract
Precision medicine can revolutionize health care by tailoring treatments to individual patient needs. Advancing precision medicine requires evidence development through research that combines needed data, including clinical data, at an unprecedented scale. Widespread adoption of health information technology (IT) has made digital clinical data broadly available. These data and information systems must evolve to support precision medicine research and delivery. Specifically, relevant health IT data, infrastructure, clinical integration, and policy needs must be addressed. This article outlines those needs and describes work the Office of the National Coordinator for Health Information Technology is leading to improve health IT through pilot projects and standards and policy development. The Office of the National Coordinator for Health Information Technology will build on these efforts and continue to coordinate with other key stakeholders to achieve the vision of precision medicine. Advancement of precision medicine will require ongoing, collaborative health IT policy and technical initiatives that advance discovery and transform healthcare delivery.
Collapse
Affiliation(s)
- Teresa Zayas-Cabán
- Office of the National Coordinator for Health Information Technology, U.S. Department of Health and Human Services, Washington, DC, USA
| | - Kevin J Chaney
- Office of the National Coordinator for Health Information Technology, U.S. Department of Health and Human Services, Washington, DC, USA
| | - Courtney C Rogers
- Department of Engineering Systems and Environment, University of Virginia, Charlottesville, Virginia, USA
| | - Joshua C Denny
- All of Us Research Program, National Institutes of Health, Bethesda, Maryland, USA
| | - P. Jon White
- Veterans Affairs Salt Lake City Health Care System, Salt Lake City, Utah, USA
- Department of Internal Medicine, University of Utah, Salt Lake City, Utah, USA
| |
Collapse
|
50
|
Abstract
Brain scientists are now capable of collecting more data in a single experiment than researchers a generation ago might have collected over an entire career. Indeed, the brain itself seems to thirst for more and more data. Such digital information not only comprises individual studies but is also increasingly shared and made openly available for secondary, confirmatory, and/or combined analyses. Numerous web resources now exist containing data across spatiotemporal scales. Data processing workflow technologies running via cloud-enabled computing infrastructures allow for large-scale processing. Such a move toward greater openness is fundamentally changing how brain science results are communicated and linked to available raw data and processed results. Ethical, professional, and motivational issues challenge the whole-scale commitment to data-driven neuroscience. Nevertheless, fueled by government investments into primary brain data collection coupled with increased sharing and community pressure challenging the dominant publishing model, large-scale brain and data science is here to stay.
Collapse
Affiliation(s)
- John Darrell Van Horn
- Department of Psychology, University of Virginia, Charlottesville, Virginia, USA
- School of Data Science, University of Virginia, Charlottesville, Virginia, USA
| |
Collapse
|