1
|
Abueg LAL, Afgan E, Allart O, Awan AH, Bacon WA, Baker D, Bassetti M, Batut B, Bernt M, Blankenberg D, Bombarely A, Bretaudeau A, Bromhead CJ, Burke ML, Capon PK, Čech M, Chavero-Díez M, Chilton JM, Collins TJ, Coppens F, Coraor N, Cuccuru G, Cumbo F, Davis J, De Geest PF, de Koning W, Demko M, DeSanto A, Begines JMD, Doyle MA, Droesbeke B, Erxleben-Eggenhofer A, Föll MC, Formenti G, Fouilloux A, Gangazhe R, Genthon T, Goecks J, Beltran ANG, Goonasekera NA, Goué N, Griffin TJ, Grüning BA, Guerler A, Gundersen S, Gustafsson OJR, Hall C, Harrop TW, Hecht H, Heidari A, Heisner T, Heyl F, Hiltemann S, Hotz HR, Hyde CJ, Jagtap PD, Jakiela J, Johnson JE, Joshi J, Jossé M, Jum’ah K, Kalaš M, Kamieniecka K, Kayikcioglu T, Konkol M, Kostrykin L, Kucher N, Kumar A, Kuntz M, Lariviere D, Lazarus R, Bras YL, Corguillé GL, Lee J, Leo S, Liborio L, Libouban R, Tabernero DL, Lopez-Delisle L, Los LS, Mahmoud A, Makunin I, Marin P, Mehta S, Mok W, Moreno PA, Morier-Genoud F, Mosher S, Müller T, Nasr E, Nekrutenko A, Nelson TM, Oba AJ, Ostrovsky A, Polunina PV, Poterlowicz K, Price EJ, Price GR, Rasche H, Raubenolt B, Royaux C, Sargent L, Savage MT, Savchenko V, Savchenko D, Schatz MC, Seguineau P, Serrano-Solano B, Soranzo N, Srikakulam SK, Suderman K, Syme AE, Tangaro MA, Tedds JA, Tekman M, Cheng (Mike) Thang W, Thanki AS, Uhl M, van den Beek M, Varshney D, Vessio J, Videm P, Von Kuster G, Watson GR, Whitaker-Allen N, Winter U, Wolstencroft M, Zambelli F, Zierep P, Zoabi R. The Galaxy platform for accessible, reproducible, and collaborative data analyses: 2024 update. Nucleic Acids Res 2024; 52:W83-W94. [PMID: 38769056 PMCID: PMC11223835 DOI: 10.1093/nar/gkae410] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/11/2024] [Revised: 04/18/2024] [Accepted: 05/02/2024] [Indexed: 05/22/2024] Open
Abstract
Galaxy (https://galaxyproject.org) is deployed globally, predominantly through free-to-use services, supporting user-driven research that broadens in scope each year. Users are attracted to public Galaxy services by platform stability, tool and reference dataset diversity, training, support and integration, which enables complex, reproducible, shareable data analysis. Applying the principles of user experience design (UXD), has driven improvements in accessibility, tool discoverability through Galaxy Labs/subdomains, and a redesigned Galaxy ToolShed. Galaxy tool capabilities are progressing in two strategic directions: integrating general purpose graphical processing units (GPGPU) access for cutting-edge methods, and licensed tool support. Engagement with global research consortia is being increased by developing more workflows in Galaxy and by resourcing the public Galaxy services to run them. The Galaxy Training Network (GTN) portfolio has grown in both size, and accessibility, through learning paths and direct integration with Galaxy tools that feature in training courses. Code development continues in line with the Galaxy Project roadmap, with improvements to job scheduling and the user interface. Environmental impact assessment is also helping engage users and developers, reminding them of their role in sustainability, by displaying estimated CO2 emissions generated by each Galaxy job.
Collapse
|
2
|
Page ML, Aguzzoli Heberle B, Brandon JA, Wadsworth ME, Gordon LA, Nations KA, Ebbert MTW. Surveying the landscape of RNA isoform diversity and expression across 9 GTEx tissues using long-read sequencing data. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.02.13.579945. [PMID: 38405825 PMCID: PMC10888753 DOI: 10.1101/2024.02.13.579945] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/27/2024]
Abstract
Even though alternative RNA splicing was discovered nearly 50 years ago (1977), we still understand very little about most isoforms arising from a single gene, including in which tissues they are expressed and if their functions differ. Human gene annotations suggest remarkable transcriptional complexity, with approximately 252,798 distinct RNA isoform annotations from 62,710 gene bodies (Ensembl v109; 2023), emphasizing the need to understand their biological effects. For example, 256 gene bodies have ≥50 annotated isoforms and 30 have ≥100, where one protein-coding gene (MAPK10) even has 192 distinct RNA isoform annotations. Whether such isoform diversity results from biological redundancy or spurious alternative splicing (i.e., noise), or whether individual isoforms have specialized functions (even if subtle) remains a mystery for most genes. Recent studies by Aguzzoli-Heberle et al., Leung et al., and Glinos et al. demonstrated long-read RNAseq enables improved RNA isoform quantification for essentially any tissue, cell type, or biological condition (e.g., disease, development, aging, etc.), making it possible to better assess individual isoform expression and function. While each study provided important discoveries related to RNA isoform diversity, deeper exploration is needed. We sought to quantify and characterize real isoform usage across tissues (compared to annotations). We used long-read RNAseq data from 58 GTEx samples across nine tissues (three brain, two heart, muscle, lung, liver, and cultured fibroblasts) generated by Glinos et al. and found considerable isoform diversity within and across tissues. Cerebellar hemisphere was the most transcriptionally complex tissue (22,522 distinct isoforms; 3,726 unique); liver was least diverse (12,435 distinct isoforms; 1,039 unique). We highlight gene clusters exhibiting high tissue-specific isoform diversity per tissue (e.g., TPM1 expresses 19 in heart's atrial appendage). We also validated 447 of the 700 new isoforms discovered by Aguzzoli-Heberle et al. and found that 88 were expressed in all nine tissues, while 58 were specific to a single tissue. This study represents a broad survey of the RNA isoform landscape, demonstrating isoform diversity across nine tissues and emphasizes the need to better understand how individual isoforms from a single gene body contribute to human health and disease.
Collapse
Affiliation(s)
- Madeline L. Page
- Sanders-Brown Center on Aging, University of Kentucky, Lexington, KY
- Division of Biomedical Informatics, Department of Internal Medicine, College of Medicine, University of Kentucky, Lexington, KY
- Department of Neuroscience, College of Medicine, University of Kentucky, Lexington, KY
| | - Bernardo Aguzzoli Heberle
- Sanders-Brown Center on Aging, University of Kentucky, Lexington, KY
- Division of Biomedical Informatics, Department of Internal Medicine, College of Medicine, University of Kentucky, Lexington, KY
- Department of Neuroscience, College of Medicine, University of Kentucky, Lexington, KY
| | - J. Anthony Brandon
- Sanders-Brown Center on Aging, University of Kentucky, Lexington, KY
- Division of Biomedical Informatics, Department of Internal Medicine, College of Medicine, University of Kentucky, Lexington, KY
- Department of Neuroscience, College of Medicine, University of Kentucky, Lexington, KY
| | - Mark E. Wadsworth
- Sanders-Brown Center on Aging, University of Kentucky, Lexington, KY
- Division of Biomedical Informatics, Department of Internal Medicine, College of Medicine, University of Kentucky, Lexington, KY
- Department of Neuroscience, College of Medicine, University of Kentucky, Lexington, KY
| | - Lacey A. Gordon
- Sanders-Brown Center on Aging, University of Kentucky, Lexington, KY
- Division of Biomedical Informatics, Department of Internal Medicine, College of Medicine, University of Kentucky, Lexington, KY
- Department of Neuroscience, College of Medicine, University of Kentucky, Lexington, KY
| | - Kayla A. Nations
- Sanders-Brown Center on Aging, University of Kentucky, Lexington, KY
- Division of Biomedical Informatics, Department of Internal Medicine, College of Medicine, University of Kentucky, Lexington, KY
- Department of Neuroscience, College of Medicine, University of Kentucky, Lexington, KY
| | - Mark T. W. Ebbert
- Sanders-Brown Center on Aging, University of Kentucky, Lexington, KY
- Division of Biomedical Informatics, Department of Internal Medicine, College of Medicine, University of Kentucky, Lexington, KY
- Department of Neuroscience, College of Medicine, University of Kentucky, Lexington, KY
| |
Collapse
|
3
|
Allers S, O’Connell KA, Carlson T, Belardo D, King BL. Reusable tutorials for using cloud-based computing environments for the analysis of bacterial gene expression data from bulk RNA sequencing. Brief Bioinform 2024; 25:bbae301. [PMID: 38997128 PMCID: PMC11245317 DOI: 10.1093/bib/bbae301] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/01/2023] [Revised: 05/29/2024] [Accepted: 06/07/2024] [Indexed: 07/14/2024] Open
Abstract
This manuscript describes the development of a resource module that is part of a learning platform named "NIGMS Sandbox for Cloud-based Learning" https://github.com/NIGMS/NIGMS-Sandbox. The overall genesis of the Sandbox is described in the editorial NIGMS Sandbox at the beginning of this Supplement. This module delivers learning materials on RNA sequencing (RNAseq) data analysis in an interactive format that uses appropriate cloud resources for data access and analyses. Biomedical research is increasingly data-driven, and dependent upon data management and analysis methods that facilitate rigorous, robust, and reproducible research. Cloud-based computing resources provide opportunities to broaden the application of bioinformatics and data science in research. Two obstacles for researchers, particularly those at small institutions, are: (i) access to bioinformatics analysis environments tailored to their research; and (ii) training in how to use Cloud-based computing resources. We developed five reusable tutorials for bulk RNAseq data analysis to address these obstacles. Using Jupyter notebooks run on the Google Cloud Platform, the tutorials guide the user through a workflow featuring an RNAseq dataset from a study of prophage altered drug resistance in Mycobacterium chelonae. The first tutorial uses a subset of the data so users can learn analysis steps rapidly, and the second uses the entire dataset. Next, a tutorial demonstrates how to analyze the read count data to generate lists of differentially expressed genes using R/DESeq2. Additional tutorials generate read counts using the Snakemake workflow manager and Nextflow with Google Batch. All tutorials are open-source and can be used as templates for other analysis.
Collapse
Affiliation(s)
- Steven Allers
- Department of Molecular and Biomedical Sciences, University of Maine, 5735 Hitchner Hall, Orono, ME 04469, United States
| | - Kyle A O’Connell
- Center for Information Technology, National Institutes of Health, 6555 Rock Spring Dr, Bethesda, MD 20817, United States
- Health Data and AI, Deloitte Consulting LLP, 1919 N. Lynn St, Arlington, VA 22203, United States
| | - Thad Carlson
- Center for Information Technology, National Institutes of Health, 6555 Rock Spring Dr, Bethesda, MD 20817, United States
- Health Data and AI, Deloitte Consulting LLP, 1919 N. Lynn St, Arlington, VA 22203, United States
| | - David Belardo
- Google Cloud, Google, 1900 Reston Metro Plaza, Reston, VA 20190, United States
| | - Benjamin L King
- Department of Molecular and Biomedical Sciences, University of Maine, 5735 Hitchner Hall, Orono, ME 04469, United States
- Maine Institutional Development Award Network of Biomedical Research Excellence (INBRE) Data Science Core, MDI Biological Laboratory, 159 Old Bar Harbor Rd, Bar Harbor, ME 04609, United States
- Graduate School of Biomedical Science and Engineering, University of Maine, 5775 Stodder Hall, Orono, ME 04469, United States
| |
Collapse
|
4
|
Oh S, Gravel-Pucillo K, Ramos M, Davis S, Carey V, Morgan M, Waldron L. AnVILWorkflow: A runnable workflow package for Cloud-implemented bioinformatics analysis pipelines. RESEARCH SQUARE 2024:rs.3.rs-4370115. [PMID: 38798429 PMCID: PMC11118690 DOI: 10.21203/rs.3.rs-4370115/v1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/29/2024]
Abstract
Advancements in sequencing technologies and the development of new data collection methods produce large volumes of biological data. The Genomic Data Science Analysis, Visualization, and Informatics Lab-space (AnVIL) provides a cloud-based platform for democratizing access to large-scale genomics data and analysis tools. However, utilizing the full capabilities of AnVIL can be challenging for researchers without extensive bioinformatics expertise, especially for executing complex workflows. Here we present the AnVILWorkflow R package, which enables the convenient execution of bioinformatics workflows hosted on AnVIL directly from an R environment. AnVILWorkflowsimplifies the setup of the cloud computing environment, input data formatting, workflow submission, and retrieval of results through intuitive functions. We demonstrate the utility of AnVILWorkflowfor three use cases: bulk RNA-seq analysis with Salmon, metagenomics analysis with bioBakery, and digital pathology image processing with PathML. The key features of AnVILWorkflow include user-friendly browsing of available data and workflows, seamless integration of R and non-R tools within a reproducible analysis pipeline, and accessibility to scalable computing resources without direct management overhead. While some limitations exist around workflow customization, AnVILWorkflowlowers the barrier to taking advantage of AnVIL's resources, especially for exploratory analyses or bulk processing with established workflows. This empowers a broader community of researchers to leverage the latest genomics tools and datasets using familiar R syntax. This package is distributed through the Bioconductor project (https://bioconductor.org/packages/AnVILWorkflow), and the source code is available through GitHub (https://github.com/shbrief/AnVILWorkflow).
Collapse
Affiliation(s)
- Sehyun Oh
- City University of New York School of Public Health
| | | | - Marcel Ramos
- City University of New York School of Public Health
| | - Sean Davis
- University of Colorado Anschutz School of Medicine
| | | | | | - Levi Waldron
- City University of New York School of Public Health
| |
Collapse
|
5
|
Kim E, Davidsen T, Davis-Dusenbery BN, Baumann A, Maggio A, Chen Z, Meerzaman D, Casas-Silva E, Pot D, Pihl T, Otridge J, Shalley E, Barnholtz-Sloan JS, Kerlavage AR. NCI Cancer Research Data Commons: Lessons Learned and Future State. Cancer Res 2024; 84:1404-1409. [PMID: 38488510 PMCID: PMC11063686 DOI: 10.1158/0008-5472.can-23-2730] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/08/2023] [Revised: 01/19/2024] [Accepted: 03/05/2024] [Indexed: 05/03/2024]
Abstract
More than ever, scientific progress in cancer research hinges on our ability to combine datasets and extract meaningful interpretations to better understand diseases and ultimately inform the development of better treatments and diagnostic tools. To enable the successful sharing and use of big data, the NCI developed the Cancer Research Data Commons (CRDC), providing access to a large, comprehensive, and expanding collection of cancer data. The CRDC is a cloud-based data science infrastructure that eliminates the need for researchers to download and store large-scale datasets by allowing them to perform analysis where data reside. Over the past 10 years, the CRDC has made significant progress in providing access to data and tools along with training and outreach to support the cancer research community. In this review, we provide an overview of the history and the impact of the CRDC to date, lessons learned, and future plans to further promote data sharing, accessibility, interoperability, and reuse. See related articles by Brady et al., p. 1384, Wang et al., p. 1388, and Pot et al., p. 1396.
Collapse
Affiliation(s)
- Erika Kim
- Center for Biomedical Informatics and Information Technology, NCI, Rockville, Maryland
| | - Tanja Davidsen
- Center for Biomedical Informatics and Information Technology, NCI, Rockville, Maryland
| | | | | | | | - Zhaoyi Chen
- Center for Biomedical Informatics and Information Technology, NCI, Rockville, Maryland
- NIH, Bethesda, Maryland
| | - Daoud Meerzaman
- Center for Biomedical Informatics and Information Technology, NCI, Rockville, Maryland
| | - Esmeralda Casas-Silva
- Center for Biomedical Informatics and Information Technology, NCI, Rockville, Maryland
| | - David Pot
- General Dynamics Information Technology, Falls Church, Virginia
| | - Todd Pihl
- Frederick National Laboratory for Cancer Research, Frederick, Maryland
| | - John Otridge
- Frederick National Laboratory for Cancer Research, Frederick, Maryland
| | - Eve Shalley
- Essex, an Emmes Company, Rockville, Maryland
| | | | - Jill S. Barnholtz-Sloan
- Center for Biomedical Informatics and Information Technology, NCI, Rockville, Maryland
- Division of Cancer Epidemiology and Genetics, NCI, Rockville, Maryland
| | - Anthony R. Kerlavage
- Center for Biomedical Informatics and Information Technology, NCI, Rockville, Maryland
| |
Collapse
|
6
|
Willsey HR, Seaby EG, Godwin A, Ennis S, Guille M, Grainger RM. Modelling human genetic disorders in Xenopus tropicalis. Dis Model Mech 2024; 17:dmm050754. [PMID: 38832520 PMCID: PMC11179720 DOI: 10.1242/dmm.050754] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/05/2024] Open
Abstract
Recent progress in human disease genetics is leading to rapid advances in understanding pathobiological mechanisms. However, the sheer number of risk-conveying genetic variants being identified demands in vivo model systems that are amenable to functional analyses at scale. Here we provide a practical guide for using the diploid frog species Xenopus tropicalis to study many genes and variants to uncover conserved mechanisms of pathobiology relevant to human disease. We discuss key considerations in modelling human genetic disorders: genetic architecture, conservation, phenotyping strategy and rigour, as well as more complex topics, such as penetrance, expressivity, sex differences and current challenges in the field. As the patient-driven gene discovery field expands significantly, the cost-effective, rapid and higher throughput nature of Xenopus make it an essential member of the model organism armamentarium for understanding gene function in development and in relation to disease.
Collapse
Affiliation(s)
- Helen Rankin Willsey
- Department of Psychiatry and Behavioral Sciences, Weill Institute for Neurosciences, University of California San Francisco, San Francisco, CA 94158, USA
- Chan Zuckerberg Biohub - San Francisco, San Francisco, CA 94518, USA
| | - Eleanor G Seaby
- Genomic Informatics Group, Faculty of Medicine, University of Southampton, Southampton SO16 6YD, UK
| | - Annie Godwin
- European Xenopus Resource Centre (EXRC), School of Biological Sciences, University of Portsmouth, Portsmouth PO1 2DY, UK
| | - Sarah Ennis
- Genomic Informatics Group, Faculty of Medicine, University of Southampton, Southampton SO16 6YD, UK
| | - Matthew Guille
- European Xenopus Resource Centre (EXRC), School of Biological Sciences, University of Portsmouth, Portsmouth PO1 2DY, UK
| | - Robert M Grainger
- Department of Biology, University of Virginia, Charlottesville, VA 22904, USA
| |
Collapse
|
7
|
Jentsch M, Schneider-Lunitz V, Taron U, Braun M, Ishaque N, Wagener H, Conrad C, Twardziok S. Creating cloud platforms for supporting FAIR data management in biomedical research projects. F1000Res 2024; 13:8. [PMID: 38779317 PMCID: PMC11109697 DOI: 10.12688/f1000research.140624.3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Accepted: 04/25/2024] [Indexed: 05/25/2024] Open
Abstract
Biomedical research projects are becoming increasingly complex and require technological solutions that support all phases of the data lifecycle and application of the FAIR principles. At the Berlin Institute of Health (BIH), we have developed and established a flexible and cost-effective approach to building customized cloud platforms for supporting research projects. The approach is based on a microservice architecture and on the management of a portfolio of supported services. On this basis, we created and maintained cloud platforms for several international research projects. In this article, we present our approach and argue that building customized cloud platforms can offer multiple advantages over using multi-project platforms. Our approach is transferable to other research environments and can be easily adapted by other projects and other service providers.
Collapse
Affiliation(s)
- Marcel Jentsch
- Berlin Institute of Health at Charité – Universitätsmedizin Berlin, Center of Digital Health, Berlin, 10117, Germany
| | - Valentin Schneider-Lunitz
- Berlin Institute of Health at Charité – Universitätsmedizin Berlin, Center of Digital Health, Berlin, 10117, Germany
| | - Ulrike Taron
- Berlin Institute of Health at Charité – Universitätsmedizin Berlin, Center of Digital Health, Berlin, 10117, Germany
| | - Martin Braun
- Berlin Institute of Health at Charité – Universitätsmedizin Berlin, Center of Digital Health, Berlin, 10117, Germany
| | - Naveed Ishaque
- Berlin Institute of Health at Charité – Universitätsmedizin Berlin, Center of Digital Health, Berlin, 10117, Germany
| | - Harald Wagener
- Berlin Institute of Health at Charité – Universitätsmedizin Berlin, Center of Digital Health, Berlin, 10117, Germany
| | - Christian Conrad
- Berlin Institute of Health at Charité – Universitätsmedizin Berlin, Center of Digital Health, Berlin, 10117, Germany
| | - Sven Twardziok
- Berlin Institute of Health at Charité – Universitätsmedizin Berlin, Center of Digital Health, Berlin, 10117, Germany
| |
Collapse
|
8
|
Duyzend MH, Cacheiro P, Jacobsen JO, Giordano J, Brand H, Wapner RJ, Talkowski ME, Robinson PN, Smedley D. Improving prenatal diagnosis through standards and aggregation. Prenat Diagn 2024; 44:454-464. [PMID: 38242839 PMCID: PMC11006584 DOI: 10.1002/pd.6522] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/26/2023] [Revised: 12/17/2023] [Accepted: 12/22/2023] [Indexed: 01/21/2024]
Abstract
Advances in sequencing and imaging technologies enable enhanced assessment in the prenatal space, with a goal to diagnose and predict the natural history of disease, to direct targeted therapies, and to implement clinical management, including transfer of care, election of supportive care, and selection of surgical interventions. The current lack of standardization and aggregation stymies variant interpretation and gene discovery, which hinders the provision of prenatal precision medicine, leaving clinicians and patients without an accurate diagnosis. With large amounts of data generated, it is imperative to establish standards for data collection, processing, and aggregation. Aggregated and homogeneously processed genetic and phenotypic data permits dissection of the genomic architecture of prenatal presentations of disease and provides a dataset on which data analysis algorithms can be tuned to the prenatal space. Here we discuss the importance of generating aggregate data sets and how the prenatal space is driving the development of interoperable standards and phenotype-driven tools.
Collapse
Affiliation(s)
- Michael H. Duyzend
- Center for Genomic Medicine, Massachusetts General Hospital, Boston, MA, USA
- Program in Medical and Population Genetics, The Broad Institute of MIT and Harvard, Cambridge, MA, USA
- Division of Genetics and Genomics, Department of Pediatrics, Boston Children’s Hospital and Harvard Medical School, Boston, MA, USA
| | - Pilar Cacheiro
- William Harvey Research Institute, Barts and the London School of Medicine and Dentistry, Queen Mary University of London, London EC1M 6BQ, UK
| | - Julius O.B. Jacobsen
- William Harvey Research Institute, Barts and the London School of Medicine and Dentistry, Queen Mary University of London, London EC1M 6BQ, UK
| | - Jessica Giordano
- Department of Obstetrics & Gynecology, Columbia University Medical Center, New York, NY, USA
| | - Harrison Brand
- Center for Genomic Medicine, Massachusetts General Hospital, Boston, MA, USA
- Program in Medical and Population Genetics, The Broad Institute of MIT and Harvard, Cambridge, MA, USA
- Department of Neurology, Harvard Medical School, Boston, MA, USA
| | - Ronald J. Wapner
- Department of Obstetrics & Gynecology, Columbia University Medical Center, New York, NY, USA
| | - Michael E. Talkowski
- Center for Genomic Medicine, Massachusetts General Hospital, Boston, MA, USA
- Program in Medical and Population Genetics, The Broad Institute of MIT and Harvard, Cambridge, MA, USA
- Department of Neurology, Harvard Medical School, Boston, MA, USA
- Program in Biological and Biomedical Sciences, Division of Medical Sciences, Harvard Medical School, Boston, MA, USA
- Program in Bioinformatics and Integrative Genomics, Division of Medical Sciences, Harvard Medical School, Boston, MA, USA
| | - Peter N. Robinson
- The Jackson Laboratory for Genomic Medicine, Farmington, CT 06032, USA
- Institute for Systems Genomics, University of Connecticut, Farmington, CT 06032, USA
| | - Damian Smedley
- William Harvey Research Institute, Barts and the London School of Medicine and Dentistry, Queen Mary University of London, London EC1M 6BQ, UK
| |
Collapse
|
9
|
Baek J, Lawson J, Rahimzadeh V. Investigating the Roles and Responsibilities of Institutional Signing Officials After Data Sharing Policy Reform for Federally Funded Research in the United States: National Survey. JMIR Form Res 2024; 8:e49822. [PMID: 38506894 PMCID: PMC10993121 DOI: 10.2196/49822] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/10/2023] [Revised: 01/04/2024] [Accepted: 01/07/2024] [Indexed: 03/21/2024] Open
Abstract
BACKGROUND New federal policies along with rapid growth in data generation, storage, and analysis tools are together driving scientific data sharing in the United States. At the same, triangulating human research data from diverse sources can also create situations where data are used for future research in ways that individuals and communities may consider objectionable. Institutional gatekeepers, namely, signing officials (SOs), are therefore at the helm of compliant management and sharing of human data for research. Of those with data governance responsibilities, SOs most often serve as signatories for investigators who deposit, access, and share research data between institutions. Although SOs play important leadership roles in compliant data sharing, we know surprisingly little about their scope of work, roles, and oversight responsibilities. OBJECTIVE The purpose of this study was to describe existing institutional policies and practices of US SOs who manage human genomic data access, as well as how these may change in the wake of new Data Management and Sharing requirements for National Institutes of Health-funded research in the United States. METHODS We administered an anonymous survey to institutional SOs recruited from biomedical research institutions across the United States. Survey items probed where data generated from extramurally funded research are deposited, how researchers outside the institution access these data, and what happens to these data after extramural funding ends. RESULTS In total, 56 institutional SOs participated in the survey. We found that SOs frequently approve duplicate data deposits and impose stricter access controls when data use limitations are unclear or unspecified. In addition, 21% (n=12) of SOs knew where data from federally funded projects are deposited after project funding sunsets. As a consequence, most investigators deposit their scientific data into "a National Institutes of Health-funded repository" to meet the Data Management and Sharing requirements but also within the "institution's own repository" or a third-party repository. CONCLUSIONS Our findings inform 5 policy recommendations and best practices for US SOs to improve coordination and develop comprehensive and consistent data governance policies that balance the need for scientific progress with effective human data protections.
Collapse
Affiliation(s)
| | | | - Vasiliki Rahimzadeh
- Center for Medical Ethics and Health Policy, Baylor College of Medicine, Houston, TX, United States
| |
Collapse
|
10
|
Serizay J, Matthey-Doret C, Bignaud A, Baudry L, Koszul R. Orchestrating chromosome conformation capture analysis with Bioconductor. Nat Commun 2024; 15:1072. [PMID: 38316789 PMCID: PMC10844600 DOI: 10.1038/s41467-024-44761-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/02/2023] [Accepted: 12/28/2023] [Indexed: 02/07/2024] Open
Abstract
Genome-wide chromatin conformation capture assays provide formidable insights into the spatial organization of genomes. However, due to the complexity of the data structure, their integration in multi-omics workflows remains challenging. We present data structures, computational methods and visualization tools available in Bioconductor to investigate Hi-C, micro-C and other 3C-related data, in R. An online book ( https://bioconductor.org/books/OHCA/ ) further provides prospective end users with a number of workflows to process, import, analyze and visualize any type of chromosome conformation capture data.
Collapse
Affiliation(s)
- Jacques Serizay
- Institut Pasteur, CNRS UMR3525, Université Paris Cité, Unité Régulation Spatiale des Génomes, Paris, France.
| | - Cyril Matthey-Doret
- Institut Pasteur, CNRS UMR3525, Université Paris Cité, Unité Régulation Spatiale des Génomes, Paris, France
- Sorbonne Université, Collège Doctoral, Paris, France
- Swiss Data Science Center, École Polytechnique Fédérale de Lausanne, 1015, Lausanne, Switzerland
| | - Amaury Bignaud
- Institut Pasteur, CNRS UMR3525, Université Paris Cité, Unité Régulation Spatiale des Génomes, Paris, France
- Sorbonne Université, Collège Doctoral, Paris, France
| | - Lyam Baudry
- Institut Pasteur, CNRS UMR3525, Université Paris Cité, Unité Régulation Spatiale des Génomes, Paris, France
- Sorbonne Université, Collège Doctoral, Paris, France
- Université de Lausanne, Center for Integrative Genomics, Quartier Sorge, 1015, Lausanne, Switzerland
| | - Romain Koszul
- Institut Pasteur, CNRS UMR3525, Université Paris Cité, Unité Régulation Spatiale des Génomes, Paris, France
| |
Collapse
|
11
|
Gaddis N, Fortriede J, Guo M, Bardes EE, Kouril M, Tabar S, Burns K, Ardini-Poleske ME, Loos S, Schnell D, Jin K, Iyer B, Du Y, Huo BX, Bhattacharjee A, Korte J, Munshi R, Smith V, Herbst A, Kitzmiller JA, Clair GC, Carson JP, Adkins J, Morrisey EE, Pryhuber GS, Misra R, Whitsett JA, Sun X, Heathorn T, Paten B, Prasath VBS, Xu Y, Tickle T, Aronow BJ, Salomonis N. LungMAP Portal Ecosystem: Systems-level Exploration of the Lung. Am J Respir Cell Mol Biol 2024; 70:129-139. [PMID: 36413377 PMCID: PMC10848697 DOI: 10.1165/rcmb.2022-0165oc] [Citation(s) in RCA: 4] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/16/2022] [Accepted: 11/21/2022] [Indexed: 11/23/2022] Open
Abstract
An improved understanding of the human lung necessitates advanced systems models informed by an ever-increasing repertoire of molecular omics, cellular imaging, and pathological datasets. To centralize and standardize information across broad lung research efforts, we expanded the LungMAP.net website into a new gateway portal. This portal connects a broad spectrum of research networks, bulk and single-cell multiomics data, and a diverse collection of image data that span mammalian lung development and disease. The data are standardized across species and technologies using harmonized data and metadata models that leverage recent advances, including those from the Human Cell Atlas, diverse ontologies, and the LungMAP CellCards initiative. To cultivate future discoveries, we have aggregated a diverse collection of single-cell atlases for multiple species (human, rhesus, and mouse) to enable consistent queries across technologies, cohorts, age, disease, and drug treatment. These atlases are provided as independent and integrated queryable datasets, with an emphasis on dynamic visualization, figure generation, reanalysis, cell-type curation, and automated reference-based classification of user-provided single-cell genomics datasets (Azimuth). As this resource grows, we intend to increase the breadth of available interactive interfaces, supported data types, data portals and datasets from LungMAP, and external research efforts.
Collapse
Affiliation(s)
- Nathan Gaddis
- RTI International, Research Triangle Park, North Carolina
| | - Joshua Fortriede
- Division of Biomedical Informatics, Cincinnati Children’s Hospital Medical Center, Cincinnati, Ohio
| | - Minzhe Guo
- Division of Pulmonary Biology, The Perinatal Institute, and
| | - Eric E. Bardes
- Division of Biomedical Informatics, Cincinnati Children’s Hospital Medical Center, Cincinnati, Ohio
| | - Michal Kouril
- Division of Biomedical Informatics, Cincinnati Children’s Hospital Medical Center, Cincinnati, Ohio
| | - Scott Tabar
- Division of Biomedical Informatics, Cincinnati Children’s Hospital Medical Center, Cincinnati, Ohio
| | - Kevin Burns
- Division of Biomedical Informatics, Cincinnati Children’s Hospital Medical Center, Cincinnati, Ohio
| | | | - Stephanie Loos
- Division of Biomedical Informatics, Cincinnati Children’s Hospital Medical Center, Cincinnati, Ohio
| | - Daniel Schnell
- Division of Biomedical Informatics, Cincinnati Children’s Hospital Medical Center, Cincinnati, Ohio
| | - Kang Jin
- Division of Biomedical Informatics, Cincinnati Children’s Hospital Medical Center, Cincinnati, Ohio
| | - Balaji Iyer
- Division of Biomedical Informatics, Cincinnati Children’s Hospital Medical Center, Cincinnati, Ohio
- Department of Electrical Engineering and Computer Science, University of Cincinnati, Cincinnati, Ohio
| | - Yina Du
- Division of Pulmonary Biology, The Perinatal Institute, and
| | - Bing-Xing Huo
- Data Sciences Platform, The Broad Institute of MIT and Harvard, Cambridge, Massachusetts
| | - Anukana Bhattacharjee
- Division of Biomedical Informatics, Cincinnati Children’s Hospital Medical Center, Cincinnati, Ohio
| | - Jeff Korte
- Data Sciences Platform, The Broad Institute of MIT and Harvard, Cambridge, Massachusetts
| | - Ruchi Munshi
- Data Sciences Platform, The Broad Institute of MIT and Harvard, Cambridge, Massachusetts
| | - Victoria Smith
- Data Sciences Platform, The Broad Institute of MIT and Harvard, Cambridge, Massachusetts
| | - Andrew Herbst
- Data Sciences Platform, The Broad Institute of MIT and Harvard, Cambridge, Massachusetts
| | | | - Geremy C. Clair
- Biological Science Division, Pacific Northwest National Laboratory, Richland, Washington
| | - James P. Carson
- Texas Advanced Computing Center, University of Texas at Austin, Austin, Texas
| | - Joshua Adkins
- Biological Science Division, Pacific Northwest National Laboratory, Richland, Washington
| | - Edward E. Morrisey
- Department of Medicine and
- Penn-CHOP Lung Biology Institute, Perelman School of Medicine, University of Pennsylvania, Philadelphia, Pennsylvania
| | - Gloria S. Pryhuber
- Department of Pediatrics, University of Rochester Medical Center, Rochester, New York
| | - Ravi Misra
- Department of Pediatrics, University of Rochester Medical Center, Rochester, New York
| | - Jeffrey A. Whitsett
- Division of Pulmonary Biology, The Perinatal Institute, and
- Department of Pediatrics, University of Cincinnati School of Medicine, Cincinnati, Ohio
| | - Xin Sun
- Department of Pediatrics and
- Department of Biological Sciences, University of California, San Diego, San Diego, California; and
| | - Trevor Heathorn
- UC Santa Cruz Genomics Institute, University of California, Santa Cruz, Santa Cruz, California
| | - Benedict Paten
- UC Santa Cruz Genomics Institute, University of California, Santa Cruz, Santa Cruz, California
| | - V. B. Surya Prasath
- Division of Biomedical Informatics, Cincinnati Children’s Hospital Medical Center, Cincinnati, Ohio
- Department of Electrical Engineering and Computer Science, University of Cincinnati, Cincinnati, Ohio
- Department of Pediatrics, University of Cincinnati School of Medicine, Cincinnati, Ohio
| | - Yan Xu
- Division of Pulmonary Biology, The Perinatal Institute, and
- Department of Pediatrics, University of Cincinnati School of Medicine, Cincinnati, Ohio
| | - Tim Tickle
- Data Sciences Platform, The Broad Institute of MIT and Harvard, Cambridge, Massachusetts
| | - Bruce J. Aronow
- Division of Biomedical Informatics, Cincinnati Children’s Hospital Medical Center, Cincinnati, Ohio
- Department of Electrical Engineering and Computer Science, University of Cincinnati, Cincinnati, Ohio
- Department of Pediatrics, University of Cincinnati School of Medicine, Cincinnati, Ohio
| | - Nathan Salomonis
- Division of Biomedical Informatics, Cincinnati Children’s Hospital Medical Center, Cincinnati, Ohio
- Department of Electrical Engineering and Computer Science, University of Cincinnati, Cincinnati, Ohio
- Department of Pediatrics, University of Cincinnati School of Medicine, Cincinnati, Ohio
| |
Collapse
|
12
|
Yang S, Gaietto K, Chen W. Mapping a New Course to Understand Lung Biology Mechanisms: LungMAP.net. Am J Respir Cell Mol Biol 2024; 70:91-93. [PMID: 38109690 PMCID: PMC10848696 DOI: 10.1165/rcmb.2023-0439ed] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/11/2023] [Accepted: 12/18/2023] [Indexed: 12/20/2023] Open
Affiliation(s)
- Sheng Yang
- Department of Biostatistics Nanjing Medical University Nanjing, Jiangsu, China
| | - Kristina Gaietto
- Department of Pediatrics University of Pittsburgh School of Medicine Pittsburgh, Pennsylvania
| | - Wei Chen
- Department of Pediatrics University of Pittsburgh School of Medicine Pittsburgh, Pennsylvania
| |
Collapse
|
13
|
Mahmoud M, Huang Y, Garimella K, Audano PA, Wan W, Prasad N, Handsaker RE, Hall S, Pionzio A, Schatz MC, Talkowski ME, Eichler EE, Levy SE, Sedlazeck FJ. Utility of long-read sequencing for All of Us. Nat Commun 2024; 15:837. [PMID: 38281971 PMCID: PMC10822842 DOI: 10.1038/s41467-024-44804-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/23/2022] [Accepted: 01/03/2024] [Indexed: 01/30/2024] Open
Abstract
The All of Us (AoU) initiative aims to sequence the genomes of over one million Americans from diverse ethnic backgrounds to improve personalized medical care. In a recent technical pilot, we compare the performance of traditional short-read sequencing with long-read sequencing in a small cohort of samples from the HapMap project and two AoU control samples representing eight datasets. Our analysis reveals substantial differences in the ability of these technologies to accurately sequence complex medically relevant genes, particularly in terms of gene coverage and pathogenic variant identification. We also consider the advantages and challenges of using low coverage sequencing to increase sample numbers in large cohort analysis. Our results show that HiFi reads produce the most accurate results for both small and large variants. Further, we present a cloud-based pipeline to optimize SNV, indel and SV calling at scale for long-reads analysis. These results lead to widespread improvements across AoU.
Collapse
Affiliation(s)
- M Mahmoud
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX, USA
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX, USA
| | - Y Huang
- Data Sciences Platform, Broad Institute of MIT and Harvard, Cambridge, MA, 02141, USA
| | - K Garimella
- Data Sciences Platform, Broad Institute of MIT and Harvard, Cambridge, MA, 02141, USA
| | - P A Audano
- The Jackson Laboratory for Genomic Medicine, Farmington, CT, 06032, USA
| | - W Wan
- Data Sciences Platform, Broad Institute of MIT and Harvard, Cambridge, MA, 02141, USA
| | - N Prasad
- Discovery Life Sciences, Huntsville, AL, 35806, USA
| | - R E Handsaker
- Department of Genetics, Harvard Medical School, Boston, MA, USA
- Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA, 02141, USA
| | - S Hall
- Discovery Life Sciences, Huntsville, AL, 35806, USA
| | - A Pionzio
- Discovery Life Sciences, Huntsville, AL, 35806, USA
| | - M C Schatz
- Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA
| | - M E Talkowski
- Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA, 02141, USA
- Center for Genomic Medicine, Massachusetts General Hospital, Boston, MA, USA
| | - E E Eichler
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
- Howard Hughes Medical Institute, University of Washington, Seattle, WA, USA
| | - S E Levy
- HudsonAlpha Institute for Biotechnology, Huntsville, AL, 35806, USA
| | - F J Sedlazeck
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX, USA.
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX, USA.
- Department of Computer Science, Rice University, Houston, TX, USA.
| |
Collapse
|
14
|
Viljoen N, Burt F, Weyer J. Coding-complete genome of human alphaherpesvirus 1 isolated from a case of fulminant hepatitis. Microbiol Resour Announc 2023; 12:e0035523. [PMID: 37747240 PMCID: PMC10586135 DOI: 10.1128/mra.00355-23] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/11/2023] [Accepted: 08/03/2023] [Indexed: 09/26/2023] Open
Abstract
We report the coding-complete genome sequence of human alphaherpesvirus 1 (HHV1) isolated from a previously healthy 64-year-old male with fulminant hepatitis, a rare presentation of a common viral agent. The sequence is highly similar to previously described HHV1 sequences. Additional sequence data for fulminant hepatitis cases are required.
Collapse
Affiliation(s)
- Natalie Viljoen
- Center for Emerging Zoonotic and Parasitic Diseases, National Institute for Communicable Disease of the National Health Laboratory Service, Sandringham, South Africa
- Division of Virology, University of the Free State, Bloemfontein, South Africa
- Centre for Viral Zoonoses, Department of Medical Virology, University of Pretoria, Pretoria, South Africa
| | - Felicity Burt
- Division of Virology, University of the Free State, Bloemfontein, South Africa
- Division of Virology, National Health Laboratory Service, Universitas, Bloemfontein, South Africa
| | - Jacqueline Weyer
- Center for Emerging Zoonotic and Parasitic Diseases, National Institute for Communicable Disease of the National Health Laboratory Service, Sandringham, South Africa
- Centre for Viral Zoonoses, Department of Medical Virology, University of Pretoria, Pretoria, South Africa
- Department of Microbiology and Infectious Diseases, Faculty of Health Sciences, University of Witwatersrand, Johannesburg, South Africa
| |
Collapse
|
15
|
Kolmogorov M, Billingsley KJ, Mastoras M, Meredith M, Monlong J, Lorig-Roach R, Asri M, Alvarez Jerez P, Malik L, Dewan R, Reed X, Genner RM, Daida K, Behera S, Shafin K, Pesout T, Prabakaran J, Carnevali P, Yang J, Rhie A, Scholz SW, Traynor BJ, Miga KH, Jain M, Timp W, Phillippy AM, Chaisson M, Sedlazeck FJ, Blauwendraat C, Paten B. Scalable Nanopore sequencing of human genomes provides a comprehensive view of haplotype-resolved variation and methylation. Nat Methods 2023; 20:1483-1492. [PMID: 37710018 PMCID: PMC11222905 DOI: 10.1038/s41592-023-01993-x] [Citation(s) in RCA: 6] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/28/2023] [Accepted: 08/04/2023] [Indexed: 09/16/2023]
Abstract
Long-read sequencing technologies substantially overcome the limitations of short-reads but have not been considered as a feasible replacement for population-scale projects, being a combination of too expensive, not scalable enough or too error-prone. Here we develop an efficient and scalable wet lab and computational protocol, Napu, for Oxford Nanopore Technologies long-read sequencing that seeks to address those limitations. We applied our protocol to cell lines and brain tissue samples as part of a pilot project for the National Institutes of Health Center for Alzheimer's and Related Dementias. Using a single PromethION flow cell, we can detect single nucleotide polymorphisms with F1-score comparable to Illumina short-read sequencing. Small indel calling remains difficult within homopolymers and tandem repeats, but achieves good concordance to Illumina indel calls elsewhere. Further, we can discover structural variants with F1-score on par with state-of-the-art de novo assembly methods. Our protocol phases small and structural variants at megabase scales and produces highly accurate, haplotype-specific methylation calls.
Collapse
Affiliation(s)
- Mikhail Kolmogorov
- Center for Cancer Research, National Cancer Institute, National Institutes of Health, Bethesda, MD, USA.
| | - Kimberley J Billingsley
- Center for Alzheimer's and Related Dementias, National Institute on Aging and National Institute of Neurological Disorders and Stroke, National Institutes of Health, Bethesda, MD, USA.
- Laboratory of Neurogenetics, National Institute on Aging, National Institutes of Health, Bethesda, MD, USA.
| | - Mira Mastoras
- UC Santa Cruz Genomics Institute, Santa Cruz, CA, USA
| | | | - Jean Monlong
- UC Santa Cruz Genomics Institute, Santa Cruz, CA, USA
| | | | - Mobin Asri
- UC Santa Cruz Genomics Institute, Santa Cruz, CA, USA
| | - Pilar Alvarez Jerez
- Center for Alzheimer's and Related Dementias, National Institute on Aging and National Institute of Neurological Disorders and Stroke, National Institutes of Health, Bethesda, MD, USA
| | - Laksh Malik
- Center for Alzheimer's and Related Dementias, National Institute on Aging and National Institute of Neurological Disorders and Stroke, National Institutes of Health, Bethesda, MD, USA
| | - Ramita Dewan
- Laboratory of Neurogenetics, National Institute on Aging, National Institutes of Health, Bethesda, MD, USA
| | - Xylena Reed
- Center for Alzheimer's and Related Dementias, National Institute on Aging and National Institute of Neurological Disorders and Stroke, National Institutes of Health, Bethesda, MD, USA
| | - Rylee M Genner
- Center for Alzheimer's and Related Dementias, National Institute on Aging and National Institute of Neurological Disorders and Stroke, National Institutes of Health, Bethesda, MD, USA
| | - Kensuke Daida
- Center for Alzheimer's and Related Dementias, National Institute on Aging and National Institute of Neurological Disorders and Stroke, National Institutes of Health, Bethesda, MD, USA
- Laboratory of Neurogenetics, National Institute on Aging, National Institutes of Health, Bethesda, MD, USA
| | - Sairam Behera
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX, USA
| | | | - Trevor Pesout
- UC Santa Cruz Genomics Institute, Santa Cruz, CA, USA
| | - Jeshuwin Prabakaran
- Center for Cancer Research, National Cancer Institute, National Institutes of Health, Bethesda, MD, USA
- Department of Biological Sciences, University of Maryland Baltimore County, Baltimore, MD, USA
| | | | - Jianzhi Yang
- Department of Quantitative and Computational Biology, University of Southern California, Los Angeles, CA, USA
| | - Arang Rhie
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA
| | - Sonja W Scholz
- Neurodegenerative Diseases Research Unit, National Institute of Neurological Disorders and Stroke, National Institutes of Health, Bethesda, MD, USA
- Department of Neurology, Johns Hopkins University Medical Center, Baltimore, MD, USA
| | - Bryan J Traynor
- Laboratory of Neurogenetics, National Institute on Aging, National Institutes of Health, Bethesda, MD, USA
- Department of Neurology, Johns Hopkins University Medical Center, Baltimore, MD, USA
| | - Karen H Miga
- UC Santa Cruz Genomics Institute, Santa Cruz, CA, USA
| | - Miten Jain
- Department of Bioengineering, Northeastern University, Boston, MA, USA
- Department of Physics, Northeastern University, Boston, MA, USA
| | - Winston Timp
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, USA
| | - Adam M Phillippy
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA
| | - Mark Chaisson
- Department of Quantitative and Computational Biology, University of Southern California, Los Angeles, CA, USA
| | - Fritz J Sedlazeck
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX, USA
- Department of Computer Science, Rice University, Houston, TX, USA
| | - Cornelis Blauwendraat
- Center for Alzheimer's and Related Dementias, National Institute on Aging and National Institute of Neurological Disorders and Stroke, National Institutes of Health, Bethesda, MD, USA.
- Laboratory of Neurogenetics, National Institute on Aging, National Institutes of Health, Bethesda, MD, USA.
| | | |
Collapse
|
16
|
Pinto BJ, O’Connor B, Schatz MC, Zarate S, Wilson MA. Concerning the eXclusion in human genomics: the choice of sex chromosome representation in the human genome drastically affects the number of identified variants. G3 (BETHESDA, MD.) 2023; 13:jkad169. [PMID: 37497639 PMCID: PMC10542555 DOI: 10.1093/g3journal/jkad169] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/23/2023] [Revised: 06/28/2023] [Accepted: 07/05/2023] [Indexed: 07/28/2023]
Abstract
Over the past 30 years, a community of scientists has pieced together every base pair of the human reference genome from telomere to telomere. Interestingly, most human genomics studies omit more than 5% of the genome from their analyses. Under "normal" circumstances, omitting any chromosome(s) from an analysis of the human genome would be a cause for concern, with the exception being sex chromosomes. Sex chromosomes in eutherians share an evolutionary origin as an ancestral pair of autosomes. In humans, they share 3 regions of high-sequence identity (∼98-100%), which, along with the unique transmission patterns of the sex chromosomes, introduce technical artifacts in genomic analyses. However, the human X chromosome bears numerous important genes, including more "immune response" genes than any other chromosome, which makes its exclusion irresponsible when sex differences across human diseases are widespread. To better characterize the possible effect of the inclusion/exclusion of the X chromosome on variants called, we conducted a pilot study on the Terra cloud platform to replicate a subset of standard genomic practices using both the CHM13 reference genome and the sex chromosome complement-aware reference genome. We compared the quality of variant calling, expression quantification, and allele-specific expression using these 2 reference genome versions across 50 human samples from the Genotype-Tissue Expression consortium annotated as females. We found that after correction, the whole X chromosome (100%) can generate reliable variant calls, allowing for the inclusion of the whole genome in human genomics analyses as a departure from the status quo of omitting the sex chromosomes from empirical and clinical genomics studies.
Collapse
Affiliation(s)
- Brendan J Pinto
- School of Life Sciences, Arizona State University, Tempe, AZ 85282, USA
- Center for Evolution and Medicine, Arizona State University, Tempe, AZ 85282, USA
- Department of Zoology, Milwaukee Public Museum, Milwaukee, WI 53233, USA
| | | | - Michael C Schatz
- Department of Computer Science, Johns Hopkins University, Baltimore, MD 21218, USA
| | - Samantha Zarate
- Department of Computer Science, Johns Hopkins University, Baltimore, MD 21218, USA
| | - Melissa A Wilson
- School of Life Sciences, Arizona State University, Tempe, AZ 85282, USA
- Center for Evolution and Medicine, Arizona State University, Tempe, AZ 85282, USA
- The Biodesign Center for Mechanisms of Evolution, Arizona State University, Tempe, AZ 85282, USA
| |
Collapse
|
17
|
Bornstein K, Gryan G, Chang ES, Marchler-Bauer A, Schneider VA. The NIH Comparative Genomics Resource: addressing the promises and challenges of comparative genomics on human health. BMC Genomics 2023; 24:575. [PMID: 37759191 PMCID: PMC10523801 DOI: 10.1186/s12864-023-09643-4] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/20/2023] [Accepted: 08/31/2023] [Indexed: 09/29/2023] Open
Abstract
Comparative genomics is the comparison of genetic information within and across organisms to understand the evolution, structure, and function of genes, proteins, and non-coding regions (Sivashankari and Shanmughavel, Bioinformation 1:376-8, 2007). Advances in sequencing technology and assembly algorithms have resulted in the ability to sequence large genomes and provided a wealth of data that are being used in comparative genomic analyses. Comparative analysis can be leveraged to systematically explore and evaluate the biological relationships and evolution between species, aid in understanding the structure and function of genes, and gain a better understanding of disease and potential drug targets. As our knowledge of genetics expands, comparative genomics can help identify emerging model organisms among a broader span of the tree of life, positively impacting human health. This impact includes, but is not limited to, zoonotic disease research, therapeutics development, microbiome research, xenotransplantation, oncology, and toxicology. Despite advancements in comparative genomics, new challenges have arisen around the quantity, quality assurance, annotation, and interoperability of genomic data and metadata. New tools and approaches are required to meet these challenges and fulfill the needs of researchers. This paper focuses on how the National Institutes of Health (NIH) Comparative Genomics Resource (CGR) can address both the opportunities for comparative genomics to further impact human health and confront an increasingly complex set of challenges facing researchers.
Collapse
Affiliation(s)
| | - Gary Gryan
- The MITRE Corporation, 7525 Colshire Dr, McLean, VA, USA
| | - E Sally Chang
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, 20894, USA
| | - Aron Marchler-Bauer
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, 20894, USA
| | - Valerie A Schneider
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, 20894, USA.
| |
Collapse
|
18
|
Deflaux N, Selvaraj MS, Condon HR, Mayo K, Haidermota S, Basford MA, Lunt C, Philippakis AA, Roden DM, Denny JC, Musick A, Collins R, Allen N, Effingham M, Glazer D, Natarajan P, Bick AG. Demonstrating paths for unlocking the value of cloud genomics through cross cohort analysis. Nat Commun 2023; 14:5419. [PMID: 37669985 PMCID: PMC10480504 DOI: 10.1038/s41467-023-41185-x] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/05/2022] [Accepted: 08/24/2023] [Indexed: 09/07/2023] Open
Abstract
Recently, large scale genomic projects such as All of Us and the UK Biobank have introduced a new research paradigm where data are stored centrally in cloud-based Trusted Research Environments (TREs). To characterize the advantages and drawbacks of different TRE attributes in facilitating cross-cohort analysis, we conduct a Genome-Wide Association Study of standard lipid measures using two approaches: meta-analysis and pooled analysis. Comparison of full summary data from both approaches with an external study shows strong correlation of known loci with lipid levels (R2 ~ 83-97%). Importantly, 90 variants meet the significance threshold only in the meta-analysis and 64 variants are significant only in pooled analysis, with approximately 20% of variants in each of those groups being most prevalent in non-European, non-Asian ancestry individuals. These findings have important implications, as technical and policy choices lead to cross-cohort analyses generating similar, but not identical results, particularly for non-European ancestral populations.
Collapse
Affiliation(s)
| | - Margaret Sunitha Selvaraj
- Program in Medical and Population Genetics and the Cardiovascular Disease Initiative, Broad Institute of Harvard and MIT, Cambridge, MA, USA
- Department of Medicine, Harvard Medical School, Boston, MA, USA
- Cardiovascular Research Center, Massachusetts General Hospital, Boston, MA, USA
- Center for Genomic Medicine, Massachusetts General Hospital, Boston, MA, USA
| | - Henry Robert Condon
- Division of Genetic Medicine, Vanderbilt University Medical Center, Nashville, TN, USA
| | - Kelsey Mayo
- Vanderbilt Institute for Clinical and Translational Research, Vanderbilt University Medical Center, Nashville, TN, USA
| | - Sara Haidermota
- Program in Medical and Population Genetics and the Cardiovascular Disease Initiative, Broad Institute of Harvard and MIT, Cambridge, MA, USA
- Division of Cardiology, Massachusetts General Hospital, Boston, MA, USA
| | - Melissa A Basford
- Vanderbilt Institute for Clinical and Translational Research, Vanderbilt University Medical Center, Nashville, TN, USA
| | - Chris Lunt
- All of Us Research Program, National Institutes of Health, Bethesda, MD, USA
| | | | - Dan M Roden
- Department of Medicine, Vanderbilt University Medical Center, Nashville, TN, USA
- Department of Pharmacology, Vanderbilt University Medical Center, Nashville, TN, USA
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN, USA
| | - Joshua C Denny
- All of Us Research Program, National Institutes of Health, Bethesda, MD, USA
| | - Anjene Musick
- All of Us Research Program, National Institutes of Health, Bethesda, MD, USA
| | - Rory Collins
- Nuffield Department of Population Health, University of Oxford, Oxford, Oxfordshire, UK
- UK Biobank, Cheadle, Stockport, UK
| | - Naomi Allen
- Nuffield Department of Population Health, University of Oxford, Oxford, Oxfordshire, UK
- UK Biobank, Cheadle, Stockport, UK
| | | | | | - Pradeep Natarajan
- Program in Medical and Population Genetics and the Cardiovascular Disease Initiative, Broad Institute of Harvard and MIT, Cambridge, MA, USA
- Department of Medicine, Harvard Medical School, Boston, MA, USA
- Cardiovascular Research Center, Massachusetts General Hospital, Boston, MA, USA
- Center for Genomic Medicine, Massachusetts General Hospital, Boston, MA, USA
- Division of Cardiology, Massachusetts General Hospital, Boston, MA, USA
| | - Alexander G Bick
- Division of Genetic Medicine, Vanderbilt University Medical Center, Nashville, TN, USA.
| |
Collapse
|
19
|
Rhie A, Nurk S, Cechova M, Hoyt SJ, Taylor DJ, Altemose N, Hook PW, Koren S, Rautiainen M, Alexandrov IA, Allen J, Asri M, Bzikadze AV, Chen NC, Chin CS, Diekhans M, Flicek P, Formenti G, Fungtammasan A, Garcia Giron C, Garrison E, Gershman A, Gerton JL, Grady PGS, Guarracino A, Haggerty L, Halabian R, Hansen NF, Harris R, Hartley GA, Harvey WT, Haukness M, Heinz J, Hourlier T, Hubley RM, Hunt SE, Hwang S, Jain M, Kesharwani RK, Lewis AP, Li H, Logsdon GA, Lucas JK, Makalowski W, Markovic C, Martin FJ, Mc Cartney AM, McCoy RC, McDaniel J, McNulty BM, Medvedev P, Mikheenko A, Munson KM, Murphy TD, Olsen HE, Olson ND, Paulin LF, Porubsky D, Potapova T, Ryabov F, Salzberg SL, Sauria MEG, Sedlazeck FJ, Shafin K, Shepelev VA, Shumate A, Storer JM, Surapaneni L, Taravella Oill AM, Thibaud-Nissen F, Timp W, Tomaszkiewicz M, Vollger MR, Walenz BP, Watwood AC, Weissensteiner MH, Wenger AM, Wilson MA, Zarate S, Zhu Y, Zook JM, Eichler EE, O'Neill RJ, Schatz MC, Miga KH, Makova KD, Phillippy AM. The complete sequence of a human Y chromosome. Nature 2023; 621:344-354. [PMID: 37612512 PMCID: PMC10752217 DOI: 10.1038/s41586-023-06457-y] [Citation(s) in RCA: 67] [Impact Index Per Article: 67.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/02/2022] [Accepted: 07/19/2023] [Indexed: 08/25/2023]
Abstract
The human Y chromosome has been notoriously difficult to sequence and assemble because of its complex repeat structure that includes long palindromes, tandem repeats and segmental duplications1-3. As a result, more than half of the Y chromosome is missing from the GRCh38 reference sequence and it remains the last human chromosome to be finished4,5. Here, the Telomere-to-Telomere (T2T) consortium presents the complete 62,460,029-base-pair sequence of a human Y chromosome from the HG002 genome (T2T-Y) that corrects multiple errors in GRCh38-Y and adds over 30 million base pairs of sequence to the reference, showing the complete ampliconic structures of gene families TSPY, DAZ and RBMY; 41 additional protein-coding genes, mostly from the TSPY family; and an alternating pattern of human satellite 1 and 3 blocks in the heterochromatic Yq12 region. We have combined T2T-Y with a previous assembly of the CHM13 genome4 and mapped available population variation, clinical variants and functional genomics data to produce a complete and comprehensive reference sequence for all 24 human chromosomes.
Collapse
Affiliation(s)
- Arang Rhie
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA
| | - Sergey Nurk
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA
- Oxford Nanopore Technologies Inc., Oxford, UK
| | - Monika Cechova
- Faculty of Informatics, Masaryk University, Brno, Czech Republic
- Department of Biomolecular Engineering, University of California Santa Cruz, Santa Cruz, CA, USA
| | - Savannah J Hoyt
- Department of Molecular and Cell Biology, University of Connecticut, Storrs, CT, USA
| | - Dylan J Taylor
- Department of Biology, Johns Hopkins University, Baltimore, MD, USA
| | - Nicolas Altemose
- Department of Molecular and Cell Biology, University of California, Berkeley, CA, USA
| | - Paul W Hook
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, USA
| | - Sergey Koren
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA
| | - Mikko Rautiainen
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA
| | - Ivan A Alexandrov
- Federal Research Center of Biotechnology of the Russian Academy of Sciences, Moscow, Russia
- Center for Algorithmic Biotechnology, Saint Petersburg State University, St Petersburg, Russia
- Department of Anatomy and Anthropology and Department of Human Molecular Genetics and Biochemistry, Sackler Faculty of Medicine, Tel Aviv University, Tel Aviv-Yafo, Israel
| | - Jamie Allen
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK
| | - Mobin Asri
- UC Santa Cruz Genomics Institute, University of California Santa Cruz, Santa Cruz, CA, USA
| | - Andrey V Bzikadze
- Graduate Program in Bioinformatics and Systems Biology, University of California, San Diego, CA, USA
| | - Nae-Chyun Chen
- Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA
| | - Chen-Shan Chin
- GeneDX Holdings Corp, Stamford, CT, USA
- Foundation of Biological Data Science, Belmont, CA, USA
| | - Mark Diekhans
- UC Santa Cruz Genomics Institute, University of California Santa Cruz, Santa Cruz, CA, USA
| | - Paul Flicek
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK
- Department of Genetics, University of Cambridge, Cambridge, UK
| | | | | | - Carlos Garcia Giron
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK
| | - Erik Garrison
- Department of Genetics, Genomics and Informatics, University of Tennessee Health Science Center, Memphis, TN, USA
| | - Ariel Gershman
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, USA
| | - Jennifer L Gerton
- Stowers Institute for Medical Research, Kansas City, MO, USA
- University of Kansas Medical Center, Kansas City, MO, USA
| | - Patrick G S Grady
- Department of Molecular and Cell Biology, University of Connecticut, Storrs, CT, USA
| | - Andrea Guarracino
- Department of Genetics, Genomics and Informatics, University of Tennessee Health Science Center, Memphis, TN, USA
- Genomics Research Centre, Human Technopole, Milan, Italy
| | - Leanne Haggerty
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK
| | - Reza Halabian
- Institute of Bioinformatics, Faculty of Medicine, University of Münster, Münster, Germany
| | - Nancy F Hansen
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA
- Cancer Genetics and Comparative Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA
| | - Robert Harris
- Department of Biology, Pennsylvania State University, University Park, PA, USA
| | - Gabrielle A Hartley
- Department of Molecular and Cell Biology, University of Connecticut, Storrs, CT, USA
| | - William T Harvey
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
| | - Marina Haukness
- UC Santa Cruz Genomics Institute, University of California Santa Cruz, Santa Cruz, CA, USA
| | - Jakob Heinz
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, USA
| | - Thibaut Hourlier
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK
| | | | - Sarah E Hunt
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK
| | - Stephen Hwang
- XDBio Program, Johns Hopkins University, Baltimore, MD, USA
| | - Miten Jain
- Department of Bioengineering, Department of Physics, Northeastern University, Boston, MA, USA
| | - Rupesh K Kesharwani
- Human Genome Sequencing Center, Baylor College of Medicine, One Baylor Plaza, Houston, TX, USA
| | - Alexandra P Lewis
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
| | - Heng Li
- Department of Data Sciences, Dana-Farber Cancer Institute, Boston, MA, USA
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA
| | - Glennis A Logsdon
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
| | - Julian K Lucas
- Department of Biomolecular Engineering, University of California Santa Cruz, Santa Cruz, CA, USA
- UC Santa Cruz Genomics Institute, University of California Santa Cruz, Santa Cruz, CA, USA
| | - Wojciech Makalowski
- Institute of Bioinformatics, Faculty of Medicine, University of Münster, Münster, Germany
| | - Christopher Markovic
- Genome Technology Access Center at the McDonnell Genome Institute, Washington University, St. Louis, MO, USA
| | - Fergal J Martin
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK
| | - Ann M Mc Cartney
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA
| | - Rajiv C McCoy
- Department of Biology, Johns Hopkins University, Baltimore, MD, USA
| | - Jennifer McDaniel
- Biosystems and Biomaterials Division, National Institute of Standards and Technology, Gaithersburg, MD, USA
| | - Brandy M McNulty
- Department of Biomolecular Engineering, University of California Santa Cruz, Santa Cruz, CA, USA
- UC Santa Cruz Genomics Institute, University of California Santa Cruz, Santa Cruz, CA, USA
| | - Paul Medvedev
- Department of Computer Science and Engineering, Pennsylvania State University, University Park, PA, USA
- Department of Biochemistry and Molecular Biology, Pennsylvania State University, University Park, PA, USA
- Center for Computational Biology and Bioinformatics, Pennsylvania State University, University Park, PA, USA
| | - Alla Mikheenko
- Center for Algorithmic Biotechnology, Saint Petersburg State University, St Petersburg, Russia
- UCL Queen Square Institute of Neurology, UCL, London, UK
| | - Katherine M Munson
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
| | - Terence D Murphy
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA
| | - Hugh E Olsen
- Department of Biomolecular Engineering, University of California Santa Cruz, Santa Cruz, CA, USA
- UC Santa Cruz Genomics Institute, University of California Santa Cruz, Santa Cruz, CA, USA
| | - Nathan D Olson
- Biosystems and Biomaterials Division, National Institute of Standards and Technology, Gaithersburg, MD, USA
| | - Luis F Paulin
- Human Genome Sequencing Center, Baylor College of Medicine, One Baylor Plaza, Houston, TX, USA
| | - David Porubsky
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
| | - Tamara Potapova
- Stowers Institute for Medical Research, Kansas City, MO, USA
| | - Fedor Ryabov
- Masters Program in National Research University Higher School of Economics, Moscow, Russia
| | - Steven L Salzberg
- Departments of Biomedical Engineering, Computer Science, and Biostatistics, Johns Hopkins University, Baltimore, MD, USA
| | | | - Fritz J Sedlazeck
- Human Genome Sequencing Center, Baylor College of Medicine, One Baylor Plaza, Houston, TX, USA
- Department of Computer Science, Rice University, Houston, TX, USA
| | | | | | - Alaina Shumate
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, USA
| | | | - Likhitha Surapaneni
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK
| | - Angela M Taravella Oill
- Center for Evolution and Medicine, School of Life Sciences, Arizona State University, Tempe, AZ, USA
| | - Françoise Thibaud-Nissen
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA
| | - Winston Timp
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, USA
| | - Marta Tomaszkiewicz
- Department of Biology, Pennsylvania State University, University Park, PA, USA
- Department of Biomedical Engineering, Pennsylvania State University, State College, PA, USA
| | - Mitchell R Vollger
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
| | - Brian P Walenz
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA
| | - Allison C Watwood
- Department of Biology, Pennsylvania State University, University Park, PA, USA
| | | | | | - Melissa A Wilson
- Center for Evolution and Medicine, School of Life Sciences, Arizona State University, Tempe, AZ, USA
| | - Samantha Zarate
- Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA
| | - Yiming Zhu
- Human Genome Sequencing Center, Baylor College of Medicine, One Baylor Plaza, Houston, TX, USA
| | - Justin M Zook
- Biosystems and Biomaterials Division, National Institute of Standards and Technology, Gaithersburg, MD, USA
| | - Evan E Eichler
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
- Investigator, Howard Hughes Medical Institute, University of Washington, Seattle, WA, USA
| | - Rachel J O'Neill
- Department of Molecular and Cell Biology, University of Connecticut, Storrs, CT, USA
- Institute for Systems Genomics, University of Connecticut, Storrs, CT, USA
- Department of Genetics and Genome Sciences, UConn Health, Farmington, CT, USA
| | - Michael C Schatz
- Department of Biology, Johns Hopkins University, Baltimore, MD, USA
- Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA
| | - Karen H Miga
- Department of Biomolecular Engineering, University of California Santa Cruz, Santa Cruz, CA, USA
- UC Santa Cruz Genomics Institute, University of California Santa Cruz, Santa Cruz, CA, USA
| | - Kateryna D Makova
- Department of Biology, Pennsylvania State University, University Park, PA, USA
| | - Adam M Phillippy
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA.
| |
Collapse
|
20
|
Mahungu AC, Steyn E, Floudiotis N, Wilson LA, Vandrovcova J, Reilly MM, Record CJ, Benatar M, Wu G, Raga S, Wilmshurst JM, Naidu K, Hanna M, Nel M, Heckmann JM. The mutational profile in a South African cohort with inherited neuropathies and spastic paraplegia. Front Neurol 2023; 14:1239725. [PMID: 37712079 PMCID: PMC10497947 DOI: 10.3389/fneur.2023.1239725] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/13/2023] [Accepted: 08/02/2023] [Indexed: 09/16/2023] Open
Abstract
Introduction Limited diagnostics are available for inherited neuromuscular diseases (NMD) in South Africa and (excluding muscle disease) are mainly aimed at the most frequent genes underlying genetic neuropathy (GN) and spastic ataxias in Europeans. In this study, we used next-generation sequencing to screen 61 probands with GN, hereditary spastic paraplegia (HSP), and spastic ataxias for a genetic diagnosis. Methods After identifying four GN probands with PMP22 duplication and one spastic ataxia proband with SCA1, the remaining probands underwent whole exome (n = 26) or genome sequencing (n = 30). The curation of coding/splice region variants using gene panels was guided by allele frequencies from internal African-ancestry control genomes (n = 537) and the Clinical Genome Resource's Sequence Variant Interpretation guidelines. Results Of 32 GN probands, 50% had African-genetic ancestry, and 44% were solved: PMP22 (n = 4); MFN2 (n = 3); one each of MORC2, ATP1A1, ADPRHL2, GJB1, GAN, MPZ, and ATM. Of 29 HSP probands (six with predominant ataxia), 66% had African-genetic ancestry, and 48% were solved: SPG11 (n = 3); KIF1A (n = 2); and one each of SPAST, ATL1, SPG7, PCYT2, PSEN1, ATXN1, ALDH18A1, CYP7B1, and RFT1. Structural variants in SPAST, SPG11, SPG7, MFN2, MPZ, KIF5A, and GJB1 were excluded by computational prediction and manual visualisation. Discussion In this preliminary cohort screening panel of disease genes using WES/WGS data, we solved ~50% of cases, which is similar to diagnostic yields reported for global cohorts. However, the mutational profile among South Africans with GN and HSP differs substantially from that in the Global North.
Collapse
Affiliation(s)
- Amokelani C. Mahungu
- Neurology Research Group, Division of Neurology, Department of Medicine, University of Cape Town, Cape Town, South Africa
- Neuroscience Institute, University of Cape Town, Cape Town, South Africa
| | - Elizabeth Steyn
- Neurology Research Group, Division of Neurology, Department of Medicine, University of Cape Town, Cape Town, South Africa
| | - Niki Floudiotis
- Neurology Research Group, Division of Neurology, Department of Medicine, University of Cape Town, Cape Town, South Africa
| | - Lindsay A. Wilson
- Department of Neuromuscular Diseases, Queen Square Institute of Neurology, University College London, London, United Kingdom
| | - Jana Vandrovcova
- Department of Neuromuscular Diseases, Queen Square Institute of Neurology, University College London, London, United Kingdom
| | - Mary M. Reilly
- Department of Neuromuscular Disease, Queen Square UCL Institute of Neurology and the National Hospital of Neurology and Neurosurgery, London, United Kingdom
| | - Christopher J. Record
- Department of Neuromuscular Disease, Queen Square UCL Institute of Neurology and the National Hospital of Neurology and Neurosurgery, London, United Kingdom
| | - Michael Benatar
- Department of Neurology, University of Miami Miller School of Medicine, Miami, FL, United States
| | - Gang Wu
- Center for Applied Bioinformatics, St. Jude Children's Research Hospital, Memphis, TN, United States
| | - Sharika Raga
- Neuroscience Institute, University of Cape Town, Cape Town, South Africa
- Division of Paediatric Neurology, Department of Paediatrics and Child Health, Red Cross War Memorial Children's Hospital, University of Cape Town, Cape Town, South Africa
| | - Jo M. Wilmshurst
- Neuroscience Institute, University of Cape Town, Cape Town, South Africa
- Division of Paediatric Neurology, Department of Paediatrics and Child Health, Red Cross War Memorial Children's Hospital, University of Cape Town, Cape Town, South Africa
| | - Kireshnee Naidu
- Neurology Research Group, Division of Neurology, Department of Medicine, University of Cape Town, Cape Town, South Africa
| | - Michael Hanna
- Department of Neuromuscular Diseases, Queen Square Institute of Neurology, University College London, London, United Kingdom
- NHS Highly Specialised Service for Rare Mitochondrial Disorders, Queen Square Centre for Neuromuscular Diseases, The National Hospital for Neurology and Neurosurgery, London, United Kingdom
| | - Melissa Nel
- Neurology Research Group, Division of Neurology, Department of Medicine, University of Cape Town, Cape Town, South Africa
- Neuroscience Institute, University of Cape Town, Cape Town, South Africa
| | - Jeannine M. Heckmann
- Neurology Research Group, Division of Neurology, Department of Medicine, University of Cape Town, Cape Town, South Africa
- Neuroscience Institute, University of Cape Town, Cape Town, South Africa
| |
Collapse
|
21
|
Casaletto J, Bernier A, McDougall R, Cline MS. Federated Analysis for Privacy-Preserving Data Sharing: A Technical and Legal Primer. Annu Rev Genomics Hum Genet 2023; 24:347-368. [PMID: 37253596 PMCID: PMC10846631 DOI: 10.1146/annurev-genom-110122-084756] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/01/2023]
Abstract
Continued advances in precision medicine rely on the widespread sharing of data that relate human genetic variation to disease. However, data sharing is severely limited by legal, regulatory, and ethical restrictions that safeguard patient privacy. Federated analysis addresses this problem by transferring the code to the data-providing the technical and legal capability to analyze the data within their secure home environment rather than transferring the data to another institution for analysis. This allows researchers to gain new insights from data that cannot be moved, while respecting patient privacy and the data stewards' legal obligations. Because federated analysis is a technical solution to the legal challenges inherent in data sharing, the technology and policy implications must be evaluated together. Here, we summarize the technical approaches to federated analysis and provide a legal analysis of their policy implications.
Collapse
Affiliation(s)
- James Casaletto
- Genomics Institute, University of California, Santa Cruz, California, USA; ,
| | - Alexander Bernier
- Centre of Genomics and Policy, Faculty of Medicine and Health Sciences, McGill University, Montreal, Quebec, Canada; ,
| | - Robyn McDougall
- Centre of Genomics and Policy, Faculty of Medicine and Health Sciences, McGill University, Montreal, Quebec, Canada; ,
| | - Melissa S Cline
- Genomics Institute, University of California, Santa Cruz, California, USA; ,
| |
Collapse
|
22
|
Mayo KR, Basford MA, Carroll RJ, Dillon M, Fullen H, Leung J, Master H, Rura S, Sulieman L, Kennedy N, Banks E, Bernick D, Gauchan A, Lichtenstein L, Mapes BM, Marginean K, Nyemba SL, Ramirez A, Rotundo C, Wolfe K, Xia W, Azuine RE, Cronin RM, Denny JC, Kho A, Lunt C, Malin B, Natarajan K, Wilkins CH, Xu H, Hripcsak G, Roden DM, Philippakis AA, Glazer D, Harris PA. The All of Us Data and Research Center: Creating a Secure, Scalable, and Sustainable Ecosystem for Biomedical Research. Annu Rev Biomed Data Sci 2023; 6:443-464. [PMID: 37561600 PMCID: PMC11157478 DOI: 10.1146/annurev-biodatasci-122120-104825] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 08/12/2023]
Abstract
The All of Us Research Program's Data and Research Center (DRC) was established to help acquire, curate, and provide access to one of the world's largest and most diverse datasets for precision medicine research. Already, over 500,000 participants are enrolled in All of Us, 80% of whom are underrepresented in biomedical research, and data are being analyzed by a community of over 2,300 researchers. The DRC created this thriving data ecosystem by collaborating with engaged participants, innovative program partners, and empowered researchers. In this review, we first describe how the DRC is organized to meet the needs of this broad group of stakeholders. We then outline guiding principles, common challenges, and innovative approaches used to build the All of Us data ecosystem. Finally, we share lessons learned to help others navigate important decisions and trade-offs in building a modern biomedical data platform.
Collapse
Affiliation(s)
- Kelsey R Mayo
- Vanderbilt Institute for Clinical and Translational Research, Vanderbilt University Medical Center, Nashville, Tennessee, USA
| | - Melissa A Basford
- Vanderbilt Institute for Clinical and Translational Research, Vanderbilt University Medical Center, Nashville, Tennessee, USA
| | - Robert J Carroll
- Deparment of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, Tennessee, USA;
| | - Moira Dillon
- Verily Life Sciences, South San Francisco, California, USA
| | - Heather Fullen
- Vanderbilt Institute for Clinical and Translational Research, Vanderbilt University Medical Center, Nashville, Tennessee, USA
| | - Jesse Leung
- Verily Life Sciences, South San Francisco, California, USA
| | - Hiral Master
- Vanderbilt Institute for Clinical and Translational Research, Vanderbilt University Medical Center, Nashville, Tennessee, USA
| | - Shimon Rura
- Verily Life Sciences, South San Francisco, California, USA
| | - Lina Sulieman
- Deparment of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, Tennessee, USA;
| | - Nan Kennedy
- Vanderbilt Institute for Clinical and Translational Research, Vanderbilt University Medical Center, Nashville, Tennessee, USA
| | - Eric Banks
- Data Sciences Platform, Broad Institute of Harvard and MIT, Cambridge, Massachusetts, USA
| | - David Bernick
- Broad Institute of Harvard and MIT, Cambridge, Massachusetts, USA
| | - Asmita Gauchan
- Vanderbilt Institute for Clinical and Translational Research, Vanderbilt University Medical Center, Nashville, Tennessee, USA
| | - Lee Lichtenstein
- Data Sciences Platform, Broad Institute of Harvard and MIT, Cambridge, Massachusetts, USA
| | - Brandy M Mapes
- Vanderbilt Institute for Clinical and Translational Research, Vanderbilt University Medical Center, Nashville, Tennessee, USA
| | - Kayla Marginean
- Vanderbilt Institute for Clinical and Translational Research, Vanderbilt University Medical Center, Nashville, Tennessee, USA
| | - Steve L Nyemba
- Deparment of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, Tennessee, USA;
| | - Andrea Ramirez
- The All of Us Research Program, National Institutes of Health, Bethesda, Maryland, USA
| | - Charissa Rotundo
- Vanderbilt University Medical Center Enterprise Cybersecurity, Vanderbilt University Medical Center, Nashville, Tennessee, USA
| | - Keri Wolfe
- Vanderbilt Institute for Clinical and Translational Research, Vanderbilt University Medical Center, Nashville, Tennessee, USA
| | - Weiyi Xia
- Deparment of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, Tennessee, USA;
| | - Romuladus E Azuine
- The All of Us Research Program, National Institutes of Health, Bethesda, Maryland, USA
| | - Robert M Cronin
- Department of Internal Medicine, The Ohio State University, Columbus, Ohio, USA
| | - Joshua C Denny
- The All of Us Research Program, National Institutes of Health, Bethesda, Maryland, USA
| | - Abel Kho
- Department of Medicine and Institute for Augmented Intelligence in Medicine, Feinberg School of Medicine, Northwestern University, Chicago, Illinois, USA
| | - Christopher Lunt
- The All of Us Research Program, National Institutes of Health, Bethesda, Maryland, USA
| | - Bradley Malin
- Deparment of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, Tennessee, USA;
| | - Karthik Natarajan
- Department of Biomedical Informatics, Columbia University, New York, NY, USA
| | - Consuelo H Wilkins
- Department of Medicine, Vanderbilt University Medical Center, Nashville, Tennessee, USA
| | - Hua Xu
- Section of Biomedical Informatics and Data Science, School of Medicine, Yale University, New Haven, Connecticut, USA
| | - George Hripcsak
- Department of Biomedical Informatics, Columbia University, New York, NY, USA
| | - Dan M Roden
- Deparment of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, Tennessee, USA;
- Department of Medicine, Vanderbilt University Medical Center, Nashville, Tennessee, USA
- Department of Pharmacology, Vanderbilt University Medical Center, Nashville, Tennessee, USA
| | | | - David Glazer
- Verily Life Sciences, South San Francisco, California, USA
| | - Paul A Harris
- Deparment of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, Tennessee, USA;
| |
Collapse
|
23
|
Felker SA, Lawlor JMJ, Hiatt SM, Thompson ML, Latner DR, Finnila CR, Bowling KM, Bonnstetter ZT, Bonini KE, Kelly NR, Kelley WV, Hurst ACE, Rashid S, Kelly MA, Nakouzi G, Hendon LG, Bebin EM, Kenny EE, Cooper GM. Poison exon annotations improve the yield of clinically relevant variants in genomic diagnostic testing. Genet Med 2023; 25:100884. [PMID: 37161864 PMCID: PMC10524927 DOI: 10.1016/j.gim.2023.100884] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/22/2022] [Revised: 05/01/2023] [Accepted: 05/03/2023] [Indexed: 05/11/2023] Open
Abstract
PURPOSE Neurodevelopmental disorders (NDDs) often result from rare genetic variation, but genomic testing yield for NDDs remains below 50%, suggesting that clinically relevant variants may be missed by standard analyses. Here, we analyze "poison exons" (PEs), which are evolutionarily conserved alternative exons often absent from standard gene annotations. Variants that alter PE inclusion can lead to loss of function and may be highly penetrant contributors to disease. METHODS We curated published RNA sequencing data from developing mouse cortex to define 1937 conserved PE regions potentially relevant to NDDs, and we analyzed variants found by genome sequencing in multiple NDD cohorts. RESULTS Across 2999 probands, we found 6 novel clinically relevant variants in PE regions. Five of these variants are in genes that are part of the sodium voltage-gated channel alpha subunit family (SCN1A, SCN2A, and SCN8A), which is associated with epilepsies. One variant is in SNRPB, associated with cerebrocostomandibular syndrome. These variants have moderate to high computational impact assessments, are absent from population variant databases, and in genes with gene-phenotype associations consistent with each probands reported features. CONCLUSION With a very minimal increase in variant analysis burden (average of 0.77 variants per proband), annotation of PEs can improve diagnostic yield for NDDs and likely other congenital conditions.
Collapse
Affiliation(s)
| | | | - Susan M Hiatt
- HudsonAlpha Institute for Biotechnology, Huntsville, AL
| | | | | | | | | | | | - Katherine E Bonini
- Institute for Genomic Health, Icahn School of Medicine at Mount Sinai, New York, NY
| | - Nicole R Kelly
- Division of Pediatric Genetic Medicine, Department of Pediatrics, Children's Hospital at Montefiore/Montefiore Medical Center/Albert Einstein College of Medicine, Bronx, NY
| | | | | | | | | | | | | | - E Martina Bebin
- Department of Neurology, University of Alabama at Birmingham, Birmingham, AL
| | - Eimear E Kenny
- Institute for Genomic Health, Icahn School of Medicine at Mount Sinai, New York, NY; Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, NY; Department of Medicine, Icahn School of Medicine at Mount Sinai, New York, NY
| | | |
Collapse
|
24
|
The Impact of Genomic Variation on Function (IGVF) Consortium. ARXIV 2023:arXiv:2307.13708v1. [PMID: 37547663 PMCID: PMC10402186] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Subscribe] [Scholar Register] [Indexed: 08/08/2023]
Abstract
Our genomes influence nearly every aspect of human biology from molecular and cellular functions to phenotypes in health and disease. Human genetics studies have now associated hundreds of thousands of differences in our DNA sequence ("genomic variation") with disease risk and other phenotypes, many of which could reveal novel mechanisms of human biology and uncover the basis of genetic predispositions to diseases, thereby guiding the development of new diagnostics and therapeutics. Yet, understanding how genomic variation alters genome function to influence phenotype has proven challenging. To unlock these insights, we need a systematic and comprehensive catalog of genome function and the molecular and cellular effects of genomic variants. Toward this goal, the Impact of Genomic Variation on Function (IGVF) Consortium will combine approaches in single-cell mapping, genomic perturbations, and predictive modeling to investigate the relationships among genomic variation, genome function, and phenotypes. Through systematic comparisons and benchmarking of experimental and computational methods, we aim to create maps across hundreds of cell types and states describing how coding variants alter protein activity, how noncoding variants change the regulation of gene expression, and how both coding and noncoding variants may connect through gene regulatory and protein interaction networks. These experimental data, computational predictions, and accompanying standards and pipelines will be integrated into an open resource that will catalyze community efforts to explore genome function and the impact of genetic variation on human biology and disease across populations.
Collapse
|
25
|
Hitz BC, Lee JW, Jolanki O, Kagda MS, Graham K, Sud P, Gabdank I, Strattan JS, Sloan CA, Dreszer T, Rowe LD, Podduturi NR, Malladi VS, Chan ET, Davidson JM, Ho M, Miyasato S, Simison M, Tanaka F, Luo Y, Whaling I, Hong EL, Lee BT, Sandstrom R, Rynes E, Nelson J, Nishida A, Ingersoll A, Buckley M, Frerker M, Kim DS, Boley N, Trout D, Dobin A, Rahmanian S, Wyman D, Balderrama-Gutierrez G, Reese F, Durand NC, Dudchenko O, Weisz D, Rao SSP, Blackburn A, Gkountaroulis D, Sadr M, Olshansky M, Eliaz Y, Nguyen D, Bochkov I, Shamim MS, Mahajan R, Aiden E, Gingeras T, Heath S, Hirst M, Kent WJ, Kundaje A, Mortazavi A, Wold B, Cherry JM. The ENCODE Uniform Analysis Pipelines. RESEARCH SQUARE 2023:rs.3.rs-3111932. [PMID: 37503119 PMCID: PMC10371165 DOI: 10.21203/rs.3.rs-3111932/v1] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 07/29/2023]
Abstract
The Encyclopedia of DNA elements (ENCODE) project is a collaborative effort to create a comprehensive catalog of functional elements in the human genome. The current database comprises more than 19000 functional genomics experiments across more than 1000 cell lines and tissues using a wide array of experimental techniques to study the chromatin structure, regulatory and transcriptional landscape of the Homo sapiens and Mus musculus genomes. All experimental data, metadata, and associated computational analyses created by the ENCODE consortium are submitted to the Data Coordination Center (DCC) for validation, tracking, storage, and distribution to community resources and the scientific community. The ENCODE project has engineered and distributed uniform processing pipelines in order to promote data provenance and reproducibility as well as allow interoperability between genomic resources and other consortia. All data files, reference genome versions, software versions, and parameters used by the pipelines are captured and available via the ENCODE Portal. The pipeline code, developed using Docker and Workflow Description Language (WDL; https://openwdl.org/) is publicly available in GitHub, with images available on Dockerhub (https://hub.docker.com), enabling access to a diverse range of biomedical researchers. ENCODE pipelines maintained and used by the DCC can be installed to run on personal computers, local HPC clusters, or in cloud computing environments via Cromwell. Access to the pipelines and data via the cloud allows small labs the ability to use the data or software without access to institutional compute clusters. Standardization of the computational methodologies for analysis and quality control leads to comparable results from different ENCODE collections - a prerequisite for successful integrative analyses.
Collapse
Affiliation(s)
- Benjamin C Hitz
- Department of Genetics, Stanford University School of Medicine, Stanford, CA 94305, USA
| | - Jin-Wook Lee
- Department of Genetics, Stanford University School of Medicine, Stanford, CA 94305, USA
| | - Otto Jolanki
- Department of Genetics, Stanford University School of Medicine, Stanford, CA 94305, USA
| | - Meenakshi S Kagda
- Department of Genetics, Stanford University School of Medicine, Stanford, CA 94305, USA
| | - Keenan Graham
- Department of Genetics, Stanford University School of Medicine, Stanford, CA 94305, USA
| | - Paul Sud
- Department of Genetics, Stanford University School of Medicine, Stanford, CA 94305, USA
| | - Idan Gabdank
- Department of Genetics, Stanford University School of Medicine, Stanford, CA 94305, USA
| | - J Seth Strattan
- Department of Genetics, Stanford University School of Medicine, Stanford, CA 94305, USA
| | - Cricket A Sloan
- Department of Genetics, Stanford University School of Medicine, Stanford, CA 94305, USA
| | - Timothy Dreszer
- Department of Genetics, Stanford University School of Medicine, Stanford, CA 94305, USA
| | - Laurence D Rowe
- Department of Genetics, Stanford University School of Medicine, Stanford, CA 94305, USA
| | - Nikhil R Podduturi
- Department of Genetics, Stanford University School of Medicine, Stanford, CA 94305, USA
| | - Venkat S Malladi
- Department of Genetics, Stanford University School of Medicine, Stanford, CA 94305, USA
| | - Esther T Chan
- Department of Genetics, Stanford University School of Medicine, Stanford, CA 94305, USA
| | - Jean M Davidson
- Department of Genetics, Stanford University School of Medicine, Stanford, CA 94305, USA
| | - Marcus Ho
- Department of Genetics, Stanford University School of Medicine, Stanford, CA 94305, USA
| | - Stuart Miyasato
- Department of Genetics, Stanford University School of Medicine, Stanford, CA 94305, USA
| | - Matt Simison
- Department of Genetics, Stanford University School of Medicine, Stanford, CA 94305, USA
| | - Forrest Tanaka
- Department of Genetics, Stanford University School of Medicine, Stanford, CA 94305, USA
| | - Yunhai Luo
- Department of Genetics, Stanford University School of Medicine, Stanford, CA 94305, USA
| | - Ian Whaling
- Department of Genetics, Stanford University School of Medicine, Stanford, CA 94305, USA
| | - Eurie L Hong
- Department of Genetics, Stanford University School of Medicine, Stanford, CA 94305, USA
| | - Brian T Lee
- Genomics Institute, School of Engineering, University of California Santa Cruz, Santa Cruz, CA 95064, USA
| | - Richard Sandstrom
- Altius Institute for Biomedical Sciences, 2211 Elliott Avenue, 6th Floor, Seattle, WA 98121, USA
| | - Eric Rynes
- Altius Institute for Biomedical Sciences, 2211 Elliott Avenue, 6th Floor, Seattle, WA 98121, USA
| | - Jemma Nelson
- Altius Institute for Biomedical Sciences, 2211 Elliott Avenue, 6th Floor, Seattle, WA 98121, USA
| | - Andrew Nishida
- Altius Institute for Biomedical Sciences, 2211 Elliott Avenue, 6th Floor, Seattle, WA 98121, USA
| | - Alyssa Ingersoll
- Altius Institute for Biomedical Sciences, 2211 Elliott Avenue, 6th Floor, Seattle, WA 98121, USA
| | - Michael Buckley
- Altius Institute for Biomedical Sciences, 2211 Elliott Avenue, 6th Floor, Seattle, WA 98121, USA
| | - Mark Frerker
- Altius Institute for Biomedical Sciences, 2211 Elliott Avenue, 6th Floor, Seattle, WA 98121, USA
| | - Daniel S Kim
- Department of Genetics, Department of Computer Science, Stanford University, 240 Pasteur Drive, Palo Alto, CA 94304, USA
| | - Nathan Boley
- Department of Genetics, Department of Computer Science, Stanford University, 240 Pasteur Drive, Palo Alto, CA 94304, USA
| | - Diane Trout
- Division of Biology and Biological Engineering, California Institute of Technology, Pasadena, CA, 91125 USA
| | - Alex Dobin
- Cold Spring Harbor Laboratory, Cold Spring Harbor, NY 11724, USA
| | - Sorena Rahmanian
- Center for Complex Biological Systems, University of California, Irvine, Irvine, CA 92697, USA
| | - Dana Wyman
- Center for Complex Biological Systems, University of California, Irvine, Irvine, CA 92697, USA
| | | | - Fairlie Reese
- Center for Complex Biological Systems, University of California, Irvine, Irvine, CA 92697, USA
| | - Neva C Durand
- Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA
- The Center for Genome Architecture, Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX 77030, USA
- Center for Theoretical Biological Physics, Rice University, Houston, TX 77030, USA
- Department of Computer Science, Rice University, Houston, TX 77030, USA
| | - Olga Dudchenko
- The Center for Genome Architecture, Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX 77030, USA
| | - David Weisz
- The Center for Genome Architecture, Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX 77030, USA
| | - Suhas S P Rao
- The Center for Genome Architecture, Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX 77030, USA
- Department of Structural Biology, Stanford University School of Medicine, Stanford, CA 94305, USA
- Department of Medicine, University of California San Francisco, San Francisco, CA 94143, USA
| | - Alyssa Blackburn
- The Center for Genome Architecture, Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX 77030, USA
- Center for Theoretical Biological Physics, Rice University, Houston, TX 77030, USA
| | - Dimos Gkountaroulis
- The Center for Genome Architecture, Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX 77030, USA
- Center for Theoretical Biological Physics, Rice University, Houston, TX 77030, USA
| | - Mahdi Sadr
- The Center for Genome Architecture, Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX 77030, USA
| | - Moshe Olshansky
- Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA
| | - Yossi Eliaz
- The Center for Genome Architecture, Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX 77030, USA
| | - Dat Nguyen
- The Center for Genome Architecture, Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX 77030, USA
| | - Ivan Bochkov
- The Center for Genome Architecture, Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX 77030, USA
| | - Muhammad Saad Shamim
- The Center for Genome Architecture, Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX 77030, USA
- Center for Theoretical Biological Physics, Rice University, Houston, TX 77030, USA
- Department of Bioengineering, Rice University, Houston, TX 77030, USA
- Medical Scientist Training Program, Baylor College of Medicine, Houston, TX 77030, USA
| | - Ragini Mahajan
- Center for Theoretical Biological Physics, Rice University, Houston, TX 77030, USA
- Department of BioSciences, Rice University, Houston, TX 77005, USA
| | - Erez Aiden
- Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA
- The Center for Genome Architecture, Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX 77030, USA
- Center for Theoretical Biological Physics, Rice University, Houston, TX 77030, USA
| | - Tom Gingeras
- Cold Spring Harbor Laboratory, Cold Spring Harbor, NY 11724, USA
| | - Simon Heath
- CNAG-CRG, Centre for Genomic Regulation (CRG), The Barcelona Institute of Science and Technology (BIST), Barcelona, Spain. Universitat Pompeu Fabra, Barcelona, Spain
| | - Martin Hirst
- Micheal Smith Laboratories, University of British Columbia, British Columbia, Canada
| | - W James Kent
- Genomics Institute, School of Engineering, University of California Santa Cruz, Santa Cruz, CA 95064, USA
| | - Anshul Kundaje
- Department of Genetics, Department of Computer Science, Stanford University, 240 Pasteur Drive, Palo Alto, CA 94304, USA
| | - Ali Mortazavi
- Center for Complex Biological Systems, University of California, Irvine, Irvine, CA 92697, USA
| | - Barbara Wold
- Division of Biology and Biological Engineering, California Institute of Technology, Pasadena, CA, 91125 USA
| | - J Michael Cherry
- Department of Genetics, Stanford University School of Medicine, Stanford, CA 94305, USA
| |
Collapse
|
26
|
Dahlquist JM, Nelson SC, Fullerton SM. Cloud-based biomedical data storage and analysis for genomic research: Landscape analysis of data governance in emerging NIH-supported platforms. HGG ADVANCES 2023; 4:100196. [PMID: 37181330 PMCID: PMC10173774 DOI: 10.1016/j.xhgg.2023.100196] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/25/2023] [Accepted: 04/07/2023] [Indexed: 05/16/2023] Open
Abstract
The storage, sharing, and analysis of genomic data poses technical and logistical challenges that have precipitated the development of cloud-based computing platforms designed to facilitate collaboration and maximize the scientific utility of data. To understand cloud platforms' policies and procedures and the implications for different stakeholder groups, in summer 2021, we reviewed publicly available documents (N = 94) sourced from platform websites, scientific literature, and lay media for five NIH-funded cloud platforms (the All of Us Research Hub, NHGRI AnVIL, NHLBI BioData Catalyst, NCI Genomic Data Commons, and the Kids First Data Resource Center) and a pre-existing data sharing mechanism, dbGaP. Platform policies were compared across seven categories of data governance: data submission, data ingestion, user authentication and authorization, data security, data access, auditing, and sanctions. Our analysis finds similarities across the platforms, including reliance on a formal data ingestion process, multiple tiers of data access with varying user authentication and/or authorization requirements, platform and user data security measures, and auditing for inappropriate data use. Platforms differ in how data tiers are organized, as well as the specifics of user authentication and authorization across access tiers. Our analysis maps elements of data governance across emerging NIH-funded cloud platforms and as such provides a key resource for stakeholders seeking to understand and utilize data access and analysis options across platforms and to surface aspects of governance that may require harmonization to achieve the desired interoperability.
Collapse
Affiliation(s)
- Jacklyn M. Dahlquist
- Department of Bioethics and Humanities, University of Washington School of Medicine, Seattle, WA 98195, USA
| | - Sarah C. Nelson
- Department of Biostatistics, University of Washington, Seattle, WA 98195, USA
- Corresponding author
| | - Stephanie M. Fullerton
- Department of Bioethics and Humanities, University of Washington School of Medicine, Seattle, WA 98195, USA
- Corresponding author
| |
Collapse
|
27
|
Hall JL, Honeycutt S, Gonzalez N, O’Donnell-Luria A, Taylor CO, Stevens L, Philippakis AA, Schatz MC. National Human Genome Research Institute Genomic Data Science Analysis, Visualization, and Informatics Lab-Space: Reaching out to Clinicians. CIRCULATION. GENOMIC AND PRECISION MEDICINE 2023; 16:275-276. [PMID: 37013830 PMCID: PMC10619961 DOI: 10.1161/circgen.122.003936] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 04/05/2023]
Affiliation(s)
- Jennifer L. Hall
- Data Science and Analytics, American Heart Association, Dallas, TX
| | - Sally Honeycutt
- Data Science and Analytics, American Heart Association, Dallas, TX
| | - Nicole Gonzalez
- Data Science and Analytics, American Heart Association, Dallas, TX
| | | | | | - Laura Stevens
- Data Science and Analytics, American Heart Association, Dallas, TX
| | | | | |
Collapse
|
28
|
Ahalt S, Avillach P, Boyles R, Bradford K, Cox S, Davis-Dusenbery B, Grossman RL, Krishnamurthy A, Manning A, Paten B, Philippakis A, Borecki I, Chen SH, Kaltman J, Ladwa S, Schwartz C, Thomson A, Davis S, Leaf A, Lyons J, Sheets E, Bis JC, Conomos M, Culotti A, Desain T, Digiovanna J, Domazet M, Gogarten S, Gutierrez-Sacristan A, Harris T, Heavner B, Jain D, O'Connor B, Osborn K, Pillion D, Pleiness J, Rice K, Rupp G, Serret-Larmande A, Smith A, Stedman JP, Stilp A, Barsanti T, Cheadle J, Erdmann C, Farlow B, Gartland-Gray A, Hayes J, Hiles H, Kerr P, Lenhardt C, Madden T, Mieczkowska JO, Miller A, Patton P, Rathbun M, Suber S, Asare J. Building a collaborative cloud platform to accelerate heart, lung, blood, and sleep research. J Am Med Inform Assoc 2023:7165700. [PMID: 37192819 DOI: 10.1093/jamia/ocad048] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/16/2022] [Revised: 02/20/2023] [Accepted: 03/24/2023] [Indexed: 05/18/2023] Open
Abstract
Research increasingly relies on interrogating large-scale data resources. The NIH National Heart, Lung, and Blood Institute developed the NHLBI BioData CatalystⓇ (BDC), a community-driven ecosystem where researchers, including bench and clinical scientists, statisticians, and algorithm developers, find, access, share, store, and compute on large-scale datasets. This ecosystem provides secure, cloud-based workspaces, user authentication and authorization, search, tools and workflows, applications, and new innovative features to address community needs, including exploratory data analysis, genomic and imaging tools, tools for reproducibility, and improved interoperability with other NIH data science platforms. BDC offers straightforward access to large-scale datasets and computational resources that support precision medicine for heart, lung, blood, and sleep conditions, leveraging separately developed and managed platforms to maximize flexibility based on researcher needs, expertise, and backgrounds. Through the NHLBI BioData Catalyst Fellows Program, BDC facilitates scientific discoveries and technological advances. BDC also facilitated accelerated research on the coronavirus disease-2019 (COVID-19) pandemic.
Collapse
Affiliation(s)
- Stan Ahalt
- Renaissance Computing Institute, University of North Carolina at Chapel Hill, Chapel Hill, North Carolina, USA
| | - Paul Avillach
- Department of Biomedical Informatics, Harvard Medical School, Boston, Massachusetts, USA
| | | | - Kira Bradford
- Renaissance Computing Institute, University of North Carolina at Chapel Hill, Chapel Hill, North Carolina, USA
- RTI International, Triangle Park, North Carolina, USA
| | - Steven Cox
- Renaissance Computing Institute, University of North Carolina at Chapel Hill, Chapel Hill, North Carolina, USA
| | | | | | - Ashok Krishnamurthy
- Renaissance Computing Institute, University of North Carolina at Chapel Hill, Chapel Hill, North Carolina, USA
| | - Alisa Manning
- The Broad Institute of MIT and Harvard, Cambridge, Massachusetts, USA
| | - Benedict Paten
- UC Santa Cruz Genomics Institute, University of California, Santa Cruz, Santa Cruz, California, USA
| | | | - Ingrid Borecki
- Independent Consultant, BioData Catalyst Steering Committee Chair, St. Louis, Missouri, USA
| | - Shu Hui Chen
- National Heart, Lung, and Blood Institute, NIH, Bethesda, Maryland, USA
| | - Jon Kaltman
- National Heart, Lung, and Blood Institute, NIH, Bethesda, Maryland, USA
| | | | | | - Alastair Thomson
- National Heart, Lung, and Blood Institute, NIH, Bethesda, Maryland, USA
| | - Sarah Davis
- Renaissance Computing Institute, University of North Carolina at Chapel Hill, Chapel Hill, North Carolina, USA
| | | | - Jessica Lyons
- Department of Biomedical Informatics, Harvard Medical School, Boston, Massachusetts, USA
| | - Elizabeth Sheets
- UC Santa Cruz Genomics Institute, University of California, Santa Cruz, Santa Cruz, California, USA
| | - Joshua C Bis
- Department of Medicine, University of Washington, Seattle, Washington, USA
| | - Matthew Conomos
- Department of Biostatistics, University of Washington, Seattle, Washington, USA
| | | | - Thomas Desain
- Department of Biomedical Informatics, Harvard Medical School, Boston, Massachusetts, USA
| | | | | | - Stephanie Gogarten
- Department of Biostatistics, University of Washington, Seattle, Washington, USA
| | | | - Tim Harris
- UC Santa Cruz Genomics Institute, University of California, Santa Cruz, Santa Cruz, California, USA
| | - Ben Heavner
- Department of Biostatistics, University of Washington, Seattle, Washington, USA
| | - Deepti Jain
- Department of Biostatistics, University of Washington, Seattle, Washington, USA
| | | | - Kevin Osborn
- UC Santa Cruz Genomics Institute, University of California, Santa Cruz, Santa Cruz, California, USA
| | - Danielle Pillion
- Department of Biomedical Informatics, Harvard Medical School, Boston, Massachusetts, USA
| | - Jacob Pleiness
- Department of Biostatistics, University of Michigan School of Public Health, Ann Arbor, Michigan, USA
| | - Ken Rice
- Department of Biostatistics, University of Washington, Seattle, Washington, USA
| | | | - Arnaud Serret-Larmande
- Department of Biomedical Informatics, Harvard Medical School, Boston, Massachusetts, USA
| | - Albert Smith
- Department of Biostatistics, University of Michigan School of Public Health, Ann Arbor, Michigan, USA
| | - Jason P Stedman
- Department of Biomedical Informatics, Harvard Medical School, Boston, Massachusetts, USA
| | - Adrienne Stilp
- Department of Biostatistics, University of Washington, Seattle, Washington, USA
| | | | - John Cheadle
- RTI International, Triangle Park, North Carolina, USA
| | - Christopher Erdmann
- Renaissance Computing Institute, University of North Carolina at Chapel Hill, Chapel Hill, North Carolina, USA
| | - Brandy Farlow
- Renaissance Computing Institute, University of North Carolina at Chapel Hill, Chapel Hill, North Carolina, USA
| | | | - Julie Hayes
- Renaissance Computing Institute, University of North Carolina at Chapel Hill, Chapel Hill, North Carolina, USA
| | - Hannah Hiles
- Renaissance Computing Institute, University of North Carolina at Chapel Hill, Chapel Hill, North Carolina, USA
| | - Paul Kerr
- Renaissance Computing Institute, University of North Carolina at Chapel Hill, Chapel Hill, North Carolina, USA
| | - Chris Lenhardt
- Renaissance Computing Institute, University of North Carolina at Chapel Hill, Chapel Hill, North Carolina, USA
| | - Tom Madden
- RTI International, Triangle Park, North Carolina, USA
| | - Joanna O Mieczkowska
- Renaissance Computing Institute, University of North Carolina at Chapel Hill, Chapel Hill, North Carolina, USA
| | - Amanda Miller
- Renaissance Computing Institute, University of North Carolina at Chapel Hill, Chapel Hill, North Carolina, USA
| | - Patrick Patton
- Renaissance Computing Institute, University of North Carolina at Chapel Hill, Chapel Hill, North Carolina, USA
| | | | - Stephanie Suber
- Renaissance Computing Institute, University of North Carolina at Chapel Hill, Chapel Hill, North Carolina, USA
| | - Joe Asare
- RTI International, Triangle Park, North Carolina, USA
| |
Collapse
|
29
|
Licata L, Via A, Turina P, Babbi G, Benevenuta S, Carta C, Casadio R, Cicconardi A, Facchiano A, Fariselli P, Giordano D, Isidori F, Marabotti A, Martelli PL, Pascarella S, Pinelli M, Pippucci T, Russo R, Savojardo C, Scafuri B, Valeriani L, Capriotti E. Resources and tools for rare disease variant interpretation. Front Mol Biosci 2023; 10:1169109. [PMID: 37234922 PMCID: PMC10206239 DOI: 10.3389/fmolb.2023.1169109] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/18/2023] [Accepted: 04/25/2023] [Indexed: 05/28/2023] Open
Abstract
Collectively, rare genetic disorders affect a substantial portion of the world's population. In most cases, those affected face difficulties in receiving a clinical diagnosis and genetic characterization. The understanding of the molecular mechanisms of these diseases and the development of therapeutic treatments for patients are also challenging. However, the application of recent advancements in genome sequencing/analysis technologies and computer-aided tools for predicting phenotype-genotype associations can bring significant benefits to this field. In this review, we highlight the most relevant online resources and computational tools for genome interpretation that can enhance the diagnosis, clinical management, and development of treatments for rare disorders. Our focus is on resources for interpreting single nucleotide variants. Additionally, we present use cases for interpreting genetic variants in clinical settings and review the limitations of these results and prediction tools. Finally, we have compiled a curated set of core resources and tools for analyzing rare disease genomes. Such resources and tools can be utilized to develop standardized protocols that will enhance the accuracy and effectiveness of rare disease diagnosis.
Collapse
Affiliation(s)
- Luana Licata
- Department of Biology, University of Rome Tor Vergata, Roma, Italy
| | - Allegra Via
- Department of Biochemical Sciences “A. Rossi Fanelli”, University of Rome “La Sapienza”, Roma, Italy
| | - Paola Turina
- Department of Pharmacy and Biotechnology, University of Bologna, Bologna, Italy
| | - Giulia Babbi
- Department of Pharmacy and Biotechnology, University of Bologna, Bologna, Italy
| | | | - Claudio Carta
- National Centre for Rare Diseases, Istituto Superiore di Sanità, Roma, Italy
| | - Rita Casadio
- Department of Pharmacy and Biotechnology, University of Bologna, Bologna, Italy
| | - Andrea Cicconardi
- Department of Physics, University of Genova, Genova, Italy
- Italiano di Tecnologia—IIT, Genova, Italy
| | - Angelo Facchiano
- National Research Council, Institute of Food Science, Avellino, Italy
| | - Piero Fariselli
- Department of Medical Sciences, University of Torino, Torino, Italy
| | - Deborah Giordano
- National Research Council, Institute of Food Science, Avellino, Italy
| | - Federica Isidori
- Medical Genetics Unit, IRCCS Azienda Ospedaliero-Universitaria di Bologna, Bologna, Italy
| | - Anna Marabotti
- Department of Chemistry and Biology “A. Zambelli”, University of Salerno, Fisciano, SA, Italy
| | - Pier Luigi Martelli
- Department of Pharmacy and Biotechnology, University of Bologna, Bologna, Italy
| | - Stefano Pascarella
- Department of Biochemical Sciences “A. Rossi Fanelli”, University of Rome “La Sapienza”, Roma, Italy
| | - Michele Pinelli
- Department of Molecular Medicine and Medical Biotechnology, University of Naples Federico II, Napoli, Italy
| | - Tommaso Pippucci
- Medical Genetics Unit, IRCCS Azienda Ospedaliero-Universitaria di Bologna, Bologna, Italy
| | - Roberta Russo
- Department of Molecular Medicine and Medical Biotechnology, University of Naples Federico II, Napoli, Italy
- CEINGE Biotecnologie Avanzate Franco Salvatore, Napoli, Italy
| | - Castrense Savojardo
- Department of Pharmacy and Biotechnology, University of Bologna, Bologna, Italy
| | - Bernardina Scafuri
- Department of Chemistry and Biology “A. Zambelli”, University of Salerno, Fisciano, SA, Italy
| | | | - Emidio Capriotti
- Department of Pharmacy and Biotechnology, University of Bologna, Bologna, Italy
| |
Collapse
|
30
|
Liao WW, Asri M, Ebler J, Doerr D, Haukness M, Hickey G, Lu S, Lucas JK, Monlong J, Abel HJ, Buonaiuto S, Chang XH, Cheng H, Chu J, Colonna V, Eizenga JM, Feng X, Fischer C, Fulton RS, Garg S, Groza C, Guarracino A, Harvey WT, Heumos S, Howe K, Jain M, Lu TY, Markello C, Martin FJ, Mitchell MW, Munson KM, Mwaniki MN, Novak AM, Olsen HE, Pesout T, Porubsky D, Prins P, Sibbesen JA, Sirén J, Tomlinson C, Villani F, Vollger MR, Antonacci-Fulton LL, Baid G, Baker CA, Belyaeva A, Billis K, Carroll A, Chang PC, Cody S, Cook DE, Cook-Deegan RM, Cornejo OE, Diekhans M, Ebert P, Fairley S, Fedrigo O, Felsenfeld AL, Formenti G, Frankish A, Gao Y, Garrison NA, Giron CG, Green RE, Haggerty L, Hoekzema K, Hourlier T, Ji HP, Kenny EE, Koenig BA, Kolesnikov A, Korbel JO, Kordosky J, Koren S, Lee H, Lewis AP, Magalhães H, Marco-Sola S, Marijon P, McCartney A, McDaniel J, Mountcastle J, Nattestad M, Nurk S, Olson ND, Popejoy AB, Puiu D, Rautiainen M, Regier AA, Rhie A, Sacco S, Sanders AD, Schneider VA, Schultz BI, Shafin K, Smith MW, Sofia HJ, Abou Tayoun AN, Thibaud-Nissen F, Tricomi FF, Wagner J, Walenz B, Wood JMD, Zimin AV, Bourque G, Chaisson MJP, Flicek P, Phillippy AM, Zook JM, Eichler EE, Haussler D, Wang T, Jarvis ED, Miga KH, Garrison E, Marschall T, Hall IM, Li H, Paten B. A draft human pangenome reference. Nature 2023; 617:312-324. [PMID: 37165242 PMCID: PMC10172123 DOI: 10.1038/s41586-023-05896-x] [Citation(s) in RCA: 204] [Impact Index Per Article: 204.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/09/2022] [Accepted: 02/28/2023] [Indexed: 05/12/2023]
Abstract
Here the Human Pangenome Reference Consortium presents a first draft of the human pangenome reference. The pangenome contains 47 phased, diploid assemblies from a cohort of genetically diverse individuals1. These assemblies cover more than 99% of the expected sequence in each genome and are more than 99% accurate at the structural and base pair levels. Based on alignments of the assemblies, we generate a draft pangenome that captures known variants and haplotypes and reveals new alleles at structurally complex loci. We also add 119 million base pairs of euchromatic polymorphic sequences and 1,115 gene duplications relative to the existing reference GRCh38. Roughly 90 million of the additional base pairs are derived from structural variation. Using our draft pangenome to analyse short-read data reduced small variant discovery errors by 34% and increased the number of structural variants detected per haplotype by 104% compared with GRCh38-based workflows, which enabled the typing of the vast majority of structural variant alleles per sample.
Collapse
Affiliation(s)
- Wen-Wei Liao
- Department of Genetics, Yale University School of Medicine, New Haven, CT, USA
- Center for Genomic Health, Yale University School of Medicine, New Haven, CT, USA
- Division of Biology and Biomedical Sciences, Washington University School of Medicine, St. Louis, MO, USA
| | - Mobin Asri
- Genomics Institute, University of California, Santa Cruz, CA, USA
| | - Jana Ebler
- Institute for Medical Biometry and Bioinformatics, Medical Faculty, Heinrich Heine University, Düsseldorf, Germany
- Center for Digital Medicine, Heinrich Heine University, Düsseldorf, Germany
| | - Daniel Doerr
- Institute for Medical Biometry and Bioinformatics, Medical Faculty, Heinrich Heine University, Düsseldorf, Germany
- Center for Digital Medicine, Heinrich Heine University, Düsseldorf, Germany
| | - Marina Haukness
- Genomics Institute, University of California, Santa Cruz, CA, USA
| | - Glenn Hickey
- Genomics Institute, University of California, Santa Cruz, CA, USA
| | - Shuangjia Lu
- Department of Genetics, Yale University School of Medicine, New Haven, CT, USA
- Center for Genomic Health, Yale University School of Medicine, New Haven, CT, USA
| | - Julian K Lucas
- Genomics Institute, University of California, Santa Cruz, CA, USA
| | - Jean Monlong
- Genomics Institute, University of California, Santa Cruz, CA, USA
| | - Haley J Abel
- Division of Oncology, Department of Internal Medicine, Washington University School of Medicine, St. Louis, MO, USA
| | - Silvia Buonaiuto
- Institute of Genetics and Biophysics, National Research Council, Naples, Italy
| | - Xian H Chang
- Genomics Institute, University of California, Santa Cruz, CA, USA
| | - Haoyu Cheng
- Department of Data Sciences, Dana-Farber Cancer Institute, Boston, MA, USA
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA
| | - Justin Chu
- Department of Data Sciences, Dana-Farber Cancer Institute, Boston, MA, USA
| | - Vincenza Colonna
- Institute of Genetics and Biophysics, National Research Council, Naples, Italy
- Department of Genetics, Genomics and Informatics, University of Tennessee Health Science Center, Memphis, TN, USA
| | - Jordan M Eizenga
- Genomics Institute, University of California, Santa Cruz, CA, USA
| | - Xiaowen Feng
- Department of Data Sciences, Dana-Farber Cancer Institute, Boston, MA, USA
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA
| | - Christian Fischer
- Department of Genetics, Genomics and Informatics, University of Tennessee Health Science Center, Memphis, TN, USA
| | - Robert S Fulton
- McDonnell Genome Institute, Washington University School of Medicine, St. Louis, MO, USA
- Department of Genetics, Washington University School of Medicine, St. Louis, MO, USA
| | - Shilpa Garg
- Novo Nordisk Foundation Center for Biosustainability, Technical University of Denmark, Copenhagen, Denmark
| | - Cristian Groza
- Quantitative Life Sciences, McGill University, Montréal, Québec, Canada
| | - Andrea Guarracino
- Department of Genetics, Genomics and Informatics, University of Tennessee Health Science Center, Memphis, TN, USA
- Genomics Research Centre, Human Technopole, Milan, Italy
| | - William T Harvey
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
| | - Simon Heumos
- Quantitative Biology Center (QBiC), University of Tübingen, Tübingen, Germany
- Biomedical Data Science, Department of Computer Science, University of Tübingen, Tübingen, Germany
| | - Kerstin Howe
- Tree of Life, Wellcome Sanger Institute, Hinxton, Cambridge, UK
| | - Miten Jain
- Northeastern University, Boston, MA, USA
| | - Tsung-Yu Lu
- Department of Quantitative and Computational Biology, University of Southern California, Los Angeles, CA, USA
| | - Charles Markello
- Genomics Institute, University of California, Santa Cruz, CA, USA
| | - Fergal J Martin
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK
| | | | - Katherine M Munson
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
| | | | - Adam M Novak
- Genomics Institute, University of California, Santa Cruz, CA, USA
| | - Hugh E Olsen
- Genomics Institute, University of California, Santa Cruz, CA, USA
| | - Trevor Pesout
- Genomics Institute, University of California, Santa Cruz, CA, USA
| | - David Porubsky
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
| | - Pjotr Prins
- Department of Genetics, Genomics and Informatics, University of Tennessee Health Science Center, Memphis, TN, USA
| | - Jonas A Sibbesen
- Center for Health Data Science, University of Copenhagen, Copenhagen, Denmark
| | - Jouni Sirén
- Genomics Institute, University of California, Santa Cruz, CA, USA
| | - Chad Tomlinson
- McDonnell Genome Institute, Washington University School of Medicine, St. Louis, MO, USA
| | - Flavia Villani
- Department of Genetics, Genomics and Informatics, University of Tennessee Health Science Center, Memphis, TN, USA
| | - Mitchell R Vollger
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
- Division of Medical Genetics, University of Washington School of Medicine, Seattle, WA, USA
| | | | | | - Carl A Baker
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
| | | | - Konstantinos Billis
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK
| | | | | | - Sarah Cody
- McDonnell Genome Institute, Washington University School of Medicine, St. Louis, MO, USA
| | | | - Robert M Cook-Deegan
- Barrett and O'Connor Washington Center, Arizona State University, Washington, DC, USA
| | - Omar E Cornejo
- Department of Ecology and Evolutionary Biology, University of California, Santa Cruz, CA, USA
| | - Mark Diekhans
- Genomics Institute, University of California, Santa Cruz, CA, USA
| | - Peter Ebert
- Institute for Medical Biometry and Bioinformatics, Medical Faculty, Heinrich Heine University, Düsseldorf, Germany
- Center for Digital Medicine, Heinrich Heine University, Düsseldorf, Germany
- Core Unit Bioinformatics, Medical Faculty, Heinrich Heine University, Düsseldorf, Germany
| | - Susan Fairley
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK
| | - Olivier Fedrigo
- Vertebrate Genome Laboratory, The Rockefeller University, New York, NY, USA
| | - Adam L Felsenfeld
- National Institutes of Health (NIH)-National Human Genome Research Institute, Bethesda, MD, USA
| | - Giulio Formenti
- Vertebrate Genome Laboratory, The Rockefeller University, New York, NY, USA
| | - Adam Frankish
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK
| | - Yan Gao
- Center for Computational and Genomic Medicine, The Children's Hospital of Philadelphia, Philadelphia, PA, USA
| | - Nanibaa' A Garrison
- Institute for Society and Genetics, College of Letters and Science, University of California, Los Angeles, CA, USA
- Institute for Precision Health, David Geffen School of Medicine, University of California, Los Angeles, CA, USA
- Division of General Internal Medicine and Health Services Research, David Geffen School of Medicine, University of California, Los Angeles, CA, USA
| | - Carlos Garcia Giron
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK
| | - Richard E Green
- Department of Biomolecular Engineering, University of California, Santa Cruz, CA, USA
- Dovetail Genomics, Scotts Valley, CA, USA
| | - Leanne Haggerty
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK
| | - Kendra Hoekzema
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
| | - Thibaut Hourlier
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK
| | - Hanlee P Ji
- Division of Oncology, Department of Medicine, Stanford University School of Medicine, Stanford, CA, USA
| | - Eimear E Kenny
- Institute for Genomic Health, Icahn School of Medicine at Mount Sinai, New York, NY, USA
| | - Barbara A Koenig
- Program in Bioethics and Institute for Human Genetics, University of California, San Francisco, CA, USA
| | | | - Jan O Korbel
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK
- Genome Biology Unit, European Molecular Biology Laboratory, Heidelberg, Germany
| | - Jennifer Kordosky
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
| | - Sergey Koren
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA
| | - HoJoon Lee
- Division of Oncology, Department of Medicine, Stanford University School of Medicine, Stanford, CA, USA
| | - Alexandra P Lewis
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
| | - Hugo Magalhães
- Institute for Medical Biometry and Bioinformatics, Medical Faculty, Heinrich Heine University, Düsseldorf, Germany
- Center for Digital Medicine, Heinrich Heine University, Düsseldorf, Germany
| | - Santiago Marco-Sola
- Computer Sciences Department, Barcelona Supercomputing Center, Barcelona, Spain
- Departament d'Arquitectura de Computadors i Sistemes Operatius, Universitat Autònoma de Barcelona, Barcelona, Spain
| | - Pierre Marijon
- Institute for Medical Biometry and Bioinformatics, Medical Faculty, Heinrich Heine University, Düsseldorf, Germany
- Center for Digital Medicine, Heinrich Heine University, Düsseldorf, Germany
| | - Ann McCartney
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA
| | - Jennifer McDaniel
- Material Measurement Laboratory, National Institute of Standards and Technology, Gaithersburg, MD, USA
| | | | | | - Sergey Nurk
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA
| | - Nathan D Olson
- Material Measurement Laboratory, National Institute of Standards and Technology, Gaithersburg, MD, USA
| | - Alice B Popejoy
- Department of Public Health Sciences, University of California, Davis, CA, USA
| | - Daniela Puiu
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, USA
| | - Mikko Rautiainen
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA
| | - Allison A Regier
- McDonnell Genome Institute, Washington University School of Medicine, St. Louis, MO, USA
| | - Arang Rhie
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA
| | - Samuel Sacco
- Department of Ecology and Evolutionary Biology, University of California, Santa Cruz, CA, USA
| | - Ashley D Sanders
- Berlin Institute for Medical Systems Biology, Max Delbrück Center for Molecular Medicine in the Helmholtz Association, Berlin, Germany
| | - Valerie A Schneider
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA
| | - Baergen I Schultz
- National Institutes of Health (NIH)-National Human Genome Research Institute, Bethesda, MD, USA
| | | | - Michael W Smith
- National Institutes of Health (NIH)-National Human Genome Research Institute, Bethesda, MD, USA
| | - Heidi J Sofia
- National Institutes of Health (NIH)-National Human Genome Research Institute, Bethesda, MD, USA
| | - Ahmad N Abou Tayoun
- Al Jalila Genomics Center of Excellence, Al Jalila Children's Specialty Hospital, Dubai, UAE
- Center for Genomic Discovery, Mohammed Bin Rashid University of Medicine and Health Sciences, Dubai, UAE
| | - Françoise Thibaud-Nissen
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA
| | - Francesca Floriana Tricomi
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK
| | - Justin Wagner
- Material Measurement Laboratory, National Institute of Standards and Technology, Gaithersburg, MD, USA
| | - Brian Walenz
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA
| | | | - Aleksey V Zimin
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, USA
- Center for Computational Biology, Johns Hopkins University, Baltimore, MD, USA
| | - Guillaume Bourque
- Department of Human Genetics, McGill University, Montréal, Québec, Canada
- Canadian Center for Computational Genomics, McGill University, Montréal, Québec, Canada
- Institute for the Advanced Study of Human Biology (WPI-ASHBi), Kyoto University, Kyoto, Japan
| | - Mark J P Chaisson
- Department of Quantitative and Computational Biology, University of Southern California, Los Angeles, CA, USA
| | - Paul Flicek
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK
| | - Adam M Phillippy
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA
| | - Justin M Zook
- Material Measurement Laboratory, National Institute of Standards and Technology, Gaithersburg, MD, USA
| | - Evan E Eichler
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
- Howard Hughes Medical Institute, Chevy Chase, MD, USA
| | - David Haussler
- Genomics Institute, University of California, Santa Cruz, CA, USA
- Howard Hughes Medical Institute, Chevy Chase, MD, USA
| | - Ting Wang
- McDonnell Genome Institute, Washington University School of Medicine, St. Louis, MO, USA
- Department of Genetics, Washington University School of Medicine, St. Louis, MO, USA
| | - Erich D Jarvis
- Vertebrate Genome Laboratory, The Rockefeller University, New York, NY, USA
- Howard Hughes Medical Institute, Chevy Chase, MD, USA
- Laboratory of Neurogenetics of Language, The Rockefeller University, New York, NY, USA
| | - Karen H Miga
- Genomics Institute, University of California, Santa Cruz, CA, USA
| | - Erik Garrison
- Department of Genetics, Genomics and Informatics, University of Tennessee Health Science Center, Memphis, TN, USA.
| | - Tobias Marschall
- Institute for Medical Biometry and Bioinformatics, Medical Faculty, Heinrich Heine University, Düsseldorf, Germany.
- Center for Digital Medicine, Heinrich Heine University, Düsseldorf, Germany.
| | - Ira M Hall
- Department of Genetics, Yale University School of Medicine, New Haven, CT, USA.
- Center for Genomic Health, Yale University School of Medicine, New Haven, CT, USA.
| | - Heng Li
- Department of Data Sciences, Dana-Farber Cancer Institute, Boston, MA, USA.
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA.
| | - Benedict Paten
- Genomics Institute, University of California, Santa Cruz, CA, USA.
| |
Collapse
|
31
|
Berger B, Yu YW. Navigating bottlenecks and trade-offs in genomic data analysis. Nat Rev Genet 2023; 24:235-250. [PMID: 36476810 DOI: 10.1038/s41576-022-00551-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 10/27/2022] [Indexed: 12/12/2022]
Abstract
Genome sequencing and analysis allow researchers to decode the functional information hidden in DNA sequences as well as to study cell to cell variation within a cell population. Traditionally, the primary bottleneck in genomic analysis pipelines has been the sequencing itself, which has been much more expensive than the computational analyses that follow. However, an important consequence of the continued drive to expand the throughput of sequencing platforms at lower cost is that often the analytical pipelines are struggling to keep up with the sheer amount of raw data produced. Computational cost and efficiency have thus become of ever increasing importance. Recent methodological advances, such as data sketching, accelerators and domain-specific libraries/languages, promise to address these modern computational challenges. However, despite being more efficient, these innovations come with a new set of trade-offs, both expected, such as accuracy versus memory and expense versus time, and more subtle, including the human expertise needed to use non-standard programming interfaces and set up complex infrastructure. In this Review, we discuss how to navigate these new methodological advances and their trade-offs.
Collapse
Affiliation(s)
- Bonnie Berger
- Department of Mathematics, Massachusetts Institute of Technology, Cambridge, MA, USA.
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA, USA.
| | - Yun William Yu
- Department of Computer and Mathematical Sciences, University of Toronto Scarborough, Toronto, Ontario, Canada
- Tri-Campus Department of Mathematics, University of Toronto, Toronto, Ontario, Canada
| |
Collapse
|
32
|
Ahmed M, Kim HJ, Kim DR. Maximizing the utility of public data. Front Genet 2023; 14:1106631. [PMID: 37065493 PMCID: PMC10102460 DOI: 10.3389/fgene.2023.1106631] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/25/2022] [Accepted: 03/21/2023] [Indexed: 04/03/2023] Open
Abstract
The human genome project galvanized the scientific community around an ambitious goal. Upon completion, the project delivered several discoveries, and a new era of research commenced. More importantly, novel technologies and analysis methods materialized during the project period. The cost reduction allowed many more labs to generate high-throughput datasets. The project also served as a model for other extensive collaborations that generated large datasets. These datasets were made public and continue to accumulate in repositories. As a result, the scientific community should consider how these data can be utilized effectively for the purposes of research and the public good. A dataset can be re-analyzed, curated, or integrated with other forms of data to enhance its utility. We highlight three important areas to achieve this goal in this brief perspective. We also emphasize the critical requirements for these strategies to be successful. We draw on our own experience and others in using publicly available datasets to support, develop, and extend our research interest. Finally, we underline the beneficiaries and discuss some risks involved in data reuse.
Collapse
Affiliation(s)
- Mahmoud Ahmed
- Department of Biochemistry and Convergence Medical Sciences, Institute of Health Sciences, College of Medicine, Gyeongsang National University, Jinju, Republic of Korea
| | - Hyun Joon Kim
- Department of Anatomy and Convergence Medical Sciences, Institute of Health Sciences, College of Medicine, Gyeongsang National University, Jinju, Republic of Korea
| | - Deok Ryong Kim
- Department of Biochemistry and Convergence Medical Sciences, Institute of Health Sciences, College of Medicine, Gyeongsang National University, Jinju, Republic of Korea
- *Correspondence: Deok Ryong Kim,
| |
Collapse
|
33
|
Grossman RL. Ten lessons for data sharing with a data commons. Sci Data 2023; 10:120. [PMID: 36878917 PMCID: PMC9988927 DOI: 10.1038/s41597-023-02029-x] [Citation(s) in RCA: 7] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/10/2022] [Accepted: 02/17/2023] [Indexed: 03/08/2023] Open
Affiliation(s)
- Robert L Grossman
- University of Chicago, Center for Translational Data Science, Chicago, IL, 60615, USA.
| |
Collapse
|
34
|
Kirsche M, Prabhu G, Sherman R, Ni B, Battle A, Aganezov S, Schatz MC. Jasmine and Iris: population-scale structural variant comparison and analysis. Nat Methods 2023; 20:408-417. [PMID: 36658279 PMCID: PMC10006329 DOI: 10.1038/s41592-022-01753-3] [Citation(s) in RCA: 30] [Impact Index Per Article: 30.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/27/2021] [Accepted: 12/15/2022] [Indexed: 01/21/2023]
Abstract
The availability of long reads is revolutionizing studies of structural variants (SVs). However, because SVs vary across individuals and are discovered through imprecise read technologies and methods, they can be difficult to compare. Addressing this, we present Jasmine and Iris ( https://github.com/mkirsche/Jasmine/ ), for fast and accurate SV refinement, comparison and population analysis. Using an SV proximity graph, Jasmine outperforms six widely used comparison methods, including reducing the rate of Mendelian discordance in trio datasets by more than fivefold, and reveals a set of high-confidence de novo SVs confirmed by multiple technologies. We also present a unified callset of 122,813 SVs and 82,379 indels from 31 samples of diverse ancestry sequenced with long reads. We genotype these variants in 1,317 samples from the 1000 Genomes Project and the Genotype-Tissue Expression project with DNA and RNA-sequencing data and assess their widespread impact on gene expression, including within medically relevant genes.
Collapse
Affiliation(s)
- Melanie Kirsche
- Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA
| | - Gautam Prabhu
- Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA
- Department of Biology, Johns Hopkins University, Baltimore, MD, USA
| | - Rachel Sherman
- Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA
| | - Bohan Ni
- Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA
| | - Alexis Battle
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, USA
| | - Sergey Aganezov
- Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA.
| | - Michael C Schatz
- Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA.
- Department of Biology, Johns Hopkins University, Baltimore, MD, USA.
| |
Collapse
|
35
|
Pinto BJ, O’Connor B, Schatz MC, Zarate S, Wilson MA. Concerning the eXclusion in human genomics: The choice of sex chromosome representation in the human genome drastically affects number of identified variants. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.02.22.529542. [PMID: 36865318 PMCID: PMC9980147 DOI: 10.1101/2023.02.22.529542] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/25/2023]
Abstract
Over the past 30 years, a community of scientists have pieced together every base pair of the human reference genome from telomere-to-telomere. Interestingly, most human genomics studies omit more than 5% of the genome from their analyses. Under 'normal' circumstances, omitting any chromosome(s) from analysis of the human genome would be reason for concern-the exception being the sex chromosomes. Sex chromosomes in eutherians share an evolutionary origin as an ancestral pair of autosomes. In humans, they share three regions of high sequence identity (~98-100%), which-along with the unique transmission patterns of the sex chromosomes-introduce technical artifacts into genomic analyses. However, the human X chromosome bears numerous important genes-including more "immune response" genes than any other chromosome-which makes its exclusion irresponsible when sex differences across human diseases are widespread. To better characterize the effect that including/excluding the X chromosome may have on variants called, we conducted a pilot study on the Terra cloud platform to replicate a subset of standard genomic practices using both the CHM13 reference genome and sex chromosome complement-aware (SCC-aware) reference genome. We compared quality of variant calling, expression quantification, and allele-specific expression using these two reference genome versions across 50 human samples from the Genotype-Tissue-Expression consortium annotated as females. We found that after correction, the whole X chromosome (100%) can generate reliable variant calls-allowing for the inclusion of the whole genome in human genomics analyses as a departure from the status quo of omitting the sex chromosomes from empirical and clinical genomics studies.
Collapse
Affiliation(s)
- Brendan J. Pinto
- School of Life Sciences, Arizona State University, Tempe AZ 85282 USA
- Center for Evolution and Medicine, Arizona State University, Tempe AZ 85282 USA
- Department of Zoology, Milwaukee Public Museum, Milwaukee, WI 53233 USA
| | | | - Michael C. Schatz
- Department of Computer Science, Johns Hopkins University, Baltimore, MD 21218 USA
| | - Samantha Zarate
- Department of Computer Science, Johns Hopkins University, Baltimore, MD 21218 USA
| | - Melissa A. Wilson
- School of Life Sciences, Arizona State University, Tempe AZ 85282 USA
- Center for Evolution and Medicine, Arizona State University, Tempe AZ 85282 USA
- The Biodesign Center for Mechanisms of Evolution, Arizona State University, Tempe AZ 85282 USA
| |
Collapse
|
36
|
Felker SA, Lawlor JMJ, Hiatt SM, Thompson ML, Latner DR, Finnila CR, Bowling KM, Bonnstetter ZT, Bonini KE, Kelly NR, Kelley WV, Hurst ACE, Kelly MA, Nakouzi G, Hendon LG, Bebin EM, Kenny EE, Cooper GM. Poison exon annotations improve the yield of clinically relevant variants in genomic diagnostic testing. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.01.12.523654. [PMID: 36711854 PMCID: PMC9882217 DOI: 10.1101/2023.01.12.523654] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 01/15/2023]
Abstract
Purpose Neurodevelopmental disorders (NDDs) often result from rare genetic variation, but genomic testing yield for NDDs remains around 50%, suggesting some clinically relevant rare variants may be missed by standard analyses. Here we analyze "poison exons" (PEs) which, while often absent from standard gene annotations, are alternative exons whose inclusion results in a premature termination codon. Variants that alter PE inclusion can lead to loss-of-function and may be highly penetrant contributors to disease. Methods We curated published RNA-seq data from developing mouse cortex to define 1,937 PE regions conserved between humans and mice and potentially relevant to NDDs. We then analyzed variants found by genome sequencing in multiple NDD cohorts. Results Across 2,999 probands, we found six clinically relevant variants in PE regions that were previously overlooked. Five of these variants are in genes that are part of the sodium voltage-gated channel alpha subunit family ( SCN1A, SCN2A , and SCN8A ), associated with epilepsies. One variant is in SNRPB , associated with Cerebrocostomandibular Syndrome. These variants have moderate to high computational impact assessments, are absent from population variant databases, and were observed in probands with features consistent with those reported for the associated gene. Conclusion With only a minimal increase in variant analysis burden (most probands had zero or one candidate PE variants in a known NDD gene, with an average of 0.77 per proband), annotation of PEs can improve diagnostic yield for NDDs and likely other congenital conditions.
Collapse
Affiliation(s)
| | - James MJ Lawlor
- HudsonAlpha Institute for Biotechnology, Huntsville, AL, USA 35806
| | - Susan M Hiatt
- HudsonAlpha Institute for Biotechnology, Huntsville, AL, USA 35806
| | | | - Donald R Latner
- HudsonAlpha Institute for Biotechnology, Huntsville, AL, USA 35806
| | | | - Kevin M Bowling
- Washington University School of Medicine, Saint Louis, MO, USA 63110
| | | | - Katherine E Bonini
- Institute for Genomic Health, Icahn School of Medicine at Mount Sinai. New York, NY, USA 10029
| | - Nicole R Kelly
- Department of Pediatrics, Division of Pediatric Genetic Medicine, Children’s Hospital at Montefiore/Montefiore Medical Center/Albert Einstein College of Medicine, Bronx, NY, USA 10467
| | - Whitley V Kelley
- HudsonAlpha Institute for Biotechnology, Huntsville, AL, USA 35806
| | - Anna CE Hurst
- University of Alabama in Birmingham, Birmingham, AL, USA 35294
| | | | | | - Laura G Hendon
- University of Mississippi Medical Center, Jackson, MS, 39216
| | - E Martina Bebin
- Department of Neurology, University of Alabama at Birmingham, Birmingham, AL, USA 35294
| | - Eimear E Kenny
- Institute for Genomic Health, Icahn School of Medicine at Mount Sinai. New York, NY, USA 10029,Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, NY, USA 10029,Department of Medicine, Icahn School of Medicine at Mount Sinai, New York, NY, USA 10029
| | - Gregory M Cooper
- HudsonAlpha Institute for Biotechnology, Huntsville, AL, USA 35806
| |
Collapse
|
37
|
Zhou H, Arapoglou T, Li X, Li Z, Zheng X, Moore J, Asok A, Kumar S, Blue E, Buyske S, Cox N, Felsenfeld A, Gerstein M, Kenny E, Li B, Matise T, Philippakis A, Rehm HL, Sofia HJ, Snyder G, Weng Z, Neale B, Sunyaev S, Lin X. FAVOR: functional annotation of variants online resource and annotator for variation across the human genome. Nucleic Acids Res 2023; 51:D1300-D1311. [PMID: 36350676 PMCID: PMC9825437 DOI: 10.1093/nar/gkac966] [Citation(s) in RCA: 34] [Impact Index Per Article: 34.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/15/2022] [Revised: 09/25/2022] [Accepted: 10/14/2022] [Indexed: 11/11/2022] Open
Abstract
Large biobank-scale whole genome sequencing (WGS) studies are rapidly identifying a multitude of coding and non-coding variants. They provide an unprecedented resource for illuminating the genetic basis of human diseases. Variant functional annotations play a critical role in WGS analysis, result interpretation, and prioritization of disease- or trait-associated causal variants. Existing functional annotation databases have limited scope to perform online queries and functionally annotate the genotype data of large biobank-scale WGS studies. We develop the Functional Annotation of Variants Online Resources (FAVOR) to meet these pressing needs. FAVOR provides a comprehensive multi-faceted variant functional annotation online portal that summarizes and visualizes findings of all possible nine billion single nucleotide variants (SNVs) across the genome. It allows for rapid variant-, gene- and region-level queries of variant functional annotations. FAVOR integrates variant functional information from multiple sources to describe the functional characteristics of variants and facilitates prioritizing plausible causal variants influencing human phenotypes. Furthermore, we provide a scalable annotation tool, FAVORannotator, to functionally annotate large-scale WGS studies and efficiently store the genotype and their variant functional annotation data in a single file using the annotated Genomic Data Structure (aGDS) format, making downstream analysis more convenient. FAVOR and FAVORannotator are available at https://favor.genohub.org.
Collapse
Affiliation(s)
- Hufeng Zhou
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA, USA
| | - Theodore Arapoglou
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA, USA
| | - Xihao Li
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA, USA
| | - Zilin Li
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA, USA
- Department of Biostatistics and Health Data Science, Indiana University School of Medicine, Indianapolis, IN, USA
| | - Xiuwen Zheng
- Department of Biostatistics, University of Washington, Seattle, WA 98195, USA
| | - Jill Moore
- Program in Bioinformatics and Integrative Biology, University of Massachusetts Chan Medical School, Worcester, MA, USA
| | | | - Sushant Kumar
- Department of Medical Biophysics, University of Toronto, Toronto, ON, Canada
- Princess Margaret Cancer Centre, Toronto, ON, Canada
| | - Elizabeth E Blue
- Division of Medical Genetics, University of Washington, Seattle, WA, USA
- Brotman Baty Institute for Precision Medicine, Seattle, WA, USA
| | - Steven Buyske
- Department of Statistics, Rutgers, The State University of New Jersey, Piscataway, NJ, USA
| | - Nancy Cox
- Department of Medicine, Vanderbilt University Medical Center, Nashville, TN, USA
| | | | - Mark Gerstein
- Program in Computational Biology and Bioinformatics, Yale University, New Haven, CT, USA
- Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, CT, USA
| | - Eimear Kenny
- Department of Genetics and Genomic Science, Icahn School of Medicine at Mount Sinai, New York, NY, USA
- Department of Medicine, Icahn School of Medicine at Mount Sinai, New York, NY, USA
- Institute for Genomic Health, Icahn School of Medicine at Mount Sinai, New York, NY, USA
| | - Bingshan Li
- Department of Molecular Physiology and Biophysics, Vanderbilt University Medical Center, Nashville, TN, USA
| | - Tara Matise
- Department of Genetics, Rutgers, The State University of New Jersey, Piscataway, NJ, USA
| | - Anthony Philippakis
- Data Science Platform, Broad Institute of Harvard and MIT, Cambridge, MA, USA
| | - Heidi L Rehm
- Program in Medical and Population Genetics, Broad Institute of Harvard and MIT, Cambridge, MA, USA
- Center for Genomic Medicine, Massachusetts General Hospital, Boston, MA, USA
| | - Heidi J Sofia
- National Human Genome Research Institute, Bethesda, DC, USA
| | - Grace Snyder
- National Human Genome Research Institute, Bethesda, DC, USA
| | | | - Zhiping Weng
- Program in Bioinformatics and Integrative Biology, University of Massachusetts Chan Medical School, Worcester, MA, USA
| | - Benjamin Neale
- Program in Medical and Population Genetics, Broad Institute of Harvard and MIT, Cambridge, MA, USA
- Analytic and Translational Genetics Unit, Massachusetts General Hospital, Boston, MA, USA
| | - Shamil R Sunyaev
- Program in Medical and Population Genetics, Broad Institute of Harvard and MIT, Cambridge, MA, USA
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA
| | - Xihong Lin
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA, USA
- Program in Medical and Population Genetics, Broad Institute of Harvard and MIT, Cambridge, MA, USA
- Department of Statistics, Harvard University, Cambridge, MA, USA
| |
Collapse
|
38
|
Rasche H, Hyde C, Davis J, Gladman S, Coraor N, Bretaudeau A, Cuccuru G, Bacon W, Serrano-Solano B, Hillman-Jackson J, Hiltemann S, Zhou M, Grüning B, Stubbs A. Training Infrastructure as a Service. Gigascience 2022; 12:giad048. [PMID: 37395629 PMCID: PMC10316688 DOI: 10.1093/gigascience/giad048] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/10/2023] [Revised: 05/31/2023] [Accepted: 06/08/2023] [Indexed: 07/04/2023] Open
Abstract
BACKGROUND Hands-on training, whether in bioinformatics or other domains, often requires significant technical resources and knowledge to set up and run. Instructors must have access to powerful compute infrastructure that can support resource-intensive jobs running efficiently. Often this is achieved using a private server where there is no contention for the queue. However, this places a significant prerequisite knowledge or labor barrier for instructors, who must spend time coordinating deployment and management of compute resources. Furthermore, with the increase of virtual and hybrid teaching, where learners are located in separate physical locations, it is difficult to track student progress as efficiently as during in-person courses. FINDINGS Originally developed by Galaxy Europe and the Gallantries project, together with the Galaxy community, we have created Training Infrastructure-as-a-Service (TIaaS), aimed at providing user-friendly training infrastructure to the global training community. TIaaS provides dedicated training resources for Galaxy-based courses and events. Event organizers register their course, after which trainees are transparently placed in a private queue on the compute infrastructure, which ensures jobs complete quickly, even when the main queue is experiencing high wait times. A built-in dashboard allows instructors to monitor student progress. CONCLUSIONS TIaaS provides a significant improvement for instructors and learners, as well as infrastructure administrators. The instructor dashboard makes remote events not only possible but also easy. Students experience continuity of learning, as all training happens on Galaxy, which they can continue to use after the event. In the past 60 months, 504 training events with over 24,000 learners have used this infrastructure for Galaxy training.
Collapse
Affiliation(s)
- Helena Rasche
- Department of Pathology and Clinical Bioinformatics, Erasmus Medical Center, Dr. Molewaterplein 40, 3015 GD, Rotterdam, the Netherlands
- School of Life Sciences and Technology, Avans University of Applied Sciences, Lovensdijkstraat 63, 4818 AJ Breda, the Netherlands
| | - Cameron Hyde
- Queensland Cyber Infrastructure Foundation Ltd., The University of Queensland, St. Lucia, QLD 4072, Australia
- University of the Sunshine Coast, Maroochydore, QLD 4558, Australia
| | - John Davis
- Department of Biology, Johns Hopkins University, Baltimore, MD 21218, USA
| | - Simon Gladman
- Melbourne Bioinformatics, The University of Melbourne, Melbourne, VIC 3051, Australia
| | - Nate Coraor
- School of Life, Health & Chemical Sciences, The Open University, Milton Keynes MK7 6AA, UK
| | - Anthony Bretaudeau
- IGEPP, INRAE, Institut Agro, University of Rennes, 35000 Rennes, France
- GenOuest Core Facility, University of Rennes, Inria, CNRS, IRISA, 35000 Rennes, France
| | - Gianmauro Cuccuru
- Bioinformatics Grou, Department of Computer Science, University of Freiburg, 79110 Freiburg im Breisgau, Germany
| | - Wendi Bacon
- School of Life, Health & Chemical Sciences, The Open University, Milton Keynes MK7 6AA, UK
| | - Beatriz Serrano-Solano
- Euro-Bioimaging ERIC Bio-Hub, EMBL, 69117 Heidelberg, Germany
- Department of Biochemistry and Molecular Biology, Eberly College of Science, The Pennsylvania State University, State College, PA 16802, USA
| | | | - Saskia Hiltemann
- Department of Pathology and Clinical Bioinformatics, Erasmus Medical Center, Dr. Molewaterplein 40, 3015 GD, Rotterdam, the Netherlands
| | - Miaomiao Zhou
- School of Life Sciences and Technology, Avans University of Applied Sciences, Lovensdijkstraat 63, 4818 AJ Breda, the Netherlands
| | - Björn Grüning
- Bioinformatics Grou, Department of Computer Science, University of Freiburg, 79110 Freiburg im Breisgau, Germany
| | - Andrew Stubbs
- Department of Pathology and Clinical Bioinformatics, Erasmus Medical Center, Dr. Molewaterplein 40, 3015 GD, Rotterdam, the Netherlands
| |
Collapse
|
39
|
Dolin RH, Heale BSE, Alterovitz G, Gupta R, Aronson J, Boxwala A, Gothi SR, Haines D, Hermann A, Hongsermeier T, Husami A, Jones J, Naeymi-Rad F, Rapchak B, Ravishankar C, Shalaby J, Terry M, Xie N, Zhang P, Chamala S. Introducing HL7 FHIR Genomics Operations: a developer-friendly approach to genomics-EHR integration. J Am Med Inform Assoc 2022; 30:485-493. [PMID: 36548217 PMCID: PMC9933060 DOI: 10.1093/jamia/ocac246] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/27/2022] [Revised: 11/16/2022] [Accepted: 12/02/2022] [Indexed: 12/24/2022] Open
Abstract
OBJECTIVE Enabling clinicians to formulate individualized clinical management strategies from the sea of molecular data remains a fundamentally important but daunting task. Here, we describe efforts towards a new paradigm in genomics-electronic health record (HER) integration, using a standardized suite of FHIR Genomics Operations that encapsulates the complexity of molecular data so that precision medicine solution developers can focus on building applications. MATERIALS AND METHODS FHIR Genomics Operations essentially "wrap" a genomics data repository, presenting a uniform interface to applications. More importantly, operations encapsulate the complexity of data within a repository and normalize redundant data representations-particularly relevant in genomics, where a tremendous amount of raw data exists in often-complex non-FHIR formats. RESULTS Fifteen FHIR Genomics Operations have been developed, designed to support a wide range of clinical scenarios, such as variant discovery; clinical trial matching; hereditary condition and pharmacogenomic screening; and variant reanalysis. Operations are being matured through the HL7 balloting process, connectathons, pilots, and the HL7 FHIR Accelerator program. DISCUSSION Next-generation sequencing can identify thousands to millions of variants, whose clinical significance can change over time as our knowledge evolves. To manage such a large volume of dynamic and complex data, new models of genomics-EHR integration are needed. Qualitative observations to date suggest that freeing application developers from the need to understand the nuances of genomic data, and instead base applications on standardized APIs can not only accelerate integration but also dramatically expand the applications of Omic data in driving precision care at scale for all.
Collapse
Affiliation(s)
- Robert H Dolin
- Corresponding Author: Robert H. Dolin, MD, Elimu Informatics, 1709 Julian Ct, El Cerrito, CA 94530, USA;
| | | | - Gil Alterovitz
- Brigham and Women’s Hospital, Boston, Massachusetts, USA,Harvard/MIT Division of Health Sciences and Technology, Harvard Medical School, Boston, Massachusetts, USA
| | - Rohan Gupta
- Shri Mata Vaishno Devi University, Katra, Jammu and Kashmir, India
| | | | | | - Shaileshbhai R Gothi
- Department of Pathology, Immunology and Laboratory Medicine, University of Florida, Gainesville, Florida, USA
| | - David Haines
- Leap of Faith Technologies, Libertyville, Illinois, USA
| | - Arthur Hermann
- Department of Health IT Strategy & Policy, Kaiser Permanente, Pasadena, California, USA
| | | | - Ammar Husami
- Division of Human Genetics, Cincinnati Children’s Hospital Medical Center, Cincinnati, Ohio, USA,Department of Pediatrics, University of Cincinnati College of Medicine, Cincinnati, Ohio, USA
| | - James Jones
- Computational Health Informatics Program, Boston Children’s Hospital, Boston, Massachusetts, USA
| | | | | | | | | | - May Terry
- MITRE Corporation, McLean, Virginia, USA
| | - Ning Xie
- Biomedical Cybernetics Laboratory, Department of Medicine, Brigham and Women’s Hospital, Boston, Massachusetts, USA
| | - Powell Zhang
- Massachusetts Institute of Technology, Cambridge, Massachusetts, USA
| | - Srikar Chamala
- Keck School of Medicine, Department of Pathology, University of Southern California, Los Angeles, California, USA,Department of Pathology and Laboratory Medicine, Children's Hospital Los Angeles, Los Angeles, California, USA
| |
Collapse
|
40
|
Sheffield NC, Bonazzi VR, Bourne PE, Burdett T, Clark T, Grossman RL, Spjuth O, Yates AD. From biomedical cloud platforms to microservices: next steps in FAIR data and analysis. Sci Data 2022; 9:553. [PMID: 36075919 PMCID: PMC9458632 DOI: 10.1038/s41597-022-01619-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/24/2022] [Accepted: 08/08/2022] [Indexed: 11/29/2022] Open
Affiliation(s)
- Nathan C Sheffield
- Center for Public Health Genomics, School of Medicine, University of Virginia, 22908, Charlottesville, VA, USA.
- School of Data Science, University of Virginia, Charlottesville VA 22904, Charlottesville, VA, USA.
- Department of Biomedical Engineering, School of Medicine, University of Virginia, 22904, Charlottesville, VA, USA.
- Department of Public Health Sciences, School of Medicine, University of Virginia, 22908, Charlottesville, VA, USA.
- Department of Biochemistry and Molecular Genetics, School of Medicine, University of Virginia, 22908, Charlottesville, VA, USA.
| | | | - Philip E Bourne
- School of Data Science, University of Virginia, Charlottesville VA 22904, Charlottesville, VA, USA
- Department of Biomedical Engineering, School of Medicine, University of Virginia, 22904, Charlottesville, VA, USA
| | - Tony Burdett
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, CB10 1SD, UK
| | - Timothy Clark
- School of Data Science, University of Virginia, Charlottesville VA 22904, Charlottesville, VA, USA
- Department of Public Health Sciences, School of Medicine, University of Virginia, 22908, Charlottesville, VA, USA
| | - Robert L Grossman
- Center for Translational Data Science, University of Chicago, Chicago, IL, 60615, USA
| | - Ola Spjuth
- Department of Pharmaceutical Biosciences and Science for Life Laboratory, Uppsala University, 75124, Uppsala, Sweden
| | - Andrew D Yates
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, CB10 1SD, UK
| |
Collapse
|
41
|
Seaby EG, Smedley D, Taylor Tavares AL, Brittain H, van Jaarsveld RH, Baralle D, Rehm HL, O'Donnell-Luria A, Ennis S. A gene-to-patient approach uplifts novel disease gene discovery and identifies 18 putative novel disease genes. Genet Med 2022; 24:1697-1707. [PMID: 35532742 DOI: 10.1016/j.gim.2022.04.019] [Citation(s) in RCA: 7] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/22/2022] [Revised: 04/14/2022] [Accepted: 04/14/2022] [Indexed: 12/14/2022] Open
Abstract
PURPOSE Exome and genome sequencing have drastically accelerated novel disease gene discoveries. However, discovery is still hindered by myriad variants of uncertain significance found in genes of undetermined biological function. This necessitates intensive functional experiments on genes of equal predicted causality, leading to a major bottleneck. METHODS We apply the loss-of-function observed/expected upper-bound fraction metric of intolerance to gene inactivation to curate a list of predicted haploinsufficient disease genes. Using data from the 100,000 Genomes Project, we adopt a gene-to-patient approach that matches de novo loss-of-function variants in constrained genes to patients with rare disease. Through large-scale aggregation of data, we reduce excess analytical noise currently hindering novel discoveries. RESULTS Results from 13,949 trios revealed 643 rare, de novo predicted loss-of-function events filtered from 1044 loss-of-function observed/expected upper-bound fraction-constrained genes. A total of 168 variants occurred within 126 genes without a known disease-gene relationship. Of these, 27 genes had >1 kindred affected, and for 18 of these genes, multiple kindreds had overlapping phenotypes. Two years after initial analysis, 11 of 18 (61%) of these genes have been independently published as novel disease gene discoveries. CONCLUSION Using large cohorts and adopting gene-based approaches can rapidly and objectively accelerate dominantly inherited novel gene discovery by targeting the most appropriate genes for functional validation.
Collapse
Affiliation(s)
- Eleanor G Seaby
- Human Development and Health, Faculty of Medicine, University of Southampton, Southampton, United Kingdom; Program in Medical and Population Genetics, Broad institute of MIT and Harvard, Boston, MA; Center for Genomic Medicine, Analytic and Translational Genetics Unit, Massachusetts General Hospital, Boston, MA; Division of Genetics and Genomics, Boston Children's Hospital, Boston, MA.
| | - Damian Smedley
- Genomics England, Dawson Hall, Charterhouse Square, London, EC1M 6BQ, United Kingdom
| | | | - Helen Brittain
- Genomics England, Dawson Hall, Charterhouse Square, London, EC1M 6BQ, United Kingdom
| | | | - Diana Baralle
- Human Development and Health, Faculty of Medicine, University of Southampton, Southampton, United Kingdom
| | - Heidi L Rehm
- Program in Medical and Population Genetics, Broad institute of MIT and Harvard, Boston, MA; Center for Genomic Medicine, Analytic and Translational Genetics Unit, Massachusetts General Hospital, Boston, MA
| | - Anne O'Donnell-Luria
- Program in Medical and Population Genetics, Broad institute of MIT and Harvard, Boston, MA; Center for Genomic Medicine, Analytic and Translational Genetics Unit, Massachusetts General Hospital, Boston, MA; Division of Genetics and Genomics, Boston Children's Hospital, Boston, MA
| | - Sarah Ennis
- Human Development and Health, Faculty of Medicine, University of Southampton, Southampton, United Kingdom
| | | |
Collapse
|
42
|
Diversifying the genomic data science research community. Genome Res 2022; 32:gr.276496.121. [PMID: 35858750 PMCID: PMC9341509 DOI: 10.1101/gr.276496.121] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/19/2022] [Accepted: 06/02/2022] [Indexed: 11/25/2022]
Abstract
Over the past 20 years, the explosion of genomic data collection and the cloud computing revolution have made computational and data science research accessible to anyone with a web browser and an internet connection. However, students at institutions with limited resources have received relatively little exposure to curricula or professional development opportunities that lead to careers in genomic data science. To broaden participation in genomics research, the scientific community needs to support these programs in local education and research at underserved institutions (UIs). These include community colleges, historically Black colleges and universities, Hispanic-serving institutions, and tribal colleges and universities that support ethnically, racially, and socioeconomically underrepresented students in the United States. We have formed the Genomic Data Science Community Network to support students, faculty, and their networks to identify opportunities and broaden access to genomic data science. These opportunities include expanding access to infrastructure and data, providing UI faculty development opportunities, strengthening collaborations among faculty, recognizing UI teaching and research excellence, fostering student awareness, developing modular and open-source resources, expanding course-based undergraduate research experiences (CUREs), building curriculum, supporting student professional development and research, and removing financial barriers through funding programs and collaborator support.
Collapse
|
43
|
Wiley K, Findley L, Goldrich M, Rakhra-Burris TK, Stevens A, Williams P, Bult CJ, Chisholm R, Deverka P, Ginsburg GS, Green ED, Jarvik G, Mensah GA, Ramos E, Relling MV, Roden DM, Rowley R, Alterovitz G, Aronson S, Bastarache L, Cimino JJ, Crowgey EL, Del Fiol G, Freimuth RR, Hoffman MA, Jeff J, Johnson K, Kawamoto K, Madhavan S, Mendonca EA, Ohno-Machado L, Pratap S, Taylor CO, Ritchie MD, Walton N, Weng C, Zayas-Cabán T, Manolio TA, Williams MS. A research agenda to support the development and implementation of genomics-based clinical informatics tools and resources. J Am Med Inform Assoc 2022; 29:1342-1349. [PMID: 35485600 PMCID: PMC9277642 DOI: 10.1093/jamia/ocac057] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/17/2021] [Revised: 02/22/2022] [Accepted: 04/08/2022] [Indexed: 11/12/2022] Open
Abstract
OBJECTIVE The Genomic Medicine Working Group of the National Advisory Council for Human Genome Research virtually hosted its 13th genomic medicine meeting titled "Developing a Clinical Genomic Informatics Research Agenda". The meeting's goal was to articulate a research strategy to develop Genomics-based Clinical Informatics Tools and Resources (GCIT) to improve the detection, treatment, and reporting of genetic disorders in clinical settings. MATERIALS AND METHODS Experts from government agencies, the private sector, and academia in genomic medicine and clinical informatics were invited to address the meeting's goals. Invitees were also asked to complete a survey to assess important considerations needed to develop a genomic-based clinical informatics research strategy. RESULTS Outcomes from the meeting included identifying short-term research needs, such as designing and implementing standards-based interfaces between laboratory information systems and electronic health records, as well as long-term projects, such as identifying and addressing barriers related to the establishment and implementation of genomic data exchange systems that, in turn, the research community could help address. DISCUSSION Discussions centered on identifying gaps and barriers that impede the use of GCIT in genomic medicine. Emergent themes from the meeting included developing an implementation science framework, defining a value proposition for all stakeholders, fostering engagement with patients and partners to develop applications under patient control, promoting the use of relevant clinical workflows in research, and lowering related barriers to regulatory processes. Another key theme was recognizing pervasive biases in data and information systems, algorithms, access, value, and knowledge repositories and identifying ways to resolve them.
Collapse
Affiliation(s)
- Ken Wiley
- National Human Genome Research Institute, National Institutes of Health, Bethesda, Maryland, USA
| | - Laura Findley
- National Human Genome Research Institute, National Institutes of Health, Bethesda, Maryland, USA
| | - Madison Goldrich
- National Human Genome Research Institute, National Institutes of Health, Bethesda, Maryland, USA
| | - Tejinder K Rakhra-Burris
- Department of Medicine, Center for Applied Genomics & Precision Medicine, Duke University, Durham, North Carolina, USA
| | - Ana Stevens
- National Human Genome Research Institute, National Institutes of Health, Bethesda, Maryland, USA
| | - Pamela Williams
- Department of Medicine, Center for Applied Genomics & Precision Medicine, Duke University, Durham, North Carolina, USA
| | | | - Rex Chisholm
- Feinberg School of Medicine, Northwestern University, Chicago, Illinois, USA
| | - Patricia Deverka
- Center for Translational and Policy Research in Precision Medicine, University of California at San Francisco, San Francisco, California, USA
| | - Geoffrey S Ginsburg
- All of Us Research Program, National Institutes of Health, Bethesda, Maryland, USA
| | - Eric D Green
- National Human Genome Research Institute, National Institutes of Health, Bethesda, Maryland, USA
| | - Gail Jarvik
- Division of Medical Genetics, University of Washington, Seattle, Washington, USA
| | - George A Mensah
- National Heart, Lung, and Blood Institute, National Institutes of Health, Bethesda, Maryland, USA
| | - Erin Ramos
- National Human Genome Research Institute, National Institutes of Health, Bethesda, Maryland, USA
| | - Mary V Relling
- Pharmacy and Pharmaceutical Sciences, St. Jude Children's Research Hospital, Memphis, Tennessee, USA
| | - Dan M Roden
- Departments of Medicine, Pharmacology, and Biomedical Informatics, Vanderbilt University Medical Center, Nashville, Tennessee, USA
| | - Robb Rowley
- National Human Genome Research Institute, National Institutes of Health, Bethesda, Maryland, USA
| | - Gil Alterovitz
- Department of Medicine, Brigham and Women's Hospital, Boston, Massachusetts, USA
| | - Samuel Aronson
- Mass General Brigham, Research Information Sciences and Computing, Somerville, Massachusetts, USA
| | - Lisa Bastarache
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, Tennessee, USA
| | - James J Cimino
- Heersink School of Medicine, University of Alabama at Birmingham, Alabama, USA
| | | | - Guilherme Del Fiol
- Department of Biomedical Informatics, University of Utah School of Medicine, Salt Lake City, Utah, USA
| | - Robert R Freimuth
- Department of Artificial Intelligence and Informatics, Mayo Clinic, Rochester, Minnesota, USA
| | - Mark A Hoffman
- School of Medicine, Children's Mercy Hospital Kansas City, University of Missouri Kansas City, Lees Summit, Missouri, USA
| | | | - Kevin Johnson
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, Tennessee, USA
| | - Kensaku Kawamoto
- Department of Biomedical Informatics, University of Utah School of Medicine, Salt Lake City, Utah, USA
| | - Subha Madhavan
- Innovation Center for Biomedical Informatics, Georgetown University, Washington, District of Columbia, USA
| | - Eneida A Mendonca
- Regenstrief Institute, Inc., Indianapolis, Indiana, USA.,Department of Pediatrics, Department of Biostatistics and Health Data Sciences, Indiana University School of Medicine, Indianapolis, Indiana, USA
| | - Lucila Ohno-Machado
- Department of Biomedical Informatics, University of California San Diego, La Jolla, California, USA
| | - Siddharth Pratap
- Bioinformatics Core, Meharry Medical College, Nashville, Tennessee, USA
| | | | - Marylyn D Ritchie
- Department of Genetics, Perelman School of Medicine, Institute for Biomedical Informatics, Penn Center for Precision Medicine, University of Pennsylvania, Philadelphia, Pennsylvania, USA
| | - Nephi Walton
- Intermountain Precision Genomics, Intermountain Healthcare, St George, Utah, USA
| | - Chunhua Weng
- Department of Biomedical Informatics, Columbia University, New York, New York, USA
| | - Teresa Zayas-Cabán
- National Library of Medicine, National Institutes of Health, Bethesda, Maryland, USA
| | - Teri A Manolio
- National Human Genome Research Institute, National Institutes of Health, Bethesda, Maryland, USA
| | - Marc S Williams
- Geisinger, Genomic Medicine Institute, Danville, Pennsylvania, USA
| |
Collapse
|
44
|
Afgan E, Nekrutenko A, Grüning BA, Blankenberg D, Goecks J, Schatz MC, Ostrovsky AE, Mahmoud A, Lonie AJ, Syme A, Fouilloux A, Bretaudeau A, Nekrutenko A, Kumar A, Eschenlauer AC, DeSanto AD, Guerler A, Serrano-Solano B, Batut B, Grüning BA, Langhorst BW, Carr B, Raubenolt BA, Hyde CJ, Bromhead CJ, Barnett CB, Royaux C, Gallardo C, Blankenberg D, Fornika DJ, Baker D, Bouvier D, Clements D, de Lima Morais DA, Tabernero DL, Lariviere D, Nasr E, Afgan E, Zambelli F, Heyl F, Psomopoulos F, Coppens F, Price GR, Cuccuru G, Corguillé GL, Von Kuster G, Akbulut GG, Rasche H, Hotz HR, Eguinoa I, Makunin I, Ranawaka IJ, Taylor JP, Joshi J, Hillman-Jackson J, Goecks J, Chilton JM, Kamali K, Suderman K, Poterlowicz K, Yvan LB, Lopez-Delisle L, Sargent L, Bassetti ME, Tangaro MA, van den Beek M, Čech M, Bernt M, Fahrner M, Tekman M, Föll MC, Schatz MC, Crusoe MR, Roncoroni M, Kucher N, Coraor N, Stoler N, Rhodes N, Soranzo N, Pinter N, Goonasekera NA, Moreno PA, Videm P, Melanie P, Mandreoli P, Jagtap PD, Gu Q, Weber RJM, Lazarus R, Vorderman RHP, Hiltemann S, Golitsynskiy S, Garg S, Bray SA, Gladman SL, Leo S, Mehta SP, Griffin TJ, Jalili V, Yves V, Wen V, Nagampalli VK, Bacon WA, de Koning W, Maier W, Briggs PJ. The Galaxy platform for accessible, reproducible and collaborative biomedical analyses: 2022 update. Nucleic Acids Res 2022; 50:W345-W351. [PMID: 35446428 PMCID: PMC9252830 DOI: 10.1093/nar/gkac247] [Citation(s) in RCA: 279] [Impact Index Per Article: 139.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/03/2022] [Revised: 03/17/2022] [Accepted: 03/30/2022] [Indexed: 01/19/2023] Open
Abstract
Galaxy is a mature, browser accessible workbench for scientific computing. It enables scientists to share, analyze and visualize their own data, with minimal technical impediments. A thriving global community continues to use, maintain and contribute to the project, with support from multiple national infrastructure providers that enable freely accessible analysis and training services. The Galaxy Training Network supports free, self-directed, virtual training with >230 integrated tutorials. Project engagement metrics have continued to grow over the last 2 years, including source code contributions, publications, software packages wrapped as tools, registered users and their daily analysis jobs, and new independent specialized servers. Key Galaxy technical developments include an improved user interface for launching large-scale analyses with many files, interactive tools for exploratory data analysis, and a complete suite of machine learning tools. Important scientific developments enabled by Galaxy include Vertebrate Genome Project (VGP) assembly workflows and global SARS-CoV-2 collaborations.
Collapse
|
45
|
Rahimzadeh V, Lawson J, Rushton G, Dove ES. Leveraging Algorithms to Improve Decision-Making Workflows for Genomic Data Access and Management. Biopreserv Biobank 2022; 20:429-435. [PMID: 35772014 PMCID: PMC9603251 DOI: 10.1089/bio.2022.0042] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
Studies on the ethics of automating clinical or research decision making using artificial intelligence and other algorithmic tools abound. Less attention has been paid, however, to the scope for, and ethics of, automating decision making within regulatory apparatuses governing the access, use, and exchange of data involving humans for research. In this article, we map how the binary logic flows and real-time capabilities of automated decision support (ADS) systems may be leveraged to accelerate one rate-limiting step in scientific discovery: data access management. We contend that improved auditability, consistency, and efficiency of the data access request process using ADS systems have the potential to yield fairer outcomes in requests for data largely sourced from biospecimens and biobanked samples. This procedural justice rationale reinforces a broader set of participant and data subject rights that data access committees (DACs) indirectly protect. DACs protect the rights of citizens to benefit from science by bringing researchers closer to the data they need to advance that science. DACs also protect the informational dignities of individuals and communities by ensuring the data being accessed are used in ways consistent with participant values. We discuss the development of the Global Alliance for Genomics and Health Data Use Ontology standard as a test case of ADS for genomic data access management specifically, and we synthesize relevant ethical, legal, and social challenges to its implementation in practice. We conclude with an agenda of future research needed to thoughtfully advance strategies for computational governance that endeavor to instill public trust in, and maximize the scientific value of, health-related human data across data types, environments, and user communities.
Collapse
Affiliation(s)
- Vasiliki Rahimzadeh
- Stanford Center for Biomedical Ethics, Stanford University, Stanford, California, USA
| | - Jonathan Lawson
- Broad Institute of MIT and Harvard, Cambridge, Massachusetts, USA
| | - Greg Rushton
- Stanford Center for Biomedical Ethics, Stanford University, Stanford, California, USA
| | - Edward S Dove
- School of Law, University of Edinburgh, Edinburgh, United Kingdom
| |
Collapse
|
46
|
Boycott KM, Azzariti DR, Hamosh A, Rehm HL. Seven years since the launch of the Matchmaker Exchange: The evolution of genomic matchmaking. Hum Mutat 2022; 43:659-667. [PMID: 35537081 PMCID: PMC9133175 DOI: 10.1002/humu.24373] [Citation(s) in RCA: 12] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/15/2022] [Accepted: 03/22/2022] [Indexed: 11/09/2022]
Abstract
The Matchmaker Exchange (MME) was launched in 2015 to provide a robust mechanism to discover novel disease-gene relationships. It operates as a federated network connecting databases holding relevant data using a common application programming interface, where two or more users are looking for a match for the same gene (two-sided matchmaking). Seven years from its launch, it is clear that the MME is making outstanding contributions to understanding the morbid anatomy of the genome. The number of unique genes present across the MME has steadily increased over time; there are currently >13,520 unique genes (~68% of all protein-coding genes) connected across the MME's eight genomic matchmaking nodes, GeneMatcher, DECIPHER, PhenomeCentral, MyGene2, seqr, Initiative on Rare and Undiagnosed Disease, PatientMatcher, and the RD-Connect Genome-Phenome Analysis Platform. The collective data set accessible across the MME currently includes more than 120,000 cases from over 12,000 contributors in 98 countries. The discovery of potential new disease-gene relationships is happening daily and international collaborative teams are moving these advances forward to publication, now numbering well over 500. Expansion of data sharing into routine clinical practice by clinicians, genetic counselors, and clinical laboratories has ensured access to discovery for even more individuals with undiagnosed rare genetic diseases. Tens of thousands of patients and their family members have been directly or indirectly impacted by the discoveries facilitated by two-sided genomic matchmaking. MME supports further connections to the literature (PubCaseFinder) and to human and model organism resources (Monarch Initiative) and scientists (ModelMatcher). Efforts are now underway to explore additional approaches to matchmaking at the gene or variant level where there is only one querier (one-sided matchmaking). Genomic matchmaking has proven its utility over the past 7 years and will continue to facilitate discoveries in the years to come.
Collapse
Affiliation(s)
- Kym M. Boycott
- Children’s Hospital of Eastern Ontario Research Institute, University of Ottawa, Ottawa, Ontario, Canada
| | - Danielle R. Azzariti
- Program in Medical and Population Genetics, Broad Institute of Harvard and MIT, Cambridge, Massachusetts, USA
| | - Ada Hamosh
- McKusick-Nathans Department of Genetic Medicine, Johns Hopkins University School of Medicine, Baltimore, Maryland, USA
| | - Heidi L. Rehm
- Program in Medical and Population Genetics, Broad Institute of Harvard and MIT, Cambridge, Massachusetts, USA
- Center for Genomic Medicine, Massachusetts General Hospital, Boston, Massachusetts, USA
| |
Collapse
|
47
|
Opportunities and challenges for the use of common controls in sequencing studies. Nat Rev Genet 2022; 23:665-679. [PMID: 35581355 DOI: 10.1038/s41576-022-00487-4] [Citation(s) in RCA: 9] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 03/22/2022] [Indexed: 01/02/2023]
Abstract
Genome-wide association studies using large-scale genome and exome sequencing data have become increasingly valuable in identifying associations between genetic variants and disease, transforming basic research and translational medicine. However, this progress has not been equally shared across all people and conditions, in part due to limited resources. Leveraging publicly available sequencing data as external common controls, rather than sequencing new controls for every study, can better allocate resources by augmenting control sample sizes or providing controls where none existed. However, common control studies must be carefully planned and executed as even small differences in sample ascertainment and processing can result in substantial bias. Here, we discuss challenges and opportunities for the robust use of common controls in high-throughput sequencing studies, including study design, quality control and statistical approaches. Thoughtful generation and use of large and valuable genetic sequencing data sets will enable investigation of a broader and more representative set of conditions, environments and genetic ancestries than otherwise possible.
Collapse
|
48
|
Muenzen KD, Amendola LM, Kauffman TL, Mittendorf KF, Bensen JT, Chen F, Green R, Powell BC, Kvale M, Angelo F, Farnan L, Fullerton SM, Robinson JO, Li T, Murali P, Lawlor JM, Ou J, Hindorff LA, Jarvik GP, Crosslin DR. Lessons learned and recommendations for data coordination in collaborative research: The CSER consortium experience. HGG ADVANCES 2022; 3:100120. [PMID: 35707062 PMCID: PMC9190054 DOI: 10.1016/j.xhgg.2022.100120] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/18/2022] [Accepted: 05/16/2022] [Indexed: 11/18/2022] Open
Abstract
Integrating data across heterogeneous research environments is a key challenge in multi-site, collaborative research projects. While it is important to allow for natural variation in data collection protocols across research sites, it is also important to achieve interoperability between datasets in order to reap the full benefits of collaborative work. However, there are few standards to guide the data coordination process from project conception to completion. In this paper, we describe the experiences of the Clinical Sequence Evidence-Generating Research (CSER) consortium Data Coordinating Center (DCC), which coordinated harmonized survey and genomic sequencing data from seven clinical research sites from 2020 to 2022. Using input from multiple consortium working groups and from CSER leadership, we first identify 14 lessons learned from CSER in the categories of communication, harmonization, informatics, compliance, and analytics. We then distill these lessons learned into 11 recommendations for future research consortia in the areas of planning, communication, informatics, and analytics. We recommend that planning and budgeting for data coordination activities occur as early as possible during consortium conceptualization and development to minimize downstream complications. We also find that clear, reciprocal, and continuous communication between consortium stakeholders and the DCC is equally important to maintaining a secure and centralized informatics ecosystem for pooling data. Finally, we discuss the importance of actively interrogating current approaches to data governance, particularly for research studies that straddle the research-clinical divide.
Collapse
|