1
|
Faria D, Eugénio P, Contreiras Silva M, Balbi L, Bedran G, Kallor AA, Nunes S, Palkowski A, Waleron M, Alfaro JA, Pesquita C. The Immunopeptidomics Ontology (ImPO). Database (Oxford) 2024; 2024:baae014. [PMID: 38857186 PMCID: PMC11164101 DOI: 10.1093/database/baae014] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/11/2023] [Revised: 11/30/2023] [Accepted: 02/22/2024] [Indexed: 06/12/2024]
Abstract
The adaptive immune response plays a vital role in eliminating infected and aberrant cells from the body. This process hinges on the presentation of short peptides by major histocompatibility complex Class I molecules on the cell surface. Immunopeptidomics, the study of peptides displayed on cells, delves into the wide variety of these peptides. Understanding the mechanisms behind antigen processing and presentation is crucial for effectively evaluating cancer immunotherapies. As an emerging domain, immunopeptidomics currently lacks standardization-there is neither an established terminology nor formally defined semantics-a critical concern considering the complexity, heterogeneity, and growing volume of data involved in immunopeptidomics studies. Additionally, there is a disconnection between how the proteomics community delivers the information about antigen presentation and its uptake by the clinical genomics community. Considering the significant relevance of immunopeptidomics in cancer, this shortcoming must be addressed to bridge the gap between research and clinical practice. In this work, we detail the development of the ImmunoPeptidomics Ontology, ImPO, the first effort at standardizing the terminology and semantics in the domain. ImPO aims to encapsulate and systematize data generated by immunopeptidomics experimental processes and bioinformatics analysis. ImPO establishes cross-references to 24 relevant ontologies, including the National Cancer Institute Thesaurus, Mondo Disease Ontology, Logical Observation Identifier Names and Codes and Experimental Factor Ontology. Although ImPO was developed using expert knowledge to characterize a large and representative data collection, it may be readily used to encode other datasets within the domain. Ultimately, ImPO facilitates data integration and analysis, enabling querying, inference and knowledge generation and importantly bridging the gap between the clinical proteomics and genomics communities. As the field of immunogenomics uses protein-level immunopeptidomics data, we expect ImPO to play a key role in supporting a rich and standardized description of the large-scale data that emerging high-throughput technologies are expected to bring in the near future. Ontology URL: https://zenodo.org/record/10237571 Project GitHub: https://github.com/liseda-lab/ImPO/blob/main/ImPO.owl.
Collapse
Affiliation(s)
- Daniel Faria
- INESC-ID, Instituto Superior Técnico, Universidade de Lisboa, Rua Alves Redol, 9, Lisboa 1000-029, Portugal
| | - Patrícia Eugénio
- LASIGE, Faculdade de Ciências da Universidade de Lisboa, Campo Grande, Lisboa 1749-016, Portugal
| | - Marta Contreiras Silva
- LASIGE, Faculdade de Ciências da Universidade de Lisboa, Campo Grande, Lisboa 1749-016, Portugal
| | - Laura Balbi
- LASIGE, Faculdade de Ciências da Universidade de Lisboa, Campo Grande, Lisboa 1749-016, Portugal
| | - Georges Bedran
- International Centre for Cancer Vaccine Science, University of Gdansk, ul. Kładki 24, Gdańsk 80-822, Poland
| | - Ashwin Adrian Kallor
- International Centre for Cancer Vaccine Science, University of Gdansk, ul. Kładki 24, Gdańsk 80-822, Poland
| | - Susana Nunes
- LASIGE, Faculdade de Ciências da Universidade de Lisboa, Campo Grande, Lisboa 1749-016, Portugal
| | - Aleksander Palkowski
- International Centre for Cancer Vaccine Science, University of Gdansk, ul. Kładki 24, Gdańsk 80-822, Poland
| | - Michal Waleron
- International Centre for Cancer Vaccine Science, University of Gdansk, ul. Kładki 24, Gdańsk 80-822, Poland
| | - Javier A Alfaro
- International Centre for Cancer Vaccine Science, University of Gdansk, ul. Kładki 24, Gdańsk 80-822, Poland
- Department of Biochemistry and Microbiology, University of Victoria, 3800 Finnerty Rd, Victoria, British Columbia, BC V8P 5C2, Canada
- Institute for Adaptive and Neural Computation, School of Informatics, University of Edinburgh, Old College, South Bridge, Edinburgh, EH8 9YL, UK
- The Canadian Association for Responsible AI in Medicine, Victoria, Canada
| | - Catia Pesquita
- LASIGE, Faculdade de Ciências da Universidade de Lisboa, Campo Grande, Lisboa 1749-016, Portugal
| |
Collapse
|
2
|
Callahan TJ, Tripodi IJ, Stefanski AL, Cappelletti L, Taneja SB, Wyrwa JM, Casiraghi E, Matentzoglu NA, Reese J, Silverstein JC, Hoyt CT, Boyce RD, Malec SA, Unni DR, Joachimiak MP, Robinson PN, Mungall CJ, Cavalleri E, Fontana T, Valentini G, Mesiti M, Gillenwater LA, Santangelo B, Vasilevsky NA, Hoehndorf R, Bennett TD, Ryan PB, Hripcsak G, Kahn MG, Bada M, Baumgartner WA, Hunter LE. An open source knowledge graph ecosystem for the life sciences. Sci Data 2024; 11:363. [PMID: 38605048 PMCID: PMC11009265 DOI: 10.1038/s41597-024-03171-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/26/2023] [Accepted: 03/21/2024] [Indexed: 04/13/2024] Open
Abstract
Translational research requires data at multiple scales of biological organization. Advancements in sequencing and multi-omics technologies have increased the availability of these data, but researchers face significant integration challenges. Knowledge graphs (KGs) are used to model complex phenomena, and methods exist to construct them automatically. However, tackling complex biomedical integration problems requires flexibility in the way knowledge is modeled. Moreover, existing KG construction methods provide robust tooling at the cost of fixed or limited choices among knowledge representation models. PheKnowLator (Phenotype Knowledge Translator) is a semantic ecosystem for automating the FAIR (Findable, Accessible, Interoperable, and Reusable) construction of ontologically grounded KGs with fully customizable knowledge representation. The ecosystem includes KG construction resources (e.g., data preparation APIs), analysis tools (e.g., SPARQL endpoint resources and abstraction algorithms), and benchmarks (e.g., prebuilt KGs). We evaluated the ecosystem by systematically comparing it to existing open-source KG construction methods and by analyzing its computational performance when used to construct 12 different large-scale KGs. With flexible knowledge representation, PheKnowLator enables fully customizable KGs without compromising performance or usability.
Collapse
Affiliation(s)
- Tiffany J Callahan
- Computational Bioscience Program, University of Colorado Anschutz Medical Campus, Aurora, CO, 80045, USA.
- Department of Biomedical Informatics, Columbia University Irving Medical Center, New York, NY, 10032, USA.
| | - Ignacio J Tripodi
- Computer Science Department, Interdisciplinary Quantitative Biology, University of Colorado Boulder, Boulder, CO, 80301, USA
| | - Adrianne L Stefanski
- Computational Bioscience Program, University of Colorado Anschutz Medical Campus, Aurora, CO, 80045, USA
| | - Luca Cappelletti
- AnacletoLab, Dipartimento di Informatica, Universit`a degli Studi di Milano, Via Celoria 18, 20133, Milan, Italy
| | - Sanya B Taneja
- Intelligent Systems Program, University of Pittsburgh, Pittsburgh, PA, 15260, USA
| | - Jordan M Wyrwa
- Department of Physical Medicine and Rehabilitation, School of Medicine, University of Colorado Anschutz Medical Campus, Aurora, CO, 80045, USA
| | - Elena Casiraghi
- AnacletoLab, Dipartimento di Informatica, Universit`a degli Studi di Milano, Via Celoria 18, 20133, Milan, Italy
- Division of Environmental Genomics and Systems Biology, Lawrence Berkeley National Laboratory, Berkeley, CA, 94720, USA
| | | | - Justin Reese
- Division of Environmental Genomics and Systems Biology, Lawrence Berkeley National Laboratory, Berkeley, CA, 94720, USA
| | - Jonathan C Silverstein
- Department of Biomedical Informatics, University of Pittsburgh School of Medicine, Pittsburgh, PA, 15206, USA
| | - Charles Tapley Hoyt
- Laboratory of Systems Pharmacology, Harvard Medical School, Boston, MA, 02115, USA
| | - Richard D Boyce
- Department of Biomedical Informatics, University of Pittsburgh School of Medicine, Pittsburgh, PA, 15206, USA
| | - Scott A Malec
- Division of Translational Informatics, University of New Mexico School of Medicine, Albuquerque, NM, 87131, USA
| | - Deepak R Unni
- SIB Swiss Institute of Bioinformatics, Basel, Switzerland
| | - Marcin P Joachimiak
- Division of Environmental Genomics and Systems Biology, Lawrence Berkeley National Laboratory, Berkeley, CA, 94720, USA
| | - Peter N Robinson
- Berlin Institute of Health at Charité-Universitatsmedizin, 10117, Berlin, Germany
| | - Christopher J Mungall
- Division of Environmental Genomics and Systems Biology, Lawrence Berkeley National Laboratory, Berkeley, CA, 94720, USA
| | - Emanuele Cavalleri
- AnacletoLab, Dipartimento di Informatica, Universit`a degli Studi di Milano, Via Celoria 18, 20133, Milan, Italy
| | - Tommaso Fontana
- AnacletoLab, Dipartimento di Informatica, Universit`a degli Studi di Milano, Via Celoria 18, 20133, Milan, Italy
| | - Giorgio Valentini
- AnacletoLab, Dipartimento di Informatica, Universit`a degli Studi di Milano, Via Celoria 18, 20133, Milan, Italy
- ELLIS, European Laboratory for Learning and Intelligent Systems, Milan Unit, Italy
| | - Marco Mesiti
- AnacletoLab, Dipartimento di Informatica, Universit`a degli Studi di Milano, Via Celoria 18, 20133, Milan, Italy
| | - Lucas A Gillenwater
- Computational Bioscience Program, University of Colorado Anschutz Medical Campus, Aurora, CO, 80045, USA
- Department of Biomedical Informatics, University of Colorado School of Medicine, Aurora, CO, 80045, USA
| | - Brook Santangelo
- Computational Bioscience Program, University of Colorado Anschutz Medical Campus, Aurora, CO, 80045, USA
- Department of Biomedical Informatics, University of Colorado School of Medicine, Aurora, CO, 80045, USA
| | - Nicole A Vasilevsky
- Data Collaboration Center, Critical Path Institute, 1840 E River Rd. Suite 100, Tucson, AZ, 85718, USA
| | - Robert Hoehndorf
- Computer, Electrical and Mathematical Sciences & Engineering Division, Computational Bioscience Research Center, King Abdullah University of Science and Technology, Thuwal, 23955-6900, Kingdom of Saudi Arabia
| | - Tellen D Bennett
- Department of Biomedical Informatics, University of Colorado School of Medicine, Aurora, CO, 80045, USA
- Department of Pediatrics, University of Colorado School of Medicine, Aurora, CO, 80045, USA
| | - Patrick B Ryan
- Janssen Research and Development, Raritan, NJ, 08869, USA
| | - George Hripcsak
- Department of Biomedical Informatics, Columbia University Irving Medical Center, New York, NY, 10032, USA
| | - Michael G Kahn
- Department of Biomedical Informatics, University of Colorado School of Medicine, Aurora, CO, 80045, USA
| | - Michael Bada
- Division of General Internal Medicine, University of Colorado School of Medicine, Aurora, CO, 80045, USA
| | - William A Baumgartner
- Division of General Internal Medicine, University of Colorado School of Medicine, Aurora, CO, 80045, USA.
| | - Lawrence E Hunter
- Computational Bioscience Program, University of Colorado Anschutz Medical Campus, Aurora, CO, 80045, USA.
- Department of Biomedical Informatics, University of Colorado School of Medicine, Aurora, CO, 80045, USA.
| |
Collapse
|
3
|
Kilicoglu H, Ensan F, McInnes B, Wang LL. Semantics-enabled biomedical literature analytics. J Biomed Inform 2024; 150:104588. [PMID: 38244957 DOI: 10.1016/j.jbi.2024.104588] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/08/2024] [Accepted: 01/10/2024] [Indexed: 01/22/2024]
Affiliation(s)
- Halil Kilicoglu
- School of Information Sciences, University of Illinois Urbana Champaign, Champaign, IL, USA.
| | - Faezeh Ensan
- Department of Electrical, Computer, and Biomedical Engineering, Toronto Metropolitan University, Toronto, ON, Canada.
| | - Bridget McInnes
- Department of Computer Science, Virginia Commonwealth University, Richmond, VA, USA.
| | - Lucy Lu Wang
- Information School, University of Washington, Seattle, WA, USA.
| |
Collapse
|
4
|
Zhang Y, Petukhov V, Biederstedt E, Que R, Zhang K, Kharchenko PV. Gene panel selection for targeted spatial transcriptomics. Genome Biol 2024; 25:35. [PMID: 38273415 PMCID: PMC10811939 DOI: 10.1186/s13059-024-03174-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/24/2023] [Accepted: 01/12/2024] [Indexed: 01/27/2024] Open
Abstract
Targeted spatial transcriptomics hold particular promise in analyzing complex tissues. Most such methods, however, measure only a limited panel of transcripts, which need to be selected in advance to inform on the cell types or processes being studied. A limitation of existing gene selection methods is their reliance on scRNA-seq data, ignoring platform effects between technologies. Here we describe gpsFISH, a computational method performing gene selection through optimizing detection of known cell types. By modeling and adjusting for platform effects, gpsFISH outperforms other methods. Furthermore, gpsFISH can incorporate cell type hierarchies and custom gene preferences to accommodate diverse design requirements.
Collapse
Affiliation(s)
- Yida Zhang
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA
- Department of Neurobiology, Duke University, Durham, NC, USA
| | - Viktor Petukhov
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA
| | - Evan Biederstedt
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA
| | - Richard Que
- Department of Bioengineering, University of California San Diego, La Jolla, CA, USA
| | - Kun Zhang
- Department of Bioengineering, University of California San Diego, La Jolla, CA, USA
- San Diego Institute of Science, Altos Labs, San Diego, CA, USA
| | - Peter V Kharchenko
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA.
- San Diego Institute of Science, Altos Labs, San Diego, CA, USA.
| |
Collapse
|
5
|
Xiong YX, Zhang XF. scDOT: enhancing single-cell RNA-Seq data annotation and uncovering novel cell types through multi-reference integration. Brief Bioinform 2024; 25:bbae072. [PMID: 38436563 PMCID: PMC10939303 DOI: 10.1093/bib/bbae072] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/26/2023] [Revised: 01/12/2024] [Accepted: 02/07/2024] [Indexed: 03/05/2024] Open
Abstract
The proliferation of single-cell RNA-seq data has greatly enhanced our ability to comprehend the intricate nature of diverse tissues. However, accurately annotating cell types in such data, especially when handling multiple reference datasets and identifying novel cell types, remains a significant challenge. To address these issues, we introduce Single Cell annotation based on Distance metric learning and Optimal Transport (scDOT), an innovative cell-type annotation method adept at integrating multiple reference datasets and uncovering previously unseen cell types. scDOT introduces two key innovations. First, by incorporating distance metric learning and optimal transport, it presents a novel optimization framework. This framework effectively learns the predictive power of each reference dataset for new query data and simultaneously establishes a probabilistic mapping between cells in the query data and reference-defined cell types. Secondly, scDOT develops an interpretable scoring system based on the acquired probabilistic mapping, enabling the precise identification of previously unseen cell types within the data. To rigorously assess scDOT's capabilities, we systematically evaluate its performance using two diverse collections of benchmark datasets encompassing various tissues, sequencing technologies and diverse cell types. Our experimental results consistently affirm the superior performance of scDOT in cell-type annotation and the identification of previously unseen cell types. These advancements provide researchers with a potent tool for precise cell-type annotation, ultimately enriching our understanding of complex biological tissues.
Collapse
Affiliation(s)
- Yi-Xuan Xiong
- School of Mathematics and Statistics, Central China Normal University, Wuhan 430079, China
- Key Laboratory of Nonlinear Analysis & Applications (Ministry of Education), Central China Normal University, Wuhan 430079, China
| | - Xiao-Fei Zhang
- School of Mathematics and Statistics, Central China Normal University, Wuhan 430079, China
- Key Laboratory of Nonlinear Analysis & Applications (Ministry of Education), Central China Normal University, Wuhan 430079, China
| |
Collapse
|
6
|
Taing L, Dandawate A, L’Yi S, Gehlenborg N, Brown M, Meyer C. Cistrome Data Browser: integrated search, analysis and visualization of chromatin data. Nucleic Acids Res 2024; 52:D61-D66. [PMID: 37971305 PMCID: PMC10767960 DOI: 10.1093/nar/gkad1069] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2023] [Revised: 10/14/2023] [Accepted: 11/02/2023] [Indexed: 11/19/2023] Open
Abstract
The Cistrome Data Browser is a resource of ChIP-seq, ATAC-seq and DNase-seq data from humans and mice. It provides maps of the genome-wide locations of transcription factors, cofactors, chromatin remodelers, histone post-translational modifications and regions of chromatin accessible to endonuclease activity. Cistrome DB v3.0 contains approximately 45 000 human and 44 000 mouse samples with about 32 000 newly collected datasets compared to the previous release. The Cistrome DB v3.0 user interface is implemented as a single page application that unifies menu driven and data driven search functions and provides an embedded genome browser, which allows users to find and visualize data more effectively. Users can find informative chromatin profiles through keyword, menu, and data-driven search tools. Browser search functions can predict the regulators of query genes as well as the cell type and factor dependent functionality of potential cis-regulatory elements. Cistrome DB v3.0 expands the display of quality control statistics, incorporates sequence logos into motif enrichment displays and includes more expansive sample metadata. Cistrome DB v3.0 is available at http://db3.cistrome.org/browser.
Collapse
Affiliation(s)
- Len Taing
- Center for Functional Cancer Epigenetics, Dana-Farber Cancer Institute, Boston, MA, USA
| | - Ariaki Dandawate
- Department of Data Science, Dana-Farber Cancer Institute, Boston, MA, USA
| | - Sehi L’Yi
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA
| | - Nils Gehlenborg
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA
| | - Myles Brown
- Center for Functional Cancer Epigenetics, Dana-Farber Cancer Institute, Boston, MA, USA
- Department of Medical Oncology, Dana-Farber Cancer Institute, Brigham and Women's Hospital, and Harvard Medical School, Boston, MA, USA
| | - Clifford A Meyer
- Department of Data Science, Dana-Farber Cancer Institute, Boston, MA, USA
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA, USA
| |
Collapse
|
7
|
Ng GYL, Tan SC, Ong CS. On the use of QDE-SVM for gene feature selection and cell type classification from scRNA-seq data. PLoS One 2023; 18:e0292961. [PMID: 37856458 PMCID: PMC10586655 DOI: 10.1371/journal.pone.0292961] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/10/2023] [Accepted: 10/03/2023] [Indexed: 10/21/2023] Open
Abstract
Cell type identification is one of the fundamental tasks in single-cell RNA sequencing (scRNA-seq) studies. It is a key step to facilitate downstream interpretations such as differential expression, trajectory inference, etc. scRNA-seq data contains technical variations that could affect the interpretation of the cell types. Therefore, gene selection, also known as feature selection in data science, plays an important role in selecting informative genes for scRNA-seq cell type identification. Generally speaking, feature selection methods are categorized into filter-, wrapper-, and embedded-based approaches. From the existing literature, methods from filter- and embedded-based approaches are widely applied in scRNA-seq gene selection tasks. The wrapper-based method that gives promising results in other fields has yet been extensively utilized for selecting gene features from scRNA-seq data; in addition, most of the existing wrapper methods used in this field are clustering instead of classification-based. With a large number of annotated data available today, this study applied a classification-based approach as an alternative to the clustering-based wrapper method. In our work, a quantum-inspired differential evolution (QDE) wrapped with a classification method was introduced to select a subset of genes from twelve well-known scRNA-seq transcriptomic datasets to identify cell types. In particular, the QDE was combined with different machine-learning (ML) classifiers namely logistic regression, decision tree, support vector machine (SVM) with linear and radial basis function kernels, as well as extreme learning machine. The linear SVM wrapped with QDE, namely QDE-SVM, was chosen by referring to the feature selection results from the experiment. QDE-SVM showed a superior cell type classification performance among QDE wrapping with other ML classifiers as well as the recent wrapper methods (i.e., FSCAM, SSD-LAHC, MA-HS, and BSF). QDE-SVM achieved an average accuracy of 0.9559, while the other wrapper methods achieved average accuracies in the range of 0.8292 to 0.8872.
Collapse
Affiliation(s)
- Grace Yee Lin Ng
- Faculty of Information Science and Technology, Multimedia University, Bukit Beruang, Melaka, Malaysia
| | - Shing Chiang Tan
- Faculty of Information Science and Technology, Multimedia University, Bukit Beruang, Melaka, Malaysia
| | - Chia Sui Ong
- Faculty of Information Science and Technology, Multimedia University, Bukit Beruang, Melaka, Malaysia
| |
Collapse
|
8
|
Ahmed SH, Deng AT, Huntley RP, Campbell NH, Lovering RC. Capturing heart valve development with Gene Ontology. Front Genet 2023; 14:1251902. [PMID: 37915827 PMCID: PMC10616796 DOI: 10.3389/fgene.2023.1251902] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/02/2023] [Accepted: 09/29/2023] [Indexed: 11/03/2023] Open
Abstract
Introduction: The normal development of all heart valves requires highly coordinated signaling pathways and downstream mediators. While genomic variants can be responsible for congenital valve disease, environmental factors can also play a role. Later in life valve calcification is a leading cause of aortic valve stenosis, a progressive disease that may lead to heart failure. Current research into the causes of both congenital valve diseases and valve calcification is using a variety of high-throughput methodologies, including transcriptomics, proteomics and genomics. High quality genetic data from biological knowledge bases are essential to facilitate analyses and interpretation of these high-throughput datasets. The Gene Ontology (GO, http://geneontology.org/) is a major bioinformatics resource used to interpret these datasets, as it provides structured, computable knowledge describing the role of gene products across all organisms. The UCL Functional Gene Annotation team focuses on GO annotation of human gene products. Having identified that the GO annotations included in transcriptomic, proteomic and genomic data did not provide sufficient descriptive information about heart valve development, we initiated a focused project to address this issue. Methods: This project prioritized 138 proteins for GO annotation, which led to the curation of 100 peer-reviewed articles and the creation of 400 heart valve development-relevant GO annotations. Results: While the focus of this project was heart valve development, around 600 of the 1000 annotations created described the broader cellular role of these proteins, including those describing aortic valve morphogenesis, BMP signaling and endocardial cushion development. Our functional enrichment analysis of the 28 proteins known to have a role in bicuspid aortic valve disease confirmed that this annotation project has led to an improved interpretation of a heart valve genetic dataset. Discussion: To address the needs of the heart valve research community this project has provided GO annotations to describe the specific roles of key proteins involved in heart valve development. The breadth of GO annotations created by this project will benefit many of those seeking to interpret a wide range of cardiovascular genomic, transcriptomic, proteomic and metabolomic datasets.
Collapse
Affiliation(s)
- Saadullah H. Ahmed
- Functional Gene Annotation, Pre-clinical and Fundamental Science, Institute of Cardiovascular Science, University College London, London, United Kingdom
| | - Alexander T. Deng
- Department of Clinical Genetics, Guy’s and St Thomas’s NHS Foundation Trust, London, United Kingdom
| | - Rachael P. Huntley
- SciBite Limited, BioData Innovation Centre, Wellcome Genome Campus, Hinxton, Cambridge, United Kingdom
| | | | - Ruth C. Lovering
- Functional Gene Annotation, Pre-clinical and Fundamental Science, Institute of Cardiovascular Science, University College London, London, United Kingdom
| |
Collapse
|
9
|
Cheng C, Chen W, Jin H, Chen X. A Review of Single-Cell RNA-Seq Annotation, Integration, and Cell-Cell Communication. Cells 2023; 12:1970. [PMID: 37566049 PMCID: PMC10417635 DOI: 10.3390/cells12151970] [Citation(s) in RCA: 18] [Impact Index Per Article: 18.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/13/2023] [Revised: 07/10/2023] [Accepted: 07/21/2023] [Indexed: 08/12/2023] Open
Abstract
Single-cell RNA sequencing (scRNA-seq) has emerged as a powerful tool for investigating cellular biology at an unprecedented resolution, enabling the characterization of cellular heterogeneity, identification of rare but significant cell types, and exploration of cell-cell communications and interactions. Its broad applications span both basic and clinical research domains. In this comprehensive review, we survey the current landscape of scRNA-seq analysis methods and tools, focusing on count modeling, cell-type annotation, data integration, including spatial transcriptomics, and the inference of cell-cell communication. We review the challenges encountered in scRNA-seq analysis, including issues of sparsity or low expression, reliability of cell annotation, and assumptions in data integration, and discuss the potential impact of suboptimal clustering and differential expression analysis tools on downstream analyses, particularly in identifying cell subpopulations. Finally, we discuss recent advancements and future directions for enhancing scRNA-seq analysis. Specifically, we highlight the development of novel tools for annotating single-cell data, integrating and interpreting multimodal datasets covering transcriptomics, epigenomics, and proteomics, and inferring cellular communication networks. By elucidating the latest progress and innovation, we provide a comprehensive overview of the rapidly advancing field of scRNA-seq analysis.
Collapse
Affiliation(s)
- Changde Cheng
- Department of Computational Biology, St. Jude Children’s Research Hospital, Memphis, TN 38105, USA;
| | - Wenan Chen
- Center for Applied Bioinformatics, St. Jude Children’s Research Hospital, Memphis, TN 38105, USA; (W.C.); (H.J.)
| | - Hongjian Jin
- Center for Applied Bioinformatics, St. Jude Children’s Research Hospital, Memphis, TN 38105, USA; (W.C.); (H.J.)
| | - Xiang Chen
- Department of Computational Biology, St. Jude Children’s Research Hospital, Memphis, TN 38105, USA;
| |
Collapse
|
10
|
Boppana A, Lee S, Malhotra R, Halushka M, Gustilo KS, Quardokus EM, Herr BW, Börner K, Weber GM. Anatomical structures, cell types, and biomarkers of the healthy human blood vasculature. Sci Data 2023; 10:452. [PMID: 37468503 PMCID: PMC10356915 DOI: 10.1038/s41597-023-02018-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/20/2022] [Accepted: 01/30/2023] [Indexed: 07/21/2023] Open
Abstract
More than 150 scientists from 17 consortia are collaborating on an international project to build a Human Reference Atlas, which maps all 37 trillion cells in the healthy adult human body. The initial release of this atlas provided hierarchical lists of the anatomical structures, cell types, and biomarkers in 11 organs. Here, we describe the methods we used as part of this initiative to build the first open, computer-readable, and comprehensive database of the adult human blood vasculature, called the Human Reference Atlas-Vasculature Common Coordinate Framework (HRA-VCCF). It includes 993 vessels and their branching connections, 10 cell types, and 10 biomarkers. With this paper we are releasing additional details on vessel types and subtypes, branching sequence, anastomoses, portal systems, microvasculature, functional tissue units, mappings to regions vessels supply or drain, geometric properties of vessels, and links to 3D reference objects. Future versions will add variants and connections to the lymph vasculature; and, it will iteratively expand and improve the database as additional experimental data become available through the participating consortia.
Collapse
Affiliation(s)
| | - Sujin Lee
- Department of Surgery, Massachusetts General Hospital, Boston, Massachusetts, USA
| | - Rajeev Malhotra
- Department of Medicine, Massachusetts General Hospital, Boston, Massachusetts, USA
| | - Marc Halushka
- Department of Pathology, Johns Hopkins Medicine, Baltimore, Maryland, USA
| | - Katherine S Gustilo
- Department of Intelligent Systems Engineering, Luddy School of Informatics, Computing, and Engineering, Indiana University, Bloomington, Indiana, USA
| | - Ellen M Quardokus
- Department of Intelligent Systems Engineering, Luddy School of Informatics, Computing, and Engineering, Indiana University, Bloomington, Indiana, USA
| | - Bruce W Herr
- Department of Intelligent Systems Engineering, Luddy School of Informatics, Computing, and Engineering, Indiana University, Bloomington, Indiana, USA
| | - Katy Börner
- Department of Intelligent Systems Engineering, Luddy School of Informatics, Computing, and Engineering, Indiana University, Bloomington, Indiana, USA
| | - Griffin M Weber
- Department of Biomedical Informatics, Harvard Medical School, Boston, Massachusetts, USA.
- Department of Medicine, Beth Israel Deaconess Medical Center, Boston, Massachusetts, USA.
| |
Collapse
|
11
|
Callahan TJ, Stefanski AL, Wyrwa JM, Zeng C, Ostropolets A, Banda JM, Baumgartner WA, Boyce RD, Casiraghi E, Coleman BD, Collins JH, Deakyne Davies SJ, Feinstein JA, Lin AY, Martin B, Matentzoglu NA, Meeker D, Reese J, Sinclair J, Taneja SB, Trinkley KE, Vasilevsky NA, Williams AE, Zhang XA, Denny JC, Ryan PB, Hripcsak G, Bennett TD, Haendel MA, Robinson PN, Hunter LE, Kahn MG. Ontologizing health systems data at scale: making translational discovery a reality. NPJ Digit Med 2023; 6:89. [PMID: 37208468 PMCID: PMC10196319 DOI: 10.1038/s41746-023-00830-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/09/2022] [Accepted: 04/28/2023] [Indexed: 05/21/2023] Open
Abstract
Common data models solve many challenges of standardizing electronic health record (EHR) data but are unable to semantically integrate all of the resources needed for deep phenotyping. Open Biological and Biomedical Ontology (OBO) Foundry ontologies provide computable representations of biological knowledge and enable the integration of heterogeneous data. However, mapping EHR data to OBO ontologies requires significant manual curation and domain expertise. We introduce OMOP2OBO, an algorithm for mapping Observational Medical Outcomes Partnership (OMOP) vocabularies to OBO ontologies. Using OMOP2OBO, we produced mappings for 92,367 conditions, 8611 drug ingredients, and 10,673 measurement results, which covered 68-99% of concepts used in clinical practice when examined across 24 hospitals. When used to phenotype rare disease patients, the mappings helped systematically identify undiagnosed patients who might benefit from genetic testing. By aligning OMOP vocabularies to OBO ontologies our algorithm presents new opportunities to advance EHR-based deep phenotyping.
Collapse
Affiliation(s)
- Tiffany J Callahan
- Computational Bioscience Program, University of Colorado Anschutz Medical Campus, Aurora, CO, 80045, USA.
- Department of Biomedical Informatics, Columbia University Irving Medical Center, New York, NY, 10032, USA.
| | - Adrianne L Stefanski
- Computational Bioscience Program, University of Colorado Anschutz Medical Campus, Aurora, CO, 80045, USA
| | - Jordan M Wyrwa
- Department of Physical Medicine and Rehabilitation, School of Medicine, University of Colorado Anschutz Medical Campus, Aurora, CO, 80045, USA
| | - Chenjie Zeng
- National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, 20892, USA
| | - Anna Ostropolets
- Department of Biomedical Informatics, Columbia University Irving Medical Center, New York, NY, 10032, USA
| | - Juan M Banda
- Department of Computer Science, Georgia State University, Atlanta, GA, 30303, USA
| | - William A Baumgartner
- Computational Bioscience Program, University of Colorado Anschutz Medical Campus, Aurora, CO, 80045, USA
| | - Richard D Boyce
- Department of Biomedical Informatics, University of Pittsburgh School of Medicine, Pittsburgh, PA, 15260, USA
| | - Elena Casiraghi
- Computer Science, Università degli Studi di Milano, Milan, Italy
- The Jackson Laboratory for Genomic Medicine, Farmington, CT, 06032, USA
| | - Ben D Coleman
- The Jackson Laboratory for Genomic Medicine, Farmington, CT, 06032, USA
| | - Janine H Collins
- Department of Haematology, University of Cambridge, Cambridge, UK
| | - Sara J Deakyne Davies
- Department of Research Informatics & Data Science, Analytics Resource Center, Children's Hospital Colorado, Aurora, CO, 80045, USA
| | - James A Feinstein
- Adult and Child Center for Health Outcomes Research and Delivery Science (ACCORDS), University of Colorado Anschutz School of Medicine, Aurora, CO, 80045, USA
| | - Asiyah Y Lin
- National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, 20892, USA
| | - Blake Martin
- Departments of Biomedical Informatics and Pediatrics, University of Colorado School of Medicine, Aurora, CO, 80045, USA
| | | | | | - Justin Reese
- Division of Environmental Genomics and Systems Biology, Lawrence Berkeley National Laboratory, Berkeley, CA, 94720, USA
| | | | - Sanya B Taneja
- Intelligent Systems Program, University of Pittsburgh, Pittsburgh, PA, 15260, USA
| | - Katy E Trinkley
- Department of Family Medicine, University of Colorado Anschutz School of Medicine, Aurora, CO, 80045, USA
| | - Nicole A Vasilevsky
- Translational and Integrative Sciences Lab, University of Colorado Anschutz Medical Campus, Aurora, CO, 80045, USA
| | - Andrew E Williams
- Tufts Institute for Clinical Research and Health Policy Studies, Tufts University, Boston, MA, 02155, USA
| | - Xingmin A Zhang
- The Jackson Laboratory for Genomic Medicine, Farmington, CT, 06032, USA
| | - Joshua C Denny
- National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, 20892, USA
| | - Patrick B Ryan
- Janssen Research and Development, Raritan, NJ, 08869, USA
| | - George Hripcsak
- Department of Biomedical Informatics, Columbia University Irving Medical Center, New York, NY, 10032, USA
| | - Tellen D Bennett
- Departments of Biomedical Informatics and Pediatrics, University of Colorado School of Medicine, Aurora, CO, 80045, USA
| | - Melissa A Haendel
- Departments of Biomedical Informatics and Pediatrics, University of Colorado School of Medicine, Aurora, CO, 80045, USA
| | - Peter N Robinson
- The Jackson Laboratory for Genomic Medicine, Farmington, CT, 06032, USA
| | - Lawrence E Hunter
- Computational Bioscience Program, University of Colorado Anschutz Medical Campus, Aurora, CO, 80045, USA
- Department of Biomedical Informatics, University of Colorado School of Medicine, Aurora, CO, 80045, USA
| | - Michael G Kahn
- Department of Biomedical Informatics, University of Colorado School of Medicine, Aurora, CO, 80045, USA
| |
Collapse
|
12
|
Taneja SB, Callahan TJ, Paine MF, Kane-Gill SL, Kilicoglu H, Joachimiak MP, Boyce RD. Developing a Knowledge Graph for Pharmacokinetic Natural Product-Drug Interactions. J Biomed Inform 2023; 140:104341. [PMID: 36933632 PMCID: PMC10150409 DOI: 10.1016/j.jbi.2023.104341] [Citation(s) in RCA: 5] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/24/2022] [Revised: 01/09/2023] [Accepted: 03/13/2023] [Indexed: 03/17/2023]
Abstract
BACKGROUND Pharmacokinetic natural product-drug interactions (NPDIs) occur when botanical or other natural products are co-consumed with pharmaceutical drugs. With the growing use of natural products, the risk for potential NPDIs and consequent adverse events has increased. Understanding mechanisms of NPDIs is key to preventing or minimizing adverse events. Although biomedical knowledge graphs (KGs) have been widely used for drug-drug interaction applications, computational investigation of NPDIs is novel. We constructed NP-KG as a first step toward computational discovery of plausible mechanistic explanations for pharmacokinetic NPDIs that can be used to guide scientific research. METHODS We developed a large-scale, heterogeneous KG with biomedical ontologies, linked data, and full texts of the scientific literature. To construct the KG, biomedical ontologies and drug databases were integrated with the Phenotype Knowledge Translator framework. The semantic relation extraction systems, SemRep and Integrated Network and Dynamic Reasoning Assembler, were used to extract semantic predications (subject-relation-object triples) from full texts of the scientific literature related to the exemplar natural products green tea and kratom. A literature-based graph constructed from the predications was integrated into the ontology-grounded KG to create NP-KG. NP-KG was evaluated with case studies of pharmacokinetic green tea- and kratom-drug interactions through KG path searches and meta-path discovery to determine congruent and contradictory information in NP-KG compared to ground truth data. We also conducted an error analysis to identify knowledge gaps and incorrect predications in the KG. RESULTS The fully integrated NP-KG consisted of 745,512 nodes and 7,249,576 edges. Evaluation of NP-KG resulted in congruent (38.98% for green tea, 50% for kratom), contradictory (15.25% for green tea, 21.43% for kratom), and both congruent and contradictory (15.25% for green tea, 21.43% for kratom) information compared to ground truth data. Potential pharmacokinetic mechanisms for several purported NPDIs, including the green tea-raloxifene, green tea-nadolol, kratom-midazolam, kratom-quetiapine, and kratom-venlafaxine interactions were congruent with the published literature. CONCLUSION NP-KG is the first KG to integrate biomedical ontologies with full texts of the scientific literature focused on natural products. We demonstrate the application of NP-KG to identify known pharmacokinetic interactions between natural products and pharmaceutical drugs mediated by drug metabolizing enzymes and transporters. Future work will incorporate context, contradiction analysis, and embedding-based methods to enrich NP-KG. NP-KG is publicly available at https://doi.org/10.5281/zenodo.6814507. The code for relation extraction, KG construction, and hypothesis generation is available at https://github.com/sanyabt/np-kg.
Collapse
Affiliation(s)
- Sanya B Taneja
- Intelligent Systems Program, University of Pittsburgh, Pittsburgh, PA 15206, USA.
| | - Tiffany J Callahan
- Department of Biomedical Informatics, Columbia University, New York, NY 10032, USA
| | - Mary F Paine
- Department of Pharmaceutical Sciences, College of Pharmacy and Pharmaceutical Sciences, Washington State University, Spokane, WA 99202, USA
| | | | - Halil Kilicoglu
- School of Information Sciences, University of Illinois at Urbana-Champaign, Champaign, IL 61820, USA
| | - Marcin P Joachimiak
- Environmental Genomics and Systems Biology Division, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA
| | - Richard D Boyce
- Department of Biomedical Informatics, University of Pittsburgh, Pittsburgh, PA 15206, USA
| |
Collapse
|
13
|
Herr BW, Hardi J, Quardokus EM, Bueckle A, Chen L, Wang F, Caron AR, Osumi-Sutherland D, Musen MA, Börner K. Specimen, biological structure, and spatial ontologies in support of a Human Reference Atlas. Sci Data 2023; 10:171. [PMID: 36973309 PMCID: PMC10043028 DOI: 10.1038/s41597-023-01993-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/27/2022] [Accepted: 01/30/2023] [Indexed: 03/29/2023] Open
Abstract
The Human Reference Atlas (HRA) is defined as a comprehensive, three-dimensional (3D) atlas of all the cells in the healthy human body. It is compiled by an international team of experts who develop standard terminologies that they link to 3D reference objects, describing anatomical structures. The third HRA release (v1.2) covers spatial reference data and ontology annotations for 26 organs. Experts access the HRA annotations via spreadsheets and view reference object models in 3D editing tools. This paper introduces the Common Coordinate Framework (CCF) Ontology v2.0.1 that interlinks specimen, biological structure, and spatial data, together with the CCF API that makes the HRA programmatically accessible and interoperable with Linked Open Data (LOD). We detail how real-world user needs and experimental data guide CCF Ontology design and implementation, present CCF Ontology classes and properties together with exemplary usage, and report on validation methods. The CCF Ontology graph database and API are used in the HuBMAP portal, HRA Organ Gallery, and other applications that support data queries across multiple, heterogeneous sources.
Collapse
Affiliation(s)
- Bruce W Herr
- Department of Intelligent Systems Engineering, Luddy School of Informatics, Computing, and Engineering, Indiana University, Bloomington, IN, 47408, USA
| | - Josef Hardi
- Stanford Center for Biomedical Informatics Research, Stanford University, Stanford, CA, 94305, USA
| | - Ellen M Quardokus
- Department of Intelligent Systems Engineering, Luddy School of Informatics, Computing, and Engineering, Indiana University, Bloomington, IN, 47408, USA
| | - Andreas Bueckle
- Department of Intelligent Systems Engineering, Luddy School of Informatics, Computing, and Engineering, Indiana University, Bloomington, IN, 47408, USA.
| | - Lu Chen
- Department of Computer Science, Stony Brook University, Stony Brook, NY, 11794, USA
| | - Fusheng Wang
- Department of Computer Science, Stony Brook University, Stony Brook, NY, 11794, USA
- Department of Biomedical Informatics, Stony Brook University, Stony Brook, NY, 11794, USA
| | - Anita R Caron
- European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, UK
| | | | - Mark A Musen
- Stanford Center for Biomedical Informatics Research, Stanford University, Stanford, CA, 94305, USA
| | - Katy Börner
- Department of Intelligent Systems Engineering, Luddy School of Informatics, Computing, and Engineering, Indiana University, Bloomington, IN, 47408, USA.
| |
Collapse
|
14
|
Zhang Y, Petukhov V, Biederstedt E, Que R, Zhang K, Kharchenko PV. Gene panel selection for targeted spatial transcriptomics. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.02.03.527053. [PMID: 36993340 PMCID: PMC10054990 DOI: 10.1101/2023.02.03.527053] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 06/19/2023]
Abstract
Targeted spatial transcriptomics hold particular promise in analysis of complex tissues. Most such methods, however, measure only a limited panel of transcripts, which need to be selected in advance to inform on the cell types or processes being studied. A limitation of existing gene selection methods is that they rely on scRNA-seq data, ignoring platform effects between technologies. Here we describe gpsFISH, a computational method to perform gene selection through optimizing detection of known cell types. By modeling and adjusting for platform effects, gpsFISH outperforms other methods. Furthermore, gpsFISH can incorporate cell type hierarchies and custom gene preferences to accommodate diverse design requirements.
Collapse
Affiliation(s)
- Yida Zhang
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA
- Department of Neurobiology, Duke University, Durham, NC, USA
| | - Viktor Petukhov
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA
| | - Evan Biederstedt
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA
| | - Richard Que
- Department of Bioengineering, University of California San Diego, La Jolla, CA, USA
| | - Kun Zhang
- Department of Bioengineering, University of California San Diego, La Jolla, CA, USA
- San Diego Institute of Science, Altos Labs, San Diego, CA, USA
| | - Peter V. Kharchenko
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA
- San Diego Institute of Science, Altos Labs, San Diego, CA, USA
| |
Collapse
|
15
|
Domcke S, Shendure J. A reference cell tree will serve science better than a reference cell atlas. Cell 2023; 186:1103-1114. [PMID: 36931241 DOI: 10.1016/j.cell.2023.02.016] [Citation(s) in RCA: 29] [Impact Index Per Article: 29.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/22/2022] [Revised: 01/15/2023] [Accepted: 02/10/2023] [Indexed: 03/18/2023]
Abstract
Single-cell biology is facing a crisis of sorts. Vast numbers of single-cell molecular profiles are being generated, clustered and annotated. However, this is overwhelmingly ad hoc, and we continue to lack a principled, unified, and well-moored system for defining, naming, and organizing cell types. In this perspective, we argue against an atlas or periodic table-like discretization as the right metaphor for a reference taxonomy of cell types. In its place, we advocate for a data-driven, tree-based nomenclature that is rooted in a "consensus ontogeny" spanning the life cycle of a given species. We explore how such a reference cell tree, inclusive of both lineage histories and molecular states, could be constructed, represented, and segmented in practice. Analogous to the taxonomic classification of species, a consensus ontogeny would provide a universal, stable, and extendable framework for precise scientific communication, both contemporaneously and across the ages.
Collapse
Affiliation(s)
- Silvia Domcke
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA.
| | - Jay Shendure
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA; Brotman Baty Institute for Precision Medicine, Seattle, WA, USA; Allen Discovery Center for Cell Lineage Tracing, Seattle, WA, USA; Howard Hughes Medical Institute, Seattle, WA, USA.
| |
Collapse
|
16
|
Hu Y, Liu C, Han W, Wang P. A theoretical framework of immune cell phenotypic classification and discovery. Front Immunol 2023; 14:1128423. [PMID: 36936975 PMCID: PMC10018129 DOI: 10.3389/fimmu.2023.1128423] [Citation(s) in RCA: 5] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/20/2022] [Accepted: 02/20/2023] [Indexed: 03/06/2023] Open
Abstract
Immune cells are highly heterogeneous and show diverse phenotypes, but the underlying mechanism remains to be elucidated. In this study, we proposed a theoretical framework for immune cell phenotypic classification based on gene plasticity, which herein refers to expressional change or variability in response to conditions. The system contains two core points. One is that the functional subsets of immune cells can be further divided into subdivisions based on their highly plastic genes, and the other is that loss of phenotype accompanies gain of phenotype during phenotypic conversion. The first point suggests phenotypic stratification or layerability according to gene plasticity, while the second point reveals expressional compatibility and mutual exclusion during the change in gene plasticity states. Abundant transcriptome data analysis in this study from both microarray and RNA sequencing in human CD4 and CD8 single-positive T cells, B cells, natural killer cells and monocytes supports the logical rationality and generality, as well as expansibility, across immune cells. A collection of thousands of known immunophenotypes reported in the literature further supports that highly plastic genes play an important role in maintaining immune cell phenotypes and reveals that the current classification model is compatible with the traditionally defined functional subsets. The system provides a new perspective to understand the characteristics of dynamic, diversified immune cell phenotypes and intrinsic regulation in the immune system. Moreover, the current substantial results based on plasticitomics analysis of bulk and single-cell sequencing data provide a useful resource for big-data-driven experimental studies and knowledge discoveries.
Collapse
Affiliation(s)
- Yuzhe Hu
- Department of Immunology, NHC Key Laboratory of Medical Immunology (Peking University), School of Basic Medical Sciences, Peking University Health Science Center, Beijing, China
- Peking University Center for Human Disease Genomics, Beijing, China
| | - Chen Liu
- Department of Clinical Laboratory, Peking University People’s Hospital, Beijing, China
| | - Wenling Han
- Department of Immunology, NHC Key Laboratory of Medical Immunology (Peking University), School of Basic Medical Sciences, Peking University Health Science Center, Beijing, China
- Peking University Center for Human Disease Genomics, Beijing, China
| | - Pingzhang Wang
- Department of Immunology, NHC Key Laboratory of Medical Immunology (Peking University), School of Basic Medical Sciences, Peking University Health Science Center, Beijing, China
- Peking University Center for Human Disease Genomics, Beijing, China
- *Correspondence: Pingzhang Wang,
| |
Collapse
|
17
|
Christensen E, Luo P, Turinsky A, Husić M, Mahalanabis A, Naidas A, Diaz-Mejia JJ, Brudno M, Pugh T, Ramani A, Shooshtari P. Evaluation of single-cell RNAseq labelling algorithms using cancer datasets. Brief Bioinform 2022; 24:6965910. [PMID: 36585784 PMCID: PMC9851326 DOI: 10.1093/bib/bbac561] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/02/2022] [Revised: 09/19/2022] [Accepted: 11/01/2022] [Indexed: 01/01/2023] Open
Abstract
Single-cell RNA sequencing (scRNA-seq) clustering and labelling methods are used to determine precise cellular composition of tissue samples. Automated labelling methods rely on either unsupervised, cluster-based approaches or supervised, cell-based approaches to identify cell types. The high complexity of cancer poses a unique challenge, as tumor microenvironments are often composed of diverse cell subpopulations with unique functional effects that may lead to disease progression, metastasis and treatment resistance. Here, we assess 17 cell-based and 9 cluster-based scRNA-seq labelling algorithms using 8 cancer datasets, providing a comprehensive large-scale assessment of such methods in a cancer-specific context. Using several performance metrics, we show that cell-based methods generally achieved higher performance and were faster compared to cluster-based methods. Cluster-based methods more successfully labelled non-malignant cell types, likely because of a lack of gene signatures for relevant malignant cell subpopulations. Larger cell numbers present in some cell types in training data positively impacted prediction scores for cell-based methods. Finally, we examined which methods performed favorably when trained and tested on separate patient cohorts in scenarios similar to clinical applications, and which were able to accurately label particularly small or under-represented cell populations in the given datasets. We conclude that scPred and SVM show the best overall performances with cancer-specific data and provide further suggestions for algorithm selection. Our analysis pipeline for assessing the performance of cell type labelling algorithms is available in https://github.com/shooshtarilab/scRNAseq-Automated-Cell-Type-Labelling.
Collapse
Affiliation(s)
| | | | - Andrei Turinsky
- Centre for Computational Medicine, The Hospital for Sick Children, Toronto, ON, Canada
| | - Mia Husić
- Centre for Computational Medicine, The Hospital for Sick Children, Toronto, ON, Canada
| | - Alaina Mahalanabis
- Centre for Computational Medicine, The Hospital for Sick Children, Toronto, ON, Canada
| | - Alaine Naidas
- Children’s Health Research Institute, Lawson Research Institute, London, ON, Canada
- Department of Pathology and Lab Medicine, University of Western Ontario, London, ON, Canada
| | | | - Michael Brudno
- Department of Computer Science, University of Toronto, Toronto, ON, Canada
| | - Trevor Pugh
- Princess Margaret Cancer Centre, University Health Network, Toronto, ON, Canada
- Ontario Institute for Cancer Research, Toronto, ON, Canada
- Department of Medical Biophysics, University of Toronto, Toronto, ON, Canada
| | - Arun Ramani
- Centre for Computational Medicine, The Hospital for Sick Children, Toronto, ON, Canada
| | - Parisa Shooshtari
- Corresponding author: Parisa Shooshtari, Department of Pathology and Lab Medicine, University of Western Ontario, London, ON, Canada. Tel.: +1 (519) 685-8500 x55427. E-mail:
| |
Collapse
|
18
|
Börner K, Bueckle A, Herr BW, Cross LE, Quardokus EM, Record EG, Ju Y, Silverstein JC, Browne KM, Jain S, Wasserfall CH, Jorgensen ML, Spraggins JM, Patterson NH, Weber GM. Tissue registration and exploration user interfaces in support of a human reference atlas. Commun Biol 2022; 5:1369. [PMID: 36513738 PMCID: PMC9747802 DOI: 10.1038/s42003-022-03644-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/08/2022] [Accepted: 06/27/2022] [Indexed: 12/14/2022] Open
Abstract
Seventeen international consortia are collaborating on a human reference atlas (HRA), a comprehensive, high-resolution, three-dimensional atlas of all the cells in the healthy human body. Laboratories around the world are collecting tissue specimens from donors varying in sex, age, ethnicity, and body mass index. However, harmonizing tissue data across 25 organs and more than 15 bulk and spatial single-cell assay types poses challenges. Here, we present software tools and user interfaces developed to spatially and semantically annotate ("register") and explore the tissue data and the evolving HRA. A key part of these tools is a common coordinate framework, providing standard terminologies and data structures for describing specimen, biological structure, and spatial data linked to existing ontologies. As of April 22, 2022, the "registration" user interface has been used to harmonize and publish data on 5,909 tissue blocks collected by the Human Biomolecular Atlas Program (HuBMAP), the Stimulating Peripheral Activity to Relieve Conditions program (SPARC), the Human Cell Atlas (HCA), the Kidney Precision Medicine Project (KPMP), and the Genotype Tissue Expression project (GTEx). Further, 5,856 tissue sections were derived from 506 HuBMAP tissue blocks. The second "exploration" user interface enables consortia to evaluate data quality, explore tissue data spatially within the context of the HRA, and guide data acquisition. A companion website is at https://cns-iu.github.io/HRA-supporting-information/ .
Collapse
Affiliation(s)
- Katy Börner
- Department of Intelligent Systems Engineering, Luddy School of Informatics, Computing, and Engineering, Indiana University, Bloomington, IN, USA.
| | - Andreas Bueckle
- Department of Intelligent Systems Engineering, Luddy School of Informatics, Computing, and Engineering, Indiana University, Bloomington, IN, USA.
| | - Bruce W Herr
- Department of Intelligent Systems Engineering, Luddy School of Informatics, Computing, and Engineering, Indiana University, Bloomington, IN, USA
| | - Leonard E Cross
- Department of Intelligent Systems Engineering, Luddy School of Informatics, Computing, and Engineering, Indiana University, Bloomington, IN, USA
| | - Ellen M Quardokus
- Department of Intelligent Systems Engineering, Luddy School of Informatics, Computing, and Engineering, Indiana University, Bloomington, IN, USA
| | - Elizabeth G Record
- Department of Intelligent Systems Engineering, Luddy School of Informatics, Computing, and Engineering, Indiana University, Bloomington, IN, USA
| | - Yingnan Ju
- Department of Intelligent Systems Engineering, Luddy School of Informatics, Computing, and Engineering, Indiana University, Bloomington, IN, USA
| | - Jonathan C Silverstein
- Department of Biomedical Informatics, University of Pittsburgh School of Medicine, Pittsburgh, PA, USA
| | - Kristen M Browne
- Bioinformatics and Computational Biosciences Branch, Office of Cyber Infrastructure and Computational Biology, National Institute of Allergy and Infectious Diseases, National Institutes of Health, Bethesda, MD, USA
| | - Sanjay Jain
- Department of Medicine, Department of Pathology & Immunology, Department of Pediatrics, Washington University School of Medicine, Saint Louis, MO, USA
| | - Clive H Wasserfall
- Departments of Pathology and Pediatrics, University of Florida, Gainesville, FL, USA
| | - Marda L Jorgensen
- Departments of Pathology and Pediatrics, University of Florida, Gainesville, FL, USA
| | - Jeffrey M Spraggins
- Mass Spectrometry Research Center, Vanderbilt University, Nashville, TN, USA
| | - N Heath Patterson
- Mass Spectrometry Research Center, Vanderbilt University, Nashville, TN, USA
| | - Griffin M Weber
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA
| |
Collapse
|
19
|
Hawkins NT, Maldaver M, Yannakopoulos A, Guare LA, Krishnan A. Systematic tissue annotations of genomics samples by modeling unstructured metadata. Nat Commun 2022; 13:6736. [PMID: 36347858 PMCID: PMC9643451 DOI: 10.1038/s41467-022-34435-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/09/2021] [Accepted: 10/25/2022] [Indexed: 11/10/2022] Open
Abstract
There are currently >1.3 million human -omics samples that are publicly available. This valuable resource remains acutely underused because discovering particular samples from this ever-growing data collection remains a significant challenge. The major impediment is that sample attributes are routinely described using varied terminologies written in unstructured natural language. We propose a natural-language-processing-based machine learning approach (NLP-ML) to infer tissue and cell-type annotations for genomics samples based only on their free-text metadata. NLP-ML works by creating numerical representations of sample descriptions and using these representations as features in a supervised learning classifier that predicts tissue/cell-type terms. Our approach significantly outperforms an advanced graph-based reasoning annotation method (MetaSRA) and a baseline exact string matching method (TAGGER). Model similarities between related tissues demonstrate that NLP-ML models capture biologically-meaningful signals in text. Additionally, these models correctly classify tissue-associated biological processes and diseases based on their text descriptions alone. NLP-ML models are nearly as accurate as models based on gene-expression profiles in predicting sample tissue annotations but have the distinct capability to classify samples irrespective of the genomics experiment type based on their text metadata. Python NLP-ML prediction code and trained tissue models are available at https://github.com/krishnanlab/txt2onto .
Collapse
Affiliation(s)
- Nathaniel T Hawkins
- Department of Computational Mathematics, Science and Engineering, Michigan State University, East Lansing, MI, 48824, USA
| | - Marc Maldaver
- Department of Computational Mathematics, Science and Engineering, Michigan State University, East Lansing, MI, 48824, USA
| | - Anna Yannakopoulos
- Department of Computational Mathematics, Science and Engineering, Michigan State University, East Lansing, MI, 48824, USA
| | - Lindsay A Guare
- Department of Computational Mathematics, Science and Engineering, Michigan State University, East Lansing, MI, 48824, USA
- Department of Biochemistry and Molecular Biology, Michigan State University, East Lansing, MI, 48824, USA
- Department of Microbiology and Molecular Genetics, Michigan State University, East Lansing, MI, 48824, USA
| | - Arjun Krishnan
- Department of Computational Mathematics, Science and Engineering, Michigan State University, East Lansing, MI, 48824, USA.
- Department of Biochemistry and Molecular Biology, Michigan State University, East Lansing, MI, 48824, USA.
- Department of Biomedical Informatics, University of Colorado Anschutz Medical Campus, Aurora, CO, 80045, USA.
| |
Collapse
|
20
|
A curated collection of human vaccination response signatures. Sci Data 2022; 9:678. [PMID: 36347894 PMCID: PMC9643367 DOI: 10.1038/s41597-022-01558-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/04/2021] [Accepted: 07/14/2022] [Indexed: 11/09/2022] Open
Abstract
AbstractRecent advances in high-throughput experiments and systems biology approaches have resulted in hundreds of publications identifying “immune signatures”. Unfortunately, these are often described within text, figures, or tables in a format not amenable to computational processing, thus severely hampering our ability to fully exploit this information. Here we present a data model to represent immune signatures, along with the Human Immunology Project Consortium (HIPC) Dashboard (www.hipc-dashboard.org), a web-enabled application to facilitate signature access and querying. The data model captures the biological response components (e.g., genes, proteins, cell types or metabolites) and metadata describing the context under which the signature was identified using standardized terms from established resources (e.g., HGNC, Protein Ontology, Cell Ontology). We have manually curated a collection of >600 immune signatures from >60 published studies profiling human vaccination responses for the current release. The system will aid in building a broader understanding of the human immune response to stimuli by enabling researchers to easily access and interrogate published immune signatures.
Collapse
|
21
|
Matentzoglu N, Goutte-Gattat D, Tan SZK, Balhoff JP, Carbon S, Caron AR, Duncan WD, Flack JE, Haendel M, Harris NL, Hogan WR, Hoyt CT, Jackson RC, Kim H, Kir H, Larralde M, McMurry JA, Overton JA, Peters B, Pilgrim C, Stefancsik R, Robb SMC, Toro S, Vasilevsky NA, Walls R, Mungall CJ, Osumi-Sutherland D. Ontology Development Kit: a toolkit for building, maintaining and standardizing biomedical ontologies. Database (Oxford) 2022; 2022:6754192. [PMID: 36208225 PMCID: PMC9547537 DOI: 10.1093/database/baac087] [Citation(s) in RCA: 16] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/07/2022] [Revised: 08/19/2022] [Accepted: 09/23/2022] [Indexed: 11/21/2022]
Abstract
Similar to managing software packages, managing the ontology life cycle involves multiple complex workflows such as preparing releases, continuous quality control checking and dependency management. To manage these processes, a diverse set of tools is required, from command-line utilities to powerful ontology-engineering environmentsr. Particularly in the biomedical domain, which has developed a set of highly diverse yet inter-dependent ontologies, standardizing release practices and metadata and establishing shared quality standards are crucial to enable interoperability. The Ontology Development Kit (ODK) provides a set of standardized, customizable and automatically executable workflows, and packages all required tooling in a single Docker image. In this paper, we provide an overview of how the ODK works, show how it is used in practice and describe how we envision it driving standardization efforts in our community. Database URL: https://github.com/INCATools/ontology-development-kit.
Collapse
Affiliation(s)
| | - Damien Goutte-Gattat
- Department of Physiology, Development and Neuroscience, University of Cambridge, Downing Street, Cambridge, CB2 3DY, UK
| | - Shawn Zheng Kai Tan
- Samples Phenotypes and Ontologies Team (SPOT), European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridgeshire, CB10 1SD, UK
| | - James P Balhoff
- RENCI, University of North Carolina, Chapel Hill, NC, North Carolina 27517, USA
| | - Seth Carbon
- Berkeley Bioinformatics Open-source Projects (BBOP), Lawrence Berkeley National Laboratory (LBNL), 1 Cyclotron Road, Mailstop 977-0257, Berkeley, CA 94720, USA
| | - Anita R Caron
- Samples Phenotypes and Ontologies Team (SPOT), European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridgeshire, CB10 1SD, UK
| | - William D Duncan
- Berkeley Bioinformatics Open-source Projects (BBOP), Lawrence Berkeley National Laboratory (LBNL), 1 Cyclotron Road, Mailstop 977-0257, Berkeley, CA 94720, USA,College of Dentistry; Health Outcomes and Biomedical Informatics, College of Medicine, University of Florida, William D. Duncan: 1395 Center Dr, Gainesville, William R. Hogan: 1600 SW Archer Rd, Gainesville, FL 32610, USA
| | - Joe E Flack
- School of Medicine, Johns Hopkins University, 733 N Broadway, Baltimore, Baltimore, MD 21205, USA
| | - Melissa Haendel
- University of Colorado Anschutz Medical Campus, 13001 E 17th Pl, Aurora, CO 80045, USA
| | - Nomi L Harris
- Berkeley Bioinformatics Open-source Projects (BBOP), Lawrence Berkeley National Laboratory (LBNL), 1 Cyclotron Road, Mailstop 977-0257, Berkeley, CA 94720, USA
| | - William R Hogan
- College of Dentistry; Health Outcomes and Biomedical Informatics, College of Medicine, University of Florida, William D. Duncan: 1395 Center Dr, Gainesville, William R. Hogan: 1600 SW Archer Rd, Gainesville, FL 32610, USA
| | - Charles Tapley Hoyt
- Laboratory of Systems Pharmacology, Harvard Medical School, 200 Longwood Avenue Armenise Building Room 109, Boston, MA 02115, USA
| | - Rebecca C Jackson
- Bend Informatics LLC, 5305 RIVER RD NORTH, STE B, KEIZER, OR 97303, USA
| | | | - Huseyin Kir
- Samples Phenotypes and Ontologies Team (SPOT), European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridgeshire, CB10 1SD, UK
| | - Martin Larralde
- Structural and Computational Biology Unit, European Molecular Biology Laboratory, Meyerhofstraße 1, Heidelberg 69117, Germany
| | - Julie A McMurry
- University of Colorado Anschutz Medical Campus, 13001 E 17th Pl, Aurora, CO 80045, USA
| | | | - Bjoern Peters
- Institute for Allergy & Immunology, La Jolla Institute for Immunology, 9420 Athena Circle, La Jolla, CA 92037, USA
| | - Clare Pilgrim
- Department of Physiology, Development and Neuroscience, University of Cambridge, Downing Street, Cambridge, CB2 3DY, UK
| | - Ray Stefancsik
- Samples Phenotypes and Ontologies Team (SPOT), European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridgeshire, CB10 1SD, UK
| | - Sofia MC Robb
- Stowers Institute for Medical Research, 1000 E. 50th St., Kansas City, MO 64110, USA
| | - Sabrina Toro
- University of Colorado Anschutz Medical Campus, 13001 E 17th Pl, Aurora, CO 80045, USA
| | - Nicole A Vasilevsky
- University of Colorado Anschutz Medical Campus, 13001 E 17th Pl, Aurora, CO 80045, USA
| | - Ramona Walls
- Critical Path Institute, 1730 E River Road, Tucson, AZ 85718, USA
| | - Christopher J Mungall
- Berkeley Bioinformatics Open-source Projects (BBOP), Lawrence Berkeley National Laboratory (LBNL), 1 Cyclotron Road, Mailstop 977-0257, Berkeley, CA 94720, USA
| | | |
Collapse
|
22
|
Wood EC, Glen AK, Kvarfordt LG, Womack F, Acevedo L, Yoon TS, Ma C, Flores V, Sinha M, Chodpathumwan Y, Termehchy A, Roach JC, Mendoza L, Hoffman AS, Deutsch EW, Koslicki D, Ramsey SA. RTX-KG2: a system for building a semantically standardized knowledge graph for translational biomedicine. BMC Bioinformatics 2022; 23:400. [PMID: 36175836 PMCID: PMC9520835 DOI: 10.1186/s12859-022-04932-3] [Citation(s) in RCA: 13] [Impact Index Per Article: 6.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/25/2022] [Accepted: 09/14/2022] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Biomedical translational science is increasingly using computational reasoning on repositories of structured knowledge (such as UMLS, SemMedDB, ChEMBL, Reactome, DrugBank, and SMPDB in order to facilitate discovery of new therapeutic targets and modalities. The NCATS Biomedical Data Translator project is working to federate autonomous reasoning agents and knowledge providers within a distributed system for answering translational questions. Within that project and the broader field, there is a need for a framework that can efficiently and reproducibly build an integrated, standards-compliant, and comprehensive biomedical knowledge graph that can be downloaded in standard serialized form or queried via a public application programming interface (API). RESULTS To create a knowledge provider system within the Translator project, we have developed RTX-KG2, an open-source software system for building-and hosting a web API for querying-a biomedical knowledge graph that uses an Extract-Transform-Load approach to integrate 70 knowledge sources (including the aforementioned core six sources) into a knowledge graph with provenance information including (where available) citations. The semantic layer and schema for RTX-KG2 follow the standard Biolink model to maximize interoperability. RTX-KG2 is currently being used by multiple Translator reasoning agents, both in its downloadable form and via its SmartAPI-registered interface. Serializations of RTX-KG2 are available for download in both the pre-canonicalized form and in canonicalized form (in which synonyms are merged). The current canonicalized version (KG2.7.3) of RTX-KG2 contains 6.4M nodes and 39.3M edges with a hierarchy of 77 relationship types from Biolink. CONCLUSION RTX-KG2 is the first knowledge graph that integrates UMLS, SemMedDB, ChEMBL, DrugBank, Reactome, SMPDB, and 64 additional knowledge sources within a knowledge graph that conforms to the Biolink standard for its semantic layer and schema. RTX-KG2 is publicly available for querying via its API at arax.rtx.ai/api/rtxkg2/v1.2/openapi.json . The code to build RTX-KG2 is publicly available at github:RTXteam/RTX-KG2 .
Collapse
Affiliation(s)
- E C Wood
- School of Electrical Engineering and Computer Science, Oregon State University, Corvallis, OR, USA
| | - Amy K Glen
- School of Electrical Engineering and Computer Science, Oregon State University, Corvallis, OR, USA.
| | - Lindsey G Kvarfordt
- School of Electrical Engineering and Computer Science, Oregon State University, Corvallis, OR, USA
| | - Finn Womack
- Computer Science and Engineering, Penn State University, State College, PA, USA
| | - Liliana Acevedo
- School of Electrical Engineering and Computer Science, Oregon State University, Corvallis, OR, USA
| | - Timothy S Yoon
- School of Electrical Engineering and Computer Science, Oregon State University, Corvallis, OR, USA
| | - Chunyu Ma
- Huck Institutes of the Life Sciences, Penn State University, State College, PA, USA
| | - Veronica Flores
- School of Electrical Engineering and Computer Science, Oregon State University, Corvallis, OR, USA
| | - Meghamala Sinha
- School of Electrical Engineering and Computer Science, Oregon State University, Corvallis, OR, USA
| | | | - Arash Termehchy
- School of Electrical Engineering and Computer Science, Oregon State University, Corvallis, OR, USA
| | | | | | - Andrew S Hoffman
- Interdisciplinary Hub for Digitalization and Society, Radboud University, Nijmegen, The Netherlands
| | | | - David Koslicki
- Computer Science and Engineering, Penn State University, State College, PA, USA
- Huck Institutes of the Life Sciences, Penn State University, State College, PA, USA
- Department of Biology, Penn State University, State College, PA, USA
| | - Stephen A Ramsey
- School of Electrical Engineering and Computer Science, Oregon State University, Corvallis, OR, USA
- Department of Biomedical Sciences, Oregon State University, Corvallis, OR, USA
| |
Collapse
|
23
|
Dai Y, Hu R, Liu A, Cho KS, Manuel AM, Li X, Dong X, Jia P, Zhao Z. WebCSEA: web-based cell-type-specific enrichment analysis of genes. Nucleic Acids Res 2022; 50:W782-W790. [PMID: 35610053 DOI: 10.1093/nar/gkac392] [Citation(s) in RCA: 30] [Impact Index Per Article: 15.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/27/2022] [Revised: 04/19/2022] [Accepted: 05/04/2022] [Indexed: 02/02/2023] Open
Abstract
Human complex traits and common diseases show tissue- and cell-type- specificity. Recently, single-cell RNA sequencing (scRNA-seq) technology has successfully depicted cellular heterogeneity in human tissue, providing an unprecedented opportunity to understand the context-specific expression of complex trait-associated genes in human tissue-cell types (TCs). Here, we present the first web-based application to quickly assess the cell-type-specificity of genes, named Web-based Cell-type Specific Enrichment Analysis of Genes (WebCSEA, available at https://bioinfo.uth.edu/webcsea/). Specifically, we curated a total of 111 scRNA-seq panels of human tissues and 1,355 TCs from 61 different general tissues across 11 human organ systems. We adapted our previous decoding tissue-specificity (deTS) algorithm to measure the enrichment for each tissue-cell type (TC). To overcome the potential bias from the number of signature genes between different TCs, we further developed a permutation-based method that accurately estimates the TC-specificity of a given inquiry gene list. WebCSEA also provides an interactive heatmap that displays the cell-type specificity across 1355 human TCs, and other interactive and static visualizations of cell-type specificity by human organ system, developmental stage, and top-ranked tissues and cell types. In short, WebCSEA is a one-click application that provides a comprehensive exploration of the TC-specificity of genes among human major TC map.
Collapse
Affiliation(s)
- Yulin Dai
- Center for Precision Health, School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX 77030, USA
| | - Ruifeng Hu
- Center for Precision Health, School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX 77030, USA.,Center for Advanced Parkinson Research, Brigham and Women's Hospital, Harvard Medical School, Boston, MA 02115, USA.,Genomics and Bioinformatics Hub, Department of Neurology, Brigham and Women's Hospital, Harvard Medical School, Boston, MA 02115, USA
| | - Andi Liu
- Center for Precision Health, School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX 77030, USA.,Department of Epidemiology, Human Genetics and Environmental Sciences, School of Public Health, The University of Texas Health Science Center at Houston, Houston, TX 77030, USA
| | - Kyung Serk Cho
- Center for Precision Health, School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX 77030, USA
| | - Astrid Marilyn Manuel
- Center for Precision Health, School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX 77030, USA
| | - Xiaoyang Li
- Center for Precision Health, School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX 77030, USA.,Department of Biostatistics and Data Science, School of Public Health, The University of Texas Health Science Center at Houston, Houston, TX 77030, USA
| | - Xianjun Dong
- Center for Advanced Parkinson Research, Brigham and Women's Hospital, Harvard Medical School, Boston, MA 02115, USA.,Genomics and Bioinformatics Hub, Department of Neurology, Brigham and Women's Hospital, Harvard Medical School, Boston, MA 02115, USA
| | - Peilin Jia
- Center for Precision Health, School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX 77030, USA
| | - Zhongming Zhao
- Center for Precision Health, School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX 77030, USA.,MD Anderson Cancer Center UTHealth Graduate School of Biomedical Sciences, Houston, TX 77030, USA
| |
Collapse
|
24
|
Chen S, Luo Y, Gao H, Li F, Chen Y, Li J, You R, Hao M, Bian H, Xi X, Li W, Li W, Ye M, Meng Q, Zou Z, Li C, Li H, Zhang Y, Cui Y, Wei L, Chen F, Wang X, Lv H, Hua K, Jiang R, Zhang X. hECA: The cell-centric assembly of a cell atlas. iScience 2022; 25:104318. [PMID: 35602947 PMCID: PMC9114628 DOI: 10.1016/j.isci.2022.104318] [Citation(s) in RCA: 8] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/01/2021] [Revised: 03/18/2022] [Accepted: 04/25/2022] [Indexed: 12/04/2022] Open
Abstract
The accumulation of massive single-cell omics data provides growing resources for building biomolecular atlases of all cells of human organs or the whole body. The true assembly of a cell atlas should be cell-centric rather than file-centric. We developed a unified informatics framework for seamless cell-centric data assembly and built the human Ensemble Cell Atlas (hECA) from scattered data. hECA v1.0 assembled 1,093,299 labeled human cells from 116 published datasets, covering 38 organs and 11 systems. We invented three new methods of atlas applications based on the cell-centric assembly: “in data” cell sorting for targeted data retrieval with customizable logic expressions, “quantitative portraiture” for multi-view representations of biological entities, and customizable reference creation for generating references for automatic annotations. Case studies on agile construction of user-defined sub-atlases and “in data” investigation of CAR-T off-targets in multiple organs showed the great potential enabled by the cell-centric ensemble atlas. A unified informatics framework for seamless cell-centric assembly of massive single-cell data Built the general-purpose human Ensemble Cell Atlas (hECA) V1.0 from scattered data Three new methods of applications enabling “in data” cell experiments and portraiture Case studies of agile atlas reconstruction and target therapies side-effect discovery
Collapse
Affiliation(s)
- Sijie Chen
- MOE Key Lab of Bioinformatics, Bioinformatics Division of BNRIST and Department of Automation, Tsinghua University, Beijing 100084, China
| | - Yanting Luo
- MOE Key Lab of Bioinformatics, Bioinformatics Division of BNRIST and Department of Automation, Tsinghua University, Beijing 100084, China
| | - Haoxiang Gao
- MOE Key Lab of Bioinformatics, Bioinformatics Division of BNRIST and Department of Automation, Tsinghua University, Beijing 100084, China
| | - Fanhong Li
- MOE Key Lab of Bioinformatics, Bioinformatics Division of BNRIST and Department of Automation, Tsinghua University, Beijing 100084, China
| | - Yixin Chen
- MOE Key Lab of Bioinformatics, Bioinformatics Division of BNRIST and Department of Automation, Tsinghua University, Beijing 100084, China
| | - Jiaqi Li
- MOE Key Lab of Bioinformatics, Bioinformatics Division of BNRIST and Department of Automation, Tsinghua University, Beijing 100084, China
| | - Renke You
- Fuzhou Institute of Data Technology, Changle, Fuzhou 350200, China
| | - Minsheng Hao
- MOE Key Lab of Bioinformatics, Bioinformatics Division of BNRIST and Department of Automation, Tsinghua University, Beijing 100084, China
| | - Haiyang Bian
- MOE Key Lab of Bioinformatics, Bioinformatics Division of BNRIST and Department of Automation, Tsinghua University, Beijing 100084, China
| | - Xi Xi
- MOE Key Lab of Bioinformatics, Bioinformatics Division of BNRIST and Department of Automation, Tsinghua University, Beijing 100084, China
| | - Wenrui Li
- MOE Key Lab of Bioinformatics, Bioinformatics Division of BNRIST and Department of Automation, Tsinghua University, Beijing 100084, China
| | - Weiyu Li
- Fuzhou Institute of Data Technology, Changle, Fuzhou 350200, China
| | - Mingli Ye
- Fuzhou Institute of Data Technology, Changle, Fuzhou 350200, China
| | - Qiuchen Meng
- MOE Key Lab of Bioinformatics, Bioinformatics Division of BNRIST and Department of Automation, Tsinghua University, Beijing 100084, China
| | - Ziheng Zou
- MOE Key Lab of Bioinformatics, Bioinformatics Division of BNRIST and Department of Automation, Tsinghua University, Beijing 100084, China
| | - Chen Li
- MOE Key Lab of Bioinformatics, Bioinformatics Division of BNRIST and Department of Automation, Tsinghua University, Beijing 100084, China
| | - Haochen Li
- School of Medicine, Tsinghua University, Beijing 100084, China
| | - Yangyuan Zhang
- MOE Key Lab of Bioinformatics, Bioinformatics Division of BNRIST and Department of Automation, Tsinghua University, Beijing 100084, China
| | - Yanfei Cui
- MOE Key Lab of Bioinformatics, Bioinformatics Division of BNRIST and Department of Automation, Tsinghua University, Beijing 100084, China
| | - Lei Wei
- MOE Key Lab of Bioinformatics, Bioinformatics Division of BNRIST and Department of Automation, Tsinghua University, Beijing 100084, China
| | - Fufeng Chen
- Fuzhou Institute of Data Technology, Changle, Fuzhou 350200, China
| | - Xiaowo Wang
- MOE Key Lab of Bioinformatics, Bioinformatics Division of BNRIST and Department of Automation, Tsinghua University, Beijing 100084, China
| | - Hairong Lv
- MOE Key Lab of Bioinformatics, Bioinformatics Division of BNRIST and Department of Automation, Tsinghua University, Beijing 100084, China.,Fuzhou Institute of Data Technology, Changle, Fuzhou 350200, China
| | - Kui Hua
- MOE Key Lab of Bioinformatics, Bioinformatics Division of BNRIST and Department of Automation, Tsinghua University, Beijing 100084, China
| | - Rui Jiang
- MOE Key Lab of Bioinformatics, Bioinformatics Division of BNRIST and Department of Automation, Tsinghua University, Beijing 100084, China
| | - Xuegong Zhang
- MOE Key Lab of Bioinformatics, Bioinformatics Division of BNRIST and Department of Automation, Tsinghua University, Beijing 100084, China.,School of Medicine, Tsinghua University, Beijing 100084, China.,School of Life Sciences, Center for Synthetic and Systems Biology, Tsinghua University, Beijing 100084, China
| |
Collapse
|
25
|
de Bono B, Gillespie T, Surles-Zeigler MC, Kokash N, Grethe JS, Martone M. Representing Normal and Abnormal Physiology as Routes of Flow in ApiNATOMY. Front Physiol 2022; 13:795303. [PMID: 35547570 PMCID: PMC9083405 DOI: 10.3389/fphys.2022.795303] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/14/2021] [Accepted: 02/07/2022] [Indexed: 01/04/2023] Open
Abstract
We present (i) the ApiNATOMY workflow to build knowledge models of biological connectivity, as well as (ii) the ApiNATOMY TOO map, a topological scaffold to organize and visually inspect these connectivity models in the context of a canonical architecture of body compartments. In this work, we outline the implementation of ApiNATOMY's knowledge representation in the context of a large-scale effort, SPARC, to map the autonomic nervous system. Within SPARC, the ApiNATOMY modeling effort has generated the SCKAN knowledge graph that combines connectivity models and TOO map. This knowledge graph models flow routes for a number of normal and disease scenarios in physiology. Calculations over SCKAN to infer routes are being leveraged to classify, navigate and search for semantically-linked metadata of multimodal experimental datasets for a number of cross-scale, cross-disciplinary projects.
Collapse
Affiliation(s)
- Bernard de Bono
- Auckland Bioengineering Institute, University of Auckland, Auckland, New Zealand
| | - Tom Gillespie
- Department of Neuroscience, University of California, San Diego, San Diego, CA, United States
| | | | - Natallia Kokash
- Faculty of Humanities, University of Amsterdam, Amsterdam, Netherlands
| | - Jeff S. Grethe
- Department of Neuroscience, University of California, San Diego, San Diego, CA, United States
| | - Maryann Martone
- Department of Neuroscience, University of California, San Diego, San Diego, CA, United States
| |
Collapse
|
26
|
Furrer L, Cornelius J, Rinaldi F. Parallel sequence tagging for concept recognition. BMC Bioinformatics 2022; 22:623. [PMID: 35331131 PMCID: PMC8943923 DOI: 10.1186/s12859-021-04511-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/29/2021] [Accepted: 12/01/2021] [Indexed: 11/23/2022] Open
Abstract
BACKGROUND Named Entity Recognition (NER) and Normalisation (NEN) are core components of any text-mining system for biomedical texts. In a traditional concept-recognition pipeline, these tasks are combined in a serial way, which is inherently prone to error propagation from NER to NEN. We propose a parallel architecture, where both NER and NEN are modeled as a sequence-labeling task, operating directly on the source text. We examine different harmonisation strategies for merging the predictions of the two classifiers into a single output sequence. RESULTS We test our approach on the recent Version 4 of the CRAFT corpus. In all 20 annotation sets of the concept-annotation task, our system outperforms the pipeline system reported as a baseline in the CRAFT shared task, a competition of the BioNLP Open Shared Tasks 2019. We further refine the systems from the shared task by optimising the harmonisation strategy separately for each annotation set. CONCLUSIONS Our analysis shows that the strengths of the two classifiers can be combined in a fruitful way. However, prediction harmonisation requires individual calibration on a development set for each annotation set. This allows achieving a good trade-off between established knowledge (training set) and novel information (unseen concepts).
Collapse
Affiliation(s)
- Lenz Furrer
- Department of Computational Linguistics, University of Zurich, Zurich, Switzerland
- Swiss Institute of Bioinformatics, Zurich, Switzerland
| | - Joseph Cornelius
- Dalle Molle Institute for Artificial Intelligence Research (IDSIA USI/SUPSI), Lugano, Switzerland
- Swiss Institute of Bioinformatics, Zurich, Switzerland
| | - Fabio Rinaldi
- Dalle Molle Institute for Artificial Intelligence Research (IDSIA USI/SUPSI), Lugano, Switzerland.
- Department of Quantitative Biomedicine, University of Zurich, Zurich, Switzerland.
- Swiss Institute of Bioinformatics, Zurich, Switzerland.
- Fondazione Bruno Kessler, Trento, Italy.
| |
Collapse
|
27
|
Porto DS, Dahdul WM, Lapp H, Balhoff JP, Vision TJ, Mabee PM, Uyeda J. Assessing Bayesian Phylogenetic Information Content of Morphological Data Using Knowledge from Anatomy Ontologies. Syst Biol 2022; 71:1290-1306. [PMID: 35285502 PMCID: PMC9558846 DOI: 10.1093/sysbio/syac022] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/03/2021] [Revised: 02/09/2022] [Accepted: 03/05/2022] [Indexed: 11/18/2022] Open
Abstract
Morphology remains a primary source of phylogenetic information for many groups of organisms, and the only one for most fossil taxa. Organismal anatomy is not a collection of randomly assembled and independent “parts”, but instead a set of dependent and hierarchically nested entities resulting from ontogeny and phylogeny. How do we make sense of these dependent and at times redundant characters? One promising approach is using ontologies—structured controlled vocabularies that summarize knowledge about different properties of anatomical entities, including developmental and structural dependencies. Here, we assess whether evolutionary patterns can explain the proximity of ontology-annotated characters within an ontology. To do so, we measure phylogenetic information across characters and evaluate if it matches the hierarchical structure given by ontological knowledge—in much the same way as across-species diversity structure is given by phylogeny. We implement an approach to evaluate the Bayesian phylogenetic information (BPI) content and phylogenetic dissonance among ontology-annotated anatomical data subsets. We applied this to data sets representing two disparate animal groups: bees (Hexapoda: Hymenoptera: Apoidea, 209 chars) and characiform fishes (Actinopterygii: Ostariophysi: Characiformes, 463 chars). For bees, we find that BPI is not substantially explained by anatomy since dissonance is often high among morphologically related anatomical entities. For fishes, we find substantial information for two clusters of anatomical entities instantiating concepts from the jaws and branchial arch bones, but among-subset information decreases and dissonance increases substantially moving to higher-level subsets in the ontology. We further applied our approach to address particular evolutionary hypotheses with an example of morphological evolution in miniature fishes. While we show that phylogenetic information does match ontology structure for some anatomical entities, additional relationships and processes, such as convergence, likely play a substantial role in explaining BPI and dissonance, and merit future investigation. Our work demonstrates how complex morphological data sets can be interrogated with ontologies by allowing one to access how information is spread hierarchically across anatomical concepts, how congruent this information is, and what sorts of processes may play a role in explaining it: phylogeny, development, or convergence. [Apidae; Bayesian phylogenetic information; Ostariophysi; Phenoscape; phylogenetic dissonance; semantic similarity.]
Collapse
Affiliation(s)
- Diego S Porto
- Department of Biological Sciences, Virginia Polytechnic Institute and State University, 926 West Campus Drive, Blacksburg, VA 24061, USA
| | - Wasila M Dahdul
- UCI Libraries,University of California, Irvine, Irvine, CA 92623, USA
- Department of Biology, University of South Dakota, 414 East Clark Street, Vermillion, SD 57069, USA
| | - Hilmar Lapp
- Center for Genomic and Computational Biology, Duke University, 101 Science Drive, Durham, NC 27708, USA
| | - James P Balhoff
- Renaissance Computing Institute, University of North Carolina, 100 Europa Drive, Suite 540, Chapel Hill, NC 27517, USA
| | - Todd J Vision
- Department of Biology and School of Information and Library Sciences, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, USA
| | - Paula M Mabee
- Department of Biology, University of South Dakota, 414 East Clark Street, Vermillion, SD 57069, USA
- Battelle, National Ecological Observatory Network, Boulder, CO 80301, USA
| | - Josef Uyeda
- Department of Biological Sciences, Virginia Polytechnic Institute and State University, 926 West Campus Drive, Blacksburg, VA 24061, USA
| |
Collapse
|
28
|
Martens M, Evelo CT, Willighagen EL. Providing Adverse Outcome Pathways from the AOP-Wiki in a Semantic Web Format to Increase Usability and Accessibility of the Content. APPLIED IN VITRO TOXICOLOGY 2022; 8:2-13. [PMID: 35388368 DOI: 10.26434/chemrxiv.13524191] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/22/2023]
Abstract
INTRODUCTION The AOP-Wiki is the main platform for the development and storage of adverse outcome pathways (AOPs). These AOPs describe mechanistic information about toxicodynamic processes and can be used to develop effective risk assessment strategies. However, it is challenging to automatically and systematically parse, filter, and use its contents. We explored solutions to better structure the AOP-Wiki content, and to link it with chemical and biological resources. Together, this allows more detailed exploration, which can be automated. MATERIALS AND METHODS We converted the complete AOP-Wiki content into resource description framework (RDF) triples. We used >20 ontologies for the semantic annotation of property-object relations, including the Chemical Information Ontology, Dublin Core, and the AOP Ontology. RESULTS The resulting RDF contains >122,000 triples describing 158 unique properties of >15,000 unique subjects. Furthermore, >3500 link-outs were added to 12 chemical databases, and >7500 link-outs to 4 gene and protein databases. The AOP-Wiki RDF has been made available at https://aopwiki.rdf.bigcat-bioinformatics.org. DISCUSSION SPARQL queries can be used to answer biological and toxicological questions, such as listing measurement methods for all Key Events leading to an Adverse Outcome of interest. The full power that the use of this new resource provides becomes apparent when combining the content with external databases using federated queries. CONCLUSION Overall, the AOP-Wiki RDF allows new ways to explore the rapidly growing AOP knowledge and makes the integration of this database in automated workflows possible, making the AOP-Wiki more FAIR.
Collapse
Affiliation(s)
- Marvin Martens
- Department of Bioinformatics-BiGCaT, NUTRIM, and Maastricht University, Maastricht, The Netherlands
| | - Chris T Evelo
- Department of Bioinformatics-BiGCaT, NUTRIM, and Maastricht University, Maastricht, The Netherlands
- Maastricht Centre for Systems Biology (MaCSBio), Maastricht University, Maastricht, The Netherlands
| | - Egon L Willighagen
- Department of Bioinformatics-BiGCaT, NUTRIM, and Maastricht University, Maastricht, The Netherlands
| |
Collapse
|
29
|
Martens M, Evelo CT, Willighagen EL. Providing Adverse Outcome Pathways from the AOP-Wiki in a Semantic Web Format to Increase Usability and Accessibility of the Content. APPLIED IN VITRO TOXICOLOGY 2022; 8:2-13. [PMID: 35388368 PMCID: PMC8978481 DOI: 10.1089/aivt.2021.0010] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 12/01/2022]
Abstract
Introduction: The AOP-Wiki is the main platform for the development and storage of adverse outcome pathways (AOPs). These AOPs describe mechanistic information about toxicodynamic processes and can be used to develop effective risk assessment strategies. However, it is challenging to automatically and systematically parse, filter, and use its contents. We explored solutions to better structure the AOP-Wiki content, and to link it with chemical and biological resources. Together, this allows more detailed exploration, which can be automated. Materials and Methods: We converted the complete AOP-Wiki content into resource description framework (RDF) triples. We used >20 ontologies for the semantic annotation of property–object relations, including the Chemical Information Ontology, Dublin Core, and the AOP Ontology. Results: The resulting RDF contains >122,000 triples describing 158 unique properties of >15,000 unique subjects. Furthermore, >3500 link-outs were added to 12 chemical databases, and >7500 link-outs to 4 gene and protein databases. The AOP-Wiki RDF has been made available at https://aopwiki.rdf.bigcat-bioinformatics.org Discussion: SPARQL queries can be used to answer biological and toxicological questions, such as listing measurement methods for all Key Events leading to an Adverse Outcome of interest. The full power that the use of this new resource provides becomes apparent when combining the content with external databases using federated queries. Conclusion: Overall, the AOP-Wiki RDF allows new ways to explore the rapidly growing AOP knowledge and makes the integration of this database in automated workflows possible, making the AOP-Wiki more FAIR.
Collapse
Affiliation(s)
- Marvin Martens
- Department of Bioinformatics—BiGCaT, NUTRIM, and Maastricht University, Maastricht, The Netherlands
| | - Chris T. Evelo
- Department of Bioinformatics—BiGCaT, NUTRIM, and Maastricht University, Maastricht, The Netherlands
- Maastricht Centre for Systems Biology (MaCSBio), Maastricht University, Maastricht, The Netherlands
| | - Egon L. Willighagen
- Department of Bioinformatics—BiGCaT, NUTRIM, and Maastricht University, Maastricht, The Netherlands
| |
Collapse
|
30
|
Li J, Sheng Q, Shyr Y, Liu Q. scMRMA: single cell multiresolution marker-based annotation. Nucleic Acids Res 2022; 50:e7. [PMID: 34648021 PMCID: PMC8789072 DOI: 10.1093/nar/gkab931] [Citation(s) in RCA: 16] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/16/2021] [Revised: 09/10/2021] [Accepted: 09/28/2021] [Indexed: 01/22/2023] Open
Abstract
Single-cell RNA sequencing has become a powerful tool for identifying and characterizing cellular heterogeneity. One essential step to understanding cellular heterogeneity is determining cell identities. The widely used strategy predicts identities by projecting cells or cell clusters unidirectionally against a reference to find the best match. Here, we develop a bidirectional method, scMRMA, where a hierarchical reference guides iterative clustering and deep annotation with enhanced resolutions. Taking full advantage of the reference, scMRMA greatly improves the annotation accuracy. scMRMA achieved better performance than existing methods in four benchmark datasets and successfully revealed the expansion of CD8 T cell populations in squamous cell carcinoma after anti-PD-1 treatment.
Collapse
Affiliation(s)
- Jia Li
- Department of Biostatistics, Vanderbilt University Medical Center, Nashville, TN 37203, USA
- Center for Quantitative Sciences, Vanderbilt University Medical Center, Nashville, TN 37232, USA
| | - Quanhu Sheng
- Department of Biostatistics, Vanderbilt University Medical Center, Nashville, TN 37203, USA
- Center for Quantitative Sciences, Vanderbilt University Medical Center, Nashville, TN 37232, USA
| | - Yu Shyr
- Department of Biostatistics, Vanderbilt University Medical Center, Nashville, TN 37203, USA
- Center for Quantitative Sciences, Vanderbilt University Medical Center, Nashville, TN 37232, USA
| | - Qi Liu
- Department of Biostatistics, Vanderbilt University Medical Center, Nashville, TN 37203, USA
- Center for Quantitative Sciences, Vanderbilt University Medical Center, Nashville, TN 37232, USA
| |
Collapse
|
31
|
Aevermann B, Zhang Y, Novotny M, Keshk M, Bakken T, Miller J, Hodge R, Lelieveldt B, Lein E, Scheuermann RH. A machine learning method for the discovery of minimum marker gene combinations for cell type identification from single-cell RNA sequencing. Genome Res 2021; 31:1767-1780. [PMID: 34088715 PMCID: PMC8494219 DOI: 10.1101/gr.275569.121] [Citation(s) in RCA: 40] [Impact Index Per Article: 13.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/27/2021] [Accepted: 05/24/2021] [Indexed: 11/24/2022]
Abstract
Single-cell genomics is rapidly advancing our knowledge of the diversity of cell phenotypes, including both cell types and cell states. Driven by single-cell/-nucleus RNA sequencing (scRNA-seq), comprehensive cell atlas projects characterizing a wide range of organisms and tissues are currently underway. As a result, it is critical that the transcriptional phenotypes discovered are defined and disseminated in a consistent and concise manner. Molecular biomarkers have historically played an important role in biological research, from defining immune cell types by surface protein expression to defining diseases by their molecular drivers. Here, we describe a machine learning-based marker gene selection algorithm, NS-Forest version 2.0, which leverages the nonlinear attributes of random forest feature selection and a binary expression scoring approach to discover the minimal marker gene expression combinations that optimally capture the cell type identity represented in complete scRNA-seq transcriptional profiles. The marker genes selected provide an expression barcode that serves as both a useful tool for downstream biological investigation and the necessary and sufficient characteristics for semantic cell type definition. The use of NS-Forest to identify marker genes for human brain middle temporal gyrus cell types reveals the importance of cell signaling and noncoding RNAs in neuronal cell type identity.
Collapse
Affiliation(s)
| | - Yun Zhang
- J. Craig Venter Institute, La Jolla, California 92037, USA
| | - Mark Novotny
- J. Craig Venter Institute, La Jolla, California 92037, USA
| | - Mohamed Keshk
- J. Craig Venter Institute, La Jolla, California 92037, USA
| | - Trygve Bakken
- Allen Institute for Brain Science, Seattle, Washington 98109, USA
| | - Jeremy Miller
- Allen Institute for Brain Science, Seattle, Washington 98109, USA
| | - Rebecca Hodge
- Allen Institute for Brain Science, Seattle, Washington 98109, USA
| | - Boudewijn Lelieveldt
- Department of Radiology, Leiden University Medical Center, 2300 Leiden, The Netherlands
- Department of Intelligent Systems, Delft University of Technology, 2628 Delft, The Netherlands
| | - Ed Lein
- Allen Institute for Brain Science, Seattle, Washington 98109, USA
| | - Richard H Scheuermann
- J. Craig Venter Institute, La Jolla, California 92037, USA
- University of California San Diego, La Jolla, California 92093, USA
- La Jolla Institute for Immunology, La Jolla, California 92037, USA
| |
Collapse
|
32
|
Wang S, Pisco AO, McGeever A, Brbic M, Zitnik M, Darmanis S, Leskovec J, Karkanias J, Altman RB. Leveraging the Cell Ontology to classify unseen cell types. Nat Commun 2021; 12:5556. [PMID: 34548483 PMCID: PMC8455606 DOI: 10.1038/s41467-021-25725-x] [Citation(s) in RCA: 17] [Impact Index Per Article: 5.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/07/2021] [Accepted: 08/17/2021] [Indexed: 11/09/2022] Open
Abstract
Single cell technologies are rapidly generating large amounts of data that enables us to understand biological systems at single-cell resolution. However, joint analysis of datasets generated by independent labs remains challenging due to a lack of consistent terminology to describe cell types. Here, we present OnClass, an algorithm and accompanying software for automatically classifying cells into cell types that are part of the controlled vocabulary that forms the Cell Ontology. A key advantage of OnClass is its capability to classify cells into cell types not present in the training data because it uses the Cell Ontology graph to infer cell type relationships. Furthermore, OnClass can be used to identify marker genes for all the cell ontology categories, regardless of whether the cell types are present or absent in the training data, suggesting that OnClass goes beyond a simple annotation tool for single cell datasets, being the first algorithm capable to identify marker genes specific to all terms of the Cell Ontology and offering the possibility of refining the Cell Ontology using a data-centric approach.
Collapse
Affiliation(s)
- Sheng Wang
- Department of Bioengineering, Stanford University, Stanford, CA, 94305, USA
- Department of Genetics, Stanford University, Stanford, CA, 94305, USA
| | | | | | - Maria Brbic
- Department of Computer Science, Stanford University, Stanford, CA, 94305, USA
| | - Marinka Zitnik
- Department of Computer Science, Stanford University, Stanford, CA, 94305, USA
| | | | - Jure Leskovec
- Chan Zuckerberg Biohub, San Francisco, CA, 94158, USA
- Department of Computer Science, Stanford University, Stanford, CA, 94305, USA
| | - Jim Karkanias
- Chan Zuckerberg Biohub, San Francisco, CA, 94158, USA
| | - Russ B Altman
- Department of Bioengineering, Stanford University, Stanford, CA, 94305, USA.
- Department of Genetics, Stanford University, Stanford, CA, 94305, USA.
- Chan Zuckerberg Biohub, San Francisco, CA, 94158, USA.
| |
Collapse
|
33
|
Abstract
Cell type annotation is important in the analysis of single-cell RNA-seq data. CellO is a machine-learning-based tool for annotating cells using the Cell Ontology, a rich hierarchy of known cell types. We provide a protocol for using the CellO Python package to annotate human cells. We demonstrate how to use CellO in conjunction with Scanpy, a Python library for performing single-cell analysis, annotate a lung tissue data set, interpret its hierarchically structured cell type annotations, and create publication-ready figures. For complete details on the use and execution of this protocol, please refer to Bernstein et al. (2021). CellO is a Python package for annotating cell types in single-cell RNA-seq data CellO classifies cells against the hierarchically structured Cell Ontology CellO can be integrated into single-cell analysis pipelines implemented with Scanpy We present a tutorial that classifies cells in an existing lung tumor data set
Collapse
Affiliation(s)
| | - Colin N Dewey
- Department of Biostatistics and Medical Informatics, University of Wisconsin - Madison, Madison, WI 53792, USA.,Department of Computer Sciences, University of Wisconsin - Madison, Madison, WI 53706, USA
| |
Collapse
|
34
|
Jha SG, Borowsky AT, Cole BJ, Fahlgren N, Farmer A, Huang SSC, Karia P, Libault M, Provart NJ, Rice SL, Saura-Sanchez M, Agarwal P, Ahkami AH, Anderton CR, Briggs SP, Brophy JAN, Denolf P, Di Costanzo LF, Exposito-Alonso M, Giacomello S, Gomez-Cano F, Kaufmann K, Ko DK, Kumar S, Malkovskiy AV, Nakayama N, Obata T, Otegui MS, Palfalvi G, Quezada-Rodríguez EH, Singh R, Uhrig RG, Waese J, Van Wijk K, Wright RC, Ehrhardt DW, Birnbaum KD, Rhee SY. Vision, challenges and opportunities for a Plant Cell Atlas. eLife 2021; 10:e66877. [PMID: 34491200 PMCID: PMC8423441 DOI: 10.7554/elife.66877] [Citation(s) in RCA: 24] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/25/2021] [Accepted: 08/26/2021] [Indexed: 02/06/2023] Open
Abstract
With growing populations and pressing environmental problems, future economies will be increasingly plant-based. Now is the time to reimagine plant science as a critical component of fundamental science, agriculture, environmental stewardship, energy, technology and healthcare. This effort requires a conceptual and technological framework to identify and map all cell types, and to comprehensively annotate the localization and organization of molecules at cellular and tissue levels. This framework, called the Plant Cell Atlas (PCA), will be critical for understanding and engineering plant development, physiology and environmental responses. A workshop was convened to discuss the purpose and utility of such an initiative, resulting in a roadmap that acknowledges the current knowledge gaps and technical challenges, and underscores how the PCA initiative can help to overcome them.
Collapse
Affiliation(s)
- Suryatapa Ghosh Jha
- Department of Plant Biology, Carnegie Institution for ScienceStanfordUnited States
| | - Alexander T Borowsky
- Department of Botany and Plant Sciences, University of California, RiversideRiversideUnited States
| | - Benjamin J Cole
- Joint Genome Institute, Lawrence Berkeley National LaboratoryWalnut CreekUnited States
| | - Noah Fahlgren
- Donald Danforth Plant Science CenterSt. LouisUnited States
| | - Andrew Farmer
- National Center for Genome ResourcesSanta FeUnited States
| | | | - Purva Karia
- Department of Plant Biology, Carnegie Institution for ScienceStanfordUnited States
- Department of Cell and Systems Biology, University of TorontoTorontoCanada
| | - Marc Libault
- Department of Agronomy and Horticulture, University of Nebraska-LincolnLincolnUnited States
| | - Nicholas J Provart
- Department of Cell and Systems Biology and the Centre for the Analysis of Genome Evolution and Function, University of TorontoTorontoCanada
| | - Selena L Rice
- Department of Plant Biology, Carnegie Institution for ScienceStanfordUnited States
| | - Maite Saura-Sanchez
- Consejo Nacional de Investigaciones Científicas y Técnicas, Instituto de Investigaciones Fisiológicas y Ecológicas Vinculadas a la Agricultura, Facultad de Agronomía, Universidad de Buenos AiresBuenos AiresArgentina
| | - Pinky Agarwal
- National Institute of Plant Genome ResearchNew DelhiIndia
| | - Amir H Ahkami
- Environmental Molecular Sciences Division, Pacific Northwest National LaboratoryRichlandUnited States
| | - Christopher R Anderton
- Environmental Molecular Sciences Division, Pacific Northwest National LaboratoryRichlandUnited States
| | - Steven P Briggs
- Department of Biological Sciences, University of California, San DiegoSan DiegoUnited States
| | | | | | - Luigi F Di Costanzo
- Department of Agricultural Sciences, University of Naples Federico IINapoliItaly
| | - Moises Exposito-Alonso
- Department of Plant Biology, Carnegie Institution for ScienceStanfordUnited States
- Department of Plant Biology, Carnegie Institution for ScienceTübingenGermany
| | | | - Fabio Gomez-Cano
- Department of Biochemistry and Molecular Biology, Michigan State UniversityEast LansingUnited States
| | - Kerstin Kaufmann
- Department for Plant Cell and Molecular Biology, Institute for Biology, Humboldt-Universitaet zu BerlinBerlinGermany
| | - Dae Kwan Ko
- Great Lakes Bioenergy Research Center, Michigan State UniversityEast LansingUnited States
| | - Sagar Kumar
- Department of Plant Breeding & Genetics, Mata Gujri College, Fatehgarh Sahib, Punjabi UniversityPatialaIndia
| | - Andrey V Malkovskiy
- Department of Plant Biology, Carnegie Institution for ScienceStanfordUnited States
| | - Naomi Nakayama
- Department of Bioengineering, Imperial College LondonLondonUnited Kingdom
| | - Toshihiro Obata
- Department of Biochemistry, University of Nebraska-LincolnMadisonUnited States
| | - Marisa S Otegui
- Department of Botany, University of Wisconsin-MadisonMadisonUnited States
| | - Gergo Palfalvi
- Division of Evolutionary Biology, National Institute for Basic BiologyOkazakiJapan
| | - Elsa H Quezada-Rodríguez
- Ciencias Agrogenómicas, Escuela Nacional de Estudios Superiores Unidad León, Universidad Nacional Autónoma de MéxicoLeónMexico
| | - Rajveer Singh
- School of Agricultural Biotechnology, Punjab Agricultural UniversityLudhianaIndia
| | - R Glen Uhrig
- Department of Science, University of AlbertaEdmontonCanada
| | - Jamie Waese
- Department of Cell and Systems Biology/Centre for the Analysis of Genome Evolution and Function, University of TorontoTorontoCanada
| | - Klaas Van Wijk
- School of Integrated Plant Science, Plant Biology Section, Cornell UniversityIthacaUnited States
| | - R Clay Wright
- Department of Biological Systems Engineering, Virginia TechBlacksburgUnited States
| | - David W Ehrhardt
- Department of Plant Biology, Carnegie Institution for ScienceStanfordUnited States
| | - Kenneth D Birnbaum
- Center for Genomics and Systems Biology, New York UniversityNew YorkUnited States
| | - Seung Y Rhee
- Department of Plant Biology, Carnegie Institution for ScienceStanfordUnited States
| |
Collapse
|
35
|
Abstract
AbstractDimensionality reduction techniques, such as t-SNE, can construct informative visualizations of high-dimensional data. When jointly visualising multiple data sets, a straightforward application of these methods often fails; instead of revealing underlying classes, the resulting visualizations expose dataset-specific clusters. To circumvent these batch effects, we propose an embedding procedure that uses a t-SNE visualization constructed on a reference data set as a scaffold for embedding new data points. Each data instance from a new, unseen, secondary data is embedded independently and does not change the reference embedding. This prevents any interactions between instances in the secondary data and implicitly mitigates batch effects. We demonstrate the utility of this approach by analyzing six recently published single-cell gene expression data sets with up to tens of thousands of cells and thousands of genes. The batch effects in our studies are particularly strong as the data comes from different institutions using different experimental protocols. The visualizations constructed by our proposed approach are clear of batch effects, and the cells from secondary data sets correctly co-cluster with cells of the same type from the primary data. We also show the predictive power of our simple, visual classification approach in t-SNE space matches the accuracy of specialized machine learning techniques that consider the entire compendium of features that profile single cells.
Collapse
|
36
|
Schaffer LV, Ideker T. Mapping the multiscale structure of biological systems. Cell Syst 2021; 12:622-635. [PMID: 34139169 PMCID: PMC8245186 DOI: 10.1016/j.cels.2021.05.012] [Citation(s) in RCA: 9] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/05/2021] [Revised: 05/04/2021] [Accepted: 05/14/2021] [Indexed: 01/14/2023]
Abstract
Biological systems are by nature multiscale, consisting of subsystems that factor into progressively smaller units in a deeply hierarchical structure. At any level of the hierarchy, an ever-increasing diversity of technologies can be applied to characterize the corresponding biological units and their relations, resulting in large networks of physical or functional proximities-e.g., proximities of amino acids within a protein, of proteins within a complex, or of cell types within a tissue. Here, we review general concepts and progress in using network proximity measures as a basis for creation of multiscale hierarchical maps of biological systems. We discuss the functionalization of these maps to create predictive models, including those useful in translation of genotype to phenotype, along with strategies for model visualization and challenges faced by multiscale modeling in the near future. Collectively, these approaches enable a unified hierarchical approach to biological data, with application from the molecular to the macroscopic.
Collapse
Affiliation(s)
- Leah V Schaffer
- Division of Genetics, Department of Medicine, University of California San Diego, San Diego, La Jolla, CA 92093, USA
| | - Trey Ideker
- Division of Genetics, Department of Medicine, University of California San Diego, San Diego, La Jolla, CA 92093, USA.
| |
Collapse
|
37
|
A Literature-Derived Knowledge Graph Augments the Interpretation of Single Cell RNA-seq Datasets. Genes (Basel) 2021; 12:genes12060898. [PMID: 34200671 PMCID: PMC8229796 DOI: 10.3390/genes12060898] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/31/2021] [Revised: 06/04/2021] [Accepted: 06/04/2021] [Indexed: 01/05/2023] Open
Abstract
Technology to generate single cell RNA-sequencing (scRNA-seq) datasets and tools to annotate them have advanced rapidly in the past several years. Such tools generally rely on existing transcriptomic datasets or curated databases of cell type defining genes, while the application of scalable natural language processing (NLP) methods to enhance analysis workflows has not been adequately explored. Here we deployed an NLP framework to objectively quantify associations between a comprehensive set of over 20,000 human protein-coding genes and over 500 cell type terms across over 26 million biomedical documents. The resultant gene-cell type associations (GCAs) are significantly stronger between a curated set of matched cell type-marker pairs than the complementary set of mismatched pairs (Mann Whitney p = 6.15 × 10−76, r = 0.24; cohen’s D = 2.6). Building on this, we developed an augmented annotation algorithm (single cell Annotation via Literature Encoding, or scALE) that leverages GCAs to categorize cell clusters identified in scRNA-seq datasets, and we tested its ability to predict the cellular identity of 133 clusters from nine datasets of human breast, colon, heart, joint, ovary, prostate, skin, and small intestine tissues. With the optimized settings, the true cellular identity matched the top prediction in 59% of tested clusters and was present among the top five predictions for 91% of clusters. scALE slightly outperformed an existing method for reference data driven automated cluster annotation, and we demonstrate that integration of scALE can meaningfully improve the annotations derived from such methods. Further, contextualization of differential expression analyses with these GCAs highlights poorly characterized markers of well-studied cell types, such as CLIC6 and DNASE1L3 in retinal pigment epithelial cells and endothelial cells, respectively. Taken together, this study illustrates for the first time how the systematic application of a literature-derived knowledge graph can expedite and enhance the annotation and interpretation of scRNA-seq data.
Collapse
|
38
|
Zheng WM, Chai QW, Zhang J, Xue X. Ternary compound ontology matching for cognitive green computing. MATHEMATICAL BIOSCIENCES AND ENGINEERING : MBE 2021; 18:4860-4870. [PMID: 34198469 DOI: 10.3934/mbe.2021247] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/13/2023]
Abstract
Cognitive green computing (CGC) dedicates to study the designing, manufacturing, using and disposing of computers, servers and associated subsystems with minimal environmental damage. These solutions should provide efficient mechanisms for maximizing the efficiency of use of computing resources. Evolutionary algorithm (EA) is a well-known global search algorithm, which has been successfully used to solve various complex optimization problems. However, a run of population-based EA often requires huge memory consumption, which limited their applications in the memory-limited hardware. To overcome this drawback, in this work, we propose a compact EA (CEA) for the sake of CGC, whose compact encoding and evolving mechanism is able to significantly reduce the memory consumption. After that, we use it to address the ternary compound ontology matching problem. Six testing cases that consist of nine ontologies are used to test CEA's performance, and the experimental results show its effectiveness.
Collapse
Affiliation(s)
- Wei-Min Zheng
- College of Computer Science and Engineering, Shandong University of Science and Technology, Qingdao 266590, China
| | - Qing-Wei Chai
- College of Computer Science and Engineering, Shandong University of Science and Technology, Qingdao 266590, China
| | - Jie Zhang
- School of Computer Science and Engineering, Yulin Normal University, Yulin 537000, China
| | - Xingsi Xue
- School of Computer Science and Mathematics, Fujian University of Technology, Fuzhou 350118, China
| |
Collapse
|
39
|
Lauriola I, Aiolli F, Lavelli A, Rinaldi F. Learning adaptive representations for entity recognition in the biomedical domain. J Biomed Semantics 2021; 12:10. [PMID: 34001263 PMCID: PMC8127187 DOI: 10.1186/s13326-021-00238-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/22/2019] [Accepted: 03/09/2021] [Indexed: 11/10/2022] Open
Abstract
Background Named Entity Recognition is a common task in Natural Language Processing applications, whose purpose is to recognize named entities in textual documents. Several systems exist to solve this task in the biomedical domain, based on Natural Language Processing techniques and Machine Learning algorithms. A crucial step of these applications is the choice of the representation which describes data. Several representations have been proposed in the literature, some of which are based on a strong knowledge of the domain, and they consist of features manually defined by domain experts. Usually, these representations describe the problem well, but they require a lot of human effort and annotated data. On the other hand, general-purpose representations like word-embeddings do not require human domain knowledge, but they could be too general for a specific task. Results This paper investigates methods to learn the best representation from data directly, by combining several knowledge-based representations and word embeddings. Two mechanisms have been considered to perform the combination, which are neural networks and Multiple Kernel Learning. To this end, we use a hybrid architecture for biomedical entity recognition which integrates dictionary look-up (also known as gazetteers) with machine learning techniques. Results on the CRAFT corpus clearly show the benefits of the proposed algorithm in terms of F1 score. Conclusions Our experiments show that the principled combination of general, domain specific, word-, and character-level representations improves the performance of entity recognition. We also discussed the contribution of each representation in the final solution.
Collapse
Affiliation(s)
- Ivano Lauriola
- Department of Mathematics, University of Padova, Via Trieste 63, Padova, 35121, Italy. .,Fondazione Bruno Kessler, Via Sommarive 18, Trento, 38123, Italy.
| | - Fabio Aiolli
- Department of Mathematics, University of Padova, Via Trieste 63, Padova, 35121, Italy
| | - Alberto Lavelli
- Fondazione Bruno Kessler, Via Sommarive 18, Trento, 38123, Italy
| | - Fabio Rinaldi
- Fondazione Bruno Kessler, Via Sommarive 18, Trento, 38123, Italy.,Dalle Molle Institute for Artificial Intelligence Research (IDSIA), Via Cantonale 2C, Manno, 6928, Svizzera.,Department of Quantitative Biomedicine, University of Zurich, Andreasstrasse 15, Zürich, 8050, Svizzera.,SIB, Swiss Institute of Bioinformatics, Génopode, Quartier UNIL-Sorge, bâtiment, Lausanne, 1015, Svizzera
| |
Collapse
|
40
|
Touré V, Vercruysse S, Acencio ML, Lovering RC, Orchard S, Bradley G, Casals-Casas C, Chaouiya C, Del-Toro N, Flobak Å, Gaudet P, Hermjakob H, Hoyt CT, Licata L, Lægreid A, Mungall CJ, Niknejad A, Panni S, Perfetto L, Porras P, Pratt D, Saez-Rodriguez J, Thieffry D, Thomas PD, Türei D, Kuiper M. The Minimum Information about a Molecular Interaction CAusal STatement (MI2CAST). Bioinformatics 2021; 36:5712-5718. [PMID: 32637990 PMCID: PMC8023674 DOI: 10.1093/bioinformatics/btaa622] [Citation(s) in RCA: 9] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/10/2020] [Revised: 06/06/2020] [Accepted: 06/30/2020] [Indexed: 12/30/2022] Open
Abstract
Motivation A large variety of molecular interactions occurs between biomolecular components in cells. When a molecular interaction results in a regulatory effect, exerted by one component onto a downstream component, a so-called ‘causal interaction’ takes place. Causal interactions constitute the building blocks in our understanding of larger regulatory networks in cells. These causal interactions and the biological processes they enable (e.g. gene regulation) need to be described with a careful appreciation of the underlying molecular reactions. A proper description of this information enables archiving, sharing and reuse by humans and for automated computational processing. Various representations of causal relationships between biological components are currently used in a variety of resources. Results Here, we propose a checklist that accommodates current representations, called the Minimum Information about a Molecular Interaction CAusal STatement (MI2CAST). This checklist defines both the required core information, as well as a comprehensive set of other contextual details valuable to the end user and relevant for reusing and reproducing causal molecular interaction information. The MI2CAST checklist can be used as reporting guidelines when annotating and curating causal statements, while fostering uniformity and interoperability of the data across resources. Availability and implementation The checklist together with examples is accessible at https://github.com/MI2CAST/MI2CAST Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Vasundra Touré
- Department of Biology, Norwegian University of Science and Technology (NTNU), Trondheim 7491, Norway
| | - Steven Vercruysse
- Department of Biology, Norwegian University of Science and Technology (NTNU), Trondheim 7491, Norway
| | - Marcio Luis Acencio
- Department of Clinical and Molecular Medicine, Norwegian University of Science and Technology (NTNU), Trondheim 7491, Norway
| | - Ruth C Lovering
- Functional Gene Annotation, Preclinical and Fundamental Science, Institute of Cardiovascular Science, UCL, University College London, London WC1E 6JF, UK
| | - Sandra Orchard
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridgeshire CB10 1SD, UK
| | - Glyn Bradley
- Computational Biology, Functional Genomics, GSK, Stevenage SG1 2NY, UK
| | | | - Claudine Chaouiya
- Aix Marseille Univ, CNRS, Centrale Marseille, I2M Marseille 13331, France
| | - Noemi Del-Toro
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridgeshire CB10 1SD, UK
| | - Åsmund Flobak
- Department of Clinical and Molecular Medicine, Norwegian University of Science and Technology (NTNU), Trondheim 7491, Norway.,The Cancer Clinic, St. Olav's Hospital, Trondheim University Hospital, Trondheim 7030, Norway
| | - Pascale Gaudet
- SIB Swiss Institute of Bioinformatics, Geneva 1211, Switzerland
| | - Henning Hermjakob
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridgeshire CB10 1SD, UK
| | | | - Luana Licata
- Department of Biology, University of Rome Tor Vergata, Via della Ricerca Scientifica, 00133 Rome, Italy
| | - Astrid Lægreid
- Department of Clinical and Molecular Medicine, Norwegian University of Science and Technology (NTNU), Trondheim 7491, Norway
| | - Christopher J Mungall
- Division of Environmental Genomics and Systems Biology, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA
| | - Anne Niknejad
- Vital-IT Group, SIB Swiss Institute of Bioinformatics, Quartier Sorge, Amphipole Building, 1015 Lausanne, Switzerland
| | - Simona Panni
- Department of Biology, Ecology and Earth Sciences, University of Calabria, Ecology and Earth Science, Via Pietro Bucci Cubo 6/C, Rende 87036, CS, Italy
| | - Livia Perfetto
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridgeshire CB10 1SD, UK
| | - Pablo Porras
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridgeshire CB10 1SD, UK
| | - Dexter Pratt
- Department of Medicine, University of California San Diego, La Jolla, CA 92093, USA
| | - Julio Saez-Rodriguez
- Institute of Computational Biomedicine, Heidelberg University, Faculty of Medicine, 69120 Heidelberg, Germany.,Joint Research Centre for Computational Biomedicine (JRC-COMBINE), Faculty of Medicine, RWTH Aachen University, Aachen 52062, Germany
| | - Denis Thieffry
- Institut de Biologie de l'ENS (IBENS), Département de Biologie, École Normale Supérieure, CNRS, INSERM, Université PSL, 75005 Paris, France
| | - Paul D Thomas
- Division of Bioinformatics, Department of Preventive Medicine, University of Southern California, Los Angeles, CA 90007, USA
| | - Dénes Türei
- Joint Research Centre for Computational Biomedicine (JRC-COMBINE), Faculty of Medicine, RWTH Aachen University, Aachen 52062, Germany
| | - Martin Kuiper
- Department of Biology, Norwegian University of Science and Technology (NTNU), Trondheim 7491, Norway
| |
Collapse
|
41
|
Kuksin M, Morel D, Aglave M, Danlos FX, Marabelle A, Zinovyev A, Gautheret D, Verlingue L. Applications of single-cell and bulk RNA sequencing in onco-immunology. Eur J Cancer 2021; 149:193-210. [PMID: 33866228 DOI: 10.1016/j.ejca.2021.03.005] [Citation(s) in RCA: 62] [Impact Index Per Article: 20.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/22/2020] [Revised: 02/26/2021] [Accepted: 03/04/2021] [Indexed: 02/08/2023]
Abstract
The rising interest for precise characterization of the tumour immune contexture has recently brought forward the high potential of RNA sequencing (RNA-seq) in identifying molecular mechanisms engaged in the response to immunotherapy. In this review, we provide an overview of the major principles of single-cell and conventional (bulk) RNA-seq applied to onco-immunology. We describe standard preprocessing and statistical analyses of data obtained from such techniques and highlight some computational challenges relative to the sequencing of individual cells. We notably provide examples of gene expression analyses such as differential expression analysis, dimensionality reduction, clustering and enrichment analysis. Additionally, we used public data sets to exemplify how deconvolution algorithms can identify and quantify multiple immune subpopulations from either bulk or single-cell RNA-seq. We give examples of machine and deep learning models used to predict patient outcomes and treatment effect from high-dimensional data. Finally, we balance the strengths and weaknesses of single-cell and bulk RNA-seq regarding their applications in the clinic.
Collapse
Affiliation(s)
- Maria Kuksin
- ENS de Lyon, 15 Parvis René Descartes, 69007, Lyon, France; Département d'Innovations Thérapeutiques et Essais Précoces (DITEP), Gustave Roussy Cancer Campus, 114 rue Edouard Vaillant, 94800, Villejuif, France
| | - Daphné Morel
- Département d'Innovations Thérapeutiques et Essais Précoces (DITEP), Gustave Roussy Cancer Campus, 114 rue Edouard Vaillant, 94800, Villejuif, France; Département de Radiothérapie, Gustave Roussy Cancer Campus, Gustave Roussy, 114 rue Edouard Vaillant, 94800, Villejuif, France; INSERM UMR1030, Molecular Radiotherapy and Therapeutic Innovations, Gustave Roussy, 114 rue Edouard Vaillant, 94800, Villejuif, France
| | - Marine Aglave
- INSERM US23, CNRS UMS 3655, Gustave Roussy Cancer Campus, 114 rue Edouard Vaillant, 94800, Villejuif, France
| | | | - Aurélien Marabelle
- Département d'Innovations Thérapeutiques et Essais Précoces (DITEP), Gustave Roussy Cancer Campus, 114 rue Edouard Vaillant, 94800, Villejuif, France; INSERM U1015, Gustave Roussy, Université Paris Saclay, France
| | - Andrei Zinovyev
- Institut Curie, PSL Research University, F-75005, Paris, France; INSERM, U900, F-75005, Paris, France; MINES ParisTech, PSL Research University, CBIO-Centre for Computational Biology, F-75006, Paris, France; Laboratory of Advanced Methods for High-dimensional Data Analysis, Lobachevsky University, 603000, Nizhny Novgorod, Russia
| | - Daniel Gautheret
- Institute for Integrative Biology of the Cell, UMR 9198, CEA, CNRS, Université Paris-Saclay, Gif-Sur-Yvette, France; IHU PRISM, Gustave Roussy Cancer Campus, Gustave Roussy, 114 Rue Edouard Vaillant, 94800, Villejuif, France; Université Paris-Saclay, France
| | - Loïc Verlingue
- Département d'Innovations Thérapeutiques et Essais Précoces (DITEP), Gustave Roussy Cancer Campus, 114 rue Edouard Vaillant, 94800, Villejuif, France; INSERM UMR1030, Molecular Radiotherapy and Therapeutic Innovations, Gustave Roussy, 114 rue Edouard Vaillant, 94800, Villejuif, France; Institut Curie, PSL Research University, F-75005, Paris, France; Université Paris-Saclay, France.
| |
Collapse
|
42
|
Ruiz C, Zitnik M, Leskovec J. Identification of disease treatment mechanisms through the multiscale interactome. Nat Commun 2021; 12:1796. [PMID: 33741907 PMCID: PMC7979814 DOI: 10.1038/s41467-021-21770-8] [Citation(s) in RCA: 61] [Impact Index Per Article: 20.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/22/2020] [Accepted: 02/04/2021] [Indexed: 12/12/2022] Open
Abstract
Most diseases disrupt multiple proteins, and drugs treat such diseases by restoring the functions of the disrupted proteins. How drugs restore these functions, however, is often unknown as a drug's therapeutic effects are not limited to the proteins that the drug directly targets. Here, we develop the multiscale interactome, a powerful approach to explain disease treatment. We integrate disease-perturbed proteins, drug targets, and biological functions into a multiscale interactome network. We then develop a random walk-based method that captures how drug effects propagate through a hierarchy of biological functions and physical protein-protein interactions. On three key pharmacological tasks, the multiscale interactome predicts drug-disease treatment, identifies proteins and biological functions related to treatment, and predicts genes that alter a treatment's efficacy and adverse reactions. Our results indicate that physical interactions between proteins alone cannot explain treatment since many drugs treat diseases by affecting the biological functions disrupted by the disease rather than directly targeting disease proteins or their regulators. We provide a general framework for explaining treatment, even when drugs seem unrelated to the diseases they are recommended for.
Collapse
Affiliation(s)
- Camilo Ruiz
- Computer Science Department, Stanford University, Stanford, CA, USA
- Bioengineering Department, Stanford University, Stanford, CA, USA
| | - Marinka Zitnik
- Biomedical Informatics Department, Harvard University, Boston, MA, USA
| | - Jure Leskovec
- Computer Science Department, Stanford University, Stanford, CA, USA.
- Chan Zuckerberg Biohub, San Francisco, CA, USA.
| |
Collapse
|
43
|
Bernstein MN, Ni Z, Collins M, Burkard ME, Kendziorski C, Stewart R. CHARTS: a web application for characterizing and comparing tumor subpopulations in publicly available single-cell RNA-seq data sets. BMC Bioinformatics 2021; 22:83. [PMID: 33622236 PMCID: PMC7903756 DOI: 10.1186/s12859-021-04021-x] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/05/2020] [Accepted: 02/11/2021] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Single-cell RNA-seq (scRNA-seq) enables the profiling of genome-wide gene expression at the single-cell level and in so doing facilitates insight into and information about cellular heterogeneity within a tissue. This is especially important in cancer, where tumor and tumor microenvironment heterogeneity directly impact development, maintenance, and progression of disease. While publicly available scRNA-seq cancer data sets offer unprecedented opportunity to better understand the mechanisms underlying tumor progression, metastasis, drug resistance, and immune evasion, much of the available information has been underutilized, in part, due to the lack of tools available for aggregating and analysing these data. RESULTS We present CHARacterizing Tumor Subpopulations (CHARTS), a web application for exploring publicly available scRNA-seq cancer data sets in the NCBI's Gene Expression Omnibus. More specifically, CHARTS enables the exploration of individual gene expression, cell type, malignancy-status, differentially expressed genes, and gene set enrichment results in subpopulations of cells across tumors and data sets. Along with the web application, we also make available the backend computational pipeline that was used to produce the analyses that are available for exploration in the web application. CONCLUSION CHARTS is an easy to use, comprehensive platform for exploring single-cell subpopulations within tumors across the ever-growing collection of public scRNA-seq cancer data sets. CHARTS is freely available at charts.morgridge.org.
Collapse
Affiliation(s)
| | - Zijian Ni
- Department of Statistics, University of Wisconsin - Madison, Madison, WI, 53706, USA
| | | | - Mark E Burkard
- Department of Medicine, Hematology/Oncology, University of Wisconsin - Madison, Madison, WI, 53705, USA
- University of Wisconsin Carbone Cancer Center, Madison, WI, 53705, USA
| | - Christina Kendziorski
- Department of Biostatistics and Medical Informatics, University of Wisconsin - Madison, Madison, WI, 53792, USA.
| | - Ron Stewart
- Morgridge Institute for Research, Madison, WI, 53715, USA.
| |
Collapse
|
44
|
Cabau-Laporta J, Ascensión AM, Arrospide-Elgarresta M, Gerovska D, Araúzo-Bravo MJ. FOntCell: Fusion of Ontologies of Cells. Front Cell Dev Biol 2021; 9:562908. [PMID: 33644039 PMCID: PMC7905052 DOI: 10.3389/fcell.2021.562908] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/18/2020] [Accepted: 01/05/2021] [Indexed: 11/25/2022] Open
Abstract
High-throughput cell-data technologies such as single-cell RNA-seq create a demand for algorithms for automatic cell classification and characterization. There exist several cell classification ontologies with complementary information. However, one needs to merge them to synergistically combine their information. The main difficulty in merging is to match the ontologies since they use different naming conventions. Therefore, we developed an algorithm that merges ontologies by integrating the name matching between class label names with the structure mapping between the ontology elements based on graph convolution. Since the structure mapping is a time consuming process, we designed two methods to perform the graph convolution: vectorial structure matching and constraint-based structure matching. To perform the vectorial structure matching, we designed a general method to calculate the similarities between vectors of different lengths for different metrics. Additionally, we adapted the slower Blondel method to work for structure matching. We implemented our algorithms into FOntCell, a software module in Python for efficient automatic parallel-computed merging/fusion of ontologies in the same or similar knowledge domains. FOntCell can unify dispersed knowledge from one domain into a unique ontology in OWL format and iteratively reuse it to continuously adapt ontologies with new data endlessly produced by data-driven classification methods, such as of the Human Cell Atlas. To navigate easily across the merged ontologies, it generates HTML files with tabulated and graphic summaries, and interactive circular Directed Acyclic Graphs. We used FOntCell to merge the CELDA, LifeMap and LungMAP Human Anatomy cell ontologies into a comprehensive cell ontology. We compared FOntCell with tools used for the alignment of mouse and human anatomy ontologies task proposed by the Ontology Alignment Evaluation Initiative (OAEI) and found that the Fβ alignment accuracies of FOntCell are above the geometric mean of the other tools; more importantly, it outperforms significantly the best OAEI tools in cell ontology alignment in terms of Fβ alignment accuracies.
Collapse
Affiliation(s)
- Javier Cabau-Laporta
- Computational Biology and Systems Biomedicine Group, Biodonostia Health Research Institute, San Sebastián, Spain
| | - Alex M Ascensión
- Computational Biology and Systems Biomedicine Group, Biodonostia Health Research Institute, San Sebastián, Spain
| | - Mikel Arrospide-Elgarresta
- Computational Biology and Systems Biomedicine Group, Biodonostia Health Research Institute, San Sebastián, Spain
| | - Daniela Gerovska
- Computational Biology and Systems Biomedicine Group, Biodonostia Health Research Institute, San Sebastián, Spain.,Computational Biomedicine Data Analysis Platform, Biodonostia Health Research Institute, San Sebastián, Spain
| | - Marcos J Araúzo-Bravo
- Computational Biology and Systems Biomedicine Group, Biodonostia Health Research Institute, San Sebastián, Spain.,Computational Biomedicine Data Analysis Platform, Biodonostia Health Research Institute, San Sebastián, Spain.,Basque Foundation for Science (IKERBASQUE), Bilbao, Spain.,Centro de Investigación Biomédica en Red (CIBER) of Frailty and Healthy Aging (CIBERfes), Madrid, Spain.,TransBioNet Thematic Network of Excellence for Transitional Bioinformatics, Barcelona Supercomputing Center, Barcelona, Spain.,Computational Biology and Bioinformatics, Department Cell and Developmental Biology Max Planck Institute for Molecular Biomedicine, Münster, Germany
| |
Collapse
|
45
|
Baldarelli RM, Smith CM, Finger JH, Hayamizu TF, McCright IJ, Xu J, Shaw DR, Beal JS, Blodgett O, Campbell J, Corbani LE, Frost PJ, Giannatto SC, Miers DB, Kadin JA, Richardson JE, Ringwald M. The mouse Gene Expression Database (GXD): 2021 update. Nucleic Acids Res 2021; 49:D924-D931. [PMID: 33104772 PMCID: PMC7778941 DOI: 10.1093/nar/gkaa914] [Citation(s) in RCA: 49] [Impact Index Per Article: 16.3] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/14/2020] [Revised: 10/01/2020] [Accepted: 10/02/2020] [Indexed: 01/05/2023] Open
Abstract
The Gene Expression Database (GXD; www.informatics.jax.org/expression.shtml) is an extensive and well-curated community resource of mouse developmental gene expression information. For many years, GXD has collected and integrated data from RNA in situ hybridization, immunohistochemistry, RT-PCR, northern blot, and western blot experiments through curation of the scientific literature and by collaborations with large-scale expression projects. Since our last report in 2019, we have continued to acquire these classical types of expression data; developed a searchable index of RNA-Seq and microarray experiments that allows users to quickly and reliably find specific mouse expression studies in ArrayExpress (https://www.ebi.ac.uk/arrayexpress/) and GEO (https://www.ncbi.nlm.nih.gov/geo/); and expanded GXD to include RNA-Seq data. Uniformly processed RNA-Seq data are imported from the EBI Expression Atlas and then integrated with the other types of expression data in GXD, and with the genetic, functional, phenotypic and disease-related information in Mouse Genome Informatics (MGI). This integration has made the RNA-Seq data accessible via GXD’s enhanced searching and filtering capabilities. Further, we have embedded the Morpheus heat map utility into the GXD user interface to provide additional tools for display and analysis of RNA-Seq data, including heat map visualization, sorting, filtering, hierarchical clustering, nearest neighbors analysis and visual enrichment.
Collapse
Affiliation(s)
| | | | | | - Terry F Hayamizu
- The Jackson Laboratory, 600 Main Street, Bar Harbor, ME 04609, USA
| | | | - Jingxia Xu
- The Jackson Laboratory, 600 Main Street, Bar Harbor, ME 04609, USA
| | - David R Shaw
- The Jackson Laboratory, 600 Main Street, Bar Harbor, ME 04609, USA
| | - Jonathan S Beal
- The Jackson Laboratory, 600 Main Street, Bar Harbor, ME 04609, USA
| | - Olin Blodgett
- The Jackson Laboratory, 600 Main Street, Bar Harbor, ME 04609, USA
| | - Jeffrey Campbell
- The Jackson Laboratory, 600 Main Street, Bar Harbor, ME 04609, USA
| | - Lori E Corbani
- The Jackson Laboratory, 600 Main Street, Bar Harbor, ME 04609, USA
| | - Pete J Frost
- The Jackson Laboratory, 600 Main Street, Bar Harbor, ME 04609, USA
| | | | - Dave B Miers
- The Jackson Laboratory, 600 Main Street, Bar Harbor, ME 04609, USA
| | - James A Kadin
- The Jackson Laboratory, 600 Main Street, Bar Harbor, ME 04609, USA
| | | | - Martin Ringwald
- The Jackson Laboratory, 600 Main Street, Bar Harbor, ME 04609, USA
| |
Collapse
|
46
|
Smith CM, Kadin JA, Baldarelli RM, Beal JS, Blodgett O, Giannatto SC, Richardson JE, Ringwald M. GXD's RNA-Seq and Microarray Experiment Search: using curated metadata to reliably find mouse expression studies of interest. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2021; 2020:5756137. [PMID: 32140729 PMCID: PMC7058436 DOI: 10.1093/database/baaa002] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 10/31/2019] [Revised: 12/09/2019] [Accepted: 01/06/2020] [Indexed: 12/28/2022]
Abstract
The Gene Expression Database (GXD), an extensive community resource of curated expression information for the mouse, has developed an RNA-Seq and Microarray Experiment Search (http://www.informatics.jax.org/gxd/htexp_index). This tool allows users to quickly and reliably find specific experiments in ArrayExpress and the Gene Expression Omnibus (GEO) that study endogenous gene expression in wild-type and mutant mice. Standardized metadata annotations, curated by GXD, allow users to specify the anatomical structure, developmental stage, mutated gene, strain and sex of samples of interest, as well as the study type and key parameters of the experiment. These searches, powered by controlled vocabularies and ontologies, can be combined with free text searching of experiment titles and descriptions. Search result summaries include link-outs to ArrayExpress and GEO, providing easy access to the expression data itself. Links to the PubMed entries for accompanying publications are also included. More information about this tool and GXD can be found at the GXD home page (http://www.informatics.jax.org/expression.shtml). Database URL:http://www.informatics.jax.org/expression.shtml
Collapse
Affiliation(s)
| | - James A Kadin
- The Jackson Laboratory 600 Main Street Bar Harbor, ME 04609, USA
| | | | - Jonathan S Beal
- The Jackson Laboratory 600 Main Street Bar Harbor, ME 04609, USA
| | - Olin Blodgett
- The Jackson Laboratory 600 Main Street Bar Harbor, ME 04609, USA
| | | | | | - Martin Ringwald
- The Jackson Laboratory 600 Main Street Bar Harbor, ME 04609, USA
| |
Collapse
|
47
|
|
48
|
Sarwar DM, Nickerson DP. CellML Model Discovery with the Physiome Model Repository. SYSTEMS MEDICINE 2021. [DOI: 10.1016/b978-0-12-801238-3.11681-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/24/2022] Open
|
49
|
Slater LT, Gkoutos GV, Hoehndorf R. Towards semantic interoperability: finding and repairing hidden contradictions in biomedical ontologies. BMC Med Inform Decis Mak 2020; 20:311. [PMID: 33319712 PMCID: PMC7736131 DOI: 10.1186/s12911-020-01336-2] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/11/2020] [Accepted: 11/16/2020] [Indexed: 12/25/2022] Open
Abstract
Background Ontologies are widely used throughout the biomedical domain. These ontologies formally represent the classes and relations assumed to exist within a domain. As scientific domains are deeply interlinked, so too are their representations. While individual ontologies can be tested for consistency and coherency using automated reasoning methods, systematically combining ontologies of multiple domains together may reveal previously hidden contradictions. Methods We developed a method that tests for hidden unsatisfiabilities in an ontology that arise when combined with other ontologies. For this purpose, we combined sets of ontologies and use automated reasoning to determine whether unsatisfiable classes are present. In addition, we designed and implemented a novel algorithm that can determine justifications for contradictions across extremely large and complicated ontologies, and use these justifications to semi-automatically repair ontologies by identifying a small set of axioms that, when removed, result in a consistent and coherent set of ontologies.
Results We tested the mutual consistency of the OBO Foundry and the OBO ontologies and find that the combined OBO Foundry gives rise to at least 636 unsatisfiable classes, while the OBO ontologies give rise to more than 300,000 unsatisfiable classes. We also applied our semi-automatic repair algorithm to each combination of OBO ontologies that resulted in unsatisfiable classes, finding that only 117 axioms could be removed to account for all cases of unsatisfiability across all OBO ontologies. Conclusions We identified a large set of hidden unsatisfiability across a broad range of biomedical ontologies, and we find that this large set of unsatisfiable classes is the result of a relatively small amount of axiomatic disagreements. Our results show that hidden unsatisfiability is a serious problem in ontology interoperability; however, our results also provide a way towards more consistent ontologies by addressing the issues we identified.
Collapse
Affiliation(s)
- Luke T Slater
- College of Medical and Dental Sciences, Institute of Cancer and Genomic Sciences, University of Birmingham, Birmingham, B15 2TT, UK. .,Institute of Translational Medicine, University Hospitals Birmingham, NHS Foundation Trust, Birmingham, B15 2TT, UK.
| | - Georgios V Gkoutos
- College of Medical and Dental Sciences, Institute of Cancer and Genomic Sciences, University of Birmingham, Birmingham, B15 2TT, UK.,Institute of Translational Medicine, University Hospitals Birmingham, NHS Foundation Trust, Birmingham, B15 2TT, UK.,NIHR Experimental Cancer Medicine Centre, Birmingham, B15 2TT, UK.,NIHR Surgical Reconstruction and Microbiology Research Centre, Birmingham, B15 2TT, UK.,NIHR Biomedical Research Centre, Birmingham, B15 2TT, UK.,MRC Health Data Research UK (HDR UK Midlands, Birmingham, B15 2TT, UK
| | - Robert Hoehndorf
- Computer, Electrical and Mathematical Sciences and Engineering Division, Computational Bioscience Research Center, King Abdullah University of Science and Technology, Thuwal, 23955, Saudi Arabia
| |
Collapse
|
50
|
Bernstein MN, Ma Z, Gleicher M, Dewey CN. CellO: comprehensive and hierarchical cell type classification of human cells with the Cell Ontology. iScience 2020; 24:101913. [PMID: 33364592 PMCID: PMC7753962 DOI: 10.1016/j.isci.2020.101913] [Citation(s) in RCA: 19] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/30/2020] [Revised: 10/28/2020] [Accepted: 12/02/2020] [Indexed: 12/15/2022] Open
Abstract
Cell type annotation is a fundamental task in the analysis of single-cell RNA-sequencing data. In this work, we present CellO, a machine learning-based tool for annotating human RNA-seq data with the Cell Ontology. CellO enables accurate and standardized cell type classification of cell clusters by considering the rich hierarchical structure of known cell types. Furthermore, CellO comes pre-trained on a comprehensive data set of human, healthy, untreated primary samples in the Sequence Read Archive. CellO's comprehensive training set enables it to run out of the box on diverse cell types and achieves competitive or even superior performance when compared to existing state-of-the-art methods. Lastly, CellO's linear models are easily interpreted, thereby enabling exploration of cell-type-specific expression signatures across the ontology. To this end, we also present the CellO Viewer: a web application for exploring CellO's models across the ontology.
Collapse
Affiliation(s)
| | - Zhongjie Ma
- Department of Computer Sciences, University of Wisconsin - Madison, Madison, WI 53706, USA
| | - Michael Gleicher
- Department of Computer Sciences, University of Wisconsin - Madison, Madison, WI 53706, USA
| | - Colin N Dewey
- Department of Computer Sciences, University of Wisconsin - Madison, Madison, WI 53706, USA.,Department of Biostatistics and Medical Informatics, University of Wisconsin - Madison, Madison, WI 53792, USA
| |
Collapse
|