1
|
Wu J, Wan C, Ji Z, Zhou Y, Hou W. EpiFoundation: A Foundation Model for Single-Cell ATAC-seq via Peak-to-Gene Alignment. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2025:2025.02.05.636688. [PMID: 39975086 PMCID: PMC11839112 DOI: 10.1101/2025.02.05.636688] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/21/2025]
Abstract
Foundation models exhibit strong capabilities for downstream tasks by learning generalized representations through self-supervised pre-training on large datasets. While several foundation models have been developed for single-cell RNA-seq (scRNA-seq) data, there is still a lack of models specifically tailored for single-cell ATAC-seq (scATAC-seq), which measures epigenetic information in individual cells. The principal challenge in developing such a model lies in the vast number of scATAC peaks and the significant sparsity of the data, which complicates the formulation of peak-to-peak correlations. To address this challenge, we introduce EpiFoundation , a foundation model for learning cell representations from the high-dimensional and sparse space of peaks. Epi-Foundation relies on an innovative cross-modality pre-training procedure with two key technical innovations. First, EpiFoundation exclusively processes the non-zero peak set, thereby enhancing the density of cell-specific information within the input data. Second, EpiFoundation utilizes dense gene expression information to supervise the pre-training process, aligning peak-to-gene correlations. EpiFoundation can handle various types of downstream tasks, including cell-type annotation, batch correction, and gene expression prediction. To train and validate EpiFoundation, we curated MiniAtlas , a dataset of 100,000+ single cells with paired scRNA-seq and scATAC-seq data, along with diverse test sets spanning various tissues and cell types for robust evaluation. EpiFoundation demonstrates state-of-the-art performance across multiple tissues and diverse downstream tasks.
Collapse
|
2
|
Levine D, Rizvi SA, Lévy S, Pallikkavaliyaveetil N, Zhang D, Chen X, Ghadermarzi S, Wu R, Zheng Z, Vrkic I, Zhong A, Raskin D, Han I, de Oliveira Fonseca AH, Caro JO, Karbasi A, Dhodapkar RM, van Dijk D. Cell2Sentence: Teaching Large Language Models the Language of Biology. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2023.09.11.557287. [PMID: 39554079 PMCID: PMC11565894 DOI: 10.1101/2023.09.11.557287] [Citation(s) in RCA: 4] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 11/19/2024]
Abstract
We introduce Cell2Sentence (C2S), a novel method to directly adapt large language models to a biological context, specifically single-cell transcriptomics. By transforming gene expression data into "cell sentences," C2S bridges the gap between natural language processing and biology. We demonstrate cell sentences enable the finetuning of language models for diverse tasks in biology, including cell generation, complex celltype annotation, and direct data-driven text generation. Our experiments reveal that GPT-2, when fine-tuned with C2S, can generate biologically valid cells based on cell type inputs, and accurately predict cell types from cell sentences. This illustrates that language models, through C2S fine-tuning, can acquire a significant understanding of single-cell biology while maintaining robust text generation capabilities. C2S offers a flexible, accessible framework to integrate natural language processing with transcriptomics, utilizing existing models and libraries for a wide range of biological applications.
Collapse
Affiliation(s)
- Daniel Levine
- Department of Computer Science, Yale University, New Haven, CT, USA
| | - Syed Asad Rizvi
- Department of Computer Science, Yale University, New Haven, CT, USA
| | - Sacha Lévy
- Department of Computer Science, Yale University, New Haven, CT, USA
| | | | - David Zhang
- Department of Computer Science, Yale University, New Haven, CT, USA
| | - Xingyu Chen
- Department of Computer Science, Yale University, New Haven, CT, USA
| | - Sina Ghadermarzi
- Department of Computer Science, Yale University, New Haven, CT, USA
| | - Ruiming Wu
- School of Engineering Applied Science, University of Pennsylvania, Philadelphia, PA, USA
| | - Zihe Zheng
- Department of Computer Science, Yale University, New Haven, CT, USA
| | - Ivan Vrkic
- School of Computer and Communication Sciences, Swiss Federal Institute of Technology Lausanne, Lausanne, Switzerland
| | - Anna Zhong
- Department of Computer Science, Yale University, New Haven, CT, USA
| | - Daphne Raskin
- Department of Computer Science, Yale University, New Haven, CT, USA
| | - Insu Han
- Department of Computer Science, Yale University, New Haven, CT, USA
| | | | - Josue Ortega Caro
- Department of Computer Science, Yale University, New Haven, CT, USA
- Department of Neuroscience, Yale School of Medicine, New Haven, CT, USA
- Wu Tsai Institute, Yale University, New Haven, CT, USA
| | - Amin Karbasi
- Department of Computer Science, Yale University, New Haven, CT, USA
- Google
- Yale Institute for Foundations of Data Science, New Haven, CT, USA
- Yale School of Engineering and Applied Science, New Haven, CT, USA
| | - Rahul M. Dhodapkar
- Roski Eye Institute, University of Southern California, Los Angeles, CA, USA
- Department of Internal Medicine (Cardiology), Yale School of Medicine, New Haven, CT, USA
| | - David van Dijk
- Department of Computer Science, Yale University, New Haven, CT, USA
- Wu Tsai Institute, Yale University, New Haven, CT, USA
- Yale Institute for Foundations of Data Science, New Haven, CT, USA
- Department of Internal Medicine (Cardiology), Yale School of Medicine, New Haven, CT, USA
- Cardiovascular Research Center, Yale School of Medicine, New Haven, CT, USA
- Interdepartmental Program in Computational Biology & Bioinformatics, Yale University, New Haven, CT, USA
| |
Collapse
|
3
|
Wang J, Ye Q, Liu L, Guo NL, Hu G. Scientific figures interpreted by ChatGPT: strengths in plot recognition and limits in color perception. NPJ Precis Oncol 2024; 8:84. [PMID: 38580746 PMCID: PMC10997760 DOI: 10.1038/s41698-024-00576-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/26/2023] [Accepted: 02/27/2024] [Indexed: 04/07/2024] Open
Abstract
Emerging studies underscore the promising capabilities of large language model-based chatbots in conducting basic bioinformatics data analyses. The recent feature of accepting image inputs by ChatGPT, also known as GPT-4V(ision), motivated us to explore its efficacy in deciphering bioinformatics scientific figures. Our evaluation with examples in cancer research, including sequencing data analysis, multimodal network-based drug repositioning, and tumor clonal evolution, revealed that ChatGPT can proficiently explain different plot types and apply biological knowledge to enrich interpretations. However, it struggled to provide accurate interpretations when color perception and quantitative analysis of visual elements were involved. Furthermore, while the chatbot can draft figure legends and summarize findings from the figures, stringent proofreading is imperative to ensure the accuracy and reliability of the content.
Collapse
Affiliation(s)
- Jinge Wang
- Department of Microbiology, Immunology & Cell Biology, West Virginia University, Morgantown, WV, 26506, USA
| | - Qing Ye
- West Virginia University Cancer Institute, West Virginia University, Morgantown, WV, 26506, USA
| | - Li Liu
- College of Health Solutions, Arizona State University, Phoenix, AZ, 85004, USA
- Biodesign Institute, Arizona State University, Tempe, AZ, 85281, USA
| | - Nancy Lan Guo
- West Virginia University Cancer Institute, West Virginia University, Morgantown, WV, 26506, USA
- Department of Occupational and Environmental Health Sciences, West Virginia University, Morgantown, WV, 26506, USA
| | - Gangqing Hu
- Department of Microbiology, Immunology & Cell Biology, West Virginia University, Morgantown, WV, 26506, USA.
- West Virginia University Cancer Institute, West Virginia University, Morgantown, WV, 26506, USA.
| |
Collapse
|
4
|
Chen Y, Zou J. GenePT: A Simple But Effective Foundation Model for Genes and Cells Built From ChatGPT. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2023.10.16.562533. [PMID: 37905130 PMCID: PMC10614824 DOI: 10.1101/2023.10.16.562533] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/02/2023]
Abstract
There has been significant recent progress in leveraging large-scale gene expression data to develop foundation models for single-cell biology. Models such as Geneformer and scGPT implicitly learn gene and cellular functions from the gene expression profiles of millions of cells, which requires extensive data curation and resource-intensive training. Here we explore a much simpler alternative by leveraging ChatGPT embeddings of genes based on literature. Our proposal, GenePT, uses NCBI text descriptions of individual genes with GPT-3.5 to generate gene embeddings. From there, GenePT generates single-cell embeddings in two ways: (i) by averaging the gene embeddings, weighted by each gene's expression level; or (ii) by creating a sentence embedding for each cell, using gene names ordered by the expression level. Without the need for dataset curation and additional pretraining, GenePT is efficient and easy to use. On many downstream tasks used to evaluate recent single-cell foundation models - e.g., classifying gene properties and cell types - GenePT achieves comparable, and often better, performance than Geneformer and other models. GenePT demonstrates that large language model embedding of literature is a simple and effective path for biological foundation models.
Collapse
Affiliation(s)
- Yiqun Chen
- Department of Biomedical Data Science, Stanford University, Stanford, 94305, CA, USA
| | - James Zou
- Department of Biomedical Data Science, Stanford University, Stanford, 94305, CA, USA
- Department of Electrical Engineering, Stanford University, Stanford, 94305, CA, USA
- Department of Computer Science, Stanford University, Stanford, 94305, CA, USA
| |
Collapse
|
5
|
Wang J, Ye Q, Liu L, Lan Guo N, Hu G. Bioinformatics Illustrations Decoded by ChatGPT: The Good, The Bad, and The Ugly. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.10.15.562423. [PMID: 37904927 PMCID: PMC10614796 DOI: 10.1101/2023.10.15.562423] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/01/2023]
Abstract
Emerging studies underscore the promising capabilities of large language model-based chatbots in conducting fundamental bioinformatics data analyses. The recent feature of accepting image-inputs by ChatGPT motivated us to explore its efficacy in deciphering bioinformatics illustrations. Our evaluation with examples in cancer research, including sequencing data analysis, multimodal network-based drug repositioning, and tumor clonal evolution, revealed that ChatGPT can proficiently explain different plot types and apply biological knowledge to enrich interpretations. However, it struggled to provide accurate interpretations when quantitative analysis of visual elements was involved. Furthermore, while the chatbot can draft figure legends and summarize findings from the figures, stringent proofreading is imperative to ensure the accuracy and reliability of the content.
Collapse
Affiliation(s)
- Jinge Wang
- Department of Microbiology, Immunology & Cell Biology, West Virginia University, Morgantown, WV 26506, USA
| | - Qing Ye
- West Virginia University Cancer Institute, West Virginia University, Morgantown, WV 26506, USA
| | - Li Liu
- College of Health Solutions, Arizona State University, Phoenix, AZ 85004, USA
- Biodesign Institute, Arizona State University, Tempe, AZ, 85281 USA
| | - Nancy Lan Guo
- West Virginia University Cancer Institute, West Virginia University, Morgantown, WV 26506, USA
- Department of Occupational and Environmental Health Sciences, West Virginia University, Morgantown, WV 26506, USA
| | - Gangqing Hu
- Department of Microbiology, Immunology & Cell Biology, West Virginia University, Morgantown, WV 26506, USA
- West Virginia University Cancer Institute, West Virginia University, Morgantown, WV 26506, USA
| |
Collapse
|
6
|
Liu G, Ma X, Zhang Y, Su B, Liu P. GPT4: The Indispensable Helper for Neurosurgeons in the New Era. Ann Biomed Eng 2023; 51:2113-2115. [PMID: 37204548 DOI: 10.1007/s10439-023-03241-x] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/10/2023] [Accepted: 05/11/2023] [Indexed: 05/20/2023]
Abstract
GPT4 is the newest multimodal language model released by OpenAI. With its powerful capabilities, GPT4 has great potential to revolutionize the healthcare industry. In this study, we proposed various ways GPT4 could display its talents in the field of neurosurgery in future. We believe that GPT4 is prone to become an indispensable assistant for neurosurgeons in the new era.
Collapse
Affiliation(s)
- Gemingtian Liu
- Department of Neurosurgery, Beijing Tiantan Hospital, Capital Medical University, Beijing, China
| | - Xin Ma
- Department of Neurosurgery, Beijing Tiantan Hospital, Capital Medical University, Beijing, China
| | - Yu Zhang
- Department of Neurosurgery, Beijing Tiantan Hospital, Capital Medical University, Beijing, China
| | - Boyan Su
- Department of Neurosurgery, Beijing Tsinghua Changgung Hospital, School of Clinical Medicine, Tsinghua University, Beijing, China
| | - Pinan Liu
- Department of Neurosurgery, Beijing Tiantan Hospital, Capital Medical University, Beijing, China.
- Department of Neural Reconstruction, Beijing Neurosurgery Institute, Capital Medical University, Beijing, China.
| |
Collapse
|
7
|
Analyzing the Future of ChatGPT in Medical Research. ARTIFICIAL INTELLIGENCE APPLICATIONS USING CHATGPT IN EDUCATION 2023:114-125. [DOI: 10.4018/978-1-6684-9300-7.ch011] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 07/02/2024]
Abstract
ChatGPT, an advanced language model based on the GPT-3.5 architecture developed by OpenAI, has garnered significant attention and widespread discussions across various domains. Students, educators, professionals, and businesses alike are engaging in dialogues about the capabilities and potential applications of this cutting-edge technology. The objective of the study is to seek current research directions of ChatGPT by looking at various pre-print servers. The current research surrounding ChatGPT demonstrates a growing interest in its application within the context of medical examination boards. Researchers have observed the potential of ChatGPT as a beneficial tool in supporting medical assessments and evaluations. Other research directions include literature synthesis and clinical decision.
Collapse
|