1
|
Sun F, Li H, Sun D, Fu S, Gu L, Shao X, Wang Q, Dong X, Duan B, Xing F, Wu J, Xiao M, Zhao F, Han JDJ, Liu Q, Fan X, Li C, Wang C, Shi T. Single-cell omics: experimental workflow, data analyses and applications. SCIENCE CHINA. LIFE SCIENCES 2024:10.1007/s11427-023-2561-0. [PMID: 39060615 DOI: 10.1007/s11427-023-2561-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/07/2023] [Accepted: 04/18/2024] [Indexed: 07/28/2024]
Abstract
Cells are the fundamental units of biological systems and exhibit unique development trajectories and molecular features. Our exploration of how the genomes orchestrate the formation and maintenance of each cell, and control the cellular phenotypes of various organismsis, is both captivating and intricate. Since the inception of the first single-cell RNA technology, technologies related to single-cell sequencing have experienced rapid advancements in recent years. These technologies have expanded horizontally to include single-cell genome, epigenome, proteome, and metabolome, while vertically, they have progressed to integrate multiple omics data and incorporate additional information such as spatial scRNA-seq and CRISPR screening. Single-cell omics represent a groundbreaking advancement in the biomedical field, offering profound insights into the understanding of complex diseases, including cancers. Here, we comprehensively summarize recent advances in single-cell omics technologies, with a specific focus on the methodology section. This overview aims to guide researchers in selecting appropriate methods for single-cell sequencing and related data analysis.
Collapse
Affiliation(s)
- Fengying Sun
- Department of Clinical Laboratory, the Affiliated Wuhu Hospital of East China Normal University (The Second People's Hospital of Wuhu City), Wuhu, 241000, China
| | - Haoyan Li
- Pharmaceutical Informatics Institute, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, 310058, China
| | - Dongqing Sun
- Key Laboratory of Spine and Spinal Cord Injury Repair and Regeneration (Tongji University), Ministry of Education, Orthopaedic Department, Tongji Hospital, Bioinformatics Department, School of Life Sciences and Technology, Tongji University, Shanghai, 200082, China
- Frontier Science Center for Stem Cells, School of Life Sciences and Technology, Tongji University, Shanghai, 200092, China
| | - Shaliu Fu
- Key Laboratory of Spine and Spinal Cord Injury Repair and Regeneration (Tongji University), Ministry of Education, Orthopaedic Department, Tongji Hospital, Bioinformatics Department, School of Life Sciences and Technology, Tongji University, Shanghai, 200082, China
- Translational Medical Center for Stem Cell Therapy and Institute for Regenerative Medicine, Shanghai East Hospital, Bioinformatics Department, School of Life Sciences and Technology, Tongji University, Shanghai, 200082, China
- Research Institute of Intelligent Computing, Zhejiang Lab, Hangzhou, 311121, China
- Shanghai Research Institute for Intelligent Autonomous Systems, Shanghai, 201210, China
| | - Lei Gu
- Center for Single-cell Omics, School of Public Health, Shanghai Jiao Tong University School of Medicine, Shanghai, 200025, China
| | - Xin Shao
- Pharmaceutical Informatics Institute, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, 310058, China
- National Key Laboratory of Chinese Medicine Modernization, Innovation Center of Yangtze River Delta, Zhejiang University, Jiaxing, 314103, China
| | - Qinqin Wang
- Center for Single-cell Omics, School of Public Health, Shanghai Jiao Tong University School of Medicine, Shanghai, 200025, China
| | - Xin Dong
- Key Laboratory of Spine and Spinal Cord Injury Repair and Regeneration (Tongji University), Ministry of Education, Orthopaedic Department, Tongji Hospital, Bioinformatics Department, School of Life Sciences and Technology, Tongji University, Shanghai, 200082, China
- Frontier Science Center for Stem Cells, School of Life Sciences and Technology, Tongji University, Shanghai, 200092, China
| | - Bin Duan
- Key Laboratory of Spine and Spinal Cord Injury Repair and Regeneration (Tongji University), Ministry of Education, Orthopaedic Department, Tongji Hospital, Bioinformatics Department, School of Life Sciences and Technology, Tongji University, Shanghai, 200082, China
- Translational Medical Center for Stem Cell Therapy and Institute for Regenerative Medicine, Shanghai East Hospital, Bioinformatics Department, School of Life Sciences and Technology, Tongji University, Shanghai, 200082, China
- Research Institute of Intelligent Computing, Zhejiang Lab, Hangzhou, 311121, China
- Shanghai Research Institute for Intelligent Autonomous Systems, Shanghai, 201210, China
| | - Feiyang Xing
- Key Laboratory of Spine and Spinal Cord Injury Repair and Regeneration (Tongji University), Ministry of Education, Orthopaedic Department, Tongji Hospital, Bioinformatics Department, School of Life Sciences and Technology, Tongji University, Shanghai, 200082, China
- Frontier Science Center for Stem Cells, School of Life Sciences and Technology, Tongji University, Shanghai, 200092, China
| | - Jun Wu
- Center for Bioinformatics and Computational Biology, Shanghai Key Laboratory of Regulatory Biology, the Institute of Biomedical Sciences and School of Life Sciences, East China Normal University, Shanghai, 200241, China
| | - Minmin Xiao
- Department of Clinical Laboratory, the Affiliated Wuhu Hospital of East China Normal University (The Second People's Hospital of Wuhu City), Wuhu, 241000, China.
| | - Fangqing Zhao
- Beijing Institutes of Life Science, Chinese Academy of Sciences, Beijing, 100101, China.
| | - Jing-Dong J Han
- Peking-Tsinghua Center for Life Sciences, Academy for Advanced Interdisciplinary Studies, Center for Quantitative Biology (CQB), Peking University, Beijing, 100871, China.
| | - Qi Liu
- Key Laboratory of Spine and Spinal Cord Injury Repair and Regeneration (Tongji University), Ministry of Education, Orthopaedic Department, Tongji Hospital, Bioinformatics Department, School of Life Sciences and Technology, Tongji University, Shanghai, 200082, China.
- Translational Medical Center for Stem Cell Therapy and Institute for Regenerative Medicine, Shanghai East Hospital, Bioinformatics Department, School of Life Sciences and Technology, Tongji University, Shanghai, 200082, China.
- Research Institute of Intelligent Computing, Zhejiang Lab, Hangzhou, 311121, China.
- Shanghai Research Institute for Intelligent Autonomous Systems, Shanghai, 201210, China.
| | - Xiaohui Fan
- Pharmaceutical Informatics Institute, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, 310058, China.
- National Key Laboratory of Chinese Medicine Modernization, Innovation Center of Yangtze River Delta, Zhejiang University, Jiaxing, 314103, China.
- Zhejiang Key Laboratory of Precision Diagnosis and Therapy for Major Gynecological Diseases, Women's Hospital, Zhejiang University School of Medicine, Hangzhou, 310006, China.
| | - Chen Li
- Center for Single-cell Omics, School of Public Health, Shanghai Jiao Tong University School of Medicine, Shanghai, 200025, China.
| | - Chenfei Wang
- Key Laboratory of Spine and Spinal Cord Injury Repair and Regeneration (Tongji University), Ministry of Education, Orthopaedic Department, Tongji Hospital, Bioinformatics Department, School of Life Sciences and Technology, Tongji University, Shanghai, 200082, China.
- Frontier Science Center for Stem Cells, School of Life Sciences and Technology, Tongji University, Shanghai, 200092, China.
| | - Tieliu Shi
- Department of Clinical Laboratory, the Affiliated Wuhu Hospital of East China Normal University (The Second People's Hospital of Wuhu City), Wuhu, 241000, China.
- Center for Bioinformatics and Computational Biology, Shanghai Key Laboratory of Regulatory Biology, the Institute of Biomedical Sciences and School of Life Sciences, East China Normal University, Shanghai, 200241, China.
- Key Laboratory of Advanced Theory and Application in Statistics and Data Science-MOE, School of Statistics, East China Normal University, Shanghai, 200062, China.
| |
Collapse
|
2
|
Gao Q, Ai Q. DCRELM: dual correlation reduction network-based extreme learning machine for single-cell RNA-seq data clustering. Sci Rep 2024; 14:13541. [PMID: 38866896 PMCID: PMC11169517 DOI: 10.1038/s41598-024-64217-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/26/2024] [Accepted: 06/06/2024] [Indexed: 06/14/2024] Open
Abstract
Single-cell ribonucleic acid sequencing (scRNA-seq) is a high-throughput genomic technique that is utilized to investigate single-cell transcriptomes. Cluster analysis can effectively reveal the heterogeneity and diversity of cells in scRNA-seq data, but existing clustering algorithms struggle with the inherent high dimensionality, noise, and sparsity of scRNA-seq data. To overcome these limitations, we propose a clustering algorithm: the Dual Correlation Reduction network-based Extreme Learning Machine (DCRELM). First, DCRELM obtains the low-dimensional and dense result features of scRNA-seq data in an extreme learning machine (ELM) random mapping space. Second, the ELM graph distortion module is employed to obtain a dual view of the resulting features, effectively enhancing their robustness. Third, the autoencoder fusion module is employed to learn the attributes and structural information of the resulting features, and merge these two types of information to generate consistent latent representations of these features. Fourth, the dual information reduction network is used to filter the redundant information and noise in the dual consistent latent representations. Last, a triplet self-supervised learning mechanism is utilized to further improve the clustering performance. Extensive experiments show that the DCRELM performs well in terms of clustering performance and robustness. The code is available at https://github.com/gaoqingyun-lucky/awesome-DCRELM .
Collapse
Affiliation(s)
- Qingyun Gao
- School of Computer Science and Software Engineering, University of Science and Technology Liaoning, Anshan, 114051, China
| | - Qing Ai
- School of Computer Science and Software Engineering, University of Science and Technology Liaoning, Anshan, 114051, China.
| |
Collapse
|
3
|
Sun Y, Kong L, Huang J, Deng H, Bian X, Li X, Cui F, Dou L, Cao C, Zou Q, Zhang Z. A comprehensive survey of dimensionality reduction and clustering methods for single-cell and spatial transcriptomics data. Brief Funct Genomics 2024:elae023. [PMID: 38860675 DOI: 10.1093/bfgp/elae023] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/26/2023] [Revised: 02/29/2024] [Accepted: 05/27/2024] [Indexed: 06/12/2024] Open
Abstract
In recent years, the application of single-cell transcriptomics and spatial transcriptomics analysis techniques has become increasingly widespread. Whether dealing with single-cell transcriptomic or spatial transcriptomic data, dimensionality reduction and clustering are indispensable. Both single-cell and spatial transcriptomic data are often high-dimensional, making the analysis and visualization of such data challenging. Through dimensionality reduction, it becomes possible to visualize the data in a lower-dimensional space, allowing for the observation of relationships and differences between cell subpopulations. Clustering enables the grouping of similar cells into the same cluster, aiding in the identification of distinct cell subpopulations and revealing cellular diversity, providing guidance for downstream analyses. In this review, we systematically summarized the most widely recognized algorithms employed for the dimensionality reduction and clustering analysis of single-cell transcriptomic and spatial transcriptomic data. This endeavor provides valuable insights and ideas that can contribute to the development of novel tools in this rapidly evolving field.
Collapse
Affiliation(s)
- Yidi Sun
- School of Computer Science and Technology, Hainan University, Haikou 570228, China
| | - Lingling Kong
- School of Computer Science and Technology, Hainan University, Haikou 570228, China
| | - Jiayi Huang
- School of Computer Science and Technology, Hainan University, Haikou 570228, China
| | - Hongyan Deng
- School of Computer Science and Technology, Hainan University, Haikou 570228, China
| | - Xinling Bian
- School of Computer Science and Technology, Hainan University, Haikou 570228, China
| | - Xingfeng Li
- School of Computer Science and Technology, Hainan University, Haikou 570228, China
| | - Feifei Cui
- School of Computer Science and Technology, Hainan University, Haikou 570228, China
| | - Lijun Dou
- Genomic Medicine Institute, Lerner Research Institute, Cleveland, OH 44106, United States
| | - Chen Cao
- School of Biomedical Engineering and Informatics, Nanjing Medical University, Nanjing 210029, China
| | - Quan Zou
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu 610054, China
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou 324000, China
| | - Zilong Zhang
- School of Computer Science and Technology, Hainan University, Haikou 570228, China
| |
Collapse
|
4
|
Abdalgader K, Matroud AA, Hossin K. Experimental study on short-text clustering using transformer-based semantic similarity measure. PeerJ Comput Sci 2024; 10:e2078. [PMID: 38855231 PMCID: PMC11157522 DOI: 10.7717/peerj-cs.2078] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/18/2023] [Accepted: 05/03/2024] [Indexed: 06/11/2024]
Abstract
Sentence clustering plays a central role in various text-processing activities and has received extensive attention for measuring semantic similarity between compared sentences. However, relatively little focus has been placed on evaluating clustering performance using available similarity measures that adopt low-dimensional continuous representations. Such representations are crucial in domains like sentence clustering, where traditional word co-occurrence representations often achieve poor results when clustering semantically similar sentences that share no common words. This article presents a new implementation that incorporates a sentence similarity measure based on the notion of embedding representation for evaluating the performance of three types of text clustering methods: partitional clustering, hierarchical clustering, and fuzzy clustering, on standard textual datasets. This measure derives its semantic information from pre-training models designed to simulate human knowledge about words in natural language. The article also compares the performance of the used similarity measure by training it on two state-of-the-art pre-training models to investigate which yields better results. We argue that the superior performance of the selected clustering methods stems from their more effective use of the semantic information offered by this embedding-based similarity measure. Furthermore, we use hierarchical clustering, the best-performing method, for a text summarization task and report the results. The implementation in this article demonstrates that incorporating the sentence embedding measure leads to significantly improved performance in both text clustering and text summarization tasks.
Collapse
Affiliation(s)
- Khaled Abdalgader
- Department of Computer Science and Engineering, American University of Ras Al Khaimah, Ras Al Khaimah, United Arab Emirates
| | | | - Khaled Hossin
- Department of Mechanical and Industrial Engineering, American University of Ras Al Khaimah, Ras Al Khaimah, United Arab Emirates
| |
Collapse
|
5
|
Wang H, Liu Z, Ma X. Learning Consistency and Specificity of Cells From Single-Cell Multi-Omic Data. IEEE J Biomed Health Inform 2024; 28:3134-3145. [PMID: 38709615 DOI: 10.1109/jbhi.2024.3370868] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/08/2024]
Abstract
Advancements in single-cell technologies concomitantly develop the epigenomic and transcriptomic profiles at the cell levels, providing opportunities to explore the potential biological mechanisms. Even though significant efforts have been dedicated to them, it remains challenging for the integration analysis of multi-omic data of single-cell because of the heterogeneity, complicated coupling and interpretability of data. To handle these issues, we propose a novel self-representation Learning-based Multi-omics data Integrative Clustering algorithm (sLMIC) for the integration of single-cell epigenomic profiles (DNA methylation or scATAC-seq) and transcriptomic (scRNA-seq), which the consistent and specific features of cells are explicitly extracted facilitating the cell clustering. Specifically, sLMIC constructs a graph for each type of single-cell data, thereby transforming omics data into multi-layer networks, which effectively removes heterogeneity of omic data. Then, sLMIC employs the low-rank and exclusivity constraints to separate the self-representation of cells into two parts, i.e., the shared and specific features, which explicitly characterize the consistency and diversity of omic data, providing an effective strategy to model the structure of cell types. Feature extraction and cell clustering are jointly formulated as an overall objective function, where latent features of data are obtained under the guidance of cell clustering. The extensive experimental results on 13 multi-omics datasets of single-cell from diverse organisms and tissues indicate that sLMIC observably exceeds the advanced algorithms regarding various measurements.
Collapse
|
6
|
Zhao S, Huang S, Niu M, Xu L, Xu L. iTTCA-MVL: A multi-view learning model based on physicochemical information and sequence statistical information for tumor T cell antigens identification. Comput Biol Med 2024; 170:107941. [PMID: 38217976 DOI: 10.1016/j.compbiomed.2024.107941] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/20/2023] [Revised: 12/13/2023] [Accepted: 01/01/2024] [Indexed: 01/15/2024]
Abstract
Immunotherapy is an emerging treatment method aimed at activating the human immune system and relying on its own immune function to kill cancer cells and tumor tissues. It has the advantages of wide applicability and minimal side effects. Effective identification of tumor T cell antigens (TTCAs) will help researchers understand their functions and mechanisms and carry out research on anti-tumor vaccine development. Considering that using biological experimental technology to identify TTCAs can be costly and time-consuming, it is necessary to develop a robust bioinformatics computing tool. At present, different machine learning models have been proposed for identifying TTCAs, but there is still room for further improvement in their performance. To establish a TTCA predictor with better prediction performance, we propose a prediction model called iTTCA-MVL in this paper. We extracted three sets of features from the views of physicochemical information and sequence statistics, namely the distribution descriptor of composition, transition, and distribution (CTDD), TF-IDF, and LSA topic. Then, we used least squares support vector machines (LSSVMs) as submodels and Hilbert‒Schmidt independence criteria (HSIC) as constraints to establish an independent and complementary multi-view learning model. The prediction accuracy of iTTCA-MVL on the independent test set is 0.873, and Matthew's correlation coefficient is 0.747, which is significantly better than those of existing methods. Therefore, iTTCA-MVL is an excellent prediction tool that researchers can use to accurately identify TTCAs.
Collapse
Affiliation(s)
- Shulin Zhao
- The Quzhou Affiliated Hospital of Wenzhou Medical University, Quzhou People's Hospital, Quzhou, 32400, China
| | - Shibo Huang
- Beidahuang Industry Group General Hospital, Harbin, 150001, China
| | - Mengting Niu
- School of Electronic and Communication Engineering, Shenzhen Polytechnic University, Shenzhen, 518055, China
| | - Lei Xu
- School of Electronic and Communication Engineering, Shenzhen Polytechnic University, Shenzhen, 518055, China
| | - Lifeng Xu
- The Quzhou Affiliated Hospital of Wenzhou Medical University, Quzhou People's Hospital, Quzhou, 32400, China.
| |
Collapse
|
7
|
Lan W, Liu M, Chen J, Ye J, Zheng R, Zhu X, Peng W. JLONMFSC: Clustering scRNA-seq data based on joint learning of non-negative matrix factorization and subspace clustering. Methods 2024; 222:1-9. [PMID: 38128706 DOI: 10.1016/j.ymeth.2023.11.019] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/29/2023] [Revised: 11/07/2023] [Accepted: 11/29/2023] [Indexed: 12/23/2023] Open
Abstract
The development of single cell RNA sequencing (scRNA-seq) has provided new perspectives to study biological problems at the single cell level. One of the key issues in scRNA-seq data analysis is to divide cells into several clusters for discovering the heterogeneity and diversity of cells. However, the existing scRNA-seq data are high-dimensional, sparse, and noisy, which challenges the existing single-cell clustering methods. In this study, we propose a joint learning framework (JLONMFSC) for clustering scRNA-seq data. In our method, the dimension of the original data is reduced to minimize the effect of noise. In addition, the graph regularized matrix factorization is used to learn the local features. Further, the Low-Rank Representation (LRR) subspace clustering is utilized to learn the global features. Finally, the joint learning of local features and global features is performed to obtain the results of clustering. We compare the proposed algorithm with eight state-of-the-art algorithms for clustering performance on six datasets, and the experimental results demonstrate that the JLONMFSC achieves better performance in all datasets. The code is avalable at https://github.com/lanbiolab/JLONMFSC.
Collapse
Affiliation(s)
- Wei Lan
- School of Computer, Electronic and Information, Guangxi University, Nanning, China; Guangxi Key Laboratory of Multimedia Communications and Network Technology, Guangxi University, Nanning, China.
| | - Mingyang Liu
- School of Computer, Electronic and Information, Guangxi University, Nanning, China
| | - Jianwei Chen
- School of Computer, Electronic and Information, Guangxi University, Nanning, China
| | - Jin Ye
- School of Computer, Electronic and Information, Guangxi University, Nanning, China
| | - Ruiqing Zheng
- Hunan Provincial Key Lab on Bioinformatics, School of Computer Science and Engineering, Central South University, Changsha, China
| | - Xiaoshu Zhu
- School of Computer Science and Information Security, Guilin University of Science and Technology, Guilin, China
| | - Wei Peng
- School of Information Engineering and Automation, Kunming University of Science and Technology, Kunming, China
| |
Collapse
|
8
|
Fang Z, Zheng R, Li M. scMAE: a masked autoencoder for single-cell RNA-seq clustering. Bioinformatics 2024; 40:btae020. [PMID: 38230824 PMCID: PMC10832357 DOI: 10.1093/bioinformatics/btae020] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/14/2023] [Revised: 01/07/2024] [Accepted: 01/12/2024] [Indexed: 01/18/2024] Open
Abstract
MOTIVATION Single-cell RNA sequencing has emerged as a powerful technology for studying gene expression at the individual cell level. Clustering individual cells into distinct subpopulations is fundamental in scRNA-seq data analysis, facilitating the identification of cell types and exploration of cellular heterogeneity. Despite the recent development of many deep learning-based single-cell clustering methods, few have effectively exploited the correlations among genes, resulting in suboptimal clustering outcomes. RESULTS Here, we propose a novel masked autoencoder-based method, scMAE, for cell clustering. scMAE perturbs gene expression and employs a masked autoencoder to reconstruct the original data, learning robust and informative cell representations. The masked autoencoder introduces a masking predictor, which captures relationships among genes by predicting whether gene expression values are masked. By integrating this masking mechanism, scMAE effectively captures latent structures and dependencies in the data, enhancing clustering performance. We conducted extensive comparative experiments using various clustering evaluation metrics on 15 scRNA-seq datasets from different sequencing platforms. Experimental results indicate that scMAE outperforms other state-of-the-art methods on these datasets. In addition, scMAE accurately identifies rare cell types, which are challenging to detect due to their low abundance. Furthermore, biological analyses confirm the biological significance of the identified cell subpopulations. AVAILABILITY AND IMPLEMENTATION The source code of scMAE is available at: https://zenodo.org/records/10465991.
Collapse
Affiliation(s)
- Zhaoyu Fang
- School of Computer Science and Engineering, Central South University, 932 South Lushan Road, Yuelu District, Changsha 410083, China
| | - Ruiqing Zheng
- School of Computer Science and Engineering, Central South University, 932 South Lushan Road, Yuelu District, Changsha 410083, China
| | - Min Li
- School of Computer Science and Engineering, Central South University, 932 South Lushan Road, Yuelu District, Changsha 410083, China
| |
Collapse
|
9
|
Erfani M, Baalousha M, Goharian E. Unveiling elemental fingerprints: A comparative study of clustering methods for multi-element nanoparticle data. THE SCIENCE OF THE TOTAL ENVIRONMENT 2023; 905:167176. [PMID: 37730026 DOI: 10.1016/j.scitotenv.2023.167176] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/02/2023] [Revised: 09/03/2023] [Accepted: 09/16/2023] [Indexed: 09/22/2023]
Abstract
Single particle-inductively coupled plasma-time of flight-mass spectrometers (SP-ICP-TOF-MS) generates large datasets of the multi-elemental composition of nanoparticles. However, extracting useful information from such datasets is challenging. Hierarchical clustering (HC) has been successfully applied to extract elemental fingerprints from multi-element nanoparticle data obtained by SP-ICP-TOF-MS. However, many other clustering approaches can be applied to analyze SP-ICP-TOF-MS data that have not yet been evaluated. This study fills this knowledge gap by comparing the performance of three clustering approaches: HC, spectral clustering, and t-distributed Stochastic Neighbor Embedding coupled with Density-Based Spatial Clustering of Applications with Noise (tSNE-DBSCAN) for analyzing SP-ICP-TOF-MS data. The performance of these clustering techniques was evaluated by comparing the size of the extracted clusters and the similarity of the elemental composition of nanoparticles within each cluster. Hierarchical clustering often failed to achieve an optimal clustering solution for SP-ICP-TOF-MS data because HC is sensitive to the presence of outliers. Spectral clustering and tSNE-DBSCAN extracted clusters that were not identified by HC. This is because spectral clustering, a method developed based on graph theory, reveals the global and local structure in the data. tSNE reduces and maps the data into a lower-dimensional space, enabling clustering algorithms such as DBSCAN to identify subclusters with subtle differences in their elemental composition. However, tSNE-DBSCAN can lead to unsatisfactory clustering solutions because tuning the perplexity hyperparameter of tSNE is a difficult and a time-consuming task, and the relative distance between datapoints is not maintained. Although the three clustering approaches successfully extract useful information from SP-ICP-TOF-MS data, spectral clustering outperforms HC and tSNE-DBSCAN by generating clusters of a large number of nanoparticles with similar elemental compositions.
Collapse
Affiliation(s)
- Mahdi Erfani
- Department of Civil and Environmental Engineering, University of South Carolina, SC 29208, USA
| | - Mohammed Baalousha
- Center for Environmental Nanoscience and Risk, Department of Environmental Health Sciences, Arnold School of Public Health, University of South Carolina, Columbia, SC, 29201, USA.
| | - Erfan Goharian
- Department of Civil and Environmental Engineering, University of South Carolina, SC 29208, USA.
| |
Collapse
|
10
|
Liu Y, Wei G, Li C, Shen LC, Gasser RB, Song J, Chen D, Yu DJ. TripletCell: a deep metric learning framework for accurate annotation of cell types at the single-cell level. Brief Bioinform 2023; 24:bbad132. [PMID: 37080771 PMCID: PMC10199768 DOI: 10.1093/bib/bbad132] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/21/2022] [Revised: 02/02/2023] [Accepted: 03/14/2023] [Indexed: 04/22/2023] Open
Abstract
Single-cell RNA sequencing (scRNA-seq) has significantly accelerated the experimental characterization of distinct cell lineages and types in complex tissues and organisms. Cell-type annotation is of great importance in most of the scRNA-seq analysis pipelines. However, manual cell-type annotation heavily relies on the quality of scRNA-seq data and marker genes, and therefore can be laborious and time-consuming. Furthermore, the heterogeneity of scRNA-seq datasets poses another challenge for accurate cell-type annotation, such as the batch effect induced by different scRNA-seq protocols and samples. To overcome these limitations, here we propose a novel pipeline, termed TripletCell, for cross-species, cross-protocol and cross-sample cell-type annotation. We developed a cell embedding and dimension-reduction module for the feature extraction (FE) in TripletCell, namely TripletCell-FE, to leverage the deep metric learning-based algorithm for the relationships between the reference gene expression matrix and the query cells. Our experimental studies on 21 datasets (covering nine scRNA-seq protocols, two species and three tissues) demonstrate that TripletCell outperformed state-of-the-art approaches for cell-type annotation. More importantly, regardless of protocols or species, TripletCell can deliver outstanding and robust performance in annotating different types of cells. TripletCell is freely available at https://github.com/liuyan3056/TripletCell. We believe that TripletCell is a reliable computational tool for accurately annotating various cell types using scRNA-seq data and will be instrumental in assisting the generation of novel biological hypotheses in cell biology.
Collapse
Affiliation(s)
- Yan Liu
- School of Computer Science and Engineering, Nanjing University of Science and Technology, 200 Xiaolingwei, Nanjing 210094, China
| | - Guo Wei
- School of Life Sciences, Nanjing University, Nanjing 210023, China
| | - Chen Li
- Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, Victoria 3800, Australia
| | - Long-Chen Shen
- School of Computer Science and Engineering, Nanjing University of Science and Technology, 200 Xiaolingwei, Nanjing 210094, China
| | - Robin B Gasser
- Department of Veterinary Biosciences, Melbourne Veterinary School, The University of Melbourne, Parkville, Victoria 3010, Australia
| | - Jiangning Song
- Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, Victoria 3800, Australia
- Monash Data Futures Institute, Monash University, Melbourne, Victoria 3800, Australia
| | - Dijun Chen
- School of Life Sciences, Nanjing University, Nanjing 210023, China
| | - Dong-Jun Yu
- School of Computer Science and Engineering, Nanjing University of Science and Technology, 200 Xiaolingwei, Nanjing 210094, China
| |
Collapse
|
11
|
Avşar G, Pir P. A comparative performance evaluation of imputation methods in spatially resolved transcriptomics data. Mol Omics 2023; 19:162-173. [PMID: 36562244 DOI: 10.1039/d2mo00266c] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/23/2022]
Abstract
Spatially resolved transcriptomics technologies have drawn enormous attention by providing RNA expression patterns together with their spatial information. Even though improved techniques are being developed rapidly, the technologies which give spatially whole transcriptome level profiles suffer from dropout problems because of the low capture rate. Imputation of missing data is one strategy to eliminate this technical problem. We evaluated the imputation performance of five available methods (SpaGE, stPlus, gimVI, Tangram and stLearn) which were indicated as capable of making predictions for the dropouts in spatially resolved transcriptomics datasets. The evaluation was performed qualitatively via visualization of the predictions against the original values and quantitatively with Pearson's correlation coefficient, cosine similarity, root mean squared log-error, Silhouette Index and Calinski Harabasz Index. We found that stPlus and gimVI outperform the other three. However, the performance of all methods was lower than expected which indicates that there is still a gap for imputation tools dealing with dropout events in spatially resolved transcriptomics.
Collapse
Affiliation(s)
- Gülben Avşar
- Department of Bioengineering, Gebze Technical University, 41400 Kocaeli, Turkey.
| | - Pınar Pir
- Department of Bioengineering, Gebze Technical University, 41400 Kocaeli, Turkey.
| |
Collapse
|
12
|
Lu S, Liang Y, Li L, Liao S, Ouyang D. Inferring human miRNA–disease associations via multiple kernel fusion on GCNII. Front Genet 2022; 13:980497. [PMID: 36134032 PMCID: PMC9483142 DOI: 10.3389/fgene.2022.980497] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/30/2022] [Accepted: 07/20/2022] [Indexed: 11/16/2022] Open
Abstract
Increasing evidence shows that the occurrence of human complex diseases is closely related to the mutation and abnormal expression of microRNAs(miRNAs). MiRNAs have complex and fine regulatory mechanisms, which makes it a promising target for drug discovery and disease diagnosis. Therefore, predicting the potential miRNA-disease associations has practical significance. In this paper, we proposed an miRNA–disease association predicting method based on multiple kernel fusion on Graph Convolutional Network via Initial residual and Identity mapping (GCNII), called MKFGCNII. Firstly, we built a heterogeneous network of miRNAs and diseases to extract multi-layer features via GCNII. Secondly, multiple kernel fusion method was applied to weight fusion of embeddings at each layer. Finally, Dual Laplacian Regularized Least Squares was used to predict new miRNA–disease associations by the combined kernel in miRNA and disease spaces. Compared with the other methods, MKFGCNII obtained the highest AUC value of 0.9631. Code is available at https://github.com/cuntjx/bioInfo.
Collapse
Affiliation(s)
- Shanghui Lu
- School of Computer Science and Engineering, Macau University of Science and Technology, Taipa, China
- School of Mathematics and Physics, Hechi University, Hechi, China
| | - Yong Liang
- School of Computer Science and Engineering, Macau University of Science and Technology, Taipa, China
- Peng Cheng Laboratory, Shenzhen, China
- *Correspondence: Yong Liang,
| | - Le Li
- School of Computer Science and Engineering, Macau University of Science and Technology, Taipa, China
| | - Shuilin Liao
- School of Computer Science and Engineering, Macau University of Science and Technology, Taipa, China
| | - Dong Ouyang
- School of Computer Science and Engineering, Macau University of Science and Technology, Taipa, China
| |
Collapse
|
13
|
Wu X, Zhou Y. GE-Impute: graph embedding-based imputation for single-cell RNA-seq data. Brief Bioinform 2022; 23:6651303. [PMID: 35901457 DOI: 10.1093/bib/bbac313] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/13/2022] [Revised: 06/27/2022] [Accepted: 07/11/2022] [Indexed: 11/12/2022] Open
Abstract
Single-cell RNA-sequencing (scRNA-seq) has been widely used to depict gene expression profiles at the single-cell resolution. However, its relatively high dropout rate often results in artificial zero expressions of genes and therefore compromised reliability of results. To overcome such unwanted sparsity of scRNA-seq data, several imputation algorithms have been developed to recover the single-cell expression profiles. Here, we propose a novel approach, GE-Impute, to impute the dropout zeros in scRNA-seq data with graph embedding-based neural network model. GE-Impute learns the neural graph representation for each cell and reconstructs the cell-cell similarity network accordingly, which enables better imputation of dropout zeros based on the more accurately allocated neighbors in the similarity network. Gene expression correlation analysis between true expression data and simulated dropout data suggests significantly better performance of GE-Impute on recovering dropout zeros for both droplet- and plated-based scRNA-seq data. GE-Impute also outperforms other imputation methods in identifying differentially expressed genes and improving the unsupervised clustering on datasets from various scRNA-seq techniques. Moreover, GE-Impute enhances the identification of marker genes, facilitating the cell type assignment of clusters. In trajectory analysis, GE-Impute improves time-course scRNA-seq data analysis and reconstructing differentiation trajectory. The above results together demonstrate that GE-Impute could be a useful method to recover the single-cell expression profiles, thus enabling better biological interpretation of scRNA-seq data. GE-Impute is implemented in Python and is freely available at https://github.com/wxbCaterpillar/GE-Impute.
Collapse
Affiliation(s)
- Xiaobin Wu
- Department of Biomedical Informatics, Center for Noncoding RNA Medicine, School of Basic Medical Sciences, Peking University, Beijing, China.,MOE Key Laboratory of Molecular Cardiovascular Sciences, Peking University, Beijing, China
| | - Yuan Zhou
- Department of Biomedical Informatics, Center for Noncoding RNA Medicine, School of Basic Medical Sciences, Peking University, Beijing, China.,MOE Key Laboratory of Molecular Cardiovascular Sciences, Peking University, Beijing, China
| |
Collapse
|
14
|
Liu G, Li M, Wang H, Lin S, Xu J, Li R, Tang M, Li C. D3K: The Dissimilarity-Density-Dynamic Radius K-means Clustering Algorithm for scRNA-Seq Data. Front Genet 2022; 13:912711. [PMID: 35846121 PMCID: PMC9284269 DOI: 10.3389/fgene.2022.912711] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/04/2022] [Accepted: 04/25/2022] [Indexed: 12/02/2022] Open
Abstract
A single-cell sequencing data set has always been a challenge for clustering because of its high dimension and multi-noise points. The traditional K-means algorithm is not suitable for this type of data. Therefore, this study proposes a Dissimilarity-Density-Dynamic Radius-K-means clustering algorithm. The algorithm adds the dynamic radius parameter to the calculation. It flexibly adjusts the active radius according to the data characteristics, which can eliminate the influence of noise points and optimize the clustering results. At the same time, the algorithm calculates the weight through the dissimilarity density of the data set, the average contrast of candidate clusters, and the dissimilarity of candidate clusters. It obtains a set of high-quality initial center points, which solves the randomness of the K-means algorithm in selecting the center points. Finally, compared with similar algorithms, this algorithm shows a better clustering effect on single-cell data. Each clustering index is higher than other single-cell clustering algorithms, which overcomes the shortcomings of the traditional K-means algorithm.
Collapse
Affiliation(s)
- Guoyun Liu
- School of Mathematics and Statistics, Hainan Normal University, Haikou, China
| | - Manzhi Li
- School of Mathematics and Statistics, Hainan Normal University, Haikou, China
- Key Laboratory of Data Science and Smart Education, Ministry of Education, Hainan Normal University, Haikou, China
- *Correspondence: Manzhi Li,
| | - Hongtao Wang
- School of Mathematics and Statistics, Hainan Normal University, Haikou, China
| | - Shijun Lin
- School of Mathematics and Statistics, Hainan Normal University, Haikou, China
| | - Junlin Xu
- College of Information Science and Engineering, Hunan University, Changsha, China
| | - Ruixi Li
- Geneis Beijing Co., Ltd., Beijing, China
| | - Min Tang
- School of Life Sciences, Jiangsu University, Zhenjiang, China
| | - Chun Li
- School of Mathematics and Statistics, Hainan Normal University, Haikou, China
| |
Collapse
|
15
|
He S, Dou L, Li X, Zhang Y. Review of bioinformatics in Azheimer's Disease Research. Comput Biol Med 2022; 143:105269. [PMID: 35158118 DOI: 10.1016/j.compbiomed.2022.105269] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/06/2022] [Revised: 01/21/2022] [Accepted: 01/23/2022] [Indexed: 01/05/2023]
Abstract
Alzheimer's disease (AD) is a severe neurodegenerative disease with slow course of onset and deterioration with time. With the speedup of global aging, AD has become a disease that seriously threatens the physical health of the elderly; therefore, the effective prevention and treatments of AD is an extremely important area of study for researchers and clinicians. Rapid technological developments have promoted the analysis of various kinds of complex data sets using machine learning methods. The common machine learning algorithms, such as Lasso, SVM and Random Forest, are very important in AD research. To help accelerate AD-related research, we review some recent research progress on Alzheimer's disease, including database, image analysis, gene expression, etc., which can provide AD researchers with more comprehensive research methods.
Collapse
Affiliation(s)
- Shida He
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou, Zhejiang, China; Department of Computer Science, University of Tsukuba, Japan
| | - Lijun Dou
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou, Zhejiang, China; School of Automotive and Transportation Engineering, Shenzhen Polytechnic, Shenzhen, China
| | - Xuehong Li
- Beidahuang Industry Group General Hospital, Harbin, China.
| | - Ying Zhang
- Department of Anesthesiology, Hospital (T.C.M) Affiliated To Southwest Medical University, Luzhou, China.
| |
Collapse
|
16
|
Gan S, Deng H, Qiu Y, Alshahrani M, Liu S. DSAE-Impute: Learning Discriminative Stacked Autoencoders for Imputing Single-cell RNA-seq Data. Curr Bioinform 2022. [DOI: 10.2174/1574893617666220330151024] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
Aims:
In this research, we aim to propose an accurate deep learning method to impute the missing values in scRNA-seq data. DSAE-Impute employs stacked autoencoders to capture gene expression characteristics in the original missing data and combines the discriminative correlation matrix between cells to capture global expression features during the training process, so as to accurately predict missing values.
Background:
Due to the limited amount of mRNA in single-cell, there are always many missing values in scRNA-seq data, which makes it impossible to accurately quantify the expression of single-cell RNA. The dropout phenomenon makes it impossible to detect the truly expressed genes in some cells, which greatly affects the downstream analysis on scRNA-seq data, such as cell cluster analysis and cell development trajectories.
Objective:
In this research, we aim to propose an accurate deep learning method to impute the missing values in scRNA-seq data. DSAE-Impute employs stacked autoencoders to capture gene expression characteristics in the original missing data and combines the discriminative correlation matrix between cells to capture global expression features during the training process, so as to accurately predict missing values.
Method:
We propose a novel deep learning model based on the discriminative stacked autoencoders to impute the missing values in scRNA-seq data, named DSAE-Impute. DSAE-Impute embeds the discriminative cell similarity to perfect the feature representation of stacked autoencoders, and comprehensively learns the scRNA-seq data expression pattern through layer-by-layer training to achieve accurate imputation.
Result:
We have systematically evaluated the performance of DSAE-Impute in the simulation and real datasets. The experimental results demonstrate that DSAE-Impute significantly improves downstream analysis, and its imputation results are more accurate compared with other state-of-the-art imputation methods.
Conclusion:
Extensive experiments show that compared with other state-of-the-art methods, the imputation results of DSAE-Impute on simulated and real datasets are more accurate and helpful for downstream analysis.
Collapse
Affiliation(s)
- Shengfeng Gan
- College of Computer, Hubei University of Education, Wuhan, China
| | - Huan Deng
- College of Informatics, Huazhong Agricultural University, Wuhan, China
| | - Yang Qiu
- College of Informatics, Huazhong Agricultural University, Wuhan, China
| | | | - Shichao Liu
- College of Informatics, Huazhong Agricultural University, Wuhan, China
| |
Collapse
|
17
|
HKAM-MKM: A hybrid kernel alignment maximization-based multiple kernel model for identifying DNA-binding proteins. Comput Biol Med 2022; 145:105395. [PMID: 35334314 DOI: 10.1016/j.compbiomed.2022.105395] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/07/2022] [Revised: 03/08/2022] [Accepted: 03/08/2022] [Indexed: 12/24/2022]
Abstract
The identification of DNA-binding proteins (DBPs) has always been a hot issue in the field of sequence classification. However, considering that the experimental identification method is very resource-intensive, the construction of a computational prediction model is worthwhile. This study developed and evaluated a hybrid kernel alignment maximization-based multiple kernel model (HKAM-MKM) for predicting DBPs. First, we collected two datasets and performed feature extraction on the sequences to obtain six feature groups, and then constructed the corresponding kernels. To ensure the effective utilisation of the base kernel and avoid ignoring the difference between the sample and its neighbours, we proposed local kernel alignment to calculate the kernel between the sample and its neighbours, with each sample as the centre. We combined the global and local kernel alignments to develop a hybrid kernel alignment model, and balance the relationship between the two through parameters. By maximising the hybrid kernel alignment value, we obtained the weight of each kernel and then linearly combined the kernels in the form of weights. Finally, the fused kernel was input into a support vector machine for training and prediction. Finally, in the independent test sets PDB186 and PDB2272, we obtained the highest Matthew's correlation coefficient (MCC) (0.768 and 0.5962, respectively) and the highest accuracy (87.1% and 78.43%, respectively), which were superior to the other predictors. Therefore, HKAM-MKM is an efficient prediction tool for DBPs.
Collapse
|
18
|
Han S, Wang N, Guo Y, Tang F, Xu L, Ju Y, Shi L. Application of Sparse Representation in Bioinformatics. Front Genet 2021; 12:810875. [PMID: 34976030 PMCID: PMC8715914 DOI: 10.3389/fgene.2021.810875] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/08/2021] [Accepted: 12/01/2021] [Indexed: 11/15/2022] Open
Abstract
Inspired by L1-norm minimization methods, such as basis pursuit, compressed sensing, and Lasso feature selection, in recent years, sparse representation shows up as a novel and potent data processing method and displays powerful superiority. Researchers have not only extended the sparse representation of a signal to image presentation, but also applied the sparsity of vectors to that of matrices. Moreover, sparse representation has been applied to pattern recognition with good results. Because of its multiple advantages, such as insensitivity to noise, strong robustness, less sensitivity to selected features, and no “overfitting” phenomenon, the application of sparse representation in bioinformatics should be studied further. This article reviews the development of sparse representation, and explains its applications in bioinformatics, namely the use of low-rank representation matrices to identify and study cancer molecules, low-rank sparse representations to analyze and process gene expression profiles, and an introduction to related cancers and gene expression profile database.
Collapse
Affiliation(s)
- Shuguang Han
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou, China
| | - Ning Wang
- Beidahuang Industry Group General Hospital, Harbin, China
| | - Yuxin Guo
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou, China
- School of Mathematics and Statistics, Hainan Normal University, Haikou, China
| | - Furong Tang
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou, China
- School of Electronic and Communication Engineering, Shenzhen Polytechnic, Shenzhen, China
| | - Lei Xu
- School of Electronic and Communication Engineering, Shenzhen Polytechnic, Shenzhen, China
| | - Ying Ju
- School of Informatics, Xiamen University, Xiamen, China
- *Correspondence: Ying Ju, ; Lei Shi,
| | - Lei Shi
- Department of Spine Surgery, Changzheng Hospital, Naval Medical University, Shanghai, China
- *Correspondence: Ying Ju, ; Lei Shi,
| |
Collapse
|
19
|
SCMAG: A Semisupervised Single-Cell Clustering Method Based on Matrix Aggregation Graph Convolutional Neural Network. COMPUTATIONAL AND MATHEMATICAL METHODS IN MEDICINE 2021; 2021:6842752. [PMID: 34646337 PMCID: PMC8505105 DOI: 10.1155/2021/6842752] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 08/05/2021] [Accepted: 09/14/2021] [Indexed: 11/17/2022]
Abstract
Clustering analysis is one of the most important technologies for single-cell data mining. It is widely used in the division of different gene sequences, the identification of functional genes, and the detection of new cell types. Although the traditional unsupervised clustering method does not require label data, the distribution of the original data, the setting of hyperparameters, and other factors all affect the effectiveness of the clustering algorithm. While in some cases the type of some cells is known, it is hoped to achieve high accuracy if the prior information about those cells is utilized sufficiently. In this study, we propose SCMAG (a semisupervised single-cell clustering method based on a matrix aggregation graph convolutional neural network) that takes into full consideration the prior information for single-cell data. To evaluate the performance of the proposed semisupervised clustering method, we test on different single-cell datasets and compare with the current semisupervised clustering algorithm in recognizing cell types on various real scRNA-seq data; the results show that it is a more accurate and significant model.
Collapse
|
20
|
Tang B, Wang Y, Chen Y, Li M, Tao Y. A Novel Early-Stage Lung Adenocarcinoma Prognostic Model Based on Feature Selection With Orthogonal Regression. Front Cell Dev Biol 2021; 8:620746. [PMID: 33585460 PMCID: PMC7874010 DOI: 10.3389/fcell.2020.620746] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/23/2020] [Accepted: 11/17/2020] [Indexed: 11/13/2022] Open
Abstract
Carcinoma diagnosis and prognosis are still hindered by the lack of effective prediction model and integration methodology. We proposed a novel feature selection with orthogonal regression (FSOR) method to resolve predictor selection and performance optimization. Functional enrichment and clinical outcome analyses with multi-omics information validated the method's robustness in the early-stage prognosis of lung adenocarcinoma. Furthermore, compared with the classic least absolute shrinkage and selection operator (LASSO) regression method [the averaged 1- to 4-years predictive area under the receiver operating characteristic curve (AUC) measure, 0.6998], the proposed one outperforms more accurately by 0.7208 with fewer predictors, particularly its averaged 1- to 3-years AUC reaches 0.723, vs. classic 0.6917 on The Cancer Genome Atlas (TCGA). In sum, the proposed method can deliver better prediction performance for early-stage prognosis and improve therapy strategy but with less predictor consideration and computation burden. The self-composed running scripts, together with the processed results, are available at https://github.com/gladex/PM-FSOR.
Collapse
Affiliation(s)
- Binhua Tang
- Epigenetics & Function Group, Hohai University, Nanjing, China
| | - Yuqi Wang
- Epigenetics & Function Group, Hohai University, Nanjing, China
| | - Yu Chen
- Epigenetics & Function Group, Hohai University, Nanjing, China
| | - Ming Li
- Epigenetics & Function Group, Hohai University, Nanjing, China
| | - Yongfeng Tao
- Epigenetics & Function Group, Hohai University, Nanjing, China
| |
Collapse
|