1
|
Dutta P, Patra AP, Saha S. DeePROG: Deep Attention-Based Model for Diseased Gene Prognosis by Fusing Multi-Omics Data. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022; 19:2770-2781. [PMID: 34166198 DOI: 10.1109/tcbb.2021.3090302] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/13/2023]
Abstract
An in-depth exploration of gene prognosis using different methodologies aids in understanding various biological regulations of genes in disease pathobiology and molecular functions. Interpreting gene functions at biological and molecular levels remains a daunting yet crucial task in domains such as drug design, personalized medicine, and next-generation diagnostics. Recent advancements in omics technologies have produced diverse heterogeneous genomic datasets like micro-array gene expression, miRNA expression, DNA sequence, 3D structures, which are significant resources for understanding the gene functions. In this paper, we propose a novel self-attention based deep multi-modal model, named DeePROG, for the prognosis of disease affected genes based on heterogeneous omics data. We use three NCBI datasets covering three modalities, namely gene expression profile, the underlying DNA sequence, and the 3D protein structures. To extract useful features from each modality, we develop several context-specific deep learning models. Besides, we develop three attention-based deep bi-modal architectures along with DeePROG to leverage the prognosis of the underlying biomedical data. We assess the performance of the models' in terms of computational assessment of function annotation (CAFA2) metrics. Moreover, we analyze the results in terms of receiver operating characteristics (ROC) curve in high-class imbalance data setting and perform statistical significance tests in terms of Welch's t-test. Experiment results show that DeePROG significantly outperforms baseline models across in terms of performance metrics. The source code and all preprocessed datasets used in this study are available at https://github.com/duttaprat/DeePROG.
Collapse
|
2
|
HUANG Y, LING J, CHANG A, YE H, ZHAO H, ZHUO X. Identification of an immune-related key gene, PPARGC1A, in the development of anaplastic thyroid carcinoma: in-silico study and in-vitro evaluation. Minerva Endocrinol (Torino) 2022; 47:150-159. [DOI: 10.23736/s2724-6507.20.03182-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
|
3
|
Yu K, Xie W, Wang L, Zhang S, Li W. Determination of biomarkers from microarray data using graph neural network and spectral clustering. Sci Rep 2021; 11:23828. [PMID: 34903818 PMCID: PMC8668890 DOI: 10.1038/s41598-021-03316-6] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/15/2021] [Accepted: 12/02/2021] [Indexed: 11/26/2022] Open
Abstract
In bioinformatics, the rapid development of gene sequencing technology has produced an increasing amount of microarray data. This type of data shares the typical characteristics of small sample size and high feature dimensions. Searching for biomarkers from microarray data, which expression features of various diseases, is essential for the disease classification. feature selection has therefore became fundemental for the analysis of microarray data, which designs to remove irrelevant and redundant features. There are a large number of redundant features and irrelevant features in microarray data, which severely degrade the classification effectiveness. We propose an innovative feature selection method with the goal of obtaining feature dependencies from a priori knowledge and removing redundant features using spectral clustering. In this paper, the graph structure is firstly constructed by using the gene interaction network as a priori knowledge, and then a link prediction method based on graph neural network is proposed to enhance the graph structure data. Finally, a feature selection method based on spectral clustering is proposed to determine biomarkers. The classification accuracy on DLBCL and Prostate can be improved by 10.90% and 16.22% compared to traditional methods. Link prediction provides an average classification accuracy improvement of 1.96% and 1.31%, and is up to 16.98% higher than the published method. The results show that the proposed method can have full use of a priori knowledge to effectively select disease prediction biomarkers with high classification accuracy.
Collapse
Affiliation(s)
- Kun Yu
- College of Medicine and Bioinformation Engineering, Northeastern University, Shenyang, China
| | - Weidong Xie
- School of Computer Science and Engineering, Northeastern University, Shenyang, China
| | - Linjie Wang
- School of Computer Science and Engineering, Northeastern University, Shenyang, China
| | - Shoujia Zhang
- School of Computer Science and Engineering, Northeastern University, Shenyang, China
| | - Wei Li
- Key Laboratory of Intelligent Computing in Medical Image MIIC, Northeastern University, Ministry of Education, Shenyang, China.
| |
Collapse
|
4
|
Tan K, Huang W, Liu X, Hu J, Dong S. A Hierarchical Graph Convolution Network for Representation Learning of Gene Expression Data. IEEE J Biomed Health Inform 2021; 25:3219-3229. [PMID: 33449889 DOI: 10.1109/jbhi.2021.3052008] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
The curse of dimensionality, which is caused by high-dimensionality and low-sample-size, is a major challenge in gene expression data analysis. However, the real situation is even worse: labelling data is laborious and time-consuming, so only a small part of the limited samples will be labelled. Having such few labelled samples further increases the difficulty of training deep learning models. Interpretability is an important requirement in biomedicine. Many existing deep learning methods are trying to provide interpretability, but rarely apply to gene expression data. Recent semi-supervised graph convolution network methods try to address these problems by smoothing the label information over a graph. However, to the best of our knowledge, these methods only utilize graphs in either the feature space or sample space, which restrict their performance. We propose a transductive semi-supervised representation learning method called a hierarchical graph convolution network (HiGCN) to aggregate the information of gene expression data in both feature and sample spaces. HiGCN first utilizes external knowledge to construct a feature graph and a similarity kernel to construct a sample graph. Then, two spatial-based GCNs are used to aggregate information on these graphs. To validate the model's performance, synthetic and real datasets are provided to lend empirical support. Compared with two recent models and three traditional models, HiGCN learns better representations of gene expression data, and these representations improve the performance of downstream tasks, especially when the model is trained on a few labelled samples. Important features can be extracted from our model to provide reliable interpretability.
Collapse
|
5
|
Wang Z, Xiao Y, Weng F, Li X, Zhu D, Lu F, Liu X, Hou M, Meng Y. R-JaunLab: Automatic Multi-Class Recognition of Jaundice on Photos of Subjects with Region Annotation Networks. J Digit Imaging 2021; 34:337-350. [PMID: 33634415 DOI: 10.1007/s10278-021-00432-7] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/22/2020] [Revised: 07/01/2020] [Accepted: 02/09/2021] [Indexed: 12/21/2022] Open
Abstract
Jaundice occurs as a symptom of various diseases, such as hepatitis, the liver cancer, gallbladder or pancreas. Therefore, clinical measurement with special equipment is a common method that is used to identify the total serum bilirubin level in patients. Fully automated multi-class recognition of jaundice combines two key issues: (1) the critical difficulties in multi-class recognition of jaundice approaches contrasting with the binary class and (2) the subtle difficulties in multi-class recognition of jaundice represent extensive individuals variability of high-resolution photos of subjects, huge coherency between healthy controls and occult jaundice, as well as broadly inhomogeneous color distribution. We introduce a novel approach for multi-class recognition of jaundice to detect occult jaundice, obvious jaundice and healthy controls. First, region annotation network is developed and trained to propose eye candidates. Subsequently, an efficient jaundice recognizer is proposed to learn similarities, context, localization features and globalization characteristics on photos of subjects. Finally, both networks are unified by using shared convolutional layer. Evaluation of the structured model in a comparative study resulted in a significant performance boost (categorical accuracy for mean 91.38%) over the independent human observer. Our work was exceeded against the state-of-the-art convolutional neural network (96.85% and 90.06% for training and validation subset, respectively) and showed a remarkable categorical result for mean 95.33% on testing subset. The proposed network makes a performance better than physicians. This work demonstrates the strength of our proposal to help bringing an efficient tool for multi-class recognition of jaundice into clinical practice.
Collapse
Affiliation(s)
- Zheng Wang
- School of Mathematics and Statistics, Central South University, Changsha, Hunan, 410083, China.,Science and Engineering School, Hunan First Normal University, Changsha, 410205, China
| | - Ying Xiao
- Gastroenterology Department of Xiangya Hospital, Central South University, Changsha, 410083, China
| | - Futian Weng
- School of Mathematics and Statistics, Central South University, Changsha, Hunan, 410083, China
| | - Xiaojun Li
- Gastroenterology Department of Xiangya Hospital, Central South University, Changsha, 410083, China
| | - Danhua Zhu
- Department of Gastroenterology, Hunan Provincial People's Hospital, Changsha, 410002, China
| | - Fanggen Lu
- The Second Xiangya Hospital, Central South University, 410083, Changsha, China
| | - Xiaowei Liu
- Gastroenterology Department of Xiangya Hospital, Central South University, Changsha, 410083, China
| | - Muzhou Hou
- School of Mathematics and Statistics, Central South University, Changsha, Hunan, 410083, China.
| | - Yu Meng
- Department of Gastroenterology and Hepatology, Shenzhen University General Hospital, Shenzhen, 518055, China.
| |
Collapse
|
6
|
Dutta P, Saha S, Chopra S, Miglani V. Ensembling of Gene Clusters Utilizing Deep Learning and Protein-Protein Interaction Information. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2020; 17:2005-2016. [PMID: 31135367 DOI: 10.1109/tcbb.2019.2918523] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/09/2023]
Abstract
Cluster ensemble techniques aim to combine the outputs of multiple clustering algorithms to obtain a single consensus partitioning. The current paper reports about the development of a cluster ensemble based technique combining the concepts of multiobjective optimization and deep-learning models for gene clustering where some additional protein-protein interaction information are utilized for generating the consensus partitioning. The proposed ensemble based framework works in four phases: (i) filtering out the irrelevant genes from the microarray dataset: only the statistically significant genes are considered for further data analysis; (ii) generation of diverse base partitionings: a multi-objective optimization-based clustering technique is proposed which simultaneously optimizes three different cluster quality measures and generates a set of partitioning solutions on the Pareto optimal front; (iii) generation of a consensus partitioning: mentha scores, calculated by accessing a highly enriched protein-protein interaction archive named mentha, of different clustering solutions are considered for generating a weighted incidence matrix; (iv) finally, two approaches are used to generate a consensus partitioning from the obtained incidence matrix. The first approach is based on a traditional machine learning method, and another approach exploits the graph partitioning algorithm and two deep neural models to generate the final clustering. To validate the efficacy of the proposed ensemble framework, it is applied on five gene expression datasets. We present a comparative analysis of the proposed technique over different clustering algorithms in terms of biological homogeneity index (BHI) and biological stability index (BSI). The traditional approach attains an average 3 and 2 percent improvements over the best non-dominated solution with respect to BHI and BSI, respectively, whereas deep learning models illustrate an average 6.8 and 1.5 percent improvements over the proposed traditional approach with respect to BHI and BSI, respectively. Subsequently, Welch's t-test is executed to prove that the results obtained by the proposed methods are statistically significant. Availability of data and materials: https://github.com/sduttap16/DeepEnsm.
Collapse
|
7
|
Dutta P, Saha S, Pai S, Kumar A. A Protein Interaction Information-based Generative Model for Enhancing Gene Clustering. Sci Rep 2020; 10:665. [PMID: 31959782 PMCID: PMC6971242 DOI: 10.1038/s41598-020-57437-5] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/02/2019] [Accepted: 12/20/2019] [Indexed: 11/18/2022] Open
Abstract
In the field of computational bioinformatics, identifying a set of genes which are responsible for a particular cellular mechanism, is very much essential for tasks such as medical diagnosis or disease gene identification. Accurately grouping (clustering) the genes is one of the important tasks in understanding the functionalities of the disease genes. In this regard, ensemble clustering becomes a promising approach to combine different clustering solutions to generate almost accurate gene partitioning. Recently, researchers have used generative model as a smart ensemble method to produce the right consensus solution. In the current paper, we develop a protein-protein interaction-based generative model that can efficiently perform a gene clustering. Utilizing protein interaction information as the generative model's latent variable enables enhance the generative model's efficiency in inferring final probabilistic labels. The proposed generative model utilizes different weak supervision sources rather utilizing any ground truth information. For weak supervision sources, we use a multi-objective optimization based clustering technique together with the world's largest gene ontology based knowledge-base named Gene Ontology Consortium(GOC). These weakly supervised labels are supplied to a generative model that eventually assigns all genes to probabilistic labels. The comparative study with respect to silhouette score, Biological Homogeneity Index (BHI) and Biological Stability Index (BSI) proves that the proposed generative model outperforms than other state-of-the-art techniques.
Collapse
Affiliation(s)
- Pratik Dutta
- Department of Computer Science and Engineering, Indian Institute of Technology Patna, Bihta, 801103, India.
| | - Sriparna Saha
- Department of Computer Science and Engineering, Indian Institute of Technology Patna, Bihta, 801103, India
| | - Sanket Pai
- Department of Chemical Science and Technology, Indian Institute of Technology Patna, Bihta, 801103, India
| | - Aviral Kumar
- Department of Chemical Science and Technology, Indian Institute of Technology Patna, Bihta, 801103, India
| |
Collapse
|
8
|
Shi J, Zhang P, Liu L, Min X, Xiao Y. Weighted gene coexpression network analysis identifies a new biomarker of CENPF for prediction disease prognosis and progression in nonmuscle invasive bladder cancer. Mol Genet Genomic Med 2019; 7:e982. [PMID: 31566930 PMCID: PMC6825849 DOI: 10.1002/mgg3.982] [Citation(s) in RCA: 17] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/01/2019] [Revised: 07/23/2019] [Accepted: 08/29/2019] [Indexed: 11/08/2022] Open
Abstract
BACKGROUND The dreadful prognosis of nonmuscle invasive bladder cancer mainly results from the delay in recognition of individuals with a high risk of progression. Thus, the emphasis of this work lies in developing valuable biomarkers that is conducive to accurately predicting the progression of NMIBC. METHODS Microarray data from GSE32894 including 209 NMIBC samples were performed by weighted gene coexpression network analysis (WGCNA), which could find modules of highly correlated genes and relate modules to external sample traits. Besides, we constructed a protein-protein interaction to facilitate screening the hub gene. At last, we used RNA-seq and microarray data and clinical information from ArrayExpress (E-MTAB-4321) and GSE13507 to select and validate the candidate gene. RESULTS In current paper, blue module of 13 gene coexpression clusters we identified was selected as the key modules. Seven genes namely: CDCA8, CENPF, MCM6, MELK, PRC1, STIL, and TPX2 have been identified as candidate genes. Notably, among them, only elevated CENPF in NIMBC tissue was closely associated with low progression-free survival (PFS) and overall survival (OS) rate in three datasets and had a large area under receiver operating characteristic (ROC) curve. Finally, CENPF was identified as an effective biomarker in NMIBC. CONCLUSION Therefore, our findings submit a new progressive and prognostic molecular marker and therapeutic target for NMIBC. Moreover, these genes that deserve to be further researched may improve the comprehension about the occurrence and development of superficial bladder cancer.
Collapse
Affiliation(s)
- Jiawei Shi
- Department of Urology, Union Hospital, Tongji Medical College, Huazhong University of Science and Technology, Wuhan, China
| | - Pu Zhang
- Department of Urology, Union Hospital, Tongji Medical College, Huazhong University of Science and Technology, Wuhan, China
| | - Lilong Liu
- Department of Urology, Union Hospital, Tongji Medical College, Huazhong University of Science and Technology, Wuhan, China
| | - Xiaobo Min
- Department of Hepatology, Union Hospital, Tongji Medical College, Huazhong University of Science and Technology, Wuhan, China
| | - Yajun Xiao
- Department of Urology, Union Hospital, Tongji Medical College, Huazhong University of Science and Technology, Wuhan, China
| |
Collapse
|