1
|
Wang H, Zheng H, Chen DZ. TANGO: A GO-Term Embedding Based Method for Protein Semantic Similarity Prediction. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2023; 20:694-706. [PMID: 35030084 DOI: 10.1109/tcbb.2022.3143480] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/04/2023]
Abstract
We aim to quantitatively predict protein semantic similarities (PSS), which is vital to making biological discoveries. Previously, researchers commonly exploited Gene Ontology (GO) graphs (containing standardized hierarchically-organized GO terms for annotating distinct protein attributes) to learn GO term embeddings (vector representations) for quantifying protein attribute similarities and aggregate these embeddings to form protein embeddings for similarity measurement. However, two key properties of GO terms and annotated proteins are not yet well-explored by these learning-based methods: (1) taxonomy relations between GO terms; (2) GO terms' different contributions in describing protein semantics. In this paper, we propose TANGO, a new framework composed of a TAxoNomy-aware embedding module and an aggreGatiOn module. Our Embedding Module encodes taxonomic information into GO term embeddings by incorporating GO term topological distances in the GO graph hierarchy. Hence, distances between GO term embeddings can be used to more accurately measure shared meanings between correlated protein attributes. Our Aggregation Module automatically determines the contributions of GO terms when merging into the target protein embeddings, by mining GO term concept dependency relations in the GO graph and correlations in protein annotations. We conduct extensive experiments on several public datasets. On two PSS metrics, our new method significantly outperforms known methods by a large margin.
Collapse
|
2
|
Chen Y, Hu Y, Hu X, Feng C, Chen M. CoGO: a contrastive learning framework to predict disease similarity based on gene network and ontology structure. Bioinformatics 2022; 38:4380-4386. [PMID: 35900147 DOI: 10.1093/bioinformatics/btac520] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/27/2022] [Revised: 06/16/2022] [Indexed: 12/24/2022] Open
Abstract
MOTIVATION Quantifying the similarity of human diseases provides guiding insights to the discovery of micro-scope mechanisms from a macro scale. Previous work demonstrated that better performance can be gained by integrating multiview data sources or applying machine learning techniques. However, designing an efficient framework to extract and incorporate information from different biological data using deep learning models remains unexplored. RESULTS We present CoGO, a Contrastive learning framework to predict disease similarity based on Gene network and Ontology structure, which incorporates the gene interaction network and gene ontology (GO) domain knowledge using graph deep learning models. First, graph deep learning models are applied to encode the features of genes and GO terms from separate graph structure data. Next, gene and GO features are projected to a common embedding space via a nonlinear projection. Then cross-view contrastive loss is applied to maximize the agreement of corresponding gene-GO associations and lead to meaningful gene representation. Finally, CoGO infers the similarity between diseases by the cosine similarity of disease representation vectors derived from related gene embedding. In our experiments, CoGO outperforms the most competitive baseline method on both AUROC and AUPRC, especially improves 19.57% in AUPRC (0.7733). The prediction results are significantly comparable with other disease similarity studies and thus highly credible. Furthermore, we conduct a detailed case study of top similar disease pairs which is demonstrated by other studies. Empirical results show that CoGO achieves powerful performance in disease similarity problem. AVAILABILITY AND IMPLEMENTATION https://github.com/yhchen1123/CoGO.
Collapse
Affiliation(s)
- Yuhao Chen
- Department of Bioinformatics, College of Life Sciences, Zhejiang University, Hangzhou, 310058, China
| | - Yanshi Hu
- Department of Bioinformatics, College of Life Sciences, Zhejiang University, Hangzhou, 310058, China
| | - Xiaotian Hu
- Department of Bioinformatics, College of Life Sciences, Zhejiang University, Hangzhou, 310058, China
| | - Cong Feng
- Department of Bioinformatics, College of Life Sciences, Zhejiang University, Hangzhou, 310058, China
| | - Ming Chen
- Department of Bioinformatics, College of Life Sciences, Zhejiang University, Hangzhou, 310058, China.,Biomedical Big Data Center, The First Affiliated Hospital, Zhejiang University School of Medicine, Hangzhou, 310058, China.,Institute of Hematology, Zhejiang University, Hangzhou, 310058, China
| |
Collapse
|
3
|
Han Y, Klinger K, Rajpal DK, Zhu C, Teeple E. Empowering the discovery of novel target-disease associations via machine learning approaches in the open targets platform. BMC Bioinformatics 2022; 23:232. [PMID: 35710324 PMCID: PMC9202116 DOI: 10.1186/s12859-022-04753-4] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/07/2022] [Accepted: 05/26/2022] [Indexed: 11/10/2022] Open
Abstract
Background The Open Targets (OT) Platform integrates a wide range of data sources on target-disease associations to facilitate identification of potential therapeutic drug targets to treat human diseases. However, due to the complexity that targets are usually functionally pleiotropic and efficacious for multiple indications, challenges in identifying novel target to indication associations remain. Specifically, persistent need exists for new methods for integration of novel target-disease association evidence and biological knowledge bases via advanced computational methods. These offer promise for increasing power for identification of the most promising target-disease pairs for therapeutic development. Here we introduce a novel approach by integrating additional target-disease features with machine learning models to further uncover druggable disease to target indications. Results We derived novel target-disease associations as supplemental features to OT platform-based associations using three data sources: (1) target tissue specificity from GTEx expression profiles; (2) target semantic similarities based on gene ontology; and (3) functional interactions among targets by embedding them from protein–protein interaction (PPI) networks. Machine learning models were applied to evaluate feature importance and performance benchmarks for predicting targets with known drug indications. The evaluation results show the newly integrated features demonstrate higher importance than current features in OT. In addition, these also show superior performance over association benchmarks and may support discovery of novel therapeutic indications for highly pursued targets. Conclusion Our newly generated features can be used to represent additional underlying biological relatedness among targets and diseases to further empower improved performance for predicting novel indications for drug targets through advanced machine learning models. The proposed methodology enables a powerful new approach for systematic evaluation of drug targets with novel indications. Supplementary Information The online version contains supplementary material available at 10.1186/s12859-022-04753-4.
Collapse
Affiliation(s)
- Yingnan Han
- Translational Sciences, Sanofi US, Framingham, MA, 01701, USA
| | | | - Deepak K Rajpal
- Translational Sciences, Sanofi US, Framingham, MA, 01701, USA
| | - Cheng Zhu
- Translational Sciences, Sanofi US, Framingham, MA, 01701, USA.
| | - Erin Teeple
- Translational Sciences, Sanofi US, Framingham, MA, 01701, USA.
| |
Collapse
|
4
|
Yuan L, Yang Z, Zhao J, Sun T, Hu C, Shen Z, Yu G. Pan-Cancer Bioinformatics Analysis of Gene UBE2C. Front Genet 2022; 13:893358. [PMID: 35571064 PMCID: PMC9091452 DOI: 10.3389/fgene.2022.893358] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/10/2022] [Accepted: 03/29/2022] [Indexed: 11/30/2022] Open
Abstract
Ubiquitin-Conjugating Enzyme E2 C (UBE2C) is a gene that encodes protein. Disorders associated with UBE2C include methotrexate-related lymphatic hyperplasia and complement component 7 deficiency. The encoded protein is necessary for the destruction of mitotic cell cyclins and cell cycle progression, and may be involved in cancer progression. In this paper, on the basis of public databases, we study the expression differential mechanism of gene expression of UBE2C in various tumors and the performance of prognosis, clinical features, immunity, methylation, etc.
Collapse
Affiliation(s)
- Lin Yuan
- School of Computer Science and Technology, Qilu University of Technology (Shandong Academy of Sciences), Jinan, China
| | - Zhenyu Yang
- School of Computer Science and Technology, Qilu University of Technology (Shandong Academy of Sciences), Jinan, China
| | - Jing Zhao
- School of Computer Science and Technology, Qilu University of Technology (Shandong Academy of Sciences), Jinan, China
| | - Tao Sun
- School of Computer Science and Technology, Qilu University of Technology (Shandong Academy of Sciences), Jinan, China
| | - Chunyu Hu
- School of Computer Science and Technology, Qilu University of Technology (Shandong Academy of Sciences), Jinan, China
| | - Zhen Shen
- School of Computer and Software, Nanyang Institute of Technology, Nanyang, China
| | - Guanying Yu
- Department of Gastrointestinal Surgery, Central Hospital Affiliated to Shandong First Medical University, Jinan, China
- *Correspondence: Guanying Yu,
| |
Collapse
|
5
|
Kamran AB, Naveed H. GOntoSim: a semantic similarity measure based on LCA and common descendants. Sci Rep 2022; 12:3818. [PMID: 35264663 PMCID: PMC8907294 DOI: 10.1038/s41598-022-07624-3] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/05/2021] [Accepted: 02/14/2022] [Indexed: 11/20/2022] Open
Abstract
The Gene Ontology (GO) is a controlled vocabulary that captures the semantics or context of an entity based on its functional role. Biomedical entities are frequently compared to each other to find similarities to help in data annotation and knowledge transfer. In this study, we propose GOntoSim, a novel method to determine the functional similarity between genes. GOntoSim quantifies the similarity between pairs of GO terms, by taking the graph structure and the information content of nodes into consideration. Our measure quantifies the similarity between the ancestors of the GO terms accurately. It also takes into account the common children of the GO terms. GOntoSim is evaluated using the entire Enzyme Dataset containing 10,890 proteins and 97,544 GO annotations. The enzymes are clustered and compared with the Gold Standard EC numbers. At level 1 of the EC Numbers for Molecular Function, GOntoSim achieves a purity score of 0.75 as compared to 0.47 and 0.51 GOGO and Wang. GOntoSim can handle the noisy IEA annotations. We achieve a purity score of 0.94 in contrast to 0.48 for both GOGO and Wang at level 1 of the EC Numbers with IEA annotations. GOntoSim can be freely accessed at (http://www.cbrlab.org/GOntoSim.html).
Collapse
Affiliation(s)
- Amna Binte Kamran
- Computational Biology Research Lab, Department of Computer Science, National University of Computer & Emerging Sciences (NUCES-FAST), Islamabad, 44800, Pakistan
| | - Hammad Naveed
- Computational Biology Research Lab, Department of Computer Science, National University of Computer & Emerging Sciences (NUCES-FAST), Islamabad, 44800, Pakistan.
| |
Collapse
|
6
|
Mallick K, Mallik S, Bandyopadhyay S, Chakraborty S. A Novel Graph Topology-Based GO-Similarity Measure for Signature Detection From Multi-Omics Data and its Application to Other Problems. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022; 19:773-785. [PMID: 32866101 DOI: 10.1109/tcbb.2020.3020537] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
Large scale multi-omics data analysis and signature prediction have been a topic of interest in the last two decades. While various traditional clustering/correlation-based methods have been proposed, but the overall prediction is not always satisfactory. To solve these challenges, in this article, we propose a new approach by leveraging the Gene Ontology (GO)similarity combined with multiomics data. In this article, a new GO similarity measure, ModSchlicker, is proposed and the effectiveness of the proposed measure along with other standardized measures are reviewed while using various graph topology-based Information Content (IC)values of GO-term. The proposed measure is deployed to PPI prediction. Furthermore, by involving GO similarity, we propose a new framework for stronger disease-based gene signature detection from the multi-omics data. For the first objective, we predict interaction from various benchmark PPI datasets of Yeast and Human species. For the latter, the gene expression and methylation profiles are used to identify Differentially Expressed and Methylated (DEM)genes. Thereafter, the GO similarity score along with a statistical method are used to determine the potential gene signature. Interestingly, the proposed method produces a better performance ( 0.9 avg. accuracy and 0.95 AUC)as compared to the other existing related methods during the classification of the participating features (genes)of the signature. Moreover, the proposed method is highly useful in other prediction/classification problems for any kind of large scale omics data.
Collapse
|
7
|
Harikumar H, Quinn TP, Rana S, Gupta S, Venkatesh S. Personalized single-cell networks: a framework to predict the response of any gene to any drug for any patient. BioData Min 2021; 14:37. [PMID: 34353329 PMCID: PMC8340371 DOI: 10.1186/s13040-021-00263-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/21/2021] [Accepted: 05/10/2021] [Indexed: 11/15/2022] Open
Abstract
BACKGROUND The last decade has seen a major increase in the availability of genomic data. This includes expert-curated databases that describe the biological activity of genes, as well as high-throughput assays that measure gene expression in bulk tissue and single cells. Integrating these heterogeneous data sources can generate new hypotheses about biological systems. Our primary objective is to combine population-level drug-response data with patient-level single-cell expression data to predict how any gene will respond to any drug for any patient. METHODS We take 2 approaches to benchmarking a "dual-channel" random walk with restart (RWR) for data integration. First, we evaluate how well RWR can predict known gene functions from single-cell gene co-expression networks. Second, we evaluate how well RWR can predict known drug responses from individual cell networks. We then present two exploratory applications. In the first application, we combine the Gene Ontology database with glioblastoma single cells from 5 individual patients to identify genes whose functions differ between cancers. In the second application, we combine the LINCS drug-response database with the same glioblastoma data to identify genes that may exhibit patient-specific drug responses. CONCLUSIONS Our manuscript introduces two innovations to the integration of heterogeneous biological data. First, we use a "dual-channel" method to predict up-regulation and down-regulation separately. Second, we use individualized single-cell gene co-expression networks to make personalized predictions. These innovations let us predict gene function and drug response for individual patients. Taken together, our work shows promise that single-cell co-expression data could be combined in heterogeneous information networks to facilitate precision medicine.
Collapse
Affiliation(s)
- Haripriya Harikumar
- Applied Artificial Intelligence Institute, Deakin University, Geelong, Australia.
- Institute for Health Transformation, Deakin University, Geelong, Australia.
| | - Thomas P Quinn
- Applied Artificial Intelligence Institute, Deakin University, Geelong, Australia.
| | - Santu Rana
- Applied Artificial Intelligence Institute, Deakin University, Geelong, Australia
| | - Sunil Gupta
- Applied Artificial Intelligence Institute, Deakin University, Geelong, Australia
| | - Svetha Venkatesh
- Applied Artificial Intelligence Institute, Deakin University, Geelong, Australia
| |
Collapse
|
8
|
Kulmanov M, Smaili FZ, Gao X, Hoehndorf R. Semantic similarity and machine learning with ontologies. Brief Bioinform 2021; 22:bbaa199. [PMID: 33049044 PMCID: PMC8293838 DOI: 10.1093/bib/bbaa199] [Citation(s) in RCA: 29] [Impact Index Per Article: 9.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/07/2020] [Revised: 08/03/2020] [Accepted: 08/04/2020] [Indexed: 12/13/2022] Open
Abstract
Ontologies have long been employed in the life sciences to formally represent and reason over domain knowledge and they are employed in almost every major biological database. Recently, ontologies are increasingly being used to provide background knowledge in similarity-based analysis and machine learning models. The methods employed to combine ontologies and machine learning are still novel and actively being developed. We provide an overview over the methods that use ontologies to compute similarity and incorporate them in machine learning methods; in particular, we outline how semantic similarity measures and ontology embeddings can exploit the background knowledge in ontologies and how ontologies can provide constraints that improve machine learning models. The methods and experiments we describe are available as a set of executable notebooks, and we also provide a set of slides and additional resources at https://github.com/bio-ontology-research-group/machine-learning-with-ontologies.
Collapse
Affiliation(s)
| | | | - Xin Gao
- Computational Bioscience Research Center and lead of the Structural and Functional Bioinformatics Group at King Abdullah University of Science and Technology
| | | |
Collapse
|
9
|
Li Y, Wang K, Wang G. Evaluating Disease Similarity Based on Gene Network Reconstruction and Representation. Bioinformatics 2021; 37:3579-3587. [PMID: 33978702 DOI: 10.1093/bioinformatics/btab252] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/22/2020] [Revised: 03/01/2021] [Accepted: 04/28/2021] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION Quantifying the associations between diseases is of great significance in increasing our understanding of disease biology, improving disease diagnosis, re-positioning, and developing drugs. Therefore, in recent years, the research of disease similarity has received a lot of attention in the field of bioinformatics. Previous work has shown that the combination of the ontology (such as disease ontology and gene ontology) and disease-gene interactions are worthy to be regarded to elucidate diseases and disease associations. However, most of them are either based on the overlap between disease-related gene sets or distance within the ontology's hierarchy. The diseases in these methods are represented by discrete or sparse feature vectors, which cannot grasp the deep semantic information of diseases. Recently, deep representation learning has been widely studied and gradually applied to various fields of bioinformatics. Based on the hypothesis that disease representation depends on its related gene representations, we propose a disease representation model using two most representative gene resources HumanNet and Gene Ontology to construct a new gene network and learn gene (disease) representations. The similarity between two diseases is computed by the cosine similarity of their corresponding representations. RESULTS We propose a novel approach to compute disease similarity, which integrates two important factors disease-related genes and gene ontology hierarchy to learn disease representation based on deep representation learning. Under the same experimental settings, the AUC value of our method is 0.8074, which improves the most competitive baseline method by 10.1%. The quantitative and qualitative experimental results show that our model can learn effective disease representations and improve the accuracy of disease similarity computation significantly. AVAILABILITY The research shows that this method has certain applicability in the prediction of gene-related diseases, the migration of disease treatment methods, drug development, and so on. SUPPLEMENTARY INFORMATION Supplementary data are available at https://github.com/catly/disease_similarity.
Collapse
Affiliation(s)
- Yang Li
- College of information and Computer Engineering, Northeast Forestry University, Harbin, 150004, China
| | - Keqi Wang
- College of information and Computer Engineering, Northeast Forestry University, Harbin, 150004, China
| | - Guohua Wang
- College of information and Computer Engineering, Northeast Forestry University, Harbin, 150004, China
| |
Collapse
|
10
|
Liu W, Sun X, Peng L, Zhou L, Lin H, Jiang Y. RWRNET: A Gene Regulatory Network Inference Algorithm Using Random Walk With Restart. Front Genet 2020; 11:591461. [PMID: 33101398 PMCID: PMC7545090 DOI: 10.3389/fgene.2020.591461] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/04/2020] [Accepted: 09/02/2020] [Indexed: 11/30/2022] Open
Abstract
Inferring gene regulatory networks from expression data is essential in identifying complex regulatory relationships among genes and revealing the mechanism of certain diseases. Various computation methods have been developed for inferring gene regulatory networks. However, these methods focus on the local topology of the network rather than on the global topology. From network optimisation standpoint, emphasising the global topology of the network also reduces redundant regulatory relationships. In this study, we propose a novel network inference algorithm using Random Walk with Restart (RWRNET) that combines local and global topology relationships. The method first captures the local topology through three elements of random walk and then combines the local topology with the global topology by Random Walk with Restart. The Markov Blanket discovery algorithm is then used to deal with isolated genes. The proposed method is compared with several state-of-the-art methods on the basis of six benchmark datasets. Experimental results demonstrated the effectiveness of the proposed method.
Collapse
Affiliation(s)
- Wei Liu
- School of Computer Science, Xiangtan University, Xiangtan, China.,Key Laboratory of Intelligent Computing and Information Processing of Ministry of Education, Xiangtan University, Xiangtan, China
| | - Xingen Sun
- School of Computer Science, Xiangtan University, Xiangtan, China
| | - Li Peng
- School of Computer Science and Engineering, Hunan University of Science and Technology, Xiangtan, China
| | - Lili Zhou
- School of Computer Science, Xiangtan University, Xiangtan, China
| | - Hui Lin
- School of Computer Science, Xiangtan University, Xiangtan, China
| | - Yi Jiang
- School of Computer Science, Xiangtan University, Xiangtan, China
| |
Collapse
|
11
|
Peng J, Guan J, Hui W, Shang X. A novel subnetwork representation learning method for uncovering disease-disease relationships. Methods 2020; 192:77-84. [PMID: 32946974 DOI: 10.1016/j.ymeth.2020.09.002] [Citation(s) in RCA: 16] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/17/2020] [Revised: 08/20/2020] [Accepted: 09/07/2020] [Indexed: 12/12/2022] Open
Abstract
Analyzing disease-disease relationships plays an important role for understanding disease mechanisms and finding alternative uses for a drug. A disease is usually the result of abnormal state of multiple molecular process. Since biological networks can model the interplay of multiple molecular processes, network-based methods have been proposed to uncover the disease-disease relationships recently. Given a disease and a network, the disease could be represented as a subnetwork constructed by the disease genes involved in the given network, named disease subnetwork. Because it is difficult to learn the feature representation of disease subnetworks, most existing methods are unsupervised ones without using labeled information. To fill this gap, we propose a novel method named SubNet2vec to learn the feature vectors of diseases from their corresponding subnetwork in the biological network. By utilizing the feature representation of disease subnetwork, we can analyze disease-disease relationships in a supervised fashion. The evaluation results show that the proposed framework outperforms some state-of-the-art approaches in a large margin on disease-disease/disease-drug association prediction. The source code and data are available athttps://github.com/MedicineBiology-AI/SubNet2vec.git.
Collapse
Affiliation(s)
- Jiajie Peng
- School of Computer Science, Northwestern Polytechnical University, Xi'an 710129, China.
| | - Jiaojiao Guan
- School of Computer Science, Northwestern Polytechnical University, Xi'an 710129, China.
| | - Weiwei Hui
- Vivo mobile communications (Hang Zhou) co. LTD, China.
| | - Xuequn Shang
- School of Computer Science, Northwestern Polytechnical University, Xi'an 710129, China.
| |
Collapse
|
12
|
Lu K, Yang K, Niyongabo E, Shu Z, Wang J, Chang K, Zou Q, Jiang J, Jia C, Liu B, Zhou X. Integrated network analysis of symptom clusters across disease conditions. J Biomed Inform 2020; 107:103482. [PMID: 32535270 DOI: 10.1016/j.jbi.2020.103482] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/24/2019] [Revised: 05/18/2020] [Accepted: 06/08/2020] [Indexed: 10/24/2022]
Abstract
Identifying the symptom clusters (two or more related symptoms) with shared underlying molecular mechanisms has been a vital analysis task to promote the symptom science and precision health. Related studies have applied the clustering algorithms (e.g. k-means, latent class model) to detect the symptom clusters mostly from various kinds of clinical data. In addition, they focused on identifying the symptom clusters (SCs) for a specific disease, which also mainly concerned with the clinical regularities for symptom management. Here, we utilized a network-based clustering algorithm (i.e., BigCLAM) to obtain 208 typical SCs across disease conditions on a large-scale symptom network derived from integrated high-quality disease-symptom associations. Furthermore, we evaluated the underlying shared molecular mechanisms for SCs, i.e., shared genes, protein-protein interaction (PPI) and gene functional annotations using integrated networks and similarity measures. We found that the symptoms in the same SCs tend to share a higher degree of genes, PPIs and have higher functional homogeneities. In addition, we found that most SCs have related symptoms with shared underlying molecular mechanisms (e.g. enriched pathways) across different disease conditions. Our work demonstrated that the integrated network analysis method could be used for identifying robust SCs and investigate the molecular mechanisms of these SCs, which would be valuable for symptom science and precision health.
Collapse
Affiliation(s)
- Kezhi Lu
- Institute of Medical Intelligence, School of Computer and Information Technology, Beijing Jiaotong University, Beijing 100044, China.
| | - Kuo Yang
- Institute of Medical Intelligence, School of Computer and Information Technology, Beijing Jiaotong University, Beijing 100044, China.
| | - Edouard Niyongabo
- Institute of Medical Intelligence, School of Computer and Information Technology, Beijing Jiaotong University, Beijing 100044, China.
| | - Zixin Shu
- Institute of Medical Intelligence, School of Computer and Information Technology, Beijing Jiaotong University, Beijing 100044, China.
| | - Jingjing Wang
- Institute of Medical Intelligence, School of Computer and Information Technology, Beijing Jiaotong University, Beijing 100044, China.
| | - Kai Chang
- Institute of Medical Intelligence, School of Computer and Information Technology, Beijing Jiaotong University, Beijing 100044, China.
| | - Qunsheng Zou
- Institute of Medical Intelligence, School of Computer and Information Technology, Beijing Jiaotong University, Beijing 100044, China.
| | - Jiyue Jiang
- Institute of Medical Intelligence, School of Computer and Information Technology, Beijing Jiaotong University, Beijing 100044, China.
| | - Caiyan Jia
- Institute of Medical Intelligence, School of Computer and Information Technology, Beijing Jiaotong University, Beijing 100044, China.
| | - Baoyan Liu
- Data Center of Traditional Chinese Medicine, China Academy of Chinese Medical Sciences, Beijing 100700, China.
| | - Xuezhong Zhou
- Institute of Medical Intelligence, School of Computer and Information Technology, Beijing Jiaotong University, Beijing 100044, China; Data Center of Traditional Chinese Medicine, China Academy of Chinese Medical Sciences, Beijing 100700, China.
| |
Collapse
|
13
|
Peng J, Zhu L, Wang Y, Chen J. Mining Relationships among Multiple Entities in Biological Networks. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2020; 17:769-776. [PMID: 30872239 DOI: 10.1109/tcbb.2019.2904965] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/09/2023]
Abstract
Identifying topological relationships among multiple entities in biological networks is critical towards the understanding of the organizational principles of network functionality. Theoretically, this problem can be solved using minimum Steiner tree (MSTT) algorithms. However, due to large network size, it remains to be computationally challenging, and the predictive value of multi-entity topological relationships is still unclear. We present a novel solution called Cluster-based Steiner Tree Miner (CST-Miner) to instantly identify multi-entity topological relationships in biological networks. Given a list of user-specific entities, CST-Miner decomposes a biological network into nested cluster-based subgraphs, on which multiple minimum Steiner trees are identified. By merging all of them into a minimum cost tree, the optimal topological relationships among all the user-specific entities are revealed. Experimental results showed that CST-Miner can finish in nearly log-linear time and the tree constructed by CST-Miner is close to the global minimum.
Collapse
|
14
|
Zhao Y, Wang J, Chen J, Zhang X, Guo M, Yu G. A Literature Review of Gene Function Prediction by Modeling Gene Ontology. Front Genet 2020; 11:400. [PMID: 32391061 PMCID: PMC7193026 DOI: 10.3389/fgene.2020.00400] [Citation(s) in RCA: 30] [Impact Index Per Article: 7.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/02/2020] [Accepted: 03/30/2020] [Indexed: 12/14/2022] Open
Abstract
Annotating the functional properties of gene products, i.e., RNAs and proteins, is a fundamental task in biology. The Gene Ontology database (GO) was developed to systematically describe the functional properties of gene products across species, and to facilitate the computational prediction of gene function. As GO is routinely updated, it serves as the gold standard and main knowledge source in functional genomics. Many gene function prediction methods making use of GO have been proposed. But no literature review has summarized these methods and the possibilities for future efforts from the perspective of GO. To bridge this gap, we review the existing methods with an emphasis on recent solutions. First, we introduce the conventions of GO and the widely adopted evaluation metrics for gene function prediction. Next, we summarize current methods of gene function prediction that apply GO in different ways, such as using hierarchical or flat inter-relationships between GO terms, compressing massive GO terms and quantifying semantic similarities. Although many efforts have improved performance by harnessing GO, we conclude that there remain many largely overlooked but important topics for future research.
Collapse
Affiliation(s)
- Yingwen Zhao
- College of Computer and Information Science, Southwest University, Chongqing, China
| | - Jun Wang
- College of Computer and Information Science, Southwest University, Chongqing, China
| | - Jian Chen
- State Key Laboratory of Agrobiotechnology and National Maize Improvement Center, China Agricultural University, Beijing, China
| | - Xiangliang Zhang
- CBRC, King Abdullah University of Science and Technology, Thuwal, Saudi Arabia
| | - Maozu Guo
- School of Electrical and Information Engineering, Beijing University of Civil Engineering and Architecture, Beijing, China
| | - Guoxian Yu
- College of Computer and Information Science, Southwest University, Chongqing, China
- CBRC, King Abdullah University of Science and Technology, Thuwal, Saudi Arabia
| |
Collapse
|
15
|
Liu H, Guan J, Li H, Bao Z, Wang Q, Luo X, Xue H. Predicting the Disease Genes of Multiple Sclerosis Based on Network Representation Learning. Front Genet 2020; 11:328. [PMID: 32373160 PMCID: PMC7186413 DOI: 10.3389/fgene.2020.00328] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/23/2020] [Accepted: 03/19/2020] [Indexed: 02/02/2023] Open
Abstract
Multiple sclerosis (MS) is an autoimmune disease for which it is difficult to find exact disease-related genes. Effectively identifying disease-related genes would contribute to improving the treatment and diagnosis of multiple sclerosis. Current methods for identifying disease-related genes mainly focus on the hypothesis of guilt-by-association and pay little attention to the global topological information of the whole protein-protein-interaction (PPI) network. Besides, network representation learning (NRL) has attracted a huge amount of attention in the area of network analysis because of its promising performance in node representation and many downstream tasks. In this paper, we try to introduce NRL into the task of disease-related gene prediction and propose a novel framework for identifying the disease-related genes multiple sclerosis. The proposed framework contains three main steps: capturing the topological structure of the PPI network using NRL-based methods, encoding learned features into low-dimensional space using a stacked autoencoder, and training a support vector machine (SVM) classifier to predict disease-related genes. Compared with three state-of-the-art algorithms, our proposed framework shows superior performance on the task of predicting disease-related genes of multiple sclerosis.
Collapse
Affiliation(s)
- Haijie Liu
- Department of Neurology, Xuanwu Hospital, Capital Medical University, Beijing, China
- Department of Physical Medicine and Rehabilitation, Tianjin Medical University General Hospital, Tianjin, China
- Stroke Biological Recovery Laboratory, Department of Physical Medicine and Rehabilitation, Spaulding Rehabilitation Hospital, The Teaching Affiliate of Harvard Medical School Charlestown, Boston, MA, United States
| | - Jiaojiao Guan
- School of Computer Science, Northwestern Polytechnical University, Xi'an, China
| | - He Li
- Department of Automation, College of Information Science and Engineering, Tianjin Tianshi College, Tianjin, China
| | - Zhijie Bao
- School of Textile Science and Engineering, Tiangong University, Tianjin, China
| | - Qingmei Wang
- Stroke Biological Recovery Laboratory, Department of Physical Medicine and Rehabilitation, Spaulding Rehabilitation Hospital, The Teaching Affiliate of Harvard Medical School Charlestown, Boston, MA, United States
| | - Xun Luo
- Kerry Rehabilitation Medicine Research Institute, Shenzhen, China
- Shenzhen Dapeng New District Nan'ao People's Hospital, Shenzhen, China
| | - Hansheng Xue
- School of Computer Science, Northwestern Polytechnical University, Xi'an, China
| |
Collapse
|
16
|
Wang H, Wang J, Dong C, Lian Y, Liu D, Yan Z. A Novel Approach for Drug-Target Interactions Prediction Based on Multimodal Deep Autoencoder. Front Pharmacol 2020; 10:1592. [PMID: 32047432 PMCID: PMC6997437 DOI: 10.3389/fphar.2019.01592] [Citation(s) in RCA: 9] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/16/2019] [Accepted: 12/09/2019] [Indexed: 01/09/2023] Open
Abstract
Drug targets are biomacromolecules or biomolecular structures that bind to specific drugs and produce therapeutic effects. Therefore, the prediction of drug-target interactions (DTIs) is important for disease therapy. Incorporating multiple similarity measures for drugs and targets is of essence for improving the accuracy of prediction of DTIs. However, existing studies with multiple similarity measures ignored the global structure information of similarity measures, and required manual extraction features of drug-target pairs, ignoring the non-linear relationship among features. In this paper, we proposed a novel approach MDADTI for DTIs prediction based on MDA. MDADTI applied random walk with restart method and positive pointwise mutual information to calculate the topological similarity matrices of drugs and targets, capturing the global structure information of similarity measures. Then, MDADTI applied multimodal deep autoencoder to fuse multiple topological similarity matrices of drugs and targets, automatically learned the low-dimensional features of drugs and targets, and applied deep neural network to predict DTIs. The results of 5-repeats of 10-fold cross-validation under three different cross-validation settings indicated that MDADTI is superior to the other four baseline methods. In addition, we validated the predictions of the MDADTI in six drug-target interactions reference databases, and the results showed that MDADTI can effectively identify unknown DTIs.
Collapse
Affiliation(s)
- Huiqing Wang
- College of Information and Computer, Taiyuan University of Technology, Taiyuan, China
| | - Jingjing Wang
- College of Information and Computer, Taiyuan University of Technology, Taiyuan, China
| | - Chunlin Dong
- Dryland Agriculture Research Center, Shanxi Academy of Agricultural Sciences, Taiyuan, China
| | - Yuanyuan Lian
- College of Information and Computer, Taiyuan University of Technology, Taiyuan, China
| | - Dan Liu
- College of Information and Computer, Taiyuan University of Technology, Taiyuan, China
| | - Zhiliang Yan
- College of Information and Computer, Taiyuan University of Technology, Taiyuan, China
| |
Collapse
|
17
|
Peng J, Lu G, Xue H, Wang T, Shang X. TS-GOEA: a web tool for tissue-specific gene set enrichment analysis based on gene ontology. BMC Bioinformatics 2019; 20:572. [PMID: 31760951 PMCID: PMC6876092 DOI: 10.1186/s12859-019-3125-6] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022] Open
Abstract
BACKGROUND The Gene Ontology (GO) knowledgebase is the world's largest source of information on the functions of genes. Since the beginning of GO project, various tools have been developed to perform GO enrichment analysis experiments. GO enrichment analysis has become a commonly used method of gene function analysis. Existing GO enrichment analysis tools do not consider tissue-specific information, although this information is very important to current research. RESULTS In this paper, we built an easy-to-use web tool called TS-GOEA that allows users to easily perform experiments based on tissue-specific GO enrichment analysis. TS-GOEA uses strict threshold statistical method for GO enrichment analysis, and provides statistical tests to improve the reliability of the analysis results. Meanwhile, TS-GOEA provides tools to compare different experimental results, which is convenient for users to compare the experimental results. To evaluate its performance, we tested the genes associated with platelet disease with TS-GOEA. CONCLUSIONS TS-GOEA is an effective GO analysis tool with unique features. The experimental results show that our method has better performance and provides a useful supplement for the existing GO enrichment analysis tools. TS-GOEA is available at http://120.77.47.2:5678.
Collapse
Affiliation(s)
- Jiajie Peng
- School of Computer Science, Northwestern Polytechnical University, Xi’an, 710129 China
| | - Guilin Lu
- School of Computer Science, Northwestern Polytechnical University, Xi’an, 710129 China
| | - Hansheng Xue
- School of Computer Science, Northwestern Polytechnical University, Xi’an, 710129 China
| | - Tao Wang
- School of Computer Science, Harbin Institute of Technology, Harbin, 150001 China
| | - Xuequn Shang
- School of Computer Science, Northwestern Polytechnical University, Xi’an, 710129 China
| |
Collapse
|
18
|
Wang Y, Juan L, Peng J, Zang T, Wang Y. Prioritizing candidate diseases-related metabolites based on literature and functional similarity. BMC Bioinformatics 2019; 20:574. [PMID: 31760947 PMCID: PMC6876110 DOI: 10.1186/s12859-019-3127-4] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/18/2022] Open
Abstract
Background As the terminal products of cellular regulatory process, functional related metabolites have a close relationship with complex diseases, and are often associated with the same or similar diseases. Therefore, identification of disease related metabolites play a critical role in understanding comprehensively pathogenesis of disease, aiming at improving the clinical medicine. Considering that a large number of metabolic markers of diseases need to be explored, we propose a computational model to identify potential disease-related metabolites based on functional relationships and scores of referred literatures between metabolites. First, obtaining associations between metabolites and diseases from the Human Metabolome database, we calculate the similarities of metabolites based on modified recommendation strategy of collaborative filtering utilizing the similarities between diseases. Next, a disease-associated metabolite network (DMN) is built with similarities between metabolites as weight. To improve the ability of identifying disease-related metabolites, we introduce scores of text mining from the existing database of chemicals and proteins into DMN and build a new disease-associated metabolite network (FLDMN) by fusing functional associations and scores of literatures. Finally, we utilize random walking with restart (RWR) in this network to predict candidate metabolites related to diseases. Results We construct the disease-associated metabolite network and its improved network (FLDMN) with 245 diseases, 587 metabolites and 28,715 disease-metabolite associations. Subsequently, we extract training sets and testing sets from two different versions of the Human Metabolome database and assess the performance of DMN and FLDMN on 19 diseases, respectively. As a result, the average AUC (area under the receiver operating characteristic curve) of DMN is 64.35%. As a further improved network, FLDMN is proven to be successful in predicting potential metabolic signatures for 19 diseases with an average AUC value of 76.03%. Conclusion In this paper, a computational model is proposed for exploring metabolite-disease pairs and has good performance in predicting potential metabolites related to diseases through adequate validation. This result suggests that integrating literature and functional associations can be an effective way to construct disease associated metabolite network for prioritizing candidate diseases-related metabolites.
Collapse
Affiliation(s)
- Yongtian Wang
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, 150001, People's Republic of China
| | - Liran Juan
- School of Life Science and Technology, Harbin Institute of Technology, Harbin, 150001, People's Republic of China
| | - Jiajie Peng
- School of Computer Science, Northwestern Polytechnical University, Xi'an, People's Republic of China
| | - Tianyi Zang
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, 150001, People's Republic of China.
| | - Yadong Wang
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, 150001, People's Republic of China.
| |
Collapse
|
19
|
Zhan Q, Wang N, Jin S, Tan R, Jiang Q, Wang Y. ProbPFP: a multiple sequence alignment algorithm combining hidden Markov model optimized by particle swarm optimization with partition function. BMC Bioinformatics 2019; 20:573. [PMID: 31760933 PMCID: PMC6876095 DOI: 10.1186/s12859-019-3132-7] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND During procedures for conducting multiple sequence alignment, that is so essential to use the substitution score of pairwise alignment. To compute adaptive scores for alignment, researchers usually use Hidden Markov Model or probabilistic consistency methods such as partition function. Recent studies show that optimizing the parameters for hidden Markov model, as well as integrating hidden Markov model with partition function can raise the accuracy of alignment. The combination of partition function and optimized HMM, which could further improve the alignment's accuracy, however, was ignored by these researches. RESULTS A novel algorithm for MSA called ProbPFP is presented in this paper. It intergrate optimized HMM by particle swarm with partition function. The algorithm of PSO was applied to optimize HMM's parameters. After that, the posterior probability obtained by the HMM was combined with the one obtained by partition function, and thus to calculate an integrated substitution score for alignment. In order to evaluate the effectiveness of ProbPFP, we compared it with 13 outstanding or classic MSA methods. The results demonstrate that the alignments obtained by ProbPFP got the maximum mean TC scores and mean SP scores on these two benchmark datasets: SABmark and OXBench, and it got the second highest mean TC scores and mean SP scores on the benchmark dataset BAliBASE. ProbPFP is also compared with 4 other outstanding methods, by reconstructing the phylogenetic trees for six protein families extracted from the database TreeFam, based on the alignments obtained by these 5 methods. The result indicates that the reference trees are closer to the phylogenetic trees reconstructed from the alignments obtained by ProbPFP than the other methods. CONCLUSIONS We propose a new multiple sequence alignment method combining optimized HMM and partition function in this paper. The performance validates this method could make a great improvement of the alignment's accuracy.
Collapse
Affiliation(s)
- Qing Zhan
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, 150001, China
| | - Nan Wang
- Department of Mathematics, Harbin Institute of Technology, Harbin, 150001, China
| | - Shuilin Jin
- Department of Mathematics, Harbin Institute of Technology, Harbin, 150001, China
| | - Renjie Tan
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, 150001, China
| | - Qinghua Jiang
- School of Life Science and Technology, Harbin Institute of Technology, Harbin, 150001, China
| | - Yadong Wang
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, 150001, China.
| |
Collapse
|
20
|
Wang Y, Nie C, Zang T, Wang Y. Predicting circRNA-Disease Associations Based on circRNA Expression Similarity and Functional Similarity. Front Genet 2019; 10:832. [PMID: 31572444 PMCID: PMC6751509 DOI: 10.3389/fgene.2019.00832] [Citation(s) in RCA: 13] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/24/2019] [Accepted: 08/13/2019] [Indexed: 12/19/2022] Open
Abstract
Circular RNAs (circRNAs) are a novel class of endogenous noncoding RNAs that have well-conserved sequences. Emerging evidence has shown that circRNAs can be novel biomarkers or therapeutic targets for many diseases and play an important role in the development of various pathological conditions. Therefore, identifying potential disease-related circRNAs is helpful in improving the efficiency of finding therapeutic targets for diseases. Here, we propose a computational model (PreCDA) to predict potential circRNA-disease associations. First, we calculated the circRNA expression similarity based on circRNA expression profiles. The circRNA functional similarity is calculated based on cosine similarity, and the disease similarity is used as the dimension of each circRNA vector. The associations between circRNAs and diseases are defined based on the circRNA functional similarity and expression similarity. We constructed a disease-related circRNA association network and used a graph-based recommendation algorithm (PersonalRank) to sort candidate disease-related circRNAs. As a result, PreCDA has an average area under the receiver operating characteristic curve value of 78.15% in predicting candidate disease-related circRNAs. In addition, we discuss the factors that affect the performance of this method and find some unknown circRNAs related to diseases, with several common diseases used as case studies. These results show that PreCDA has good performance in predicting potential circRNA-disease associations and is helpful for the diagnosis and treatment of human diseases.
Collapse
Affiliation(s)
| | | | - Tianyi Zang
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China
| | - Yadong Wang
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China
| |
Collapse
|
21
|
Peng J, Wang X, Shang X. Combining gene ontology with deep neural networks to enhance the clustering of single cell RNA-Seq data. BMC Bioinformatics 2019; 20:284. [PMID: 31182005 PMCID: PMC6557741 DOI: 10.1186/s12859-019-2769-6] [Citation(s) in RCA: 30] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/24/2022] Open
Abstract
Background Single cell RNA sequencing (scRNA-seq) is applied to assay the individual transcriptomes of large numbers of cells. The gene expression at single-cell level provides an opportunity for better understanding of cell function and new discoveries in biomedical areas. To ensure that the single-cell based gene expression data are interpreted appropriately, it is crucial to develop new computational methods. Results In this article, we try to re-construct a neural network based on Gene Ontology (GO) for dimension reduction of scRNA-seq data. By integrating GO with both unsupervised and supervised models, two novel methods are proposed, named GOAE (Gene Ontology AutoEncoder) and GONN (Gene Ontology Neural Network) respectively. Conclusions The evaluation results show that the proposed models outperform some state-of-the-art dimensionality reduction approaches. Furthermore, incorporating with GO, we provide an opportunity to interpret the underlying biological mechanism behind the neural network-based model.
Collapse
Affiliation(s)
- Jiajie Peng
- School of Computer Science, Northwestern Polytechnical University, Xi'an, 710072, China.,Key Laboratory of Big Data Storage and Management, Northwestern Polytechnical University, Ministry of Industry and Information Technology, Xi'an, 710072, China.,Centre for Multidisciplinary Convergence Computing, School of Computer Science, Northwestern Polytechnical University, Xi'an, 710072, China
| | - Xiaoyu Wang
- School of Computer Science, Northwestern Polytechnical University, Xi'an, 710072, China
| | - Xuequn Shang
- School of Computer Science, Northwestern Polytechnical University, Xi'an, 710072, China. .,Key Laboratory of Big Data Storage and Management, Northwestern Polytechnical University, Ministry of Industry and Information Technology, Xi'an, 710072, China.
| |
Collapse
|
22
|
Xue H, Peng J, Shang X. Predicting disease-related phenotypes using an integrated phenotype similarity measurement based on HPO. BMC SYSTEMS BIOLOGY 2019; 13:34. [PMID: 30953559 PMCID: PMC6449884 DOI: 10.1186/s12918-019-0697-8] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/13/2022]
Abstract
Background Improving efficiency of disease diagnosis based on phenotype ontology is a critical yet challenging research area. Recently, Human Phenotype Ontology (HPO)-based semantic similarity has been affectively and widely used to identify causative genes and diseases. However, current phenotype similarity measurements just consider the annotations and hierarchy structure of HPO, neglecting the definition description of phenotype terms. Results In this paper, we propose a novel phenotype similarity measurement, termed as DisPheno, which adequately incorporates the definition of phenotype terms in addition to HPO structure and annotations to measure the similarity between phenotype terms. DisPheno also integrates phenotype term associations into phenotype-set similarity measurement using gene and disease annotations of phenotype terms. Conclusions Compared with five existing state-of-the-art methods, DisPheno shows great performance in HPO-based phenotype semantic similarity measurement and improves the efficiency of disease identification, especially on noisy patients dataset.
Collapse
Affiliation(s)
- Hansheng Xue
- School of Computer Science, Northwestern Polytechnical University, Xi'an, China.,School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen, China
| | - Jiajie Peng
- School of Computer Science, Northwestern Polytechnical University, Xi'an, China.
| | - Xuequn Shang
- School of Computer Science, Northwestern Polytechnical University, Xi'an, China.
| |
Collapse
|
23
|
Peng J, Guan J, Shang X. Predicting Parkinson's Disease Genes Based on Node2vec and Autoencoder. Front Genet 2019; 10:226. [PMID: 31001311 PMCID: PMC6454041 DOI: 10.3389/fgene.2019.00226] [Citation(s) in RCA: 56] [Impact Index Per Article: 11.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/06/2018] [Accepted: 02/28/2019] [Indexed: 12/26/2022] Open
Abstract
Identifying genes associated with Parkinson's disease plays an extremely important role in the diagnosis and treatment of Parkinson's disease. In recent years, based on the guilt-by-association hypothesis, many methods have been proposed to predict disease-related genes, but few of these methods are designed or used for Parkinson's disease gene prediction. In this paper, we propose a novel prediction method for Parkinson's disease gene prediction, named N2A-SVM. N2A-SVM includes three parts: extracting features of genes based on network, reducing the dimension using deep neural network, and predicting Parkinson's disease genes using a machine learning method. The evaluation test shows that N2A-SVM performs better than existing methods. Furthermore, we evaluate the significance of each step in the N2A-SVM algorithm and the influence of the hyper-parameters on the result. In addition, we train N2A-SVM on the recent dataset and used it to predict Parkinson's disease genes. The predicted top-rank genes can be verified based on literature study.
Collapse
Affiliation(s)
| | | | - Xuequn Shang
- School of Computer Science, Northwestern Polytechnical University, Xi'an, China
| |
Collapse
|
24
|
Cheng L, Zhuang H, Ju H, Yang S, Han J, Tan R, Hu Y. Exposing the Causal Effect of Body Mass Index on the Risk of Type 2 Diabetes Mellitus: A Mendelian Randomization Study. Front Genet 2019; 10:94. [PMID: 30891058 PMCID: PMC6413727 DOI: 10.3389/fgene.2019.00094] [Citation(s) in RCA: 43] [Impact Index Per Article: 8.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/21/2018] [Accepted: 01/29/2019] [Indexed: 12/17/2022] Open
Abstract
Introduction: High body mass index (BMI) is a positive associated phenotype of type 2 diabetes mellitus (T2DM). Abundant studies have observed this from a clinical perspective. Since the rapid increase in a large number of genetic variants from the genome-wide association studies (GWAS), common SNPs of BMI and T2DM were identified as the genetic basis for understanding their associations. Currently, their causality is beginning to blur. Materials and Methods: To classify it, a Mendelian randomisation (MR), using genetic instrumental variables (IVs) to explore the causality of intermediate phenotype and disease, was utilized here to test the effect of BMI on the risk of T2DM. In this article, MR was carried out on GWAS data using 52 independent BMI SNPs as IVs. The pooled odds ratio (OR) of these SNPs was calculated using inverse-variance weighted method for the assessment of 5 kg/m2 higher BMI on the risk of T2DM. The leave-one-out validation was conducted to identify the effect of individual SNPs. MR-Egger regression was utilized to detect potential pleiotropic bias of variants. Results: We obtained the high OR (1.470; 95% CI 1.170 to 1.847; P = 0.001), low intercept (0.004, P = 0.661), and small fluctuation of ORs {from -0.039 [(1.412 - 1.470) / 1.470)] to 0.075 [(1.568- 1.470) / 1.470)] in leave-one-out validation. Conclusion: We validate the causal effect of high BMI on the risk of T2DM. The low intercept shows no pleiotropic bias of IVs. The small alterations of ORs activated by removing individual SNPs showed no single SNP drives our estimate.
Collapse
Affiliation(s)
- Liang Cheng
- College of Bioinformatics Science and Technology, Harbin Medical University, Harbin, China
| | - He Zhuang
- College of Bioinformatics Science and Technology, Harbin Medical University, Harbin, China
| | - Hong Ju
- Department of Information Engineering, Heilongjiang Biological Science and Technology Career Academy, Harbin, China
| | - Shuo Yang
- College of Bioinformatics Science and Technology, Harbin Medical University, Harbin, China
| | - Junwei Han
- College of Bioinformatics Science and Technology, Harbin Medical University, Harbin, China
| | - Renjie Tan
- College of Bioinformatics Science and Technology, Harbin Medical University, Harbin, China
| | - Yang Hu
- School of Life Sciences and Technology, Harbin Institute of Technology, Harbin, China
| |
Collapse
|
25
|
Xu L, Liang G, Liao C, Chen GD, Chang CC. k-Skip-n-Gram-RF: A Random Forest Based Method for Alzheimer's Disease Protein Identification. Front Genet 2019; 10:33. [PMID: 30809242 PMCID: PMC6379451 DOI: 10.3389/fgene.2019.00033] [Citation(s) in RCA: 53] [Impact Index Per Article: 10.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/10/2018] [Accepted: 01/17/2019] [Indexed: 11/18/2022] Open
Abstract
In this paper, a computational method based on machine learning technique for identifying Alzheimer's disease genes is proposed. Compared with most existing machine learning based methods, existing methods predict Alzheimer's disease genes by using structural magnetic resonance imaging (MRI) technique. Most methods have attained acceptable results, but the cost is expensive and time consuming. Thus, we proposed a computational method for identifying Alzheimer disease genes by use of the sequence information of proteins, and classify the feature vectors by random forest. In the proposed method, the gene protein information is extracted by adaptive k-skip-n-gram features. The proposed method can attain the accuracy to 85.5% on the selected UniProt dataset, which has been demonstrated by the experimental results.
Collapse
Affiliation(s)
- Lei Xu
- School of Electronic and Communication Engineering, Shenzhen Polytechnic, Shenzhen, China
| | - Guangmin Liang
- School of Electronic and Communication Engineering, Shenzhen Polytechnic, Shenzhen, China
| | - Changrui Liao
- Key Laboratory of Optoelectronic Devices and Systems of Ministry of Education and Guangdong Province, College of Optoelectronic Engineering, Shenzhen University, Shenzhen, China
| | - Gin-Den Chen
- Department of Obstetrics and Gynecology, Chung Shan Medical University Hospital, Taichung, Taiwan
| | - Chi-Chang Chang
- School of Medical Informatics, Chung Shan Medical University, Taichung, Taiwan
- IT Office, Chung Shan Medical University Hospital, Taichung, Taiwan
| |
Collapse
|
26
|
Cheng L, Zhuang H, Yang S, Jiang H, Wang S, Zhang J. Exposing the Causal Effect of C-Reactive Protein on the Risk of Type 2 Diabetes Mellitus: A Mendelian Randomization Study. Front Genet 2018; 9:657. [PMID: 30619477 PMCID: PMC6306438 DOI: 10.3389/fgene.2018.00657] [Citation(s) in RCA: 57] [Impact Index Per Article: 9.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/14/2018] [Accepted: 12/03/2018] [Indexed: 12/21/2022] Open
Abstract
As a biomarker of inflammation, C-reactive protein (CRP) has attracted much attention due to its role in the incidence of type 2 diabetes mellitus (T2DM). Prospective studies have observed a positive correlation between the level of serum CRP and the incidence of T2DM. Recently, studies have reported that drugs for curing T2DM can also decrease the level of serum CRP. However, it is not yet clear whether high CRP levels cause T2DM. To evaluate this, we conducted a Mendelian randomization (MR) analysis using genetic variations as instrumental variables (IVs). Significantly associated single nucleotide polymorphisms (SNPs) of CRP were obtained from a genome-wide study and a replication study. Therein, 17,967 participants were utilized for the genome-wide association study (GWAS), and another 14,747 participants were utilized for the replication of identifying SNPs associated with CRP levels. The associations between SNPs and T2DM were from the DIAbetes Genetics Replication And Meta-analysis (DIAGRAM) consortium. After removing SNPs in linkage disequilibrium (LD) and T2DM-related SNPs, the four remaining CRP-related SNPs were deemed as IVs. To evaluate the pooled influence of these IVs on the risk of developing T2DM through CRP, the penalized robust inverse-variance weighted (IVW) method was carried out. The combined result (OR 1.114048; 95% CI 1.058656 to 1.172338; P = 0.024) showed that high levels of CRP significantly increase the risk of T2DM. In the subsequent analysis of the relationship between CRP and type 1 diabetes mellitus (T1DM), the pooled result (OR 1.017145; 95% CI 0.9066489 to 1.14225; P = 0.909) supported that CRP levels cannot determine the risk of developing T1DM.
Collapse
Affiliation(s)
- Liang Cheng
- College of Bioinformatics Science and Technology, Harbin Medical University, Harbin, China
| | - He Zhuang
- College of Bioinformatics Science and Technology, Harbin Medical University, Harbin, China
| | - Shuo Yang
- College of Bioinformatics Science and Technology, Harbin Medical University, Harbin, China
| | - Huijie Jiang
- Department of Radiology, The Second Affiliated Hospital of Harbin Medical University, Harbin, China
| | - Song Wang
- Department of Radiology, Longhua Hospital, Shanghai University of Traditional Chinese Medicine, Shanghai, China
| | - Jun Zhang
- Heilongjiang Provincial Hospital, Harbin, China
| |
Collapse
|
27
|
Peng J, Xue H, Hui W, Lu J, Chen B, Jiang Q, Shang X, Wang Y. An online tool for measuring and visualizing phenotype similarities using HPO. BMC Genomics 2018; 19:571. [PMID: 30367579 PMCID: PMC6101067 DOI: 10.1186/s12864-018-4927-z] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022] Open
Abstract
Background The Human Phenotype Ontology (HPO) is one of the most popular bioinformatics resources. Recently, HPO-based phenotype semantic similarity has been effectively applied to model patient phenotype data. However, the existing tools are revised based on the Gene Ontology (GO)-based term similarity. The design of the models are not optimized for the unique features of HPO. In addition, existing tools only allow HPO terms as input and only provide pure text-based outputs. Results We present PhenoSimWeb, a web application that allows researchers to measure HPO-based phenotype semantic similarities using four approaches borrowed from GO-based similarity measurements. Besides, we provide a approach considering the unique properties of HPO. And, PhenoSimWeb allows text that describes phenotypes as input, since clinical phenotype data is always in text. PhenoSimWeb also provides a graphic visualization interface to visualize the resulting phenotype network. Conclusions PhenoSimWeb is an easy-to-use and functional online application. Researchers can use it to calculate phenotype similarity conveniently, predict phenotype associated genes or diseases, and visualize the network of phenotype interactions. PhenoSimWeb is available at http://120.77.47.2:8080.
Collapse
Affiliation(s)
- Jiajie Peng
- School of Computer Science, Northwestern Polytechnical University, Xi'an, 710072, China
| | - Hansheng Xue
- Department of Computer Science and Technology, Harbin Institute of Technology, Shenzhen, 518055, China
| | - Weiwei Hui
- School of Computer Science, Northwestern Polytechnical University, Xi'an, 710072, China
| | - Junya Lu
- School of Computer Science, Northwestern Polytechnical University, Xi'an, 710072, China
| | - Bolin Chen
- School of Computer Science, Northwestern Polytechnical University, Xi'an, 710072, China
| | - Qinghua Jiang
- School of Life Science and Technology, Harbin Institute of Technology, Harbin, 150001, China
| | - Xuequn Shang
- School of Computer Science, Northwestern Polytechnical University, Xi'an, 710072, China.
| | - Yadong Wang
- Department of Computer Science and Technology, Harbin Institute of Technology, Shenzhen, 518055, China. .,School of Computer Science and Technology, Harbin Institute of Technology, Harbin, 150001, China.
| |
Collapse
|
28
|
A Similarity Regression Fusion Model for Integrating Multi-Omics Data to Identify Cancer Subtypes. Genes (Basel) 2018; 9:genes9070314. [PMID: 29933539 PMCID: PMC6070922 DOI: 10.3390/genes9070314] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/31/2018] [Revised: 06/08/2018] [Accepted: 06/11/2018] [Indexed: 12/26/2022] Open
Abstract
The identification of cancer subtypes is crucial to cancer diagnosis and treatments. A number of methods have been proposed to identify cancer subtypes by integrating multi-omics data in recent years. However, the existing methods rarely consider the biases of similarity between samples and weights of different omics data in integration. More accurate and flexible integration approaches need to be developed to comprehensively investigate cancer subtypes. In this paper, we propose a simple and flexible similarity fusion model for integrating multi-omics data to identify cancer subtypes. We consider the similarity biases between samples in each omics data and predict corrected similarities between samples using a generalized linear model. We integrate the corrected similarity information from multi-omics data according to different data-view weights. Based on the integrative similarity information, we cluster patient samples into different subtype groups. Comprehensive experiments demonstrate that the proposed approach obtains more significant results than the state-of-the-art integrative methods. In conclusion, our approach provides an effective and flexible tool to investigate subtypes in cancer by integrating multi-omics data.
Collapse
|
29
|
Abstract
BACKGROUND Recently, measuring phenotype similarity began to play an important role in disease diagnosis. Researchers have begun to pay attention to develop phenotype similarity measurement. However, existing methods ignore the interactions between phenotype-associated proteins, which may lead to inaccurate phenotype similarity. RESULTS We proposed a network-based method PhenoNet to calculate the similarity between phenotypes. We localized phenotypes in the network and calculated the similarity between phenotype-associated modules by modeling both the inter- and intra-similarity. CONCLUSIONS PhenoNet was evaluated on two independent evaluation datasets: gene ontology and gene expression data. The result shows that PhenoNet performs better than the state-of-art methods on all evaluation tests.
Collapse
Affiliation(s)
- Jiajie Peng
- School of Computer Science, Northwestern Polytechnical University, Xi’an, China
| | - Weiwei Hui
- School of Computer Science, Northwestern Polytechnical University, Xi’an, China
| | - Xuequn Shang
- School of Computer Science, Northwestern Polytechnical University, Xi’an, China
| |
Collapse
|
30
|
Wang Z, Wu X, Wang Y. A framework for analyzing DNA methylation data from Illumina Infinium HumanMethylation450 BeadChip. BMC Bioinformatics 2018; 19:115. [PMID: 29671397 PMCID: PMC5907140 DOI: 10.1186/s12859-018-2096-3] [Citation(s) in RCA: 30] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/14/2022] Open
Abstract
Background DNA methylation has been identified to be widely associated to complex diseases. Among biological platforms to profile DNA methylation in human, the Illumina Infinium HumanMethylation450 BeadChip (450K) has been accepted as one of the most efficient technologies. However, challenges exist in analysis of DNA methylation data generated by this technology due to widespread biases. Results Here we proposed a generalized framework for evaluating data analysis methods for Illumina 450K array. This framework considers the following steps towards a successful analysis: importing data, quality control, within-array normalization, correcting type bias, detecting differentially methylated probes or regions and biological interpretation. Conclusions We evaluated five methods using three real datasets, and proposed outperform methods for the Illumina 450K array data analysis. Minfi and methylumi are optimal choice when analyzing small dataset. BMIQ and RCP are proper to correcting type bias and the normalized result of them can be used to discover DMPs. R package missMethyl is suitable for GO term enrichment analysis and biological interpretation.
Collapse
Affiliation(s)
- Zhenxing Wang
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, 150001, China
| | - XiaoLiang Wu
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, 150001, China
| | - Yadong Wang
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, 150001, China.
| |
Collapse
|
31
|
Chu Y, Teng M, Wang Y. Modeling and correct the GC bias of tumor and normal WGS data for SCNA based tumor subclonal population inferring. BMC Bioinformatics 2018; 19:112. [PMID: 29671389 PMCID: PMC5907144 DOI: 10.1186/s12859-018-2099-0] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/09/2023] Open
Abstract
BACKGROUND Somatic copy number alternations (SCNAs) can be utilized to infer tumor subclonal populations in whole genome seuqncing studies, where usually their read count ratios between tumor-normal paired samples serve as the inferring proxy. Existing SCNA based subclonal population inferring tools consider the GC bias of tumor and normal sample is of the same fature, and could be fully offset by read count ratio. However, we found that, the read count ratio on SCNA segments presents a Log linear biased pattern, which influence existing read count ratios based subclonal inferring tools performance. Currently no correction tools take into account the read ratio bias. RESULTS We present Pre-SCNAClonal, a tool that improving tumor subclonal population inferring by correcting GC-bias at SCNAs level. Pre-SCNAClonal first corrects GC bias using Markov chain Monte Carlo probability model, then accurately locates baseline DNA segments (not containing any SCNAs) with a hierarchy clustering model. We show Pre-SCNAClonal's superiority to exsiting GC-bias correction methods at any level of subclonal population. CONCLUSIONS Pre-SCNAClonal could be run independently as well as serving as pre-processing/gc-correction step in conjuntion with exsiting SCNA-based subclonal inferring tools.
Collapse
Affiliation(s)
- Yanshuo Chu
- Center for Bioinformatics, Harbin Institute of Technology, Harbin, China
| | - Mingxiang Teng
- Center for Bioinformatics, Harbin Institute of Technology, Harbin, China
| | - Yadong Wang
- Center for Bioinformatics, Harbin Institute of Technology, Harbin, China.
| |
Collapse
|
32
|
Hu Y, Zhao T, Zhang N, Zang T, Zhang J, Cheng L. Identifying diseases-related metabolites using random walk. BMC Bioinformatics 2018; 19:116. [PMID: 29671398 PMCID: PMC5907145 DOI: 10.1186/s12859-018-2098-1] [Citation(s) in RCA: 44] [Impact Index Per Article: 7.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/06/2023] Open
Abstract
Background Metabolites disrupted by abnormal state of human body are deemed as the effect of diseases. In comparison with the cause of diseases like genes, these markers are easier to be captured for the prevention and diagnosis of metabolic diseases. Currently, a large number of metabolic markers of diseases need to be explored, which drive us to do this work. Methods The existing metabolite-disease associations were extracted from Human Metabolome Database (HMDB) using a text mining tool NCBO annotator as priori knowledge. Next we calculated the similarity of a pair-wise metabolites based on the similarity of disease sets of them. Then, all the similarities of metabolite pairs were utilized for constructing a weighted metabolite association network (WMAN). Subsequently, the network was utilized for predicting novel metabolic markers of diseases using random walk. Results Totally, 604 metabolites and 228 diseases were extracted from HMDB. From 604 metabolites, 453 metabolites are selected to construct the WMAN, where each metabolite is deemed as a node, and the similarity of two metabolites as the weight of the edge linking them. The performance of the network is validated using the leave one out method. As a result, the high area under the receiver operating characteristic curve (AUC) (0.7048) is achieved. The further case studies for identifying novel metabolites of diabetes mellitus were validated in the recent studies. Conclusion In this paper, we presented a novel method for prioritizing metabolite-disease pairs. The superior performance validates its reliability for exploring novel metabolic markers of diseases.
Collapse
Affiliation(s)
- Yang Hu
- School of Life Science and Technology, Department of Computer Science and Technology, Harbin Institute of Technology, Harbin, 150001, People's Republic of China
| | - Tianyi Zhao
- School of Life Science and Technology, Department of Computer Science and Technology, Harbin Institute of Technology, Harbin, 150001, People's Republic of China
| | - Ningyi Zhang
- School of Life Science and Technology, Department of Computer Science and Technology, Harbin Institute of Technology, Harbin, 150001, People's Republic of China
| | - Tianyi Zang
- School of Life Science and Technology, Department of Computer Science and Technology, Harbin Institute of Technology, Harbin, 150001, People's Republic of China.
| | - Jun Zhang
- Department of rehabilitation, Heilongjiang Province Land Reclamation Headquarters General Hospital, Harbin, 150001, People's Republic of China.
| | - Liang Cheng
- College of Bioinformatics Science and Technology, Harbin Medical University, Harbin, 150001, China.
| |
Collapse
|
33
|
Hao X, Hao J, Wang L, Hou H. Effective norm emergence in cell systems under limited communication. BMC Bioinformatics 2018; 19:119. [PMID: 29671391 PMCID: PMC5907317 DOI: 10.1186/s12859-018-2097-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022] Open
Abstract
Background The cooperation of cells in biological systems is similar to that of agents in cooperative multi-agent systems. Research findings in multi-agent systems literature can provide valuable inspirations to biological research. The well-coordinated states in cell systems can be viewed as desirable social norms in cooperative multi-agent systems. One important research question is how a norm can rapidly emerge with limited communication resources. Results In this work, we propose a learning approach which can trade off the agents’ performance of coordinating on a consistent norm and the communication cost involved. During the learning process, the agents can dynamically adjust their coordination set according to their own observations and pick out the most crucial agents to coordinate with. In this way, our method significantly reduces the coordination dependence among agents. Conclusion The experiment results show that our method can efficiently facilitate the social norm emergence among agents, and also scale well to large-scale populations.
Collapse
Affiliation(s)
- Xiaotian Hao
- School of Computer Science and Software, Tianjin University, Peiyang Park Campus: No.135 Yaguan Road, Haihe Education Park, Tianjin, 300350, China
| | - Jianye Hao
- School of Computer Science and Software, Tianjin University, Peiyang Park Campus: No.135 Yaguan Road, Haihe Education Park, Tianjin, 300350, China
| | - Li Wang
- School of Computer Science and Software, Tianjin University, Peiyang Park Campus: No.135 Yaguan Road, Haihe Education Park, Tianjin, 300350, China.
| | - Hanxu Hou
- School of Electrical Engineering and Intelligentization, Dongguan University of Technology, No. 1, university road, songshan lake district, dongguan, 221116, China.
| |
Collapse
|
34
|
Guo Y, Liu S, Li Z, Shang X. BCDForest: a boosting cascade deep forest model towards the classification of cancer subtypes based on gene expression data. BMC Bioinformatics 2018; 19:118. [PMID: 29671390 PMCID: PMC5907304 DOI: 10.1186/s12859-018-2095-4] [Citation(s) in RCA: 46] [Impact Index Per Article: 7.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND The classification of cancer subtypes is of great importance to cancer disease diagnosis and therapy. Many supervised learning approaches have been applied to cancer subtype classification in the past few years, especially of deep learning based approaches. Recently, the deep forest model has been proposed as an alternative of deep neural networks to learn hyper-representations by using cascade ensemble decision trees. It has been proved that the deep forest model has competitive or even better performance than deep neural networks in some extent. However, the standard deep forest model may face overfitting and ensemble diversity challenges when dealing with small sample size and high-dimensional biology data. RESULTS In this paper, we propose a deep learning model, so-called BCDForest, to address cancer subtype classification on small-scale biology datasets, which can be viewed as a modification of the standard deep forest model. The BCDForest distinguishes from the standard deep forest model with the following two main contributions: First, a named multi-class-grained scanning method is proposed to train multiple binary classifiers to encourage diversity of ensemble. Meanwhile, the fitting quality of each classifier is considered in representation learning. Second, we propose a boosting strategy to emphasize more important features in cascade forests, thus to propagate the benefits of discriminative features among cascade layers to improve the classification performance. Systematic comparison experiments on both microarray and RNA-Seq gene expression datasets demonstrate that our method consistently outperforms the state-of-the-art methods in application of cancer subtype classification. CONCLUSIONS The multi-class-grained scanning and boosting strategy in our model provide an effective solution to ease the overfitting challenge and improve the robustness of deep forest model working on small-scale data. Our model provides a useful approach to the classification of cancer subtypes by using deep learning on high-dimensional and small-scale biology data.
Collapse
Affiliation(s)
- Yang Guo
- School of Computer Science and Engineering, Northwestern Polytechnical University, Xi’an, 710072 People’s Republic of China
| | - Shuhui Liu
- School of Computer Science and Engineering, Northwestern Polytechnical University, Xi’an, 710072 People’s Republic of China
| | - Zhanhuai Li
- School of Computer Science and Engineering, Northwestern Polytechnical University, Xi’an, 710072 People’s Republic of China
| | - Xuequn Shang
- School of Computer Science and Engineering, Northwestern Polytechnical University, Xi’an, 710072 People’s Republic of China
| |
Collapse
|
35
|
Peng J, Wang H, Lu J, Hui W, Wang Y, Shang X. Identifying term relations cross different gene ontology categories. BMC Bioinformatics 2017; 18:573. [PMID: 29297309 PMCID: PMC5751813 DOI: 10.1186/s12859-017-1959-3] [Citation(s) in RCA: 39] [Impact Index Per Article: 5.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022] Open
Abstract
Background The Gene Ontology (GO) is a community-based bioinformatics resource that employs ontologies to represent biological knowledge and describes information about gene and gene product function. GO includes three independent categories: molecular function, biological process and cellular component. For better biological reasoning, identifying the biological relationships between terms in different categories are important. However, the existing measurements to calculate similarity between terms in different categories are either developed by using the GO data only or only take part of combined gene co-function network information. Results We propose an iterative ranking-based method called CroGO2 to measure the cross-categories GO term similarities by incorporating level information of GO terms with both direct and indirect interactions in the gene co-function network. Conclusions The evaluation test shows that CroGO2 performs better than the existing methods. A genome-specific term association network for yeast is also generated by connecting terms with the high confidence score. The linkages in the term association network could be supported by the literature. Given a gene set, the related terms identified by using the association network have overlap with the related terms identified by GO enrichment analysis.
Collapse
Affiliation(s)
- Jiajie Peng
- School of Computer Science, Northwestern Polytechnical University, Xi'an, China
| | - Honggang Wang
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China
| | - Junya Lu
- School of Computer Science, Northwestern Polytechnical University, Xi'an, China
| | - Weiwei Hui
- School of Computer Science, Northwestern Polytechnical University, Xi'an, China
| | - Yadong Wang
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China.
| | - Xuequn Shang
- School of Computer Science, Northwestern Polytechnical University, Xi'an, China.
| |
Collapse
|