1
|
Zhang Y, Wang Z, Ge F, Wang X, Zhang Y, Li S, Guo Y, Song J, Yu DJ. MLSNet: a deep learning model for predicting transcription factor binding sites. Brief Bioinform 2024; 25:bbae489. [PMID: 39350338 PMCID: PMC11442149 DOI: 10.1093/bib/bbae489] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/22/2024] [Revised: 09/05/2024] [Accepted: 09/16/2024] [Indexed: 10/04/2024] Open
Abstract
Accurate prediction of transcription factor binding sites (TFBSs) is essential for understanding gene regulation mechanisms and the etiology of diseases. Despite numerous advances in deep learning for predicting TFBSs, their performance can still be enhanced. In this study, we propose MLSNet, a novel deep learning architecture designed specifically to predict TFBSs. MLSNet innovatively integrates multisize convolutional fusion with long short-term memory (LSTM) networks to effectively capture DNA-sparse higher-order sequence features. Further, MLSNet incorporates super token attention and Bi-LSTM to systematically extract and integrate higher-order DNA shape features. Experimental results on 165 ChIP-seq (chromatin immunoprecipitation followed by sequencing) datasets indicate that MLSNet consistently outperforms several state-of-the-art algorithms in the prediction of TFBSs. Specifically, MLSNet reports average metrics: 0.8306 for ACC, 0.8992 for AUROC, and 0.9035 for AUPRC, surpassing the second-best methods by 1.82%, 1.68%, and 1.54%, respectively. This research delineates the effectiveness of combining multi-size convolutional layers with LSTM and DNA shape-based features in enhancing predictive accuracy. Moreover, this study comprehensively assesses the variability in model performance across different cell lines and transcription factors. The source code of MLSNet is available at https://github.com/minghaidea/MLSNet.
Collapse
Affiliation(s)
- Yuchuan Zhang
- School of Computer Science and Engineering, Nanjing University of Science and Technology, 200 Xiaolingwei, Nanjing 210094, China
| | - Zhikang Wang
- Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Wellington Rd, Clayton, Melbourne, VIC 3800, Australia
| | - Fang Ge
- State Key Laboratory of Organic Electronics and Information Displays & Institute of Advanced Materials (IAM), Nanjing University of Posts & Telecommunications, 9 Wenyuan, Nanjing, 210023, China
| | - Xiaoyu Wang
- Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Wellington Rd, Clayton, Melbourne, VIC 3800, Australia
| | - Yiwen Zhang
- Climate, Air Quality Research Unit, School of Public Health and Preventive Medicine, Monash University, 553 St Kilda Road, Melbourne, VIC 3004, Australia
| | - Shanshan Li
- Climate, Air Quality Research Unit, School of Public Health and Preventive Medicine, Monash University, 553 St Kilda Road, Melbourne, VIC 3004, Australia
| | - Yuming Guo
- Climate, Air Quality Research Unit, School of Public Health and Preventive Medicine, Monash University, 553 St Kilda Road, Melbourne, VIC 3004, Australia
| | - Jiangning Song
- Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Wellington Rd, Clayton, Melbourne, VIC 3800, Australia
- Monash Data Futures Institute, Monash University, Wellington Rd, Clayton, Melbourne, VIC 3800, Australia
| | - Dong-Jun Yu
- School of Computer Science and Engineering, Nanjing University of Science and Technology, 200 Xiaolingwei, Nanjing 210094, China
| |
Collapse
|
2
|
Ye Q, Zhang X, Lin X. Drug-Target Interaction Prediction via Graph Auto-Encoder and Multi-Subspace Deep Neural Networks. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2023; 20:2647-2658. [PMID: 36107905 DOI: 10.1109/tcbb.2022.3206907] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]
Abstract
Computational prediction of drug-target interaction (DTI) is important for the new drug discovery. Currently, the deep neural network (DNN) has been widely used in DTI prediction. However, parameters of the DNN could be insufficiently trained and features of the data could be insufficiently utilized, because the DTI data is limited and its dimension is very high. To deal with the above problems, in this paper, a graph auto-encoder and multi-subspace deep neural network (GAEMSDNN) is designed. GAEMSDNN enhances its learning ability with a graph auto-encoder, a subspace layer and an ensemble layer. The graph auto-encoder can preserve the reconstruction information. The subspace layer can obtain different strong feature subsets. The ensemble layer in the GAEMSDNN can comprehensively utilize these strong feature subsets in a unified optimization framework. As a result, more features can be extracted from the network input and the DNN network can be better trained. In experiments, the results of GAEMSDNN are significantly improved compared to the previous methods, which validates the effectiveness of our strategies.
Collapse
|
3
|
Zhang P, Zhang W, Sun W, Li L, Xu J, Wang L, Wong L. A lncRNA-disease association prediction tool development based on bridge heterogeneous information network via graph representation learning for family medicine and primary care. Front Genet 2023; 14:1084482. [PMID: 37274787 PMCID: PMC10234424 DOI: 10.3389/fgene.2023.1084482] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/12/2022] [Accepted: 05/02/2023] [Indexed: 06/07/2023] Open
Abstract
Identification of long non-coding RNAs (lncRNAs) associated with common diseases is crucial for patient self-diagnosis and monitoring of health conditions using artificial intelligence (AI) technology at home. LncRNAs have gained significant attention due to their crucial roles in the pathogenesis of complex human diseases and identifying their associations with diseases can aid in developing diagnostic biomarkers at the molecular level. Computational methods for predicting lncRNA-disease associations (LDAs) have become necessary due to the time-consuming and labor-intensive nature of wet biological experiments in hospitals, enabling patients to access LDAs through their AI terminal devices at any time. Here, we have developed a predictive tool, LDAGRL, for identifying potential LDAs using a bridge heterogeneous information network (BHnet) constructed via Structural Deep Network Embedding (SDNE). The BHnet consists of three types of molecules as bridge nodes to implicitly link the lncRNA with disease nodes and the SDNE is used to learn high-quality node representations and make LDA predictions in a unified graph space. To assess the feasibility and performance of LDAGRL, extensive experiments, including 5-fold cross-validation, comparison with state-of-the-art methods, comparison on different classifiers and comparison of different node feature combinations, were conducted, and the results showed that LDAGRL achieved satisfactory prediction performance, indicating its potential as an effective LDAs prediction tool for family medicine and primary care.
Collapse
Affiliation(s)
- Ping Zhang
- Hubei Key Laboratory of Agricultural Bioinformatics, College of Informatics, Huazhong Agricultural University, Wuhan, China
| | - Weihan Zhang
- Hubei Key Laboratory of Agricultural Bioinformatics, College of Informatics, Huazhong Agricultural University, Wuhan, China
| | - Weicheng Sun
- Hubei Key Laboratory of Agricultural Bioinformatics, College of Informatics, Huazhong Agricultural University, Wuhan, China
| | - Li Li
- Hubei Key Laboratory of Agricultural Bioinformatics, College of Informatics, Huazhong Agricultural University, Wuhan, China
| | - Jinsheng Xu
- Hubei Key Laboratory of Agricultural Bioinformatics, College of Informatics, Huazhong Agricultural University, Wuhan, China
| | - Lei Wang
- Guangxi Key Lab of Human-Machine Interaction and Intelligent Decision, Guangxi Academy of Sciences, Nanning, China
| | - Leon Wong
- Guangxi Key Lab of Human-Machine Interaction and Intelligent Decision, Guangxi Academy of Sciences, Nanning, China
- Institute of Machine Learning and Systems Biology, School of Electronics and Information Engineering, Tongji University, Shanghai, China
| |
Collapse
|
4
|
Li B, Tian Y, Tian Y, Zhang S, Zhang X. Predicting Cancer Lymph-Node Metastasis From LncRNA Expression Profiles Using Local Linear Reconstruction Guided Distance Metric Learning. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022; 19:3179-3189. [PMID: 35139024 DOI: 10.1109/tcbb.2022.3149791] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/14/2023]
Abstract
Lymph-node metastasis is the most perilous cancer progressive state, where long non-coding RNA (lncRNA) has been confirmed to be an important genetic indicator in cancer prediction. However, lncRNA expression profile is often characterized of large features and small samples, it is urgent to establish an efficient judgment to deal with such high dimensional lncRNA data, which will aid in clinical targeted treatment. Thus, in this study, a local linear reconstruction guided distance metric learning is put forward to handle lncRNA data for determination of cancer lymph-node metastasis. In the original locally linear embedding (LLE) approach, any point can be approximately linearly reconstructed using its nearest neighborhood points, from which a novel distance metric can be learned by satisfying both nonnegative and sum-to-one constraints on the reconstruction weights. Taking the defined distance metric and lncRNA data supervised information into account, a local margin model will be deduced to find a low dimensional subspace for lncRNA signature extraction. At last, a classifier is constructed to predict cancer lymph-node metastasis, where the learned distance metric is also adopted. Several experiments on lncRNA data sets have been carried out, and experimental results show the performance of the proposed method by making comparisons with some other related dimensionality reduction methods and the classical classifier models.
Collapse
|
5
|
Liu J, Cheng Y, Wang X, Ge S. One-Step Robust Low-Rank Subspace Segmentation for Tumor Sample Clustering. COMPUTATIONAL INTELLIGENCE AND NEUROSCIENCE 2021; 2021:9990297. [PMID: 34925501 PMCID: PMC8674076 DOI: 10.1155/2021/9990297] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 08/23/2021] [Revised: 11/13/2021] [Accepted: 11/19/2021] [Indexed: 11/18/2022]
Abstract
Clustering of tumor samples can help identify cancer types and discover new cancer subtypes, which is essential for effective cancer treatment. Although many traditional clustering methods have been proposed for tumor sample clustering, advanced algorithms with better performance are still needed. Low-rank subspace clustering is a popular algorithm in recent years. In this paper, we propose a novel one-step robust low-rank subspace segmentation method (ORLRS) for clustering the tumor sample. For a gene expression data set, we seek its lowest rank representation matrix and the noise matrix. By imposing the discrete constraint on the low-rank matrix, without performing spectral clustering, ORLRS learns the cluster indicators of subspaces directly, i.e., performing the clustering task in one step. To improve the robustness of the method, capped norm is adopted to remove the extreme data outliers in the noise matrix. Furthermore, we conduct an efficient solution to solve the problem of ORLRS. Experiments on several tumor gene expression data demonstrate the effectiveness of ORLRS.
Collapse
Affiliation(s)
- Jian Liu
- School of Information and Control Engineering, China University of Mining and Technology, Xuzhou 221116, China
| | - Yuhu Cheng
- School of Information and Control Engineering, China University of Mining and Technology, Xuzhou 221116, China
| | - Xuesong Wang
- School of Information and Control Engineering, China University of Mining and Technology, Xuzhou 221116, China
| | - Shuguang Ge
- School of Information and Control Engineering, China University of Mining and Technology, Xuzhou 221116, China
| |
Collapse
|
6
|
Wang S, He Y, Chen Z, Zhang Q. FCNGRU: Locating Transcription Factor Binding Sites by combing Fully Convolutional Neural Network with Gated Recurrent Unit. IEEE J Biomed Health Inform 2021; 26:1883-1890. [PMID: 34613923 DOI: 10.1109/jbhi.2021.3117616] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
Abstract
Deciphering the relationship between transcription factors (TFs) and DNA sequences is very helpful for computational inference of gene regulation and a comprehensive understanding of gene regulation mechanisms. Transcription factor binding sites (TFBSs) are specific DNA short sequences that play a pivotal role in controlling gene expression through interaction with TF proteins. Although recently many computational and deep learning methods have been proposed to predict TFBSs aiming to predict sequence specificity of TF-DNA binding, there is still a lack of effective methods to directly locate TFBSs. In order to address this problem, we propose FCNGRU combing a fully convolutional neural network (FCN) with the gated recurrent unit (GRU) to directly locate TFBSs in this paper. Furthermore, we present a two-task framework (FCNGRU-double): one is a classification task at nucleotide level which predicts the probability of each nucleotide and locates TFBSs, and the other is a regression task at sequence level which predicts the intensity of each sequence. A series of experiments are conducted on 45 in-vitro datasets collected from the UniPROBE database derived from universal protein binding microarrays (uPBMs). Compared with competing methods, FCNGRU-double achieves much better results on these datasets. Moreover, FCNGRU-double has an advantage over a single-task framework, FCNGRU-single, which only contains the branch of locating TFBSs. In additionwe combine with in vivo datasets to make a further analysis and discussion. The source codes are avaiable at https://github.com/wangguoguoa/FCNGRU.
Collapse
|
7
|
Yang B, Bao W, Zhang W, Wang H, Song C, Chen Y, Jiang X. Reverse engineering gene regulatory network based on complex-valued ordinary differential equation model. BMC Bioinformatics 2021; 22:448. [PMID: 34544363 PMCID: PMC8451084 DOI: 10.1186/s12859-021-04367-2] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/08/2021] [Accepted: 09/09/2021] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND The growing researches of molecular biology reveal that complex life phenomena have the ability to demonstrating various types of interactions in the level of genomics. To establish the interactions between genes or proteins and understand the intrinsic mechanisms of biological systems have become an urgent need and study hotspot. RESULTS In order to forecast gene expression data and identify more accurate gene regulatory network, complex-valued version of ordinary differential equation (CVODE) is proposed in this paper. In order to optimize CVODE model, a complex-valued hybrid evolutionary method based on Grammar-guided genetic programming and complex-valued firefly algorithm is presented. CONCLUSIONS When tested on three real gene expression datasets from E. coli and Human Cell, the experiment results suggest that CVODE model could improve 20-50% prediction accuracy of gene expression data, which could also infer more true-positive regulatory relationships and less false-positive regulations than ordinary differential equation.
Collapse
Affiliation(s)
- Bin Yang
- School of Information Science and Engineering, Zaozhuang University, Zaozhuang, 277160, China
| | - Wenzheng Bao
- School of Information and Electrical Engineering, Xuzhou University of Technology, Xuzhou, 221018, China.
| | - Wei Zhang
- School of Information Science and Engineering, Zaozhuang University, Zaozhuang, 277160, China
| | - Haifeng Wang
- School of Information Science and Engineering, Zaozhuang University, Zaozhuang, 277160, China
| | - Chuandong Song
- School of Information Science and Engineering, Zaozhuang University, Zaozhuang, 277160, China
| | - Yuehui Chen
- School of Information Science and Engineering, University of Jinan, Jinan, 250022, China
| | - Xiuying Jiang
- School of Information Science and Engineering, Zaozhuang University, Zaozhuang, 277160, China
| |
Collapse
|
8
|
Liu Y, Wang SL, Zhang JF, Zhang W, Zhou S, Li W. DMFMDA: Prediction of Microbe-Disease Associations Based on Deep Matrix Factorization Using Bayesian Personalized Ranking. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2021; 18:1763-1772. [PMID: 32816678 DOI: 10.1109/tcbb.2020.3018138] [Citation(s) in RCA: 13] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
Identifying the microbe-disease associations is conducive to understanding the pathogenesis of disease from the perspective of microbe. In this paper, we propose a deep matrix factorization prediction model (DMFMDA) based on deep neural network. First, the disease one-hot encoding is fed into neural network, which is transformed into a low-dimensional dense vector in implicit semantic space via embedding layer, and so is microbe. Then, matrix factorization is realized by neural network with embedding layer. Furthermore, our model synthesizes the non-linear modeling advantages of multi-layer perceptron based on the linear modeling advantages of matrix factorization. Finally, different from other methods using square error loss function, Bayesian Personalized Ranking optimizes the model from a ranking perspective to obtain the optimal model parameters, which makes full use of the unobserved data. Experiments show that DMFMDA reaches average AUCs of 0.9091 and 0.9103 in the framework of 5-fold cross validation and Leave-one-out cross validation, which is superior to three the-state-of-art methods. In case studies, 10, 9 and 9 out of top-10 candidate microbes are verified by recently published literature for asthma, inflammatory bowel disease and colon cancer, respectively. In conclusion, DMFMDA is successful application of deep learning in the prediction of microbe-disease association.
Collapse
|
9
|
Sharma G, Rana PS, Bawa S. Hybrid Machine Learning Models for Predicting Types of Human T-cell Lymphotropic Virus. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2021; 18:1524-1534. [PMID: 31567100 DOI: 10.1109/tcbb.2019.2944610] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
Life threatening diseases like adult T-cell leukemia, neurodegenerative diseases, and demyelinating diseases such as HTLV-1 based myelopathy/tropical spastic paraparesis (HAM/TSP), hypocalcaemia, and bone lesions are caused by a group of human retrovirus known as Human T-cell Lymphotropic virus (HTLV). Out of the four different types of HTLVs, HTLV-1 is most prominent in scourging over 20 million people around the world and still not much effort has been made in understanding the epidemiology and controlling the prevalence of this virus. This condition further worsens when most of the infected cases remain asymptomatic throughout their lifetime due to the limited diagnostic methods; that are most of the times unavailable for timely detection of infected individuals. Moreover, at present, there is no licensed vaccination for HTLV-1 infection. Therefore, there is a need to develop the faster and efficient diagnostic method for the detection of HTLV-1. Influenced from the outcomes of the machine learning techniques in the field of bio-informatics, this is the first study in which 64 hybrid machine learning techniques have been proposed for the prediction of different type of HTLVs (HTLV-1, HTLV-2, and HTLV-3). The hybrid techniques are built by permutation and combination of four classification methods, four feature weighting, and four feature selection techniques. The proposed hybrid models when evaluated on the basis of various model evaluation parameters are found to be capable of efficiently predicting the type of HTLVs. The best hybrid model has been identified by having accuracy, an AUROC value, and F1 score of 99.85 percent, 0.99, and 0.99, respectively. This kind of the system can assist the current diagnostic system for the detection of HTLV-1 as after the molecular diagnostics of HTLV by various screening tests like enzyme-linked immunoassay or particle agglutination assays there is always a need of confirmatory tests like western blotting, immuno-fluorescence assay, or radio-immuno-precipitation assay for distinguishing HTLV-1 from HTLV-2. These confirmatory tests are indeed very complex analytical techniques involving various steps. The proposed hybrid techniques can be used to support and verify the results of confirmatory test from the protein mixture. Furthermore, better insights about the virus can be obtained by exploring the physicochemical properties of the protein sequences of HTLVs.
Collapse
|
10
|
Wang S, Zhang Q, Shen Z, He Y, Chen ZH, Li J, Huang DS. Predicting transcription factor binding sites using DNA shape features based on shared hybrid deep learning architecture. MOLECULAR THERAPY-NUCLEIC ACIDS 2021; 24:154-163. [PMID: 33767912 PMCID: PMC7972936 DOI: 10.1016/j.omtn.2021.02.014] [Citation(s) in RCA: 21] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 11/19/2020] [Accepted: 02/14/2021] [Indexed: 12/26/2022]
Abstract
The study of transcriptional regulation is still difficult yet fundamental in molecular biology research. Recent research has shown that the double helix structure of nucleotides plays an important role in improving the accuracy and interpretability of transcription factor binding sites (TFBSs). Although several computational methods have been designed to take both DNA sequence and DNA shape features into consideration simultaneously, how to design an efficient model is still an intractable topic. In this paper, we proposed a hybrid convolutional recurrent neural network (CNN/RNN) architecture, CRPTS, to predict TFBSs by combining DNA sequence and DNA shape features. The novelty of our proposed method relies on three critical aspects: (1) the application of a shared hybrid CNN and RNN has the ability to efficiently extract features from large-scale genomic sequences obtained by high-throughput technology; (2) the common patterns were found from DNA sequences and their corresponding DNA shape features; (3) our proposed CRPTS can capture local structural information of DNA sequences without completely relying on DNA shape data. A series of comprehensive experiments on 66 in vitro datasets derived from universal protein binding microarrays (uPBMs) shows that our proposed method CRPTS obviously outperforms the state-of-the-art methods.
Collapse
Affiliation(s)
- Siguo Wang
- The Institute of Machine Learning and Systems Biology, School of Electronics and Information Engineering, Tongji University, No. 4800 Caoan Road, Shanghai 201804, China
| | - Qinhu Zhang
- The Institute of Machine Learning and Systems Biology, School of Electronics and Information Engineering, Tongji University, No. 4800 Caoan Road, Shanghai 201804, China.,Translational Medical Center for Stem Cell Therapy and Institute for Regenerative Medicine, Shanghai East Hospital, Tongji University, Siping Road 1239, Shanghai 200092, China
| | - Zhen Shen
- School of Computer and Software, Nanyang Institute of Technology, Changjiang Road 80, Nanyang, Henan 473004, China
| | - Ying He
- The Institute of Machine Learning and Systems Biology, School of Electronics and Information Engineering, Tongji University, No. 4800 Caoan Road, Shanghai 201804, China
| | - Zhen-Heng Chen
- College of Computer Science and Software Engineering, Shenzhen University, Shenzhen 518060, China
| | - Jianqiang Li
- College of Computer Science and Software Engineering, Shenzhen University, Shenzhen 518060, China
| | - De-Shuang Huang
- The Institute of Machine Learning and Systems Biology, School of Electronics and Information Engineering, Tongji University, No. 4800 Caoan Road, Shanghai 201804, China
| |
Collapse
|
11
|
Abstract
Aims:
Robust and more accurate method for identifying transcription factor binding sites
(TFBS) for gene expression.
Background:
Deep neural networks (DNNs) have shown promising growth in solving complex
machine learning problems. Conventional techniques are comfortably replaced by DNNs in
computer vision, signal processing, healthcare, and genomics. Understanding DNA sequences is
always a crucial task in healthcare and regulatory genomics. For DNA motif prediction, choosing the
right dataset with a sufficient number of input sequences is crucial in order to design an effective
model.
Objective:
Designing a new algorithm which works on different dataset while an improved
performance for TFBS prediction.
Methods:
With the help of Layerwise Relevance Propagation, the proposed algorithm identifies the
invariant features with adaptive noise patterns.
Results:
The performance is compared by calculating various metrics on standard as well as recent
methods and significant improvement is noted.
Conclusion:
By identifying the invariant and robust features in the DNA sequences, the
classification performance can be increased.
Collapse
Affiliation(s)
- Kanu Geete
- Department of Computer Science & Engineering, Maulana Azad National Institute of Technology, Bhopal, India
| | - Manish Pandey
- Department of Computer Science & Engineering, Maulana Azad National Institute of Technology, Bhopal, India
| |
Collapse
|
12
|
Ding C, Tian W, Geng C, Zhu X, Peng Q, Zhao Z. Multi-template supervised descent method for face alignment. COGN SYST RES 2021. [DOI: 10.1016/j.cogsys.2020.09.004] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/23/2022]
|
13
|
Hind J, Lisboa P, Hussain AJ, Al-Jumeily D. A Novel Approach to Detecting Epistasis using Random Sampling Regularisation. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2020; 17:1535-1545. [PMID: 31634840 DOI: 10.1109/tcbb.2019.2948330] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
Epistasis is a progressive approach that complements the 'common disease, common variant' hypothesis that highlights the potential for connected networks of genetic variants collaborating to produce a phenotypic expression. Epistasis is commonly performed as a pairwise or limitless-arity capacity that considers variant networks as either variant vs variant or as high order interactions. This type of analysis extends the number of tests that were previously performed in a standard approach such as Genome-Wide Association Study (GWAS), in which False Discovery Rate (FDR) is already an issue, therefore by multiplying the number of tests up to a factorial rate also increases the issue of FDR. Further to this, epistasis introduces its own limitations of computational complexity and intensity that are generated based on the analysis performed; to consider the most intense approach, a multivariate analysis introduces a time complexity of O(n!). Proposed in this paper is a novel methodology for the detection of epistasis using interpretable methods and best practice to outline interactions through filtering processes. Using a process of Random Sampling Regularisation which randomly splits and produces sample sets to conduct a voting system to regularise the significance and reliability of biological markers, SNPs. Preliminary results are promising, outlining a concise detection of interactions. Results for the detection of epistasis, in the classification of breast cancer patients, indicated eight outlined risk candidate interactions from five variants and a singular candidate variant with high protective association.
Collapse
|
14
|
Peng C, Zheng Y, Huang DS. Capsule Network Based Modeling of Multi-omics Data for Discovery of Breast Cancer-Related Genes. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2020; 17:1605-1612. [PMID: 30969931 DOI: 10.1109/tcbb.2019.2909905] [Citation(s) in RCA: 18] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/23/2023]
Abstract
Breast cancer is one of the most common cancers all over the world, which bring about more than 450,000 deaths each year. Although this malignancy has been extensively studied by a large number of researchers, its prognosis is still poor. Since therapeutic advance can be obtained based on gene signatures, there is an urgent need to discover genes related to breast cancer that may help uncover the mechanisms in cancer progression. We propose a deep learning method for the discovery of breast cancer-related genes by using Capsule Network based Modeling of Multi-omics Data (CapsNetMMD). In CapsNetMMD, we make use of known breast cancer-related genes to transform the issue of gene identification into the issue of supervised classification. The features of genes are generated through comprehensive integration of multi-omics data, e.g., mRNA expression, z scores for mRNA expression, DNA methylation, and two forms of DNA copy-number alterations (CNAs). By modeling features based on the capsule network, we identify breast cancer-related genes with a significantly better performance than other existing machine learning methods. The predicted genes with prognostic values play potential important roles in breast cancer and may serve as candidates for biologists and medical scientists in the future studies of biomarkers.
Collapse
|
15
|
Zhao L, Yan H. MCNF: A Novel Method for Cancer Subtyping by Integrating Multi-Omics and Clinical Data. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2020; 17:1682-1690. [PMID: 30990192 DOI: 10.1109/tcbb.2019.2910515] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/09/2023]
Abstract
In the age of personalized medicine, there is a great need to classify cancer (from the same organ site) into homogeneous subtypes. Recent technology advancements in genome-wide molecular profiling have made it possible to profiling multiple molecular datasets to characterize the genomic changes in various cancer types. How to take full advantage of the availability of these omics data? And how to integrate these molecular data with patient clinical data to do a more systematic subtyping of cancer are the focuses of the paper. We proposed a new method called Molecular and Clinical Networks Fusion (MCNF) to classify cancer into homogeneous subtypes. Our method has two highlights: one is that it can integrate both numerical and non-numerical data into the fused network; the next highlight is that it is unsupervised, which means it can automatically determine the optimal number of clusters.
Collapse
|
16
|
Abstract
Aims:Post-Translational Modifications (PTMs), which include more than 450 types, can be regarded as the fundamental cellular regulation.Background:Recently, experiments demonstrated that the lysine malonylation modification is a significant process in several organisms and cells. Meanwhile, malonylation plays an important role in the regulation of protein subcellular localization, stability, translocation to lipid rafts and many other protein functions.Objective:Identification of malonylation will contribute to understanding the molecular mechanism in the field of biology. Nevertheless, several existing experimental approaches, which can hardly meet the need of the high speed data generation, are expensive and time-consuming. Moreover, some machine learning methods can hardly meet the high-accuracy need in this issue.Methods:In this study, we proposed a method, named MSIT that means malonylation sites identification tree, utilized the amino acid residues and profile information to identify the lysine malonylation sites with the tree structural neural network in the peptides sequence level.Methods:The proposed algorithm can get 0.8699 of F1 score and 89.34% in true positive ratio in E. coli. MSIT outperformed existing malonylation site identification methods and features on different species datasets.Conclusion:Based on these measures, it can be demonstrated that MSIT will be helpful in identifying candidate malonylation sites.
Collapse
Affiliation(s)
- Wenzheng Bao
- School of Information and Electrical Engineering, Xuzhou University of Technology, Xuzhou 221018, China
| | - De-Shuang Huang
- Institute of Machine Learning and Systems Biology, School of Electronics and Information Engineering, Tongji University, Shanghai 201804, China
| | - Yue-Hui Chen
- School of Information, University of Jinan, Jinan 250022, China
| |
Collapse
|
17
|
Akram P, Liao L. Prediction of comorbid diseases using weighted geometric embedding of human interactome. BMC Med Genomics 2019; 12:161. [PMID: 31888634 PMCID: PMC6936100 DOI: 10.1186/s12920-019-0605-5] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/01/2019] [Accepted: 10/16/2019] [Indexed: 01/20/2023] Open
Abstract
BACKGROUND Comorbidity is the phenomenon of two or more diseases occurring simultaneously not by random chance and presents great challenges to accurate diagnosis and treatment. As an effort toward better understanding the genetic causes of comorbidity, in this work, we have developed a computational method to predict comorbid diseases. Two diseases sharing common genes tend to increase their comorbidity. Previous work shows that after mapping the associated genes onto the human interactome the distance between the two disease modules (subgraphs) is correlated with comorbidity. METHODS To fully incorporate structural characteristics of interactome as features into prediction of comorbidity, our method embeds the human interactome into a high dimensional geometric space with weights assigned to the network edges and uses the projection onto different dimension to "fingerprint" disease modules. A supervised machine learning classifier is then trained to discriminate comorbid diseases versus non-comorbid diseases. RESULTS In cross-validation using a benchmark dataset of more than 10,000 disease pairs, we report that our model achieves remarkable performance of ROC score = 0.90 for comorbidity threshold at relative risk RR = 0 and 0.76 for comorbidity threshold at RR = 1, and significantly outperforms the previous method and the interactome generated by annotated data. To further incorporate prior knowledge pathways association with diseases, we weight the protein-protein interaction network edges according to their frequency of occurring in those pathways in such a way that edges with higher frequency will more likely be selected in the minimum spanning tree for geometric embedding. Such weighted embedding is shown to lead to further improvement of comorbid disease prediction. CONCLUSION The work demonstrates that embedding the two-dimension planar graph of human interactome into a high dimensional geometric space allows for characterizing and capturing disease modules (subgraphs formed by the disease associated genes) from multiple perspectives, and hence provides enriched features for a supervised classifier to discriminate comorbid disease pairs from non-comorbid disease pairs more accurately than based on simply the module separation.
Collapse
Affiliation(s)
- Pakeeza Akram
- School of Electrical Engineering and Computer Science (SEECS), National University of Sciences and Technology (NUST), H-12, Islamabad, Pakistan
- Department of Computer Science, University of Delaware, Newark, USA
| | - Li Liao
- Department of Computer Science, University of Delaware, Newark, USA
| |
Collapse
|
18
|
Wu H, Yang R, Fu Q, Chen J, Lu W, Li H. Research on predicting 2D-HP protein folding using reinforcement learning with full state space. BMC Bioinformatics 2019; 20:685. [PMID: 31874607 PMCID: PMC6929271 DOI: 10.1186/s12859-019-3259-6] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Protein structure prediction has always been an important issue in bioinformatics. Prediction of the two-dimensional structure of proteins based on the hydrophobic polarity model is a typical non-deterministic polynomial hard problem. Currently reported hydrophobic polarity model optimization methods, greedy method, brute-force method, and genetic algorithm usually cannot converge robustly to the lowest energy conformations. Reinforcement learning with the advantages of continuous Markov optimal decision-making and maximizing global cumulative return is especially suitable for solving global optimization problems of biological sequences. RESULTS In this study, we proposed a novel hydrophobic polarity model optimization method derived from reinforcement learning which structured the full state space, and designed an energy-based reward function and a rigid overlap detection rule. To validate the performance, sixteen sequences were selected from the classical data set. The results indicated that reinforcement learning with full states successfully converged to the lowest energy conformations against all sequences, while the reinforcement learning with partial states folded 50% sequences to the lowest energy conformations. Reinforcement learning with full states hits the lowest energy on an average 5 times, which is 40 and 100% higher than the three and zero hit by the greedy algorithm and reinforcement learning with partial states respectively in the last 100 episodes. CONCLUSIONS Our results indicate that reinforcement learning with full states is a powerful method for predicting two-dimensional hydrophobic-polarity protein structure. It has obvious competitive advantages compared with greedy algorithm and reinforcement learning with partial states.
Collapse
Affiliation(s)
- Hongjie Wu
- School of Electronic and Information Engineering, Suzhou University of Science and Technology, Suzhou, 215009, China
| | - Ru Yang
- School of Electronic and Information Engineering, Suzhou University of Science and Technology, Suzhou, 215009, China
| | - Qiming Fu
- School of Electronic and Information Engineering, Suzhou University of Science and Technology, Suzhou, 215009, China. .,Jiangsu Province Key Laboratory of Intelligent Building Energy Efficiency, Suzhou University of Science and Technology, Suzhou, 215009, China.
| | - Jianping Chen
- School of Electronic and Information Engineering, Suzhou University of Science and Technology, Suzhou, 215009, China.,Jiangsu Province Key Laboratory of Intelligent Building Energy Efficiency, Suzhou University of Science and Technology, Suzhou, 215009, China
| | - Weizhong Lu
- School of Electronic and Information Engineering, Suzhou University of Science and Technology, Suzhou, 215009, China
| | - Haiou Li
- School of Electronic and Information Engineering, Suzhou University of Science and Technology, Suzhou, 215009, China
| |
Collapse
|
19
|
Abstract
Background Cost-sensitive algorithm is an effective strategy to solve imbalanced classification problem. However, the misclassification costs are usually determined empirically based on user expertise, which leads to unstable performance of cost-sensitive classification. Therefore, an efficient and accurate method is needed to calculate the optimal cost weights. Results In this paper, two approaches are proposed to search for the optimal cost weights, targeting at the highest weighted classification accuracy (WCA). One is the optimal cost weights grid searching and the other is the function fitting. Comparisons are made between these between the two algorithms above. In experiments, we classify imbalanced gene expression data using extreme learning machine to test the cost weights obtained by the two approaches. Conclusions Comprehensive experimental results show that the function fitting method is generally more efficient, which can well find the optimal cost weights with acceptable WCA.
Collapse
|
20
|
Wu MJ, Gao YL, Liu JX, Zhu R, Wang J. Principal Component Analysis Based on Graph Laplacian and Double Sparse Constraints for Feature Selection and Sample Clustering on Multi-View Data. Hum Hered 2019; 84:47-58. [PMID: 31466072 DOI: 10.1159/000501653] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/18/2019] [Accepted: 06/23/2019] [Indexed: 11/19/2022] Open
Abstract
Principal component analysis (PCA) is a widely used method for evaluating low-dimensional data. Some variants of PCA have been proposed to improve the interpretation of the principal components (PCs). One of the most common methods is sparse PCA which aims at finding a sparse basis to improve the interpretability over the dense basis of PCA. However, the performances of these improved methods are still far from satisfactory because the data still contain redundant PCs. In this paper, a novel method called PCA based on graph Laplacian and double sparse constraints (GDSPCA) is proposed to improve the interpretation of the PCs and consider the internal geometry of the data. In detail, GDSPCA utilizes L2,1-norm and L1-norm regularization terms simultaneously to enforce the matrix to be sparse by filtering redundant and irrelative PCs, where the L2,1-norm regularization term can produce row sparsity, while the L1-norm regularization term can enforce element sparsity. This way, we can make a better interpretation of the new PCs in low-dimensional subspace. Meanwhile, the method of GDSPCA integrates graph Laplacian into PCA to explore the geometric structure hidden in the data. A simple and effective optimization solution is provided. Extensive experiments on multi-view biological data demonstrate the feasibility and effectiveness of the proposed approach.
Collapse
Affiliation(s)
- Ming-Juan Wu
- School of Information Science and Engineering, Qufu Normal University, Rizhao, China
| | - Ying-Lian Gao
- Library of Qufu Normal University, Qufu Normal University, Rizhao, China,
| | - Jin-Xing Liu
- School of Information Science and Engineering, Qufu Normal University, Rizhao, China
| | - Rong Zhu
- School of Information Science and Engineering, Qufu Normal University, Rizhao, China
| | - Juan Wang
- School of Information Science and Engineering, Qufu Normal University, Rizhao, China
| |
Collapse
|
21
|
Zhang Q, Zhu L, Huang DS. High-Order Convolutional Neural Network Architecture for Predicting DNA-Protein Binding Sites. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2019; 16:1184-1192. [PMID: 29993783 DOI: 10.1109/tcbb.2018.2819660] [Citation(s) in RCA: 55] [Impact Index Per Article: 9.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/15/2023]
Abstract
Although Deep learning algorithms have outperformed conventional methods in predicting the sequence specificities of DNA-protein binding, they lack to consider the dependencies among nucleotides and the diverse binding lengths for different transcription factors (TFs). To address the above two limitations simultaneously, in this paper, we propose a high-order convolutional neural network architecture (HOCNN), which employs a high-order encoding method to build high-order dependencies among nucleotides, and a multi-scale convolutional layer to capture the motif features of different length. The experimental results on real ChIP-seq datasets show that the proposed method outperforms the state-of-the-art deep learning method (DeepBind) in the motif discovery task. In addition, we provide further insights about the importance of introducing additional convolutional kernels and the degeneration problem of importing high-order in the motif discovery task.
Collapse
|
22
|
Zhang Q, Shen Z, Huang DS. Modeling in-vivo protein-DNA binding by combining multiple-instance learning with a hybrid deep neural network. Sci Rep 2019; 9:8484. [PMID: 31186519 PMCID: PMC6559991 DOI: 10.1038/s41598-019-44966-x] [Citation(s) in RCA: 27] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/19/2018] [Accepted: 05/15/2019] [Indexed: 01/26/2023] Open
Abstract
Modeling in-vivo protein-DNA binding is not only fundamental for further understanding of the regulatory mechanisms, but also a challenging task in computational biology. Deep-learning based methods have succeed in modeling in-vivo protein-DNA binding, but they often (1) follow the fully supervised learning framework and overlook the weakly supervised information of genomic sequences that a bound DNA sequence may has multiple TFBS(s), and, (2) use one-hot encoding to encode DNA sequences and ignore the dependencies among nucleotides. In this paper, we propose a weakly supervised framework, which combines multiple-instance learning with a hybrid deep neural network and uses k-mer encoding to transform DNA sequences, for modeling in-vivo protein-DNA binding. Firstly, this framework segments sequences into multiple overlapping instances using a sliding window, and then encodes all instances into image-like inputs of high-order dependencies using k-mer encoding. Secondly, it separately computes a score for all instances in the same bag using a hybrid deep neural network that integrates convolutional and recurrent neural networks. Finally, it integrates the predicted values of all instances as the final prediction of this bag using the Noisy-and method. The experimental results on in-vivo datasets demonstrate the superior performance of the proposed framework. In addition, we also explore the performance of the proposed framework when using k-mer encoding, and demonstrate the performance of the Noisy-and method by comparing it with other fusion methods, and find that adding recurrent layers can improve the performance of the proposed framework.
Collapse
Affiliation(s)
- Qinhu Zhang
- Institute of Machine Learning and Systems Biology, School of Electronics and Information Engineering, Tongji University, Shanghai, 201804, P.R. China
| | - Zhen Shen
- Institute of Machine Learning and Systems Biology, School of Electronics and Information Engineering, Tongji University, Shanghai, 201804, P.R. China
| | - De-Shuang Huang
- Institute of Machine Learning and Systems Biology, School of Electronics and Information Engineering, Tongji University, Shanghai, 201804, P.R. China.
| |
Collapse
|
23
|
You ZH, Huang W, Zhang S, Huang YA, Yu CQ, Li LP. An Efficient Ensemble Learning Approach for Predicting Protein-Protein Interactions by Integrating Protein Primary Sequence and Evolutionary Information. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2018; 16:809-817. [PMID: 30475726 DOI: 10.1109/tcbb.2018.2882423] [Citation(s) in RCA: 15] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/09/2023]
Abstract
Protein-protein interactions (PPIs) perform a very important function in many cellular processes, including signal transduction, post-translational modifications, apoptosis, and cell growth. Deregulation of PPIs results in many diseases, including cancer and pernicious anemia. Although many high-throughput methods have been applied to generate a large amount of PPIs data, they are generally expensive, inefficient and labor-intensive. Hence, there is an urgent need for developing a computational method to accurately and rapidly detect PPIs. In this article, we proposed a highly efficient approach to predict PPIs by integrating a new protein sequence substitution matrix feature representation and ensemble weighted sparse representation model classifier. The proposed method is demonstrated on Saccharomyces cerevisiae dataset and achieved 99.26% prediction accuracy with 98.53% sensitivity at precision of 100%, which is shown to have much higher predictive accuracy than current state-of-the-art algorithms. Extensive experiments are performed with the benchmark data set from Human and Helicobacter pylori that the proposed method achieves outstanding better success rates than other existing approaches in this problem. Experiment results illustrate that our proposed method presents an economical approach for computational building of PPI networks, which can be a helpful supplementary method for future proteomics researches.
Collapse
|
24
|
Laplacian regularized low-rank representation for cancer samples clustering. Comput Biol Chem 2018; 78:504-509. [PMID: 30528509 DOI: 10.1016/j.compbiolchem.2018.11.003] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/25/2018] [Accepted: 11/07/2018] [Indexed: 12/18/2022]
Abstract
Cancer samples clustering based on biomolecular data has been becoming an important tool for cancer classification. The recognition of cancer types is of great importance for cancer treatment. In this paper, in order to improve the accuracy of cancer recognition, we propose to use Laplacian regularized Low-Rank Representation (LLRR) to cluster the cancer samples based on genomic data. In LLRR method, the high-dimensional genomic data are approximately treated as samples extracted from a combination of several low-rank subspaces. The purpose of LLRR method is to seek the lowest-rank representation matrix based on a dictionary. Because a Laplacian regularization based on manifold is introduced into LLRR, compared to the Low-Rank Representation (LRR) method, besides capturing the global geometric structure, LLRR can capture the intrinsic local structure of high-dimensional observation data well. And what is more, in LLRR, the original data themselves are selected as a dictionary, so the lowest-rank representation is actually a similar expression between the samples. Therefore, corresponding to the low-rank representation matrix, the samples with high similarity are considered to come from the same subspace and are grouped into a class. The experiment results on real genomic data illustrate that LLRR method, compared with LRR and MLLRR, is more robust to noise and has a better ability to learn the inherent subspace structure of data, and achieves remarkable performance in the clustering of cancer samples.
Collapse
|
25
|
Recurrent Neural Network for Predicting Transcription Factor Binding Sites. Sci Rep 2018; 8:15270. [PMID: 30323198 PMCID: PMC6189047 DOI: 10.1038/s41598-018-33321-1] [Citation(s) in RCA: 104] [Impact Index Per Article: 14.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/27/2018] [Accepted: 09/25/2018] [Indexed: 12/23/2022] Open
Abstract
It is well known that DNA sequence contains a certain amount of transcription factors (TF) binding sites, and only part of them are identified through biological experiments. However, these experiments are expensive and time-consuming. To overcome these problems, some computational methods, based on k-mer features or convolutional neural networks, have been proposed to identify TF binding sites from DNA sequences. Although these methods have good performance, the context information that relates to TF binding sites is still lacking. Research indicates that standard recurrent neural networks (RNN) and its variants have better performance in time-series data compared with other models. In this study, we propose a model, named KEGRU, to identify TF binding sites by combining Bidirectional Gated Recurrent Unit (GRU) network with k-mer embedding. Firstly, DNA sequences are divided into k-mer sequences with a specified length and stride window. And then, we treat each k-mer as a word and pre-trained word representation model though word2vec algorithm. Thirdly, we construct a deep bidirectional GRU model for feature learning and classification. Experimental results have shown that our method has better performance compared with some state-of-the-art methods. Additional experiments about embedding strategy show that k-mer embedding will be helpful to enhance model performance. The robustness of KEGRU is proved by experiments with different k-mer length, stride window and embedding vector dimension.
Collapse
|
26
|
Deng SP, Guo WL. Identifying Key Genes of Liver Cancer by Networking of Multiple Data Sets. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2018; 16:792-800. [PMID: 30296239 DOI: 10.1109/tcbb.2018.2874238] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
Liver cancer is one of the deadliest cancers in the world. To find effective therapies for this cancer, it is indispensable to identify key genes, which may play critical roles in the incidence of the liver cancer. To identify key genes of the liver cancer with high accuracy, we integrated multiple microarray gene expression data sets to compute common differentially expressed genes, which will result more accurate than those from individual data set. To find the main functions or pathways that these genes are involved in, some enrichment analyses were performed including functional enrichment analysis, pathway enrichment analysis, and disease association study. Based on these genes, a protein-protein interaction network was constructed and analyzed to identify key genes of the liver cancer by combining the local and global influence of nodes in the network. The identified key genes, such as TOP2A, ESR1, and KMO, have been demonstrated to be key biomarkers of the liver cancer in many publications. All the results suggest that our method can effectively identify key genes of the liver cancer. Moreover, our method can be applied to other types of data sets to select key genes of other complex diseases.
Collapse
|
27
|
Liu J, Cheng Y, Wang X, Cui X, Kong Y, Du J. Low Rank Subspace Clustering via Discrete Constraint and Hypergraph Regularization for Tumor Molecular Pattern Discovery. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2018; 15:1500-1512. [PMID: 29993749 DOI: 10.1109/tcbb.2018.2834371] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
Tumor clustering is a powerful approach for cancer class discovery which is crucial to the effective treatment of cancer. Many traditional clustering methods such as NMF-based models, have been widely used to identify tumors. However, they cannot achieve satisfactory results. Recently, subspace clustering approaches have been proposed to improve the performance by dividing the original space into multiple low-dimensional subspaces. Among them, low rank representation is becoming a popular approach to attain subspace clustering. In this paper, we propose a novel Low Rank Subspace Clustering model via Discrete Constraint and Hypergraph Regularization (DHLRS). The proposed method learns the cluster indicators directly by using discrete constraint, which makes the clustering task simple. For each subspace, we adopt Schatten -norm to better approximate the low rank constraint. Moreover, Hypergraph Regularization is adopted to infer the complex relationship between genes and intrinsic geometrical structure of gene expression data in each subspace. Finally, the molecular pattern of tumor gene expression data sets is discovered according to the optimized cluster indicators. Experiments on both synthetic data and real tumor gene expression data sets prove the effectiveness of proposed DHLRS.
Collapse
|
28
|
Bao W, Yuan CA, Zhang Y, Han K, Nandi AK, Honig B, Huang DS. Mutli-Features Prediction of Protein Translational Modification Sites. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2018; 15:1453-1460. [PMID: 28961121 DOI: 10.1109/tcbb.2017.2752703] [Citation(s) in RCA: 26] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/07/2023]
Abstract
Post translational modification plays a significiant role in the biological processing. The potential post translational modification is composed of the center sites and the adjacent amino acid residues which are fundamental protein sequence residues. It can be helpful to perform their biological functions and contribute to understanding the molecular mechanisms that are the foundations of protein design and drug design. The existing algorithms of predicting modified sites often have some shortcomings, such as lower stability and accuracy. In this paper, a combination of physical, chemical, statistical, and biological properties of a protein have been ulitized as the features, and a novel framework is proposed to predict a protein's post translational modification sites. The multi-layer neural network and support vector machine are invoked to predict the potential modified sites with the selected features that include the compositions of amino acid residues, the E-H description of protein segments, and several properties from the AAIndex database. Being aware of the possible redundant information, the feature selection is proposed in the propocessing step in this research. The experimental results show that the proposed method has the ability to improve the accuracy in this classification issue.
Collapse
|
29
|
Yuan L, Guo LH, Yuan CA, Zhang YH, Han K, Nandi A, Honig B, Huang DS. Integration of Multi-omics Data for Gene Regulatory Network Inference and Application to Breast Cancer. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2018; 16:782-791. [PMID: 30137012 DOI: 10.1109/tcbb.2018.2866836] [Citation(s) in RCA: 28] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
Underlying a cancer phenotype is a specific gene regulatory network that represents the complex regulatory relationships between genes. However, it remains a challenge to find cancer-related gene regulatory network because of insufficient sample sizes and complex regulatory mechanisms in which gene is influenced by not only other genes but also other biological factors. With the development of high-throughput technologies and the unprecedented wealth of multi-omics data give us a new opportunity to design machine learning method to investigate underlying gene regulatory network. In this paper, we propose an approach, which use biweight midcorrelation to measure the correlation between factors and make use of nonconvex penalty based sparse regression for gene regulatory network inference (BMNPGRN). BMNCGRN incorporates multi-omics data (including DNA methylation and copy number variation) and their interactions in gene regulatory network model. The experimental results on synthetic datasets show that BMNPGRN outperforms popular and state-of-the-art methods (including DCGRN, ARACNE and CLR) under false positive control. Furthermore, we applied BMNPGRN on breast cancer (BRCA) data from The Cancer Genome Atlas database and provided gene regulatory network.
Collapse
|
30
|
Detection of Protein Complexes Based on Penalized Matrix Decomposition in a Sparse Protein⁻Protein Interaction Network. Molecules 2018; 23:molecules23061460. [PMID: 29914123 PMCID: PMC6100434 DOI: 10.3390/molecules23061460] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/21/2018] [Revised: 06/11/2018] [Accepted: 06/12/2018] [Indexed: 01/20/2023] Open
Abstract
High-throughput technology has generated large-scale protein interaction data, which is crucial in our understanding of biological organisms. Many complex identification algorithms have been developed to determine protein complexes. However, these methods are only suitable for dense protein interaction networks, because their capabilities decrease rapidly when applied to sparse protein–protein interaction (PPI) networks. In this study, based on penalized matrix decomposition (PMD), a novel method of penalized matrix decomposition for the identification of protein complexes (i.e., PMDpc) was developed to detect protein complexes in the human protein interaction network. This method mainly consists of three steps. First, the adjacent matrix of the protein interaction network is normalized. Second, the normalized matrix is decomposed into three factor matrices. The PMDpc method can detect protein complexes in sparse PPI networks by imposing appropriate constraints on factor matrices. Finally, the results of our method are compared with those of other methods in human PPI network. Experimental results show that our method can not only outperform classical algorithms, such as CFinder, ClusterONE, RRW, HC-PIN, and PCE-FR, but can also achieve an ideal overall performance in terms of a composite score consisting of F-measure, accuracy (ACC), and the maximum matching ratio (MMR).
Collapse
|
31
|
Ge SG, Xia J, Sha W, Zheng CH. Cancer Subtype Discovery Based on Integrative Model of Multigenomic Data. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2017; 14:1115-1121. [PMID: 28113782 DOI: 10.1109/tcbb.2016.2621769] [Citation(s) in RCA: 16] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/06/2023]
Abstract
One major goal of large-scale cancer omics study is to understand molecular mechanisms of cancer and find new biomedical targets. To deal with the high-dimensional multidimensional cancer omics data (DNA methylation, mRNA expression, etc.), which can be used to discover new insight on identifying cancer subtypes, clustering methods are usually used to find an effective low-dimensional subspace of the original data and then cluster cancer samples in the reduced subspace. However, due to data-type diversity and big data volume, few methods can integrate these data and map them into an effective low-dimensional subspace. In this paper, we develop a dimension-reduction and data-integration method for indentifying cancer subtypes, named Scluster. First, Scluster, respectively, projects the different original data into the principal subspaces by an adaptive sparse reduced-rank regression method. Then, a fused patient-by-patient network is obtained for these subgroups through a scaled exponential similarity kernel method. Finally, candidate cancer subtypes are identified using spectral clustering method. We demonstrate the efficiency of our Scluster method using three cancers by jointly analyzing mRNA expression, miRNA expression, and DNA methylation data. The evaluation results and analyses show that Scluster is effective for predicting survival and identifies novel cancer subtypes of large-scale multi-omics data.
Collapse
|
32
|
Yuan L, Zhu L, Guo WL, Zhou X, Zhang Y, Huang Z, Huang DS. Nonconvex Penalty Based Low-Rank Representation and Sparse Regression for eQTL Mapping. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2017; 14:1154-1164. [PMID: 28114074 DOI: 10.1109/tcbb.2016.2609420] [Citation(s) in RCA: 16] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/06/2023]
Abstract
This paper addresses the problem of accounting for confounding factors and expression quantitative trait loci (eQTL) mapping in the study of SNP-gene associations. The existing convex penalty based algorithm has limited capacity to keep main information of matrix in the process of reducing matrix rank. We present an algorithm, which use nonconvex penalty based low-rank representation to account for confounding factors and make use of sparse regression for eQTL mapping (NCLRS). The efficiency of the presented algorithm is evaluated by comparing the results of 18 synthetic datasets given by NCLRS and presented algorithm, respectively. The experimental results or biological dataset show that our approach is an effective tool to account for non-genetic effects than currently existing methods.
Collapse
|
33
|
He F, Zhu G, Wang YY, Zhao XM, Huang DS. PCID: A Novel Approach for Predicting Disease Comorbidity by Integrating Multi-Scale Data. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2017; 14:678-686. [PMID: 27076462 DOI: 10.1109/tcbb.2016.2550443] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]
Abstract
Disease comorbidity is the presence of one or more diseases along with a primary disorder, which causes additional pain to patients and leads to the failure of standard treatments compared with single diseases. Therefore, the identification of potential comorbidity can help prevent those comorbid diseases when treating a primary disease. Unfortunately, most of current known disease comorbidities are discovered occasionally in clinic, and our knowledge about comorbidity is far from complete. Despite the fact that many efforts have been made to predict disease comorbidity, the prediction accuracy of existing computational approaches needs to be improved. By investigating the factors underlying disease comorbidity, e.g., mutated genes and rewired protein-protein interactions (PPIs), we here present a novel algorithm to predict disease comorbidity by integrating multi-scale data ranging from genes to phenotypes. Benchmark results on real data show that our approach outperforms existing algorithms, and some of our novel predictions are validated with those reported in literature, indicating the effectiveness and predictive power of our approach. In addition, we identify some pathway and PPI patterns that underlie the co-occurrence between a primary disease and certain disease classes, which can help explain how the comorbidity is initiated from molecular perspectives.
Collapse
|
34
|
Liu JX, Wang D, Gao YL, Zheng CH, Shang JL, Liu F, Xu Y. A joint-L2,1-norm-constraint-based semi-supervised feature extraction for RNA-Seq data analysis. Neurocomputing 2017. [DOI: 10.1016/j.neucom.2016.09.083] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/20/2022]
|
35
|
Han F, Yang C, Wu YQ, Zhu JS, Ling QH, Song YQ, Huang DS. A Gene Selection Method for Microarray Data Based on Binary PSO Encoding Gene-to-Class Sensitivity Information. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2017; 14:85-96. [PMID: 28182545 DOI: 10.1109/tcbb.2015.2465906] [Citation(s) in RCA: 25] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/06/2023]
Abstract
Traditional gene selection methods for microarray data mainly considered the features' relevance by evaluating their utility for achieving accurate predication or exploiting data variance and distribution, and the selected genes were usually poorly explicable. To improve the interpretability of the selected genes as well as prediction accuracy, an improved gene selection method based on binary particle swarm optimization (BPSO) and prior information is proposed in this paper. In the proposed method, BPSO encoding gene-to-class sensitivity (GCS) information is used to perform gene selection. The gene-to-class sensitivity information, extracted from the samples by extreme learning machine (ELM), is encoded into the selection process in four aspects: initializing particles, updating the particles, modifying maximum velocity, and adopting mutation operation adaptively. Constrained by the gene-to-class sensitivity information, the new method can select functional gene subsets which are significantly sensitive to the samples' classes. With the few discriminative genes selected by the proposed method, ELM, K-nearest neighbor and support vector machine classifiers achieve much high prediction accuracy on five public microarray data, which in turn verifies the efficiency and effectiveness of the proposed gene selection method.
Collapse
|
36
|
Wei PJ, Zhang D, Xia J, Zheng CH. LNDriver: identifying driver genes by integrating mutation and expression data based on gene-gene interaction network. BMC Bioinformatics 2016; 17:467. [PMID: 28155630 PMCID: PMC5259866 DOI: 10.1186/s12859-016-1332-y] [Citation(s) in RCA: 23] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/29/2022] Open
Abstract
BACKGROUND Cancer is a complex disease which is characterized by the accumulation of genetic alterations during the patient's lifetime. With the development of the next-generation sequencing technology, multiple omics data, such as cancer genomic, epigenomic and transcriptomic data etc., can be measured from each individual. Correspondingly, one of the key challenges is to pinpoint functional driver mutations or pathways, which contributes to tumorigenesis, from millions of functional neutral passenger mutations. RESULTS In this paper, in order to identify driver genes effectively, we applied a generalized additive model to mutation profiles to filter genes with long length and constructed a new gene-gene interaction network. Then we integrated the mutation data and expression data into the gene-gene interaction network. Lastly, greedy algorithm was used to prioritize candidate driver genes from the integrated data. We named the proposed method Length-Net-Driver (LNDriver). CONCLUSIONS Experiments on three TCGA datasets, i.e., head and neck squamous cell carcinoma, kidney renal clear cell carcinoma and thyroid carcinoma, demonstrated that the proposed method was effective. Also, it can identify not only frequently mutated drivers, but also rare candidate driver genes.
Collapse
Affiliation(s)
- Pi-Jing Wei
- College of Electrical Engineering and Automation, Anhui University, Hefei, Anhui 230601 China
| | - Di Zhang
- College of Computer Science and Technology, Anhui University, Hefei, Anhui 230601 China
| | - Junfeng Xia
- Institute of Health Sciences, Anhui University, Hefei, Anhui 230601 China
| | - Chun-Hou Zheng
- College of Computer Science and Technology, Anhui University, Hefei, Anhui 230601 China
| |
Collapse
|
37
|
Vural S, Wang X, Guda C. Classification of breast cancer patients using somatic mutation profiles and machine learning approaches. BMC SYSTEMS BIOLOGY 2016; 10 Suppl 3:62. [PMID: 27587275 PMCID: PMC5009820 DOI: 10.1186/s12918-016-0306-z] [Citation(s) in RCA: 41] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]
Abstract
BACKGROUND The high degree of heterogeneity observed in breast cancers makes it very difficult to classify the cancer patients into distinct clinical subgroups and consequently limits the ability to devise effective therapeutic strategies. Several classification strategies based on ER/PR/HER2 expression or the expression profiles of a panel of genes have helped, but such methods often produce misleading results due to their dynamic nature. In contrast, somatic DNA mutations are relatively stable and lead to initiation and progression of many sporadic cancers. Hence in this study, we explore the use of gene mutation profiles to classify, characterize and predict the subgroups of breast cancers. RESULTS We analyzed the whole exome sequencing data from 358 ethnically similar breast cancer patients in The Cancer Genome Atlas (TCGA) project. Somatic and non-synonymous single nucleotide variants identified from each patient were assigned a quantitative score (C-score) that represents the extent of negative impact on the gene function. Using these scores with non-negative matrix factorization method, we clustered the patients into three subgroups. By comparing the clinical stage of patients, we identified an early-stage-enriched and a late-stage-enriched subgroup. Comparison of the mutation scores of early and late-stage-enriched subgroups identified 358 genes that carry significantly higher mutations rates in the late stage subgroup. Functional characterization of these genes revealed important functional gene families that carry a heavy mutational load in the late state rich subgroup of patients. Finally, using the identified subgroups, we also developed a supervised classification model to predict the stage of the patients. CONCLUSIONS This study demonstrates that gene mutation profiles can be effectively used with unsupervised machine-learning methods to identify clinically distinguishable breast cancer subgroups. The classification model developed in this method could provide a reasonable prediction of the cancer patients' stage solely based on their mutation profiles. This study represents the first use of only somatic mutation profile data to identify and predict breast cancer subgroups and this generic methodology can also be applied to other cancer datasets.
Collapse
Affiliation(s)
- Suleyman Vural
- Department of Genetics, Cell Biology and Anatomy, University of Nebraska Medical Center, Omaha, NE, 68198, USA
| | - Xiaosheng Wang
- School of Basic Medicine and Clinic Pharmacy, China Pharmaceutical University, Nanjing, 211198, China
| | - Chittibabu Guda
- Department of Genetics, Cell Biology and Anatomy, University of Nebraska Medical Center, Omaha, NE, 68198, USA. .,Bioinformatics and Systems Biology Core, University of Nebraska Medical Center, Omaha, NE, 68198, USA. .,Department of Biochemistry and Molecular Biology, University of Nebraska Medical Center, Omaha, NE, 68198, USA. .,Fred and Pamela Buffet Cancer Center, University of Nebraska Medical Center, Omaha, NE, 68198, USA.
| |
Collapse
|
38
|
Zhang X, Zong L, Liu X, Luo J. Constrained Clustering With Nonnegative Matrix Factorization. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2016; 27:1514-1526. [PMID: 26241978 DOI: 10.1109/tnnls.2015.2448653] [Citation(s) in RCA: 21] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/04/2023]
Abstract
Nonnegative matrix factorization (NMF) and symmetric NMF (SymNMF) have been shown to be effective for clustering linearly separable data and nonlinearly separable data, respectively. Nevertheless, many practical applications demand constrained algorithms in which a small number of constraints in the form of must-link and cannot-link are available. In this paper, we propose an NMF-based constrained clustering framework in which the similarity between two points on a must-link is enforced to approximate 1 and the similarity between two points on a cannot-link is enforced to approximate 0. We then formulate the framework using NMF and SymNMF to deal with clustering of linearly separable data and nonlinearly separable data, respectively. Furthermore, we present multiplicative update rules to solve them and show the correctness and convergence. Experimental results on various text data sets, University of California, Irvine (UCI) data sets, and gene expression data sets demonstrate the superiority of our algorithms over existing constrained clustering algorithms.
Collapse
|
39
|
Deng SP, Zhu L, Huang DS. Predicting Hub Genes Associated with Cervical Cancer through Gene Co-Expression Networks. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2016; 13:27-35. [PMID: 26415208 DOI: 10.1109/tcbb.2015.2476790] [Citation(s) in RCA: 86] [Impact Index Per Article: 9.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]
Abstract
Cervical cancer is the third most common malignancy in women worldwide. It remains a leading cause of cancer-related death for women in developing countries. In order to contribute to the treatment of the cervical cancer, in our work, we try to find a few key genes resulting in the cervical cancer. Employing functions of several bioinformatics tools, we selected 143 differentially expressed genes (DEGs) associated with the cervical cancer. The results of bioinformatics analysis show that these DEGs play important roles in the development of cervical cancer. Through comparing two differential co-expression networks (DCNs) at two different states, we found a common sub-network and two differential sub-networks as well as some hub genes in three sub-networks. Moreover, some of the hub genes have been reported to be related to the cervical cancer. Those hub genes were analyzed from Gene Ontology function enrichment, pathway enrichment and protein binding three aspects. The results can help us understand the development of the cervical cancer and guide further experiments about the cervical cancer.
Collapse
|
40
|
Bonilla-Huerta E, Hernández-Montiel A, Caporal RM, López MA. Hybrid Framework Using Multiple-Filters and an Embedded Approach for an Efficient Selection and Classification of Microarray Data. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2016; 13:12-26. [PMID: 26336138 DOI: 10.1109/tcbb.2015.2474384] [Citation(s) in RCA: 21] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]
Abstract
A hybrid framework composed of two stages for gene selection and classification of DNA microarray data is proposed. At the first stage, five traditional statistical methods are combined for preliminary gene selection (Multiple Fusion Filter). Then, different relevant gene subsets are selected by using an embedded Genetic Algorithm (GA), Tabu Search (TS), and Support Vector Machine (SVM). A gene subset, consisting of the most relevant genes, is obtained from this process, by analyzing the frequency of each gene in the different gene subsets. Finally, the most frequent genes are evaluated by the embedded approach to obtain a final relevant small gene subset with high performance. The proposed method is tested in four DNA microarray datasets. From simulation study, it is observed that the proposed approach works better than other methods reported in the literature.
Collapse
|
41
|
Han K, Lee J. GeneNetFinder2: Improved Inference of Dynamic Gene Regulatory Relations with Multiple Regulators. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2016; 13:4-11. [PMID: 26886731 DOI: 10.1109/tcbb.2015.2450728] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]
Abstract
A gene involved in complex regulatory interactions may have multiple regulators since gene expression in such interactions is often controlled by more than one gene. Another thing that makes gene regulatory interactions complicated is that regulatory interactions are not static, but change over time during the cell cycle. Most research so far has focused on identifying gene regulatory relations between individual genes in a particular stage of the cell cycle. In this study we developed a method for identifying dynamic gene regulations of several types from the time-series gene expression data. The method can find gene regulations with multiple regulators that work in combination or individually as well as those with single regulators. The method has been implemented as the second version of GeneNetFinder (hereafter called GeneNetFinder2) and tested on several gene expression datasets. Experimental results with gene expression data revealed the existence of genes that are not regulated by individual genes but rather by a combination of several genes. Such gene regulatory relations cannot be found by conventional methods. Our method finds such regulatory relations as well as those with multiple, independent regulators or single regulators, and represents gene regulatory relations as a dynamic network in which different gene regulatory relations are shown in different stages of the cell cycle. GeneNetFinder2 is available at http://bclab.inha.ac.kr/GeneNetFinder and will be useful for modeling dynamic gene regulations with multiple regulators.
Collapse
|
42
|
Hsieh SY, Chou YC. A Faster cDNA Microarray Gene Expression Data Classifier for Diagnosing Diseases. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2016; 13:43-54. [PMID: 26336139 DOI: 10.1109/tcbb.2015.2474389] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]
Abstract
Profiling cancer molecules has several advantages; however, using microarray technology in routine clinical diagnostics is challenging for physicians. The classification of microarray data has two main limitations: 1) the data set is unreliable for building classifiers; and 2) the classifiers exhibit poor performance. Current microarray classification algorithms typically yield a high rate of false-positives cases, which is unacceptable in diagnostic applications. Numerous algorithms have been developed to detect false-positive cases; however, they require a considerable computation time. To address this problem, this study enhanced a previously proposed gene expression graph (GEG)-based classifier to shorten the computation time. The modified classifier filters genes by using an edge weight to determine their significance, thereby facilitating accurate comparison and classification. This study experimentally compared the proposed classifier with a GEG-based classifier by using real data and benchmark tests. The results show that the proposed classifier is faster at detecting false-positives.
Collapse
|
43
|
Yu Z, Chen H, You J, Liu J, Wong HS, Han G, Li L. Adaptive Fuzzy Consensus Clustering Framework for Clustering Analysis of Cancer Data. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2015; 12:887-901. [PMID: 26357330 DOI: 10.1109/tcbb.2014.2359433] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]
Abstract
Performing clustering analysis is one of the important research topics in cancer discovery using gene expression profiles, which is crucial in facilitating the successful diagnosis and treatment of cancer. While there are quite a number of research works which perform tumor clustering, few of them considers how to incorporate fuzzy theory together with an optimization process into a consensus clustering framework to improve the performance of clustering analysis. In this paper, we first propose a random double clustering based cluster ensemble framework (RDCCE) to perform tumor clustering based on gene expression data. Specifically, RDCCE generates a set of representative features using a randomly selected clustering algorithm in the ensemble, and then assigns samples to their corresponding clusters based on the grouping results. In addition, we also introduce the random double clustering based fuzzy cluster ensemble framework (RDCFCE), which is designed to improve the performance of RDCCE by integrating the newly proposed fuzzy extension model into the ensemble framework. RDCFCE adopts the normalized cut algorithm as the consensus function to summarize the fuzzy matrices generated by the fuzzy extension models, partition the consensus matrix, and obtain the final result. Finally, adaptive RDCFCE (A-RDCFCE) is proposed to optimize RDCFCE and improve the performance of RDCFCE further by adopting a self-evolutionary process (SEPP) for the parameter set. Experiments on real cancer gene expression profiles indicate that RDCFCE and A-RDCFCE works well on these data sets, and outperform most of the state-of-the-art tumor clustering algorithms.
Collapse
|
44
|
Zhou Y, Zhao H, Pan X, Shang L. Deception detecting from speech signal using relevance vector machine and non-linear dynamics features. Neurocomputing 2015. [DOI: 10.1016/j.neucom.2014.04.083] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/24/2022]
|
45
|
Deng SP, Zhu L, Huang DS. Mining the bladder cancer-associated genes by an integrated strategy for the construction and analysis of differential co-expression networks. BMC Genomics 2015; 16 Suppl 3:S4. [PMID: 25707808 PMCID: PMC4331807 DOI: 10.1186/1471-2164-16-s3-s4] [Citation(s) in RCA: 89] [Impact Index Per Article: 8.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022] Open
Abstract
Background Bladder cancer is the most common malignant tumor of the urinary system and it is a heterogeneous disease with both superficial and invasive growth. However, its aetiological agent is still unclear. And it is indispensable to find key genes or modules causing the bladder cancer. Based on gene expression microarray datasets, constructing differential co-expression networks (DCNs) is an important method to investigate diseases and there have been some relevant good tools such as R package 'WGCNA', 'DCGL'. Results Employing an integrated strategy, 36 up-regulated differentially expressed genes (DEGs) and 356 down-regulated DEGs were selected and main functions of those DEGs are cellular physiological precess(24 up-regulated DEGs; 167 down-regulated DEGs) and cellular metabolism (19 up-regulated DEGs; 104 down-regulated DEGs). The up-regulated DEGs are mainly involved in the the pathways related to "metabolism". By comparing two DCNs between the normal and cancer states, we found some great changes in hub genes and topological structure, which suggest that the modules of two different DCNs change a lot. Especially, we screened some hub genes of a differential subnetwork between the normal and the cancer states and then do bioinformatics analysis for them. Conclusions Through constructing and analyzing two differential co-expression networks at different states using the screened DEGs, we found some hub genes associated with the bladder cancer. The results of the bioinformatics analysis for those hub genes will support the biological experiments and the further treatment of the bladder cancer.
Collapse
|
46
|
You ZH, Zhu L, Zheng CH, Yu HJ, Deng SP, Ji Z. Prediction of protein-protein interactions from amino acid sequences using a novel multi-scale continuous and discontinuous feature set. BMC Bioinformatics 2014; 15 Suppl 15:S9. [PMID: 25474679 PMCID: PMC4271571 DOI: 10.1186/1471-2105-15-s15-s9] [Citation(s) in RCA: 84] [Impact Index Per Article: 7.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/06/2023] Open
Abstract
BACKGROUND Identifying protein-protein interactions (PPIs) is essential for elucidating protein functions and understanding the molecular mechanisms inside the cell. However, the experimental methods for detecting PPIs are both time-consuming and expensive. Therefore, computational prediction of protein interactions are becoming increasingly popular, which can provide an inexpensive way of predicting the most likely set of interactions at the entire proteome scale, and can be used to complement experimental approaches. Although much progress has already been achieved in this direction, the problem is still far from being solved and new approaches are still required to overcome the limitations of the current prediction models. RESULTS In this work, a sequence-based approach is developed by combining a novel Multi-scale Continuous and Discontinuous (MCD) feature representation and Support Vector Machine (SVM). The MCD representation gives adequate consideration to the interactions between sequentially distant but spatially close amino acid residues, thus it can sufficiently capture multiple overlapping continuous and discontinuous binding patterns within a protein sequence. An effective feature selection method mRMR was employed to construct an optimized and more discriminative feature set by excluding redundant features. Finally, a prediction model is trained and tested based on SVM algorithm to predict the interaction probability of protein pairs. CONCLUSIONS When performed on the yeast PPIs data set, the proposed approach achieved 91.36% prediction accuracy with 91.94% precision at the sensitivity of 90.67%. Extensive experiments are conducted to compare our method with the existing sequence-based method. Experimental results show that the performance of our predictor is better than several other state-of-the-art predictors, whose average prediction accuracy is 84.91%, sensitivity is 83.24%, and precision is 86.12%. Achieved results show that the proposed approach is very promising for predicting PPI, so it can be a useful supplementary tool for future proteomics studies. The source code and the datasets are freely available at http://csse.szu.edu.cn/staff/youzh/MCDPPI.zip for academic use.
Collapse
|
47
|
Gene differential coexpression analysis based on biweight correlation and maximum clique. BMC Bioinformatics 2014; 15 Suppl 15:S3. [PMID: 25474074 PMCID: PMC4271563 DOI: 10.1186/1471-2105-15-s15-s3] [Citation(s) in RCA: 31] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/08/2023] Open
Abstract
Differential coexpression analysis usually requires the definition of 'distance' or 'similarity' between measured datasets. Until now, the most common choice is Pearson correlation coefficient. However, Pearson correlation coefficient is sensitive to outliers. Biweight midcorrelation is considered to be a good alternative to Pearson correlation since it is more robust to outliers. In this paper, we introduce to use Biweight Midcorrelation to measure 'similarity' between gene expression profiles, and provide a new approach for gene differential coexpression analysis. Firstly, we calculate the biweight midcorrelation coefficients between all gene pairs. Then, we filter out non-informative correlation pairs using the 'half-thresholding' strategy and calculate the differential coexpression value of gene, The experimental results on simulated data show that the new approach performed better than three previously published differential coexpression analysis (DCEA) methods. Moreover, we use the maximum clique analysis to gene subset included genes identified by our approach and previously reported T2D-related genes, many additional discoveries can be found through our method.
Collapse
|
48
|
Bevilacqua V, Pietroleonardo N, Giannino E, Stroppa F, Simone D, Pesole G, Picardi E. EasyCluster2: an improved tool for clustering and assembling long transcriptome reads. BMC Bioinformatics 2014; 15 Suppl 15:S7. [PMID: 25474441 PMCID: PMC4271567 DOI: 10.1186/1471-2105-15-s15-s7] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Expressed sequences (e.g. ESTs) are a strong source of evidence to improve gene structures and predict reliable alternative splicing events. When a genome assembly is available, ESTs are suitable to generate gene-oriented clusters through the well-established EasyCluster software. Nowadays, EST-like sequences can be massively produced using Next Generation Sequencing (NGS) technologies. In order to handle genome-scale transcriptome data, we present here EasyCluster2, a reimplementation of EasyCluster able to speed up the creation of gene-oriented clusters and facilitate downstream analyses as the assembly of full-length transcripts and the detection of splicing isoforms. RESULTS EasyCluster2 has been developed to facilitate the genome-based clustering of EST-like sequences generated through the NGS 454 technology. Reads mapped onto the reference genome can be uploaded using the standard GFF3 file format. Alignment parsing is initially performed to produce a first collection of pseudo-clusters by grouping reads according to the overlap of their genomic coordinates on the same strand. EasyCluster2 then refines read grouping by including in each cluster only reads sharing at least one splice site and optionally performs a Smith-Waterman alignment in the region surrounding splice sites in order to correct for potential alignment errors. In addition, EasyCluster2 can include unspliced reads, which generally account for >50% of 454 datasets, and collapses overlapping clusters. Finally, EasyCluster2 can assemble full-length transcripts using a Directed-Acyclic-Graph-based strategy, simplifying the identification of alternative splicing isoforms, thanks also to the implementation of the widespread AStalavista methodology. Accuracy and performances have been tested on real as well as simulated datasets. CONCLUSIONS EasyCluster2 represents a unique tool to cluster and assemble transcriptome reads produced with 454 technology, as well as ESTs and full-length transcripts. The clustering procedure is enhanced with the employment of genome annotations and unspliced reads. Overall, EasyCluster2 is able to perform an effective detection of splicing isoforms, since it can refine exon-exon junctions and explore alternative splicing without known reference transcripts. Results in GFF3 format can be browsed in the UCSC Genome Browser. Therefore, EasyCluster2 is a powerful tool to generate reliable clusters for gene expression studies, facilitating the analysis also to researchers not skilled in bioinformatics.
Collapse
|
49
|
|
50
|
Wang JJY, Gao X. Max-min distance nonnegative matrix factorization. Neural Netw 2014; 61:75-84. [PMID: 25462636 DOI: 10.1016/j.neunet.2014.10.006] [Citation(s) in RCA: 22] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/10/2014] [Revised: 10/12/2014] [Accepted: 10/16/2014] [Indexed: 11/16/2022]
Abstract
Nonnegative Matrix Factorization (NMF) has been a popular representation method for pattern classification problems. It tries to decompose a nonnegative matrix of data samples as the product of a nonnegative basis matrix and a nonnegative coefficient matrix. The columns of the coefficient matrix can be used as new representations of these data samples. However, traditional NMF methods ignore class labels of the data samples. In this paper, we propose a novel supervised NMF algorithm to improve the discriminative ability of the new representation by using the class labels. Using the class labels, we separate all the data sample pairs into within-class pairs and between-class pairs. To improve the discriminative ability of the new NMF representations, we propose to minimize the maximum distance of the within-class pairs in the new NMF space, and meanwhile to maximize the minimum distance of the between-class pairs. With this criterion, we construct an objective function and optimize it with regard to basis and coefficient matrices, and slack variables alternatively, resulting in an iterative algorithm. The proposed algorithm is evaluated on three pattern classification problems and experiment results show that it outperforms the state-of-the-art supervised NMF methods.
Collapse
Affiliation(s)
- Jim Jing-Yan Wang
- Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Saudi Arabia.
| | | |
Collapse
|