1
|
Yan J, Qu W, Li X, Wang R, Tan J. GATLGEMF: A graph attention model with line graph embedding multi-complex features for ncRNA-protein interactions prediction. Comput Biol Chem 2024; 108:108000. [PMID: 38070456 DOI: 10.1016/j.compbiolchem.2023.108000] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/16/2023] [Revised: 11/27/2023] [Accepted: 12/03/2023] [Indexed: 01/22/2024]
Abstract
Non-coding RNA (ncRNA) plays an important role in many fundamental biological processes, and it may be closely associated with many complex human diseases. NcRNAs exert their functions by interacting with proteins. Therefore, identifying novel ncRNA-protein interactions (NPIs) is important for understanding the mechanism of ncRNAs role. The computational approach has the advantage of low cost and high efficiency. Machine learning and deep learning have achieved great success by making full use of sequence information and structure information. Graph neural network (GNN) is a deep learning algorithm for complex network link prediction, which can extract and discover features in graph topology data. In this study, we propose a new computational model called GATLGEMF. We used a line graph transformation strategy to obtain the most valuable feature information and input this feature information into the attention network to predict NPIs. The results on four benchmark datasets show that our method achieves superior performance. We further compare GATLGEMF with the state-of-the-art existing methods to evaluate the model performance. GATLGEMF shows the best performance with the area under curve (AUC) of 92.41% and 98.93% on RPI2241 and NPInter v2.0 datasets, respectively. In addition, a case study shows that GATLGEMF has the ability to predict new interactions based on known interactions. The source code is available at https://github.com/JianjunTan-Beijing/GATLGEMF.
Collapse
Affiliation(s)
- Jing Yan
- Department of Biomedical Engineering, Faculty of Environment and Life, Beijing University of Technology, Beijing International Science and Technology Cooperation Base for Intelligent Physiological Measurement and Clinical Transformation, Beijing 100124, China
| | - Wenyan Qu
- Department of Biomedical Engineering, Faculty of Environment and Life, Beijing University of Technology, Beijing International Science and Technology Cooperation Base for Intelligent Physiological Measurement and Clinical Transformation, Beijing 100124, China
| | - Xiaoyi Li
- Department of Biomedical Engineering, Faculty of Environment and Life, Beijing University of Technology, Beijing International Science and Technology Cooperation Base for Intelligent Physiological Measurement and Clinical Transformation, Beijing 100124, China
| | - Ruobing Wang
- Department of Biomedical Engineering, Faculty of Environment and Life, Beijing University of Technology, Beijing International Science and Technology Cooperation Base for Intelligent Physiological Measurement and Clinical Transformation, Beijing 100124, China
| | - Jianjun Tan
- Department of Biomedical Engineering, Faculty of Environment and Life, Beijing University of Technology, Beijing International Science and Technology Cooperation Base for Intelligent Physiological Measurement and Clinical Transformation, Beijing 100124, China.
| |
Collapse
|
2
|
Zhang Q, Zhang Y, Wang S, Chen ZH, Gribova V, Filaretov VF, Huang DS. Predicting In-Vitro DNA-Protein Binding With a Spatially Aligned Fusion of Sequence and Shape. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022; 19:3144-3153. [PMID: 34882561 DOI: 10.1109/tcbb.2021.3133869] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/13/2023]
Abstract
Discovery of transcription factor binding sites (TFBSs) is of primary importance for understanding the underlying binding mechanic and gene regulation process. Growing evidence indicates that apart from the primary DNA sequences, DNA shape landscape has a significant influence on transcription factor binding preference. To effectively model the co-influence of sequence and shape features, we emphasize the importance of position information of sequence motif and shape pattern. In this paper, we propose a novel deep learning-based architecture, named hybridShape eDeepCNN, for TFBS prediction which integrates DNA sequence and shape information in a spatially aligned manner. Our model utilizes the power of the multi-layer convolutional neural network and constructs an independent subnetwork to adapt for the distinct data distribution of heterogeneous features. Besides, we explore the usage of continuous embedding vectors as the representation of DNA sequences. Based on the experiments on 20 in-vitro datasets derived from universal protein binding microarrays (uPBMs), we demonstrate the superiority of our proposed method and validate the underlying design logic.
Collapse
|
3
|
Shen Z, Zhang Q, Han K, Huang DS. A Deep Learning Model for RNA-Protein Binding Preference Prediction Based on Hierarchical LSTM and Attention Network. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022; 19:753-762. [PMID: 32750884 DOI: 10.1109/tcbb.2020.3007544] [Citation(s) in RCA: 19] [Impact Index Per Article: 9.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
Attention mechanism has the ability to find important information in the sequence. The regions of the RNA sequence that can bind to proteins are more important than those that cannot bind to proteins. Neither conventional methods nor deep learning-based methods, they are not good at learning this information. In this study, LSTM is used to extract the correlation features between different sites in RNA sequence. We also use attention mechanism to evaluate the importance of different sites in RNA sequence. We get the optimal combination of k-mer length, k-mer stride window, k-mer sentence length, k-mer sentence stride window, and optimization function through hyper-parm experiments. The results show that the performance of our method is better than other methods. We tested the effects of changes in k-mer vector length on model performance. We show model performance changes under various k-mer related parameter settings. Furthermore, we investigate the effect of attention mechanism and RNA structure data on model performance.
Collapse
|
4
|
Hu P, Huang YA, Mei J, Leung H, Chen ZH, Kuang ZM, You ZH, Hu L. Learning from low-rank multimodal representations for predicting disease-drug associations. BMC Med Inform Decis Mak 2021; 21:308. [PMID: 34736437 PMCID: PMC8567544 DOI: 10.1186/s12911-021-01648-x] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/20/2021] [Accepted: 10/06/2021] [Indexed: 12/15/2022] Open
Abstract
Background Disease-drug associations provide essential information for drug discovery and disease treatment. Many disease-drug associations remain unobserved or unknown, and trials to confirm these associations are time-consuming and expensive. To better understand and explore these valuable associations, it would be useful to develop computational methods for predicting unobserved disease-drug associations. With the advent of various datasets describing diseases and drugs, it has become more feasible to build a model describing the potential correlation between disease and drugs.
Results In this work, we propose a new prediction method, called LMFDA, which works in several stages. First, it studies the drug chemical structure, disease MeSH descriptors, disease-related phenotypic terms, and drug-drug interactions. On this basis, similarity networks of different sources are constructed to enrich the representation of drugs and diseases. Based on the fused disease similarity network and drug similarity network, LMFDA calculated the association score of each pair of diseases and drugs in the database. This method achieves good performance on Fdataset and Cdataset, AUROCs were 91.6% and 92.1% respectively, higher than many of the existing computational models. Conclusions The novelty of LMFDA lies in the introduction of multimodal fusion using low-rank tensors to fuse multiple similar networks and combine matrix complement technology to predict potential association. We have demonstrated that LMFDA can display excellent network integration ability for accurate disease-drug association inferring and achieve substantial improvement over the advanced approach. Overall, experimental results on two real-world networks dataset demonstrate that LMFDA able to delivers an excellent detecting performance. Results also suggest that perfecting similar networks with as much domain knowledge as possible is a promising direction for drug repositioning.
Collapse
Affiliation(s)
- Pengwei Hu
- Xinjiang Technical Institute of Physics and Chemistry, Chinese Academy of Sciences, Ürümqi, China
| | - Yu-An Huang
- The Hong Kong Polytechnic University, Hong Kong SAR, China
| | | | - Henry Leung
- Electrical and Computer Engineering, University of Calgary, Calgary, Canada
| | - Zhan-Heng Chen
- Xinjiang Technical Institute of Physics and Chemistry, Chinese Academy of Sciences, Ürümqi, China
| | - Ze-Min Kuang
- Beijing Anzhen Hospital of Capital Medical University, Beijing, China
| | - Zhu-Hong You
- Xinjiang Technical Institute of Physics and Chemistry, Chinese Academy of Sciences, Ürümqi, China.
| | - Lun Hu
- Xinjiang Technical Institute of Physics and Chemistry, Chinese Academy of Sciences, Ürümqi, China.
| |
Collapse
|
5
|
Wang S, He Y, Chen Z, Zhang Q. FCNGRU: Locating Transcription Factor Binding Sites by combing Fully Convolutional Neural Network with Gated Recurrent Unit. IEEE J Biomed Health Inform 2021; 26:1883-1890. [PMID: 34613923 DOI: 10.1109/jbhi.2021.3117616] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
Abstract
Deciphering the relationship between transcription factors (TFs) and DNA sequences is very helpful for computational inference of gene regulation and a comprehensive understanding of gene regulation mechanisms. Transcription factor binding sites (TFBSs) are specific DNA short sequences that play a pivotal role in controlling gene expression through interaction with TF proteins. Although recently many computational and deep learning methods have been proposed to predict TFBSs aiming to predict sequence specificity of TF-DNA binding, there is still a lack of effective methods to directly locate TFBSs. In order to address this problem, we propose FCNGRU combing a fully convolutional neural network (FCN) with the gated recurrent unit (GRU) to directly locate TFBSs in this paper. Furthermore, we present a two-task framework (FCNGRU-double): one is a classification task at nucleotide level which predicts the probability of each nucleotide and locates TFBSs, and the other is a regression task at sequence level which predicts the intensity of each sequence. A series of experiments are conducted on 45 in-vitro datasets collected from the UniPROBE database derived from universal protein binding microarrays (uPBMs). Compared with competing methods, FCNGRU-double achieves much better results on these datasets. Moreover, FCNGRU-double has an advantage over a single-task framework, FCNGRU-single, which only contains the branch of locating TFBSs. In additionwe combine with in vivo datasets to make a further analysis and discussion. The source codes are avaiable at https://github.com/wangguoguoa/FCNGRU.
Collapse
|
6
|
Zhang Q, Yu W, Han K, Nandi AK, Huang DS. Multi-Scale Capsule Network for Predicting DNA-Protein Binding Sites. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2021; 18:1793-1800. [PMID: 32960766 DOI: 10.1109/tcbb.2020.3025579] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
Discovering DNA-protein binding sites, also known as motif discovery, is the foundation for further analysis of transcription factors (TFs). Deep learning algorithms such as convolutional neural networks (CNN) have been introduced to motif discovery task and have achieved state-of-art performance. However, due to the limitations of CNN, motif discovery methods based on CNN do not take full advantage of large-scale sequencing data generated by high-throughput sequencing technology. Hence, in this paper we propose multi-scale capsule network architecture (MSC) integrating multi-scale CNN, a variant of CNN able to extract motif features of different lengths, and capsule network, a novel type of artificial neural network architecture aimed at improving CNN. The proposed method is tested on real ChIP-seq datasets and the experimental results show a considerable improvement compared with two well-tested deep learning-based sequence model, DeepBind and Deepsea.
Collapse
|
7
|
Zhang Q, Wang D, Han K, Huang DS. Predicting TF-DNA Binding Motifs from ChIP-seq Datasets Using the Bag-Based Classifier Combined With a Multi-Fold Learning Scheme. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2021; 18:1743-1751. [PMID: 32946398 DOI: 10.1109/tcbb.2020.3025007] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
The rapid development of high-throughput sequencing technology provides unique opportunities for studying of transcription factor binding sites, but also brings new computational challenges. Recently, a series of discriminative motif discovery (DMD) methods have been proposed and offer promising solutions for addressing these challenges. However, because of the huge computation cost, most of them have to choose approximate schemes that either sacrifice the accuracy of motif representation or tune motif parameter indirectly. In this paper, we propose a bag-based classifier combined with a multi-fold learning scheme (BCMF) to discover motifs from ChIP-seq datasets. First, BCMF formulates input sequences as a labeled bag naturally. Then, a bag-based classifier, combining with a bag feature extracting strategy, is applied to construct the objective function, and a multi-fold learning scheme is used to solve it. Compared with the existing DMD tools, BCMF features three improvements: 1) Learning position weight matrix (PWM) directly in a continuous space; 2) Proposing to represent a positive bag with a feature fused by its k "most positive" patterns. 3) Applying a more advanced learning scheme. The experimental results on 134 ChIP-seq datasets show that BCMF substantially outperforms existing DMD methods (including DREME, HOMER, XXmotif, motifRG, EDCOD and our previous work).
Collapse
|
8
|
Lin X, Zhang X. Identification of hot regions in hub protein-protein interactions by clustering and PPRA optimization. BMC Med Inform Decis Mak 2021; 21:143. [PMID: 33941163 PMCID: PMC8094484 DOI: 10.1186/s12911-020-01350-4] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/19/2020] [Accepted: 11/23/2020] [Indexed: 11/24/2022] Open
Abstract
Background Protein–protein interactions (PPIs) are the core of protein function, which provide an effective means to understand the function at cell level. Identification of PPIs is the crucial foundation of predicting drug-target interactions. Although traditional biological experiments of identifying PPIs are becoming available, these experiments remain to be extremely time-consuming and expensive. Therefore, various computational models have been introduced to identify PPIs. In protein-protein interaction network (PPIN), Hub protein, as a highly connected node, can coordinate PPIs and play biological functions. Detecting hot regions on Hub protein interaction interfaces is an issue worthy of discussing. Methods Two clustering methods, LCSD and RCNOIK are used to detect the hot regions on Hub protein interaction interfaces in this paper. In order to improve the efficiency of K-means clustering algorithm, the best k value is selected by calculating the distance square sum and the average silhouette coefficients. Then, the optimization of residue coordination number strategy is used to calculate the average coordination number. In addition, the pair potentials and relative ASA (PPRA) strategy is also used to optimize the predicted results. Results DataHub dataset and PartyHub dataset were used to train two clustering models respectively. Experiments show that LCSD and RCNOIK have the same coverage with Hub protein datasets, and RCNOIK is slightly higher than LCSD in Precision. The predicted hot regions are closer to the standard hot regions. Conclusions This paper optimizes two clustering methods based on PPRA strategy. Compared our methods for hot regions prediction against the well-known approaches, our improved methods have the higher reliability and are effective for predicting hot regions on Hub protein interaction interfaces.
Collapse
Affiliation(s)
- Xiaoli Lin
- Hubei Key Laboratory of Intelligent Information Processing and Real-Time Industrial System, School of Computer Science and Technology, Wuhan University of Science and Technology, Wuhan, 430065, People's Republic of China.
| | - Xiaolong Zhang
- Hubei Key Laboratory of Intelligent Information Processing and Real-Time Industrial System, School of Computer Science and Technology, Wuhan University of Science and Technology, Wuhan, 430065, People's Republic of China
| |
Collapse
|
9
|
Wang J, Zhao Y, Gong W, Liu Y, Wang M, Huang X, Tan J. EDLMFC: an ensemble deep learning framework with multi-scale features combination for ncRNA-protein interaction prediction. BMC Bioinformatics 2021; 22:133. [PMID: 33740884 PMCID: PMC7980572 DOI: 10.1186/s12859-021-04069-9] [Citation(s) in RCA: 14] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/23/2021] [Accepted: 03/05/2021] [Indexed: 11/29/2022] Open
Abstract
Background Non-coding RNA (ncRNA) and protein interactions play essential roles in various physiological and pathological processes. The experimental methods used for predicting ncRNA–protein interactions are time-consuming and labor-intensive. Therefore, there is an increasing demand for computational methods to accurately and efficiently predict ncRNA–protein interactions. Results In this work, we presented an ensemble deep learning-based method, EDLMFC, to predict ncRNA–protein interactions using the combination of multi-scale features, including primary sequence features, secondary structure sequence features, and tertiary structure features. Conjoint k-mer was used to extract protein/ncRNA sequence features, integrating tertiary structure features, then fed into an ensemble deep learning model, which combined convolutional neural network (CNN) to learn dominating biological information with bi-directional long short-term memory network (BLSTM) to capture long-range dependencies among the features identified by the CNN. Compared with other state-of-the-art methods under five-fold cross-validation, EDLMFC shows the best performance with accuracy of 93.8%, 89.7%, and 86.1% on RPI1807, NPInter v2.0, and RPI488 datasets, respectively. The results of the independent test demonstrated that EDLMFC can effectively predict potential ncRNA–protein interactions from different organisms. Furtherly, EDLMFC is also shown to predict hub ncRNAs and proteins presented in ncRNA–protein networks of Mus musculus successfully. Conclusions In general, our proposed method EDLMFC improved the accuracy of ncRNA–protein interaction predictions and anticipated providing some helpful guidance on ncRNA functions research. The source code of EDLMFC and the datasets used in this work are available at https://github.com/JingjingWang-87/EDLMFC. Supplementary Information The online version contains supplementary material available at 10.1186/s12859-021-04069-9.
Collapse
Affiliation(s)
- Jingjing Wang
- Department of Biomedical Engineering, Faculty of Environment and Life, Beijing International Science and Technology Cooperation Base for Intelligent Physiological Measurement and Clinical Transformation, Beijing University of Technology, Beijing, 100124, China
| | - Yanpeng Zhao
- Department of Biomedical Engineering, Faculty of Environment and Life, Beijing International Science and Technology Cooperation Base for Intelligent Physiological Measurement and Clinical Transformation, Beijing University of Technology, Beijing, 100124, China
| | - Weikang Gong
- Department of Biomedical Engineering, Faculty of Environment and Life, Beijing International Science and Technology Cooperation Base for Intelligent Physiological Measurement and Clinical Transformation, Beijing University of Technology, Beijing, 100124, China
| | - Yang Liu
- Department of Biomedical Engineering, Faculty of Environment and Life, Beijing International Science and Technology Cooperation Base for Intelligent Physiological Measurement and Clinical Transformation, Beijing University of Technology, Beijing, 100124, China
| | - Mei Wang
- Department of Biomedical Engineering, Faculty of Environment and Life, Beijing International Science and Technology Cooperation Base for Intelligent Physiological Measurement and Clinical Transformation, Beijing University of Technology, Beijing, 100124, China
| | - Xiaoqian Huang
- Department of Biomedical Engineering, Faculty of Environment and Life, Beijing International Science and Technology Cooperation Base for Intelligent Physiological Measurement and Clinical Transformation, Beijing University of Technology, Beijing, 100124, China
| | - Jianjun Tan
- Department of Biomedical Engineering, Faculty of Environment and Life, Beijing International Science and Technology Cooperation Base for Intelligent Physiological Measurement and Clinical Transformation, Beijing University of Technology, Beijing, 100124, China.
| |
Collapse
|
10
|
Zhang Q, Shen Z, Huang DS. Predicting in-vitro Transcription Factor Binding Sites Using DNA Sequence + Shape. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2021; 18:667-676. [PMID: 31634140 DOI: 10.1109/tcbb.2019.2947461] [Citation(s) in RCA: 28] [Impact Index Per Article: 9.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
Discovery of transcription factor binding sites (TFBSs) is essential for understanding the underlying binding mechanisms and cellular functions. Recently, Convolutional neural network (CNN) has succeeded in predicting TFBSs from the primary DNA sequences. In addition to DNA sequences, several evidences suggest that protein-DNA binding is partly mediated by properties of DNA shape. Although many methods have been proposed to jointly account for DNA sequences and shape properties in predicting TFBSs, they ignore the power of the combination of deep learning and DNA sequence + shape. Therefore we develop a deep-learning-based sequence + shape framework (DLBSS) in this paper, which appropriately integrates DNA sequences and shape properties, to better understand protein-DNA binding preference. This method uses a shared CNN to find their common patterns from DNA sequences and their corresponding shape features, which are then concatenated to compute a predicted value. Using 66 in-vitro datasets derived from universal protein binding microarrays (uPBMs), we show that our proposed method DLBSS significantly improves the performance of predicting TFBSs. In addition, we explain the reason why we should use the shared CNN, and explore the performance of DLBSS when using a deeper CNN, through a series of experiments.
Collapse
|
11
|
Yuan X, Yu J, Xi J, Yang L, Shang J, Li Z, Duan J. CNV_IFTV: An Isolation Forest and Total Variation-Based Detection of CNVs from Short-Read Sequencing Data. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2021; 18:539-549. [PMID: 31180897 DOI: 10.1109/tcbb.2019.2920889] [Citation(s) in RCA: 27] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/09/2023]
Abstract
Accurate detection of copy number variations (CNVs) from short-read sequencing data is challenging due to the uneven distribution of reads and the unbalanced amplitudes of gains and losses. The direct use of read depths to measure CNVs tends to limit performance. Thus, robust computational approaches equipped with appropriate statistics are required to detect CNV regions and boundaries. This study proposes a new method called CNV_IFTV to address this need. CNV_IFTV assigns an anomaly score to each genome bin through a collection of isolation trees. The trees are trained based on isolation forest algorithm through conducting subsampling from measured read depths. With the anomaly scores, CNV_IFTV uses a total variation model to smooth adjacent bins, leading to a denoised score profile. Finally, a statistical model is established to test the denoised scores for calling CNVs. CNV_IFTV is tested on both simulated and real data in comparison to several peer methods. The results indicate that the proposed method outperforms the peer methods. CNV_IFTV is a reliable tool for detecting CNVs from short-read sequencing data even for low-level coverage and tumor purity. The detection results on tumor samples can aid to evaluate known cancer genes and to predict target drugs for disease diagnosis.
Collapse
|
12
|
Wang S, Zhang Q, Shen Z, He Y, Chen ZH, Li J, Huang DS. Predicting transcription factor binding sites using DNA shape features based on shared hybrid deep learning architecture. MOLECULAR THERAPY-NUCLEIC ACIDS 2021; 24:154-163. [PMID: 33767912 PMCID: PMC7972936 DOI: 10.1016/j.omtn.2021.02.014] [Citation(s) in RCA: 21] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 11/19/2020] [Accepted: 02/14/2021] [Indexed: 12/26/2022]
Abstract
The study of transcriptional regulation is still difficult yet fundamental in molecular biology research. Recent research has shown that the double helix structure of nucleotides plays an important role in improving the accuracy and interpretability of transcription factor binding sites (TFBSs). Although several computational methods have been designed to take both DNA sequence and DNA shape features into consideration simultaneously, how to design an efficient model is still an intractable topic. In this paper, we proposed a hybrid convolutional recurrent neural network (CNN/RNN) architecture, CRPTS, to predict TFBSs by combining DNA sequence and DNA shape features. The novelty of our proposed method relies on three critical aspects: (1) the application of a shared hybrid CNN and RNN has the ability to efficiently extract features from large-scale genomic sequences obtained by high-throughput technology; (2) the common patterns were found from DNA sequences and their corresponding DNA shape features; (3) our proposed CRPTS can capture local structural information of DNA sequences without completely relying on DNA shape data. A series of comprehensive experiments on 66 in vitro datasets derived from universal protein binding microarrays (uPBMs) shows that our proposed method CRPTS obviously outperforms the state-of-the-art methods.
Collapse
Affiliation(s)
- Siguo Wang
- The Institute of Machine Learning and Systems Biology, School of Electronics and Information Engineering, Tongji University, No. 4800 Caoan Road, Shanghai 201804, China
| | - Qinhu Zhang
- The Institute of Machine Learning and Systems Biology, School of Electronics and Information Engineering, Tongji University, No. 4800 Caoan Road, Shanghai 201804, China.,Translational Medical Center for Stem Cell Therapy and Institute for Regenerative Medicine, Shanghai East Hospital, Tongji University, Siping Road 1239, Shanghai 200092, China
| | - Zhen Shen
- School of Computer and Software, Nanyang Institute of Technology, Changjiang Road 80, Nanyang, Henan 473004, China
| | - Ying He
- The Institute of Machine Learning and Systems Biology, School of Electronics and Information Engineering, Tongji University, No. 4800 Caoan Road, Shanghai 201804, China
| | - Zhen-Heng Chen
- College of Computer Science and Software Engineering, Shenzhen University, Shenzhen 518060, China
| | - Jianqiang Li
- College of Computer Science and Software Engineering, Shenzhen University, Shenzhen 518060, China
| | - De-Shuang Huang
- The Institute of Machine Learning and Systems Biology, School of Electronics and Information Engineering, Tongji University, No. 4800 Caoan Road, Shanghai 201804, China
| |
Collapse
|
13
|
Hind J, Lisboa P, Hussain AJ, Al-Jumeily D. A Novel Approach to Detecting Epistasis using Random Sampling Regularisation. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2020; 17:1535-1545. [PMID: 31634840 DOI: 10.1109/tcbb.2019.2948330] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
Epistasis is a progressive approach that complements the 'common disease, common variant' hypothesis that highlights the potential for connected networks of genetic variants collaborating to produce a phenotypic expression. Epistasis is commonly performed as a pairwise or limitless-arity capacity that considers variant networks as either variant vs variant or as high order interactions. This type of analysis extends the number of tests that were previously performed in a standard approach such as Genome-Wide Association Study (GWAS), in which False Discovery Rate (FDR) is already an issue, therefore by multiplying the number of tests up to a factorial rate also increases the issue of FDR. Further to this, epistasis introduces its own limitations of computational complexity and intensity that are generated based on the analysis performed; to consider the most intense approach, a multivariate analysis introduces a time complexity of O(n!). Proposed in this paper is a novel methodology for the detection of epistasis using interpretable methods and best practice to outline interactions through filtering processes. Using a process of Random Sampling Regularisation which randomly splits and produces sample sets to conduct a voting system to regularise the significance and reliability of biological markers, SNPs. Preliminary results are promising, outlining a concise detection of interactions. Results for the detection of epistasis, in the classification of breast cancer patients, indicated eight outlined risk candidate interactions from five variants and a singular candidate variant with high protective association.
Collapse
|
14
|
Lin X, Zhang X, Xu X. Efficient Classification of Hot Spots and Hub Protein Interfaces by Recursive Feature Elimination and Gradient Boosting. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2020; 17:1525-1534. [PMID: 31380766 DOI: 10.1109/tcbb.2019.2931717] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
Proteins are not isolated biological molecules, which have the specific three-dimensional structures and interact with other proteins to perform functions. A small number of residues (hot spots) in protein-protein interactions (PPIs) play the vital role in bioinformatics to influence and control of biological processes. This paper uses the boosting algorithm and gradient boosting algorithm based on two feature selection strategies to classify hot spots with three common datasets and two hub protein datasets. First, the correlation-based feature selection is used to remove the highly related features for improving accuracy of prediction. Then, the recursive feature elimination based on support vector machine (SVM-RFE) is adopted to select the optimal feature subset to improve the training performance. Finally, boosting and gradient boosting (G-boosting) methods are invoked to generate classification results. Gradient boosting is capable of obtaining an excellent model by reducing the loss function in the gradient direction to avoid overfitting. Five datasets from different protein databases are used to verify our models in the experiments. Experimental results show that our proposed classification models have the competitive performance compared with existing classification methods.
Collapse
|
15
|
Shen Z, Deng SP, Huang DS. RNA-Protein Binding Sites Prediction via Multi Scale Convolutional Gated Recurrent Unit Networks. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2020; 17:1741-1750. [PMID: 30990191 DOI: 10.1109/tcbb.2019.2910513] [Citation(s) in RCA: 13] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/09/2023]
Abstract
RNA-Protein binding plays important roles in the field of gene expression. With the development of high throughput sequencing, several conventional methods and deep learning-based methods have been proposed to predict the binding preference of RNA-protein binding. These methods can hardly meet the need of consideration of the dependencies between subsequence and the various motif lengths of different translation factors (TFs). To overcome such limitations, we propose a predictive model that utilizes a combination of multi-scale convolutional layers and bidirectional gated recurrent unit (GRU) layer. Multi-scale convolution layer has the ability to capture the motif features of different lengths, and bidirectional GRU layer is able to capture the dependencies among subsequence. Experimental results show that the proposed method performs better than four state-of-the-art methods in this field. In addition, we investigate the effect of model structure on model performance by performing our proposed method with a different convolution layer and a different number of kernel size. We also demonstrate the effectiveness of bidirectional GRU in improving model performance through comparative experiments.
Collapse
|
16
|
Li J, Shi X, You ZH, Yi HC, Chen Z, Lin Q, Fang M. Using Weighted Extreme Learning Machine Combined With Scale-Invariant Feature Transform to Predict Protein-Protein Interactions From Protein Evolutionary Information. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2020; 17:1546-1554. [PMID: 31940546 DOI: 10.1109/tcbb.2020.2965919] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
Protein-Protein Interactions (PPIs) play an irreplaceable role in biological activities of organisms. Although many high-throughput methods are used to identify PPIs from different kinds of organisms, they have some shortcomings, such as high cost and time-consuming. To solve the above problems, computational methods are developed to predict PPIs. Thus, in this paper, we present a method to predict PPIs using protein sequences. First, protein sequences are transformed into Position Weight Matrix (PWM), in which Scale-Invariant Feature Transform (SIFT) algorithm is used to extract features. Then Principal Component Analysis (PCA) is applied to reduce the dimension of features. At last, Weighted Extreme Learning Machine (WELM) classifier is employed to predict PPIs and a series of evaluation results are obtained. In our method, since SIFT and WELM are used to extract features and classify respectively, we called the proposed method SIFT-WELM. When applying the proposed method on three well-known PPIs datasets of Yeast, Human and Helicobacter.pylori, the average accuracies of our method using five-fold cross validation are obtained as high as 94.83, 97.60 and 83.64 percent, respectively. In order to evaluate the proposed approach properly, we compare it with Support Vector Machine (SVM) classifier and other recent-developed methods in different aspects. Moreover, the training time of our method is greatly shortened, which is obviously superior to the previous methods, such as SVM, ACC, PCVMZM and so on.
Collapse
|
17
|
Peng C, Zheng Y, Huang DS. Capsule Network Based Modeling of Multi-omics Data for Discovery of Breast Cancer-Related Genes. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2020; 17:1605-1612. [PMID: 30969931 DOI: 10.1109/tcbb.2019.2909905] [Citation(s) in RCA: 18] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/23/2023]
Abstract
Breast cancer is one of the most common cancers all over the world, which bring about more than 450,000 deaths each year. Although this malignancy has been extensively studied by a large number of researchers, its prognosis is still poor. Since therapeutic advance can be obtained based on gene signatures, there is an urgent need to discover genes related to breast cancer that may help uncover the mechanisms in cancer progression. We propose a deep learning method for the discovery of breast cancer-related genes by using Capsule Network based Modeling of Multi-omics Data (CapsNetMMD). In CapsNetMMD, we make use of known breast cancer-related genes to transform the issue of gene identification into the issue of supervised classification. The features of genes are generated through comprehensive integration of multi-omics data, e.g., mRNA expression, z scores for mRNA expression, DNA methylation, and two forms of DNA copy-number alterations (CNAs). By modeling features based on the capsule network, we identify breast cancer-related genes with a significantly better performance than other existing machine learning methods. The predicted genes with prognostic values play potential important roles in breast cancer and may serve as candidates for biologists and medical scientists in the future studies of biomarkers.
Collapse
|
18
|
Shen Z, Deng SP, Huang DS. Capsule Network for Predicting RNA-Protein Binding Preferences Using Hybrid Feature. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2020; 17:1483-1492. [PMID: 31562101 DOI: 10.1109/tcbb.2019.2943465] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
RNA-Protein binding is involved in many different biological processes. With the progress of technology, more and more data are available for research. Based on these data, many prediction methods have been proposed to predict RNA-Protein binding preference. Some of these methods use only RNA sequence features for prediction, and some methods use multiple features for prediction. But, the performance of these methods is not satisfactory. In this study, we propose an improved capsule network to predict RNA-protein binding preferences, which can use both RNA sequence features and structure features. Experimental results show that our proposed method iCapsule performs better than three baseline methods in this field. We used both RNA sequence features and structure features in the model, so we tested the effect of primary capsule layer changes on model performance. In addition, we also studied the impact of model structure on model performance by performing our proposed method with different number of convolution layers and different kernel sizes.
Collapse
|
19
|
Zhang Q, Zhu L, Bao W, Huang DS. Weakly-Supervised Convolutional Neural Network Architecture for Predicting Protein-DNA Binding. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2020; 17:679-689. [PMID: 30106688 DOI: 10.1109/tcbb.2018.2864203] [Citation(s) in RCA: 12] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
Although convolutional neural networks (CNN) have outperformed conventional methods in predicting the sequence specificities of protein-DNA binding in recent years, they do not take full advantage of the intrinsic weakly-supervised information of DNA sequences that a bound sequence may contain multiple TFBS(s). Here, we propose a weakly-supervised convolutional neural network architecture (WSCNN), combining multiple-instance learning (MIL) with CNN, to further boost the performance of predicting protein-DNA binding. WSCNN first divides each DNA sequence into multiple overlapping subsequences (instances) with a sliding window, and then separately models each instance using CNN, and finally fuses the predicted scores of all instances in the same bag using four fusion methods, including Max, Average, Linear Regression, and Top-Bottom Instances. The experimental results on in vivo and in vitro datasets illustrate the performance of the proposed approach. Moreover, models built on in vitro data using WSCNN can predict in vivo protein-DNA binding with good accuracy. In addition, we give a quantitative analysis of the importance of the reverse-complement mode in predicting in vivo protein-DNA binding, and explain why not directly use advanced pooling layers to combine MIL with CNN, through a series of experiments.
Collapse
|
20
|
Zhang Q, Zhu L, Huang DS. High-Order Convolutional Neural Network Architecture for Predicting DNA-Protein Binding Sites. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2019; 16:1184-1192. [PMID: 29993783 DOI: 10.1109/tcbb.2018.2819660] [Citation(s) in RCA: 55] [Impact Index Per Article: 11.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/15/2023]
Abstract
Although Deep learning algorithms have outperformed conventional methods in predicting the sequence specificities of DNA-protein binding, they lack to consider the dependencies among nucleotides and the diverse binding lengths for different transcription factors (TFs). To address the above two limitations simultaneously, in this paper, we propose a high-order convolutional neural network architecture (HOCNN), which employs a high-order encoding method to build high-order dependencies among nucleotides, and a multi-scale convolutional layer to capture the motif features of different length. The experimental results on real ChIP-seq datasets show that the proposed method outperforms the state-of-the-art deep learning method (DeepBind) in the motif discovery task. In addition, we provide further insights about the importance of introducing additional convolutional kernels and the degeneration problem of importing high-order in the motif discovery task.
Collapse
|
21
|
Zhang Q, Shen Z, Huang DS. Modeling in-vivo protein-DNA binding by combining multiple-instance learning with a hybrid deep neural network. Sci Rep 2019; 9:8484. [PMID: 31186519 PMCID: PMC6559991 DOI: 10.1038/s41598-019-44966-x] [Citation(s) in RCA: 27] [Impact Index Per Article: 5.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/19/2018] [Accepted: 05/15/2019] [Indexed: 01/26/2023] Open
Abstract
Modeling in-vivo protein-DNA binding is not only fundamental for further understanding of the regulatory mechanisms, but also a challenging task in computational biology. Deep-learning based methods have succeed in modeling in-vivo protein-DNA binding, but they often (1) follow the fully supervised learning framework and overlook the weakly supervised information of genomic sequences that a bound DNA sequence may has multiple TFBS(s), and, (2) use one-hot encoding to encode DNA sequences and ignore the dependencies among nucleotides. In this paper, we propose a weakly supervised framework, which combines multiple-instance learning with a hybrid deep neural network and uses k-mer encoding to transform DNA sequences, for modeling in-vivo protein-DNA binding. Firstly, this framework segments sequences into multiple overlapping instances using a sliding window, and then encodes all instances into image-like inputs of high-order dependencies using k-mer encoding. Secondly, it separately computes a score for all instances in the same bag using a hybrid deep neural network that integrates convolutional and recurrent neural networks. Finally, it integrates the predicted values of all instances as the final prediction of this bag using the Noisy-and method. The experimental results on in-vivo datasets demonstrate the superior performance of the proposed framework. In addition, we also explore the performance of the proposed framework when using k-mer encoding, and demonstrate the performance of the Noisy-and method by comparing it with other fusion methods, and find that adding recurrent layers can improve the performance of the proposed framework.
Collapse
Affiliation(s)
- Qinhu Zhang
- Institute of Machine Learning and Systems Biology, School of Electronics and Information Engineering, Tongji University, Shanghai, 201804, P.R. China
| | - Zhen Shen
- Institute of Machine Learning and Systems Biology, School of Electronics and Information Engineering, Tongji University, Shanghai, 201804, P.R. China
| | - De-Shuang Huang
- Institute of Machine Learning and Systems Biology, School of Electronics and Information Engineering, Tongji University, Shanghai, 201804, P.R. China.
| |
Collapse
|
22
|
Yuan L, Huang DS. A Network-guided Association Mapping Approach from DNA Methylation to Disease. Sci Rep 2019; 9:5601. [PMID: 30944378 PMCID: PMC6447594 DOI: 10.1038/s41598-019-42010-6] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/31/2018] [Accepted: 03/12/2019] [Indexed: 01/11/2023] Open
Abstract
Aberrant DNA methylation may contribute to development of cancer. However, understanding the associations between DNA methylation and cancer remains a challenge because of the complex mechanisms involved in the associations and insufficient sample sizes. The unprecedented wealth of DNA methylation, gene expression and disease status data give us a new opportunity to design machine learning methods to investigate the underlying associated mechanisms. In this paper, we propose a network-guided association mapping approach from DNA methylation to disease (NAMDD). Compared with existing methods, NAMDD finds methylation-disease path associations by integrating analysis of multiple data combined with a stability selection strategy, thereby mining more information in the datasets and improving the quality of resultant methylation sites. The experimental results on both synthetic and real ovarian cancer data show that NAMDD substantially outperforms former disease-related methylation site research methods (including NsRRR and PCLOGIT) under false positive control. Furthermore, we applied NAMDD to ovarian cancer data, identified significant path associations and provided hypothetical biological path associations to explain our findings.
Collapse
Affiliation(s)
- Lin Yuan
- Institute of Machine Learning and Systems Biology, College of Electronic and Information Engineering, Tongji University, Shanghai, 201804, P.R. China
| | - De-Shuang Huang
- Institute of Machine Learning and Systems Biology, College of Electronic and Information Engineering, Tongji University, Shanghai, 201804, P.R. China.
| |
Collapse
|
23
|
Xu W, Zhu L, Huang DS. DCDE: An Efficient Deep Convolutional Divergence Encoding Method for Human Promoter Recognition. IEEE Trans Nanobioscience 2019; 18:136-145. [PMID: 30624223 DOI: 10.1109/tnb.2019.2891239] [Citation(s) in RCA: 19] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
Efficient human promoter feature extraction is still a major challenge in genome analysis as it can better understand human gene regulation and will be useful for experimental guidance. Although many machine learning algorithms have been developed for eukaryotic gene recognition, performance on promoters is unsatisfactory due to the diverse nature. To extract discriminative features from human promoters, an efficient deep convolutional divergence encoding method (DCDE) is proposed based on statistical divergence (SD) and convolutional neural network (CNN). SD can help optimize kmer feature extraction for human promoters. CNN can also be used to automatically extract features in gene analysis. In DCDE, we first perform informative kmers settlement to encode original gene sequences. A series of SD methods can optimize the most discriminative kmers distributions while maintaining important positional information. Then, CNN is utilized to extract lower dimensional deep features by secondary encoding. Finally, we construct a hybrid recognition architecture with multiple support vector machines and a bilayer decision method. It is flexible to add new features or new models and can be extended to identify other genomic functional elements. The extensive experiments demonstrate that DCDE is effective in promoter encoding and can significantly improve the performance of promoter recognition.
Collapse
|
24
|
Zhang H, Zhu L, Huang DS. DiscMLA: An Efficient Discriminative Motif Learning Algorithm over High-Throughput Datasets. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2018; 15:1810-1820. [PMID: 27164602 DOI: 10.1109/tcbb.2016.2561930] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]
Abstract
The transcription factors (TFs) can activate or suppress gene expression by binding to specific sites, hence are crucial regulatory elements for transcription. Recently, series of discriminative motif finders have been tailored to offering promising strategy for harnessing the power of large quantities of accumulated high-throughput experimental data. However, in order to achieve high speed, these algorithms have to sacrifice accuracy by employing simplified statistical models during the searching process. In this paper, we propose a novel approach named Discriminative Motif Learning via AUC (DiscMLA) to discover motifs on high-throughput datasets. Unlike previous approaches, DiscMLA tries to optimize with a more comprehensive criterion (AUC) during motifs searching. In addition, based on an experimental observation of motif identification on large-scale datasets, some novel procedures are designed to accelerate DiscMLA. The experimental results on 52 real-world datasets demonstrate that our approach substantially outperforms previous methods on discriminative motif learning problems. DiscMLA' stability, discriminability, and validity will help to exploit high-throughput datasets and answer many fundamental biological questions.
Collapse
|
25
|
Recurrent Neural Network for Predicting Transcription Factor Binding Sites. Sci Rep 2018; 8:15270. [PMID: 30323198 PMCID: PMC6189047 DOI: 10.1038/s41598-018-33321-1] [Citation(s) in RCA: 104] [Impact Index Per Article: 17.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/27/2018] [Accepted: 09/25/2018] [Indexed: 12/23/2022] Open
Abstract
It is well known that DNA sequence contains a certain amount of transcription factors (TF) binding sites, and only part of them are identified through biological experiments. However, these experiments are expensive and time-consuming. To overcome these problems, some computational methods, based on k-mer features or convolutional neural networks, have been proposed to identify TF binding sites from DNA sequences. Although these methods have good performance, the context information that relates to TF binding sites is still lacking. Research indicates that standard recurrent neural networks (RNN) and its variants have better performance in time-series data compared with other models. In this study, we propose a model, named KEGRU, to identify TF binding sites by combining Bidirectional Gated Recurrent Unit (GRU) network with k-mer embedding. Firstly, DNA sequences are divided into k-mer sequences with a specified length and stride window. And then, we treat each k-mer as a word and pre-trained word representation model though word2vec algorithm. Thirdly, we construct a deep bidirectional GRU model for feature learning and classification. Experimental results have shown that our method has better performance compared with some state-of-the-art methods. Additional experiments about embedding strategy show that k-mer embedding will be helpful to enhance model performance. The robustness of KEGRU is proved by experiments with different k-mer length, stride window and embedding vector dimension.
Collapse
|
26
|
Deng SP, Hu W, Calhoun VD, Wang YP. Integrating Imaging Genomic Data in the Quest for Biomarkers of Schizophrenia Disease. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2018; 15:1480-1491. [PMID: 28880187 PMCID: PMC6207076 DOI: 10.1109/tcbb.2017.2748944] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/06/2023]
Abstract
It's increasingly important but difficult to determine potential biomarkers of schizophrenia (SCZ) disease, owing to the complex pathophysiology of this disease. In this study, a network-fusion based framework was proposed to identify genetic biomarkers of the SCZ disease. A three-step feature selection was applied to single nucleotide polymorphisms (SNPs), DNA methylation, and functional magnetic resonance imaging (fMRI) data to select important features, which were then used to construct two gene networks in different states for the SNPs and DNA methylation data, respectively. Two health networks (one is for SNP data and the other is for DNA methylation data) were combined into one health network from which health minimum spanning trees (MSTs) were extracted. Two disease networks also followed the same procedures. Those genes with significant changes were determined as SCZ biomarkers by comparing MSTs in two different states and they were finally validated from five aspects. The effectiveness of the proposed discovery framework was also demonstrated by comparing with other network-based discovery methods. In summary, our approach provides a general framework for discovering gene biomarkers of the complex diseases by integrating imaging genomic data, which can be applied to the diagnosis of the complex diseases in the future.
Collapse
Affiliation(s)
- Su-Ping Deng
- Department of Biomedical Engineering, School of Science and Engineering, Tulane University, New Orleans, LA 70118, USA.,
| | - Wenxing Hu
- Department of Biomedical Engineering, School of Science and Engineering, Tulane University, New Orleans, LA 70118, USA.,
| | | | - Yu-Ping Wang
- Department of Biomedical Engineering, School of Science and Engineering, Tulane University, New Orleans, LA 70118, USA., , Telephone: (504)865-5867, Fax: (504)862-8779
| |
Collapse
|
27
|
Lin X, Zhang X. Prediction of Hot Regions in PPIs Based on Improved Local Community Structure Detecting. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2018; 15:1470-1479. [PMID: 29994749 DOI: 10.1109/tcbb.2018.2793858] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
The hot regions in PPIs are some assembly regions which are composed of the tightly packed HotSpots. The discovery of hot regions helps to understand life activities and has very important value for biological applications. The identification of hot regions is the basis for protein design and cancer prevention. The existing algorithms of predicting hot regions often have some defects, such as low accuracy and unstability. This paper proposes a novel hot region prediction method based on diverse biological characteristics. First, feature evaluation is employed by using an impoved mRMR method. Then, SVM is adopted to create cassification model based on the features selected. In addition, a new clustering algorithm, namely LCSD (Local community structure detecting), is developed to detect and analyze the conformation of hot regions. In the clustering process, the link similarity of protein residues is introduced to handle the boundary nodes. This algorithm can effectively deal with the missing residue nodes and control the local community boundaries. The results indicate that the spatial structure of hot regions can be obtained more effectively, and that our method is more effective than previous methods for precise identification of hot regions.
Collapse
|
28
|
Liu J, Cheng Y, Wang X, Cui X, Kong Y, Du J. Low Rank Subspace Clustering via Discrete Constraint and Hypergraph Regularization for Tumor Molecular Pattern Discovery. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2018; 15:1500-1512. [PMID: 29993749 DOI: 10.1109/tcbb.2018.2834371] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
Tumor clustering is a powerful approach for cancer class discovery which is crucial to the effective treatment of cancer. Many traditional clustering methods such as NMF-based models, have been widely used to identify tumors. However, they cannot achieve satisfactory results. Recently, subspace clustering approaches have been proposed to improve the performance by dividing the original space into multiple low-dimensional subspaces. Among them, low rank representation is becoming a popular approach to attain subspace clustering. In this paper, we propose a novel Low Rank Subspace Clustering model via Discrete Constraint and Hypergraph Regularization (DHLRS). The proposed method learns the cluster indicators directly by using discrete constraint, which makes the clustering task simple. For each subspace, we adopt Schatten -norm to better approximate the low rank constraint. Moreover, Hypergraph Regularization is adopted to infer the complex relationship between genes and intrinsic geometrical structure of gene expression data in each subspace. Finally, the molecular pattern of tumor gene expression data sets is discovered according to the optimized cluster indicators. Experiments on both synthetic data and real tumor gene expression data sets prove the effectiveness of proposed DHLRS.
Collapse
|
29
|
Bao W, Yuan CA, Zhang Y, Han K, Nandi AK, Honig B, Huang DS. Mutli-Features Prediction of Protein Translational Modification Sites. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2018; 15:1453-1460. [PMID: 28961121 DOI: 10.1109/tcbb.2017.2752703] [Citation(s) in RCA: 26] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/07/2023]
Abstract
Post translational modification plays a significiant role in the biological processing. The potential post translational modification is composed of the center sites and the adjacent amino acid residues which are fundamental protein sequence residues. It can be helpful to perform their biological functions and contribute to understanding the molecular mechanisms that are the foundations of protein design and drug design. The existing algorithms of predicting modified sites often have some shortcomings, such as lower stability and accuracy. In this paper, a combination of physical, chemical, statistical, and biological properties of a protein have been ulitized as the features, and a novel framework is proposed to predict a protein's post translational modification sites. The multi-layer neural network and support vector machine are invoked to predict the potential modified sites with the selected features that include the compositions of amino acid residues, the E-H description of protein segments, and several properties from the AAIndex database. Being aware of the possible redundant information, the feature selection is proposed in the propocessing step in this research. The experimental results show that the proposed method has the ability to improve the accuracy in this classification issue.
Collapse
|
30
|
Yuan L, Guo LH, Yuan CA, Zhang YH, Han K, Nandi A, Honig B, Huang DS. Integration of Multi-omics Data for Gene Regulatory Network Inference and Application to Breast Cancer. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2018; 16:782-791. [PMID: 30137012 DOI: 10.1109/tcbb.2018.2866836] [Citation(s) in RCA: 28] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
Underlying a cancer phenotype is a specific gene regulatory network that represents the complex regulatory relationships between genes. However, it remains a challenge to find cancer-related gene regulatory network because of insufficient sample sizes and complex regulatory mechanisms in which gene is influenced by not only other genes but also other biological factors. With the development of high-throughput technologies and the unprecedented wealth of multi-omics data give us a new opportunity to design machine learning method to investigate underlying gene regulatory network. In this paper, we propose an approach, which use biweight midcorrelation to measure the correlation between factors and make use of nonconvex penalty based sparse regression for gene regulatory network inference (BMNPGRN). BMNCGRN incorporates multi-omics data (including DNA methylation and copy number variation) and their interactions in gene regulatory network model. The experimental results on synthetic datasets show that BMNPGRN outperforms popular and state-of-the-art methods (including DCGRN, ARACNE and CLR) under false positive control. Furthermore, we applied BMNPGRN on breast cancer (BRCA) data from The Cancer Genome Atlas database and provided gene regulatory network.
Collapse
|
31
|
Gao L, Bao W, Zhang H, Yuan CA, Huang DS. Fast sequence analysis based on diamond sampling. PLoS One 2018; 13:e0198922. [PMID: 29953448 PMCID: PMC6023231 DOI: 10.1371/journal.pone.0198922] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/27/2018] [Accepted: 05/29/2018] [Indexed: 12/02/2022] Open
Abstract
Both in DNA and protein contexts, an important method for modelling motifs is to utilize position weight matrix (PWM) in biological sequences. With the development of genome sequencing technology, the quantity of the sequence data is increasing explosively, so the faster searching algorithms which have the ability to meet the increasingly need are desired to develop. In this paper, we proposed a method for speeding up the searching process of candidate transcription factor binding sites (TFBS), and the users can be allowed to specify p threshold to get the desired trade-off between speed and sensitivity for a particular sequence analysis. Moreover, the proposed method can also be generalized to large-scale annotation and sequence projects.
Collapse
Affiliation(s)
- Liangxin Gao
- Institute of Machine Learning and Systems Biology, School of Electronics and Information Engineering, Tongji University, Shanghai, China
| | - Wenzhen Bao
- Institute of Machine Learning and Systems Biology, School of Electronics and Information Engineering, Tongji University, Shanghai, China
| | - Hongbo Zhang
- Institute of Machine Learning and Systems Biology, School of Electronics and Information Engineering, Tongji University, Shanghai, China
| | - Chang-An Yuan
- Science Computing and Intelligent Information Processing of GuangXi Higher Education Key Laboratory, Guangxi Teachers Education University, Nanning, Guangxi, China
| | - De-Shuang Huang
- Institute of Machine Learning and Systems Biology, School of Electronics and Information Engineering, Tongji University, Shanghai, China
| |
Collapse
|
32
|
Liu J, Wang X, Cheng Y, Zhang L. Tumor gene expression data classification via sample expansion-based deep learning. Oncotarget 2017; 8:109646-109660. [PMID: 29312636 PMCID: PMC5752549 DOI: 10.18632/oncotarget.22762] [Citation(s) in RCA: 32] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/31/2017] [Accepted: 10/29/2017] [Indexed: 12/15/2022] Open
Abstract
Since tumor is seriously harmful to human health, effective diagnosis measures are in urgent need for tumor therapy. Early detection of tumor is particularly important for better treatment of patients. A notable issue is how to effectively discriminate tumor samples from normal ones. Many classification methods, such as Support Vector Machines (SVMs), have been proposed for tumor classification. Recently, deep learning has achieved satisfactory performance in the classification task of many areas. However, the application of deep learning is rare in tumor classification due to insufficient training samples of gene expression data. In this paper, a Sample Expansion method is proposed to address the problem. Inspired by the idea of Denoising Autoencoder (DAE), a large number of samples are obtained by randomly cleaning partially corrupted input many times. The expanded samples can not only maintain the merits of corrupted data in DAE but also deal with the problem of insufficient training samples of gene expression data to a certain extent. Since Stacked Autoencoder (SAE) and Convolutional Neural Network (CNN) models show excellent performance in classification task, the applicability of SAE and 1-dimensional CNN (1DCNN) on gene expression data is analyzed. Finally, two deep learning models, Sample Expansion-Based SAE (SESAE) and Sample Expansion-Based 1DCNN (SE1DCNN), are designed to carry out tumor gene expression data classification by using the expanded samples. Experimental studies indicate that SESAE and SE1DCNN are very effective in tumor classification.
Collapse
Affiliation(s)
- Jian Liu
- School of Information and Control Engineering, China University of Mining and Technology, Xuzhou 221116, China
| | - Xuesong Wang
- School of Information and Control Engineering, China University of Mining and Technology, Xuzhou 221116, China
| | - Yuhu Cheng
- School of Information and Control Engineering, China University of Mining and Technology, Xuzhou 221116, China
| | - Lin Zhang
- School of Information and Control Engineering, China University of Mining and Technology, Xuzhou 221116, China
| |
Collapse
|
33
|
Lee W, Park B, Han K. Sequence-based prediction of putative transcription factor binding sites in DNA sequences of any length. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2017; 15:1461-1469. [PMID: 29990126 DOI: 10.1109/tcbb.2017.2773075] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
A transcription factor (TF) is a protein that regulates gene expression by binding to specific DNA sequences. Despite the recent advances in experimental techniques for identifying transcription factor binding sites (TFBS) in DNA sequences, a large number of TFBS are to be unveiled in many species. Several computational methods developed for predicting TFBS in DNA are tissue- or species-specific methods, so cannot be used without prior knowledge of tissue or species. Some computational methods are applicable to finding TFBS in short DNA sequences only. In this paper we propose a new learning method for predicting TFBS in DNA of any length using the composition, transition and distribution of nucleotides and amino acids in DNA and TF sequences. In independent testing of the method on datasets that were not used in training the method, its accuracy and MCC were as high as 81.84% and 0.634, respectively. The proposed method can be a useful aid for selecting potential TFBS in a large amount of DNA sequences before conducting biochemical experiments to empirically determine TFBS. The program and data sets are available at http://bclab.inha.ac.kr/TFbinding.
Collapse
|
34
|
Li JQ, You ZH, Li X, Ming Z, Chen X. PSPEL: In Silico Prediction of Self-Interacting Proteins from Amino Acids Sequences Using Ensemble Learning. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2017; 14:1165-1172. [PMID: 28092572 DOI: 10.1109/tcbb.2017.2649529] [Citation(s) in RCA: 33] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/06/2023]
Abstract
Self interacting proteins (SIPs) play an important role in various aspects of the structural and functional organization of the cell. Detecting SIPs is one of the most important issues in current molecular biology. Although a large number of SIPs data has been generated by experimental methods, wet laboratory approaches are both time-consuming and costly. In addition, they yield high false negative and positive rates. Thus, there is a great need for in silico methods to predict SIPs accurately and efficiently. In this study, a new sequence-based method is proposed to predict SIPs. The evolutionary information contained in Position-Specific Scoring Matrix (PSSM) is extracted from of protein with known sequence. Then, features are fed to an ensemble classifier to distinguish the self-interacting and non-self-interacting proteins. When performed on Saccharomyces cerevisiae and Human SIPs data sets, the proposed method can achieve high accuracies of 86.86 and 91.30 percent, respectively. Our method also shows a good performance when compared with the SVM classifier and previous methods. Consequently, the proposed method can be considered to be a novel promising tool to predict SIPs.
Collapse
|
35
|
Chen Q, Lan C, Chen B, Wang L, Li J, Zhang C. Exploring Consensus RNA Substructural Patterns Using Subgraph Mining. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2017; 14:1134-1146. [PMID: 28026781 DOI: 10.1109/tcbb.2016.2645202] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/06/2023]
Abstract
Frequently recurring RNA structural motifs play important roles in RNA folding process and interaction with other molecules. Traditional index-based and shape-based schemas are useful in modeling RNA secondary structures but ignore the structural discrepancy of individual RNA family member. Further, the in-depth analysis of underlying substructure pattern is insufficient due to varied and unnormalized substructure data. This prevents us from understanding RNAs functions and their inherent synergistic regulation networks. This article thus proposes a novel labeled graph-based algorithm RnaGraph to uncover frequently RNA substructure patterns. Attribute data and graph data are combined to characterize diverse substructures and their correlations, respectively. Further, a top-k graph pattern mining algorithm is developed to extract interesting substructure motifs by integrating frequency and similarity. The experimental results show that our methods assist in not only modelling complex RNA secondary structures but also identifying hidden but interesting RNA substructure patterns.
Collapse
|
36
|
Yuan L, Zhu L, Guo WL, Zhou X, Zhang Y, Huang Z, Huang DS. Nonconvex Penalty Based Low-Rank Representation and Sparse Regression for eQTL Mapping. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2017; 14:1154-1164. [PMID: 28114074 DOI: 10.1109/tcbb.2016.2609420] [Citation(s) in RCA: 16] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/06/2023]
Abstract
This paper addresses the problem of accounting for confounding factors and expression quantitative trait loci (eQTL) mapping in the study of SNP-gene associations. The existing convex penalty based algorithm has limited capacity to keep main information of matrix in the process of reducing matrix rank. We present an algorithm, which use nonconvex penalty based low-rank representation to account for confounding factors and make use of sparse regression for eQTL mapping (NCLRS). The efficiency of the presented algorithm is evaluated by comparing the results of 18 synthetic datasets given by NCLRS and presented algorithm, respectively. The experimental results or biological dataset show that our approach is an effective tool to account for non-genetic effects than currently existing methods.
Collapse
|
37
|
Robust Significance Analysis of Microarrays by Minimum β-Divergence Method. BIOMED RESEARCH INTERNATIONAL 2017; 2017:5310198. [PMID: 28819626 PMCID: PMC5551475 DOI: 10.1155/2017/5310198] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 03/29/2017] [Accepted: 05/28/2017] [Indexed: 11/18/2022]
Abstract
Identification of differentially expressed (DE) genes with two or more conditions is an important task for discovery of few biomarker genes. Significance Analysis of Microarrays (SAM) is a popular statistical approach for identification of DE genes for both small- and large-sample cases. However, it is sensitive to outlying gene expressions and produces low power in presence of outliers. Therefore, in this paper, an attempt is made to robustify the SAM approach using the minimum β-divergence estimators instead of the maximum likelihood estimators of the parameters. We demonstrated the performance of the proposed method in a comparison of some other popular statistical methods such as ANOVA, SAM, LIMMA, KW, EBarrays, GaGa, and BRIDGE using both simulated and real gene expression datasets. We observe that all methods show good and almost equal performance in absence of outliers for the large-sample cases, while in the small-sample cases only three methods (SAM, LIMMA, and proposed) show almost equal and better performance than others with two or more conditions. However, in the presence of outliers, on an average, only the proposed method performs better than others for both small- and large-sample cases with each condition.
Collapse
|
38
|
Tang B. Genomic feature extraction and comparison based on global alignment of ChIP-sequencing data. Bioengineered 2017; 8:248-255. [PMID: 27690208 PMCID: PMC5470523 DOI: 10.1080/21655979.2016.1226714] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/26/2022] Open
Abstract
Enhanced accuracy and high-throughput capability in capturing genetic activities lead ChIP-sequencing technology to be applied prevalently in diverse study for tackling DNA-protein interaction problems. Till now, such questions as deciding suitable ChIP-seq arguments and comparing sample quality still haunt biologists. We propose the methods for answering such questions as deciding optimal argument pairs in global alignment of ChIP sequencing data; then we employ a modern signal processing approach to extract inherent genomic features from the global alignments of transcriptional binding activities; together with pairwise comparison from intra- and inter-sample perspectives; thus we can further determine alignment quality and decide the optimal candidate for multi-source heterogeneous high-throughput sequences. The work provides a practical approach to quantitatively compare the alignment quality for heterogeneous sequencing data, especially in determining the efficiency of transcriptional binding from replicate samples, thus it helps to exploit the potentiality of ChIP-seq for deep comprehension of inherent biological meanings from the high-throughput genomic sequences.
Collapse
Affiliation(s)
- Binhua Tang
- a Epigenetics & Function Group , College of the Internet of Things, Hohai University , Jiangsu , China.,b School of Public Health , Shanghai Jiao Tong University , Shanghai , China
| |
Collapse
|
39
|
Zhu L, Guo WL, Lu C, Huang DS. Collaborative Completion of Transcription Factor Binding Profiles via Local Sensitive Unified Embedding. IEEE Trans Nanobioscience 2016; 15:946-958. [PMID: 27845669 DOI: 10.1109/tnb.2016.2625823] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
Although the newly available ChIP-seq data provides immense opportunities for comparative study of regulatory activities across different biological conditions, due to cost, time or sample material availability, it is not always possible for researchers to obtain binding profiles for every protein in every sample of interest, which considerably limits the power of integrative studies. Recently, by leveraging related information from measured data, Ernst et al. proposed ChromImpute for predicting additional ChIP-seq and other types of datasets, it is demonstrated that the imputed signal tracks accurately approximate the experimentally measured signals, and thereby could potentially enhance the power of integrative analysis. Despite the success of ChromImpute, in this paper, we reexamine its learning process, and show that its performance may degrade substantially and sometimes may even fail to output a prediction when the available data is scarce. This limitation could hurt its applicability to important predictive tasks, such as the imputation of TF binding data. To alleviate this problem, we propose a novel method called Local Sensitive Unified Embedding (LSUE) for imputing new ChIP-seq datasets. In LSUE, the ChIP-seq data compendium are fused together by mapping proteins, samples, and genomic positions simultaneously into the Euclidean space, thereby making their underling associations directly evaluable using simple calculations. In contrast to ChromImpute which mainly makes use of the local correlations between available datasets, LSUE can better estimate the overall data structure by formulating the representation learning of all involved entities as a single unified optimization problem. Meanwhile, a novel form of local sensitive low rank regularization is also proposed to further improve the performance of LSUE. Experimental evaluations on the ENCODE TF ChIP-seq data illustrate the performance of the proposed model. The code of LSUE is available at https://github.com/ekffar/LSUE.
Collapse
|