1
|
Deng Y, Tang J, Zhang J, Zou J, Zhu Q, Fan S. GraphCpG: imputation of single-cell methylomes based on locus-aware neighboring subgraphs. Bioinformatics 2023; 39:btad533. [PMID: 37647650 PMCID: PMC10516632 DOI: 10.1093/bioinformatics/btad533] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/21/2023] [Revised: 07/24/2023] [Accepted: 08/28/2023] [Indexed: 09/01/2023] Open
Abstract
MOTIVATION Single-cell DNA methylation sequencing can assay DNA methylation at single-cell resolution. However, incomplete coverage compromises related downstream analyses, outlining the importance of imputation techniques. With a rising number of cell samples in recent large datasets, scalable and efficient imputation models are critical to addressing the sparsity for genome-wide analyses. RESULTS We proposed a novel graph-based deep learning approach to impute methylation matrices based on locus-aware neighboring subgraphs with locus-aware encoding orienting on one cell type. Merely using the CpGs methylation matrix, the obtained GraphCpG outperforms previous methods on datasets containing more than hundreds of cells and achieves competitive performance on smaller datasets, with subgraphs of predicted sites visualized by retrievable bipartite graphs. Besides better imputation performance with increasing cell number, it significantly reduces computation time and demonstrates improvement in downstream analysis. AVAILABILITY AND IMPLEMENTATION The source code is freely available at https://github.com/yuzhong-deng/graphcpg.git.
Collapse
Affiliation(s)
- Yuzhong Deng
- School of Automation Engineering, University of Electronic Science and Technology of China, Chengdu 611731, Sichuan, China
| | - Jianxiong Tang
- School of Automation Engineering, University of Electronic Science and Technology of China, Chengdu 611731, Sichuan, China
| | - Jiyang Zhang
- School of Automation Engineering, University of Electronic Science and Technology of China, Chengdu 611731, Sichuan, China
| | - Jianxiao Zou
- School of Automation Engineering, University of Electronic Science and Technology of China, Chengdu 611731, Sichuan, China
- Shenzhen Institute for Advanced Study, University of Electronic Science and Technology of China, Shenzhen 518110, Guangdong, China
| | - Que Zhu
- Department of Out-patient, The Second Affiliated Hospital of Chongqing Medical University, Chongqing 400010, China
| | - Shicai Fan
- School of Automation Engineering, University of Electronic Science and Technology of China, Chengdu 611731, Sichuan, China
- Shenzhen Institute for Advanced Study, University of Electronic Science and Technology of China, Shenzhen 518110, Guangdong, China
| |
Collapse
|
2
|
Park S, Rehman MU, Ullah F, Tayara H, Chong KT. iCpG-Pos: an accurate computational approach for identification of CpG sites using positional features on single-cell whole genome sequence data. Bioinformatics 2023; 39:btad474. [PMID: 37555812 PMCID: PMC10444964 DOI: 10.1093/bioinformatics/btad474] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/02/2023] [Revised: 05/11/2023] [Accepted: 08/08/2023] [Indexed: 08/10/2023] Open
Abstract
MOTIVATION The investigation of DNA methylation can shed light on the processes underlying human well-being and help determine overall human health. However, insufficient coverage makes it challenging to implement single-stranded DNA methylation sequencing technologies, highlighting the need for an efficient prediction model. Models are required to create an understanding of the underlying biological systems and to project single-cell (methylated) data accurately. RESULTS In this study, we developed positional features for predicting CpG sites. Positional characteristics of the sequence are derived using data from CpG regions and the separation between nearby CpG sites. Multiple optimized classifiers and different ensemble learning approaches are evaluated. The OPTUNA framework is used to optimize the algorithms. The CatBoost algorithm followed by the stacking algorithm outperformed existing DNA methylation identifiers. AVAILABILITY AND IMPLEMENTATION The data and methodologies used in this study are openly accessible to the research community. Researchers can access the positional features and algorithms used for predicting CpG site methylation patterns. To achieve superior performance, we employed the CatBoost algorithm followed by the stacking algorithm, which outperformed existing DNA methylation identifiers. The proposed iCpG-Pos approach utilizes only positional features, resulting in a substantial reduction in computational complexity compared to other known approaches for detecting CpG site methylation patterns. In conclusion, our study introduces a novel approach, iCpG-Pos, for predicting CpG site methylation patterns. By focusing on positional features, our model offers both accuracy and efficiency, making it a promising tool for advancing DNA methylation research and its applications in human health and well-being.
Collapse
Affiliation(s)
- Sehi Park
- Department of Electronics and Information Engineering, Jeonbuk National University, Jeonju 54896, South Korea
| | - Mobeen Ur Rehman
- Department of Electronics and Information Engineering, Jeonbuk National University, Jeonju 54896, South Korea
| | - Farman Ullah
- College of Information Technology in the United Arab Emirates University (UAEU), Abu Dhabi 15551, UAE
| | - Hilal Tayara
- School of International Engineering and Science, Jeonbuk National University, Jeonju 54896, South Korea
| | - Kil To Chong
- Department of Electronics and Information Engineering, Jeonbuk National University, Jeonju 54896, South Korea
- Advances Electronics and Information Research Center, Jeonbuk National University, Jeonju 54896, South Korea
| |
Collapse
|
3
|
Dodlapati S, Jiang Z, Sun J. Completing Single-Cell DNA Methylome Profiles via Transfer Learning Together With KL-Divergence. Front Genet 2022; 13:910439. [PMID: 35938031 PMCID: PMC9353187 DOI: 10.3389/fgene.2022.910439] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/01/2022] [Accepted: 05/25/2022] [Indexed: 11/13/2022] Open
Abstract
The high level of sparsity in methylome profiles obtained using whole-genome bisulfite sequencing in the case of low biological material amount limits its value in the study of systems in which large samples are difficult to assemble, such as mammalian preimplantation embryonic development. The recently developed computational methods for addressing the sparsity by imputing missing have their limits when the required minimum data coverage or profiles of the same tissue in other modalities are not available. In this study, we explored the use of transfer learning together with Kullback-Leibler (KL) divergence to train predictive models for completing methylome profiles with very low coverage (below 2%). Transfer learning was used to leverage less sparse profiles that are typically available for different tissues for the same species, while KL divergence was employed to maximize the usage of information carried in the input data. A deep neural network was adopted to extract both DNA sequence and local methylation patterns for imputation. Our study of training models for completing methylome profiles of bovine oocytes and early embryos demonstrates the effectiveness of transfer learning and KL divergence, with individual increase of 29.98 and 29.43%, respectively, in prediction performance and 38.70% increase when the two were used together. The drastically increased data coverage (43.80-73.6%) after imputation powers downstream analyses involving methylomes that cannot be effectively done using the very low coverage profiles (0.06-1.47%) before imputation.
Collapse
Affiliation(s)
- Sanjeeva Dodlapati
- Department of Computer Science, Old Dominion University, Norfolk, VA, United States
| | - Zongliang Jiang
- School of Animal Sciences, AgCenter, Louisiana State University, Baton Rouge, LA, United States
| | - Jiangwen Sun
- Department of Computer Science, Old Dominion University, Norfolk, VA, United States
| |
Collapse
|
4
|
Yu B, Zhang Y, Wang X, Gao H, Sun J, Gao X. Identification of DNA modification sites based on elastic net and bidirectional gated recurrent unit with convolutional neural network. Biomed Signal Process Control 2022. [DOI: 10.1016/j.bspc.2022.103566] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/06/2023]
|
5
|
Guo Y, Wu C, Yuan Z, Wang Y, Liang Z, Wang Y, Zhang Y, Xu L. Gene-Based Testing of Interactions Using XGBoost in Genome-Wide Association Studies. Front Cell Dev Biol 2021; 9:801113. [PMID: 34977040 PMCID: PMC8716787 DOI: 10.3389/fcell.2021.801113] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/24/2021] [Accepted: 11/23/2021] [Indexed: 11/30/2022] Open
Abstract
Among the myriad of statistical methods that identify gene–gene interactions in the realm of qualitative genome-wide association studies, gene-based interactions are not only powerful statistically, but also they are interpretable biologically. However, they have limited statistical detection by making assumptions on the association between traits and single nucleotide polymorphisms. Thus, a gene-based method (GGInt-XGBoost) originated from XGBoost is proposed in this article. Assuming that log odds ratio of disease traits satisfies the additive relationship if the pair of genes had no interactions, the difference in error between the XGBoost model with and without additive constraint could indicate gene–gene interaction; we then used a permutation-based statistical test to assess this difference and to provide a statistical p-value to represent the significance of the interaction. Experimental results on both simulation and real data showed that our approach had superior performance than previous experiments to detect gene–gene interactions.
Collapse
Affiliation(s)
- Yingjie Guo
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China
- School of Electronic and Communication Engineering, Shenzhen Polytechnic, Shenzhen, China
| | - Chenxi Wu
- Department of Mathematics, University of Wisconsin-Madison, Madison, WI, United States
| | - Zhian Yuan
- Research Institute of Big Data Science and Industry, Shanxi University, Taiyuan, China
| | - Yansu Wang
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China
- School of Electronic and Communication Engineering, Shenzhen Polytechnic, Shenzhen, China
| | - Zhen Liang
- School of Life Science, Shanxi University, Taiyuan, China
| | - Yang Wang
- School of Electronic and Communication Engineering, Shenzhen Polytechnic, Shenzhen, China
| | - Yi Zhang
- Beidahuang Industry Group General Hospital, Harbin, China
- *Correspondence: Yi Zhang, ; Lei Xu,
| | - Lei Xu
- School of Electronic and Communication Engineering, Shenzhen Polytechnic, Shenzhen, China
- *Correspondence: Yi Zhang, ; Lei Xu,
| |
Collapse
|
6
|
Ao C, Zou Q, Yu L. NmRF: identification of multispecies RNA 2'-O-methylation modification sites from RNA sequences. Brief Bioinform 2021; 23:6446272. [PMID: 34850821 DOI: 10.1093/bib/bbab480] [Citation(s) in RCA: 34] [Impact Index Per Article: 11.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/09/2021] [Revised: 10/05/2021] [Accepted: 10/18/2021] [Indexed: 12/12/2022] Open
Abstract
2'-O-methylation (Nm) is a post-transcriptional modification of RNA that is catalyzed by 2'-O-methyltransferase and involves replacing the H on the 2'-hydroxyl group with a methyl group. The 2'-O-methylation modification site is detected in a variety of RNA types (miRNA, tRNA, mRNA, etc.), plays an important role in biological processes and is associated with different diseases. There are few functional mechanisms developed at present, and traditional high-throughput experiments are time-consuming and expensive to explore functional mechanisms. For a deeper understanding of relevant biological mechanisms, it is necessary to develop efficient and accurate recognition tools based on machine learning. Based on this, we constructed a predictor called NmRF based on optimal mixed features and random forest classifier to identify 2'-O-methylation modification sites. The predictor can identify modification sites of multiple species at the same time. To obtain a better prediction model, a two-step strategy is adopted; that is, the optimal hybrid feature set is obtained by combining the light gradient boosting algorithm and incremental feature selection strategy. In 10-fold cross-validation, the accuracies of Homo sapiens and Saccharomyces cerevisiae were 89.069 and 93.885%, and the AUC were 0.9498 and 0.9832, respectively. The rigorous 10-fold cross-validation and independent tests confirm that the proposed method is significantly better than existing tools. A user-friendly web server is accessible at http://lab.malab.cn/∼acy/NmRF.
Collapse
Affiliation(s)
- Chunyan Ao
- School of Computer Science and Technology, Xidian University, Xi'an, China
| | - Quan Zou
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China.,Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou, China
| | - Liang Yu
- School of Computer Science and Technology, Xidian University, Xi'an, China
| |
Collapse
|
7
|
Lyu Y, He W, Li S, Zou Q, Guo F. iPro2L-PSTKNC: A Two-Layer Predictor for Discovering Various Types of Promoters by Position Specific of Nucleotide Composition. IEEE J Biomed Health Inform 2021; 25:2329-2337. [PMID: 32976109 DOI: 10.1109/jbhi.2020.3026735] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
Abstract
Promoters are DNA regulatory elements located proximal to the transcription start site, which are in charge of the initiation of specific gene transcription. In Escherichia coli, promoters can be recognized by σ factors that have multiple families based on distinct function and structure, such as σ24, σ28, σ32, σ38, σ54 and σ70. At present, biological methods are mainly used to identify these promoters. However, because it is time-consuming and material-consuming to do biological experiments, computational biology algorithm has emerged as a more effective way to predict the classification. In this study, we develop a novel two-layer seamless predictor called iPro2L-PSTKNC to identify the promoters of the E. coli genome, which based on the feature extraction model we newly proposed that is named as the position specific tendencies of k-mer nucleotide composition (PSTKNC). On the first layer, it is a binary classification predicting whether a sequence is promoter or not. And the second layer is a multiple classification identifying which type the identified promoter belongs to. The ensemble classification SVM performsbest comparing with other algorithms, which gets a promising accuracy and the Matthews correlation coefficient (MCC) at [Formula: see text] and [Formula: see text]. Our data and code are available at https://github.com/lyuyinuo/iPro2L-PSTKNC.
Collapse
|
8
|
Zhang Z, Cui F, Lin C, Zhao L, Wang C, Zou Q. Critical downstream analysis steps for single-cell RNA sequencing data. Brief Bioinform 2021; 22:6210064. [PMID: 33822873 DOI: 10.1093/bib/bbab105] [Citation(s) in RCA: 28] [Impact Index Per Article: 9.3] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/05/2021] [Revised: 02/20/2021] [Accepted: 03/09/2021] [Indexed: 12/13/2022] Open
Abstract
Single-cell RNA sequencing (scRNA-seq) has enabled us to study biological questions at the single-cell level. Currently, many analysis tools are available to better utilize these relatively noisy data. In this review, we summarize the most widely used methods for critical downstream analysis steps (i.e. clustering, trajectory inference, cell-type annotation and integrating datasets). The advantages and limitations are comprehensively discussed, and we provide suggestions for choosing proper methods in different situations. We hope this paper will be useful for scRNA-seq data analysts and bioinformatics tool developers.
Collapse
Affiliation(s)
- Zilong Zhang
- University of Electronic Science and Technology of China
| | | | | | | | | | - Quan Zou
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China
| |
Collapse
|
9
|
Feng J, Jiang L, Li S, Tang J, Wen L. Multi-Omics Data Fusion via a Joint Kernel Learning Model for Cancer Subtype Discovery and Essential Gene Identification. Front Genet 2021; 12:647141. [PMID: 33747053 PMCID: PMC7969795 DOI: 10.3389/fgene.2021.647141] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/29/2020] [Accepted: 02/02/2021] [Indexed: 01/17/2023] Open
Abstract
The multiple sources of cancer determine its multiple causes, and the same cancer can be composed of many different subtypes. Identification of cancer subtypes is a key part of personalized cancer treatment and provides an important reference for clinical diagnosis and treatment. Some studies have shown that there are significant differences in the genetic and epigenetic profiles among different cancer subtypes during carcinogenesis and development. In this study, we first collect seven cancer datasets from the Broad Institute GDAC Firehose, including gene expression profile, isoform expression profile, DNA methylation expression data, and survival information correspondingly. Furthermore, we employ kernel principal component analysis (PCA) to extract features for each expression profile, convert them into three similarity kernel matrices by Gaussian kernel function, and then fuse these matrices as a global kernel matrix. Finally, we apply it to spectral clustering algorithm to get the clustering results of different cancer subtypes. In the experimental results, besides using the P-value from the Cox regression model and survival analysis as the primary evaluation measures, we also introduce statistical indicators such as Rand index (RI) and adjusted RI (ARI) to verify the performance of clustering. Then combining with gene expression profile, we obtain the differential expression of genes among different subtypes by gene set enrichment analysis. For lung cancer, GMPS, EPHA10, C10orf54, and MAGEA6 are highly expressed in different subtypes; for liver cancer, CMYA5, DEPDC6, FAU, VPS24, RCBTB2, LOC100133469, and SLC35B4 are significantly expressed in different subtypes.
Collapse
Affiliation(s)
- Jie Feng
- School of Computer Science and Technology, College of Intelligence and Computing, Tianjin University, Tianjin, China
| | - Limin Jiang
- School of Computer Science and Technology, College of Intelligence and Computing, Tianjin University, Tianjin, China
| | - Shuhao Li
- School of Computer Science and Technology, College of Intelligence and Computing, Tianjin University, Tianjin, China
| | - Jijun Tang
- School of Computer Science and Technology, College of Intelligence and Computing, Tianjin University, Tianjin, China.,School of Computational Science and Engineering, University of South Carolina, Columbia, SC, United States.,Key Laboratory of Systems Bioengineering (Ministry of Education), Tianjin University, Tianjin, China
| | - Lan Wen
- Changsha Municipal Center of Disease Control, Changsha, China
| |
Collapse
|
10
|
Tang J, Zou J, Fan M, Tian Q, Zhang J, Fan S. CaMelia: imputation in single-cell methylomes based on local similarities between cells. Bioinformatics 2021; 37:1814-1820. [PMID: 33459762 DOI: 10.1093/bioinformatics/btab029] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/27/2020] [Revised: 12/11/2020] [Accepted: 01/12/2021] [Indexed: 12/16/2022] Open
Abstract
MOTIVATION Single-cell DNA methylation sequencing detects methylation levels with single-cell resolution, while this technology is upgrading our understanding of the regulation of gene expression through epigenetic modifications. Meanwhile, almost all current technologies suffer from the inherent problem of detecting low coverage of the number of CpGs. Therefore, addressing the inherent sparsity of raw data is essential for quantitative analysis of the whole genome. RESULTS Here, we reported CaMelia, a CatBoost gradient boosting method for predicting the missing methylation states based on the locally paired similarity of intercellular methylation patterns. On real single-cell methylation data sets, CaMelia yielded significant imputation performance gains over previous methods. Furthermore, applying the imputed data to the downstream analysis of cell-type identification, we found that CaMelia helped to discover more intercellular differentially methylated loci that were masked by the sparsity in raw data, and the clustering results demonstrated that CaMelia could preserve cell-cell relationships and improve the identification of cell types and cell subpopulations. AVAILABILITY Python code is available at https://github.com/JxTang-bioinformatics/CaMelia. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Jianxiong Tang
- Department of Automation Engineering, University of Electronic Science and Technology of China, Chengdu, 611731, China
| | - Jianxiao Zou
- Department of Automation Engineering, University of Electronic Science and Technology of China, Chengdu, 611731, China
| | - Mei Fan
- Chengdu Women's and Children's Central Hospital, School of Medicine, University of Electronic Science and Technology of China, Chengdu, 611731, China
| | - Qi Tian
- Department of Automation Engineering, University of Electronic Science and Technology of China, Chengdu, 611731, China
| | - Jiyang Zhang
- Department of Automation Engineering, University of Electronic Science and Technology of China, Chengdu, 611731, China
| | - Shicai Fan
- Department of Automation Engineering, University of Electronic Science and Technology of China, Chengdu, 611731, China.,Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, 611731, China
| |
Collapse
|
11
|
Li J, Zhang L, He S, Guo F, Zou Q. SubLocEP: a novel ensemble predictor of subcellular localization of eukaryotic mRNA based on machine learning. Brief Bioinform 2021; 22:6059770. [PMID: 33388743 DOI: 10.1093/bib/bbaa401] [Citation(s) in RCA: 14] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/02/2020] [Revised: 11/28/2020] [Accepted: 12/08/2020] [Indexed: 01/23/2023] Open
Abstract
MOTIVATION mRNA location corresponds to the location of protein translation and contributes to precise spatial and temporal management of the protein function. However, current assignment of subcellular localization of eukaryotic mRNA reveals important limitations: (1) turning multiple classifications into multiple dichotomies makes the training process tedious; (2) the majority of the models trained by classical algorithm are based on the extraction of single sequence information; (3) the existing state-of-the-art models have not reached an ideal level in terms of prediction and generalization ability. To achieve better assignment of subcellular localization of eukaryotic mRNA, a better and more comprehensive model must be developed. RESULTS In this paper, SubLocEP is proposed as a two-layer integrated prediction model for accurate prediction of the location of sequence samples. Unlike the existing models based on limited features, SubLocEP comprehensively considers additional feature attributes and is combined with LightGBM to generated single feature classifiers. The initial integration model (single-layer model) is generated according to the categories of a feature. Subsequently, two single-layer integration models are weighted (sequence-based: physicochemical properties = 3:2) to produce the final two-layer model. The performance of SubLocEP on independent datasets is sufficient to indicate that SubLocEP is an accurate and stable prediction model with strong generalization ability. Additionally, an online tool has been developed that contains experimental data and can maximize the user convenience for estimation of subcellular localization of eukaryotic mRNA.
Collapse
Affiliation(s)
| | - Lichao Zhang
- School of Intelligent Manufacturing and Equipment, Shenzhen Institute of Information Technology
| | | | | | | |
Collapse
|
12
|
The progress on the estimation of DNA methylation level and the detection of abnormal methylation. QUANTITATIVE BIOLOGY 2021. [DOI: 10.15302/j-qb-022-0289] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
|
13
|
Li Y, Zhang Z, Teng Z, Liu X. PredAmyl-MLP: Prediction of Amyloid Proteins Using Multilayer Perceptron. COMPUTATIONAL AND MATHEMATICAL METHODS IN MEDICINE 2020; 2020:8845133. [PMID: 33294004 PMCID: PMC7700051 DOI: 10.1155/2020/8845133] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 09/14/2020] [Revised: 10/06/2020] [Accepted: 10/31/2020] [Indexed: 01/20/2023]
Abstract
Amyloid is generally an aggregate of insoluble fibrin; its abnormal deposition is the pathogenic mechanism of various diseases, such as Alzheimer's disease and type II diabetes. Therefore, accurately identifying amyloid is necessary to understand its role in pathology. We proposed a machine learning-based prediction model called PredAmyl-MLP, which consists of the following three steps: feature extraction, feature selection, and classification. In the step of feature extraction, seven feature extraction algorithms and different combinations of them are investigated, and the combination of SVMProt-188D and tripeptide composition (TPC) is selected according to the experimental results. In the step of feature selection, maximum relevant maximum distance (MRMD) and binomial distribution (BD) are, respectively, used to remove the redundant or noise features, and the appropriate features are selected according to the experimental results. In the step of classification, we employed multilayer perceptron (MLP) to train the prediction model. The 10-fold cross-validation results show that the overall accuracy of PredAmyl-MLP reached 91.59%, and the performance was better than the existing methods.
Collapse
Affiliation(s)
- Yanjuan Li
- College of Information and Computer Engineering, Northeast Forestry University, Harbin 150040, China
| | - Zitong Zhang
- College of Information and Computer Engineering, Northeast Forestry University, Harbin 150040, China
| | - Zhixia Teng
- College of Information and Computer Engineering, Northeast Forestry University, Harbin 150040, China
| | - Xiaoyan Liu
- College of Computer Science and Technology, Harbin Institute of Technology, Harbin 150040, China
| |
Collapse
|
14
|
Dou L, Li X, Zhang L, Xiang H, Xu L. iGlu_AdaBoost: Identification of Lysine Glutarylation Using the AdaBoost Classifier. J Proteome Res 2020; 20:191-201. [PMID: 33090794 DOI: 10.1021/acs.jproteome.0c00314] [Citation(s) in RCA: 14] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/01/2023]
Abstract
Lysine glutarylation is a newly reported post-translational modification (PTM) that plays significant roles in regulating metabolic and mitochondrial processes. Accurate identification of protein glutarylation is the primary task to better investigate molecular functions and various applications. Due to the common disadvantages of the time-consuming and expensive nature of traditional biological sequencing techniques as well as the explosive growth of protein data, building precise computational models to rapidly diagnose glutarylation is a popular and feasible solution. In this work, we proposed a novel AdaBoost-based predictor called iGlu_AdaBoost to distinguish glutarylation and non-glutarylation sequences. Here, the top 37 features were chosen from a total of 1768 combined features using Chi2 following incremental feature selection (IFS) to build the model, including 188D, the composition of k-spaced amino acid pairs (CKSAAP), and enhanced amino acid composition (EAAC). With the help of the hybrid-sampling method SMOTE-Tomek, the AdaBoost algorithm was performed with satisfactory recall, specificity, and AUC values of 87.48%, 72.49%, and 0.89 over 10-fold cross validation as well as 72.73%, 71.92%, and 0.63 over independent test, respectively. Further feature analysis inferred that positively charged amino acids RK play critical roles in glutarylation recognition. Our model presented the well generalization ability and consistency of the prediction results of positive and negative samples, which is comparable to four published tools. The proposed predictor is an efficient tool to find potential glutarylation sites and provides helpful suggestions for further research on glutarylation mechanisms and concerned disease treatments.
Collapse
Affiliation(s)
- Lijun Dou
- School of Automotive and Transportation Engineering, Shenzhen Polytechnic, Shenzhen 518055, China.,Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu 610054, China
| | - Xiaoling Li
- Department of Oncology, Heilongjiang Province Land Reclamation Headquarters General Hospital, Harbin 150000, China
| | - Lichao Zhang
- School of Intelligent Manufacturing and Equipment, Shenzhen Institute of Information Technology, Shenzhen 518172, China
| | - Huaikun Xiang
- School of Automotive and Transportation Engineering, Shenzhen Polytechnic, Shenzhen 518055, China
| | - Lei Xu
- School of Electronic and Communication Engineering, Shenzhen Polytechnic, Shenzhen 518055, China
| |
Collapse
|
15
|
Cai J, Xu Y, Zhang W, Ding S, Sun Y, Lyu J, Duan M, Liu S, Huang L, Zhou F. A comprehensive comparison of residue-level methylation levels with the regression-based gene-level methylation estimations by ReGear. Brief Bioinform 2020; 22:5921981. [PMID: 33048108 DOI: 10.1093/bib/bbaa253] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/11/2020] [Revised: 08/10/2020] [Accepted: 09/08/2020] [Indexed: 02/07/2023] Open
Abstract
MOTIVATION DNA methylation is a biological process impacting the gene functions without changing the underlying DNA sequence. The DNA methylation machinery usually attaches methyl groups to some specific cytosine residues, which modify the chromatin architectures. Such modifications in the promoter regions will inactivate some tumor-suppressor genes. DNA methylation within the coding region may significantly reduce the transcription elongation efficiency. The gene function may be tuned through some cytosines are methylated. METHODS This study hypothesizes that the overall methylation level across a gene may have a better association with the sample labels like diseases than the methylations of individual cytosines. The gene methylation level is formulated as a regression model using the methylation levels of all the cytosines within this gene. A comprehensive evaluation of various feature selection algorithms and classification algorithms is carried out between the gene-level and residue-level methylation levels. RESULTS A comprehensive evaluation was conducted to compare the gene and cytosine methylation levels for their associations with the sample labels and classification performances. The unsupervised clustering was also improved using the gene methylation levels. Some genes demonstrated statistically significant associations with the class label, even when no residue-level methylation features have statistically significant associations with the class label. So in summary, the trained gene methylation levels improved various methylome-based machine learning models. Both methodology development of regression algorithms and experimental validation of the gene-level methylation biomarkers are worth of further investigations in the future studies. The source code, example data files and manual are available at http://www.healthinformaticslab.org/supp/.
Collapse
|
16
|
Yasen A, Aini A, Wang H, Li W, Zhang C, Ran B, Tuxun T, Maimaitinijiati Y, Shao Y, Aji T, Wen H. Progress and applications of single-cell sequencing techniques. INFECTION GENETICS AND EVOLUTION 2020; 80:104198. [PMID: 31958516 DOI: 10.1016/j.meegid.2020.104198] [Citation(s) in RCA: 19] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/16/2019] [Revised: 01/07/2020] [Accepted: 01/16/2020] [Indexed: 01/06/2023]
Abstract
Single-cell sequencing (SCS) is a next-generation sequencing method that is mainly used to analyze differences in genetic and protein information between cells, to obtain genetic information on microorganisms that are difficult to cultivate at a single-cell level and to better understand their specific roles in the microenvironment. By sequencing the whole genome, transcriptome and epigenome of a single cell, the complex heterogeneous mechanisms involved in disease occurrence and progression can be revealed, further improving disease diagnosis, prognosis prediction and monitoring of the therapeutic effects of drugs. In this study, we mainly summarized the methods and application fields of SCS, which may provide potential references for its future clinical applications, including the analysis of embryonic and organ development, the immune system, cancer progression, and parasitic and infectious diseases as well as stem cell research, antibody screening, and therapeutic research and development.
Collapse
Affiliation(s)
- Aimaiti Yasen
- State Key Laboratory of Pathogenesis, Prevention and Treatment of High Incidence Diseases in Central Asia, Xinjiang Medical University, 393 Xin Yi Road, Urumqi 830011, Xinjiang Uyghur Autonomous Region, People's Republic of China; The first affiliated Hospital of Xinjiang Medical University, Urumqi 830011, Xinjiang Uyghur Autonomous Region, People's Republic of China; Department of Hepatobiliary and Hydatid Disease, Digestive and Vascular Surgery Center, The First Affiliated Hospital of Xinjiang Medical University, Urumqi 830011, Xinjiang Uyghur Autonomous Region, People's Republic of China
| | - Abudusalamu Aini
- The first affiliated Hospital of Xinjiang Medical University, Urumqi 830011, Xinjiang Uyghur Autonomous Region, People's Republic of China; Department of Hepatobiliary and Hydatid Disease, Digestive and Vascular Surgery Center, The First Affiliated Hospital of Xinjiang Medical University, Urumqi 830011, Xinjiang Uyghur Autonomous Region, People's Republic of China
| | - Hui Wang
- Clinical Medical Research Institute, The First Affiliated Hospital of Xinjiang Medical University, Urumqi, 830011, Xinjiang Uyghur Autonomous Region, People's Republic of China
| | - Wending Li
- The first affiliated Hospital of Xinjiang Medical University, Urumqi 830011, Xinjiang Uyghur Autonomous Region, People's Republic of China
| | - Chuanshan Zhang
- Clinical Medical Research Institute, The First Affiliated Hospital of Xinjiang Medical University, Urumqi, 830011, Xinjiang Uyghur Autonomous Region, People's Republic of China
| | - Bo Ran
- Department of Hepatobiliary and Hydatid Disease, Digestive and Vascular Surgery Center, The First Affiliated Hospital of Xinjiang Medical University, Urumqi 830011, Xinjiang Uyghur Autonomous Region, People's Republic of China
| | - Tuerhongjiang Tuxun
- Department of Hepatobiliary and Hydatid Disease, Digestive and Vascular Surgery Center, The First Affiliated Hospital of Xinjiang Medical University, Urumqi 830011, Xinjiang Uyghur Autonomous Region, People's Republic of China
| | - Yusufukadier Maimaitinijiati
- The first affiliated Hospital of Xinjiang Medical University, Urumqi 830011, Xinjiang Uyghur Autonomous Region, People's Republic of China; Department of Hepatobiliary and Hydatid Disease, Digestive and Vascular Surgery Center, The First Affiliated Hospital of Xinjiang Medical University, Urumqi 830011, Xinjiang Uyghur Autonomous Region, People's Republic of China
| | - Yingmei Shao
- Department of Hepatobiliary and Hydatid Disease, Digestive and Vascular Surgery Center, The First Affiliated Hospital of Xinjiang Medical University, Urumqi 830011, Xinjiang Uyghur Autonomous Region, People's Republic of China
| | - Tuerganaili Aji
- State Key Laboratory of Pathogenesis, Prevention and Treatment of High Incidence Diseases in Central Asia, Xinjiang Medical University, 393 Xin Yi Road, Urumqi 830011, Xinjiang Uyghur Autonomous Region, People's Republic of China; Department of Hepatobiliary and Hydatid Disease, Digestive and Vascular Surgery Center, The First Affiliated Hospital of Xinjiang Medical University, Urumqi 830011, Xinjiang Uyghur Autonomous Region, People's Republic of China.
| | - Hao Wen
- State Key Laboratory of Pathogenesis, Prevention and Treatment of High Incidence Diseases in Central Asia, Xinjiang Medical University, 393 Xin Yi Road, Urumqi 830011, Xinjiang Uyghur Autonomous Region, People's Republic of China; Department of Hepatobiliary and Hydatid Disease, Digestive and Vascular Surgery Center, The First Affiliated Hospital of Xinjiang Medical University, Urumqi 830011, Xinjiang Uyghur Autonomous Region, People's Republic of China.
| |
Collapse
|
17
|
Wang Z, He W, Tang J, Guo F. Identification of Highest-Affinity Binding Sites of Yeast Transcription Factor Families. J Chem Inf Model 2020; 60:1876-1883. [DOI: 10.1021/acs.jcim.9b01012] [Citation(s) in RCA: 23] [Impact Index Per Article: 5.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/19/2022]
Affiliation(s)
- Zongyu Wang
- School of Computer Science and Technology, College of Intelligence and Computing, Tianjin University, Tianjin 300350, China
| | - Wenying He
- School of Computer Science and Technology, College of Intelligence and Computing, Tianjin University, Tianjin 300350, China
| | - Jijun Tang
- School of Computer Science and Technology, College of Intelligence and Computing, Tianjin University, Tianjin 300350, China
- Key Laboratory of Systems Bioengineering (Ministry of Education), Tianjin University, Tianjin 300072, P. R. China
- Department of Computer Science and Engineering, University of South Carolina, Columbia, South Carolina 29208, United States
| | - Fei Guo
- School of Computer Science and Technology, College of Intelligence and Computing, Tianjin University, Tianjin 300350, China
| |
Collapse
|
18
|
Jiang L, Wang C, Tang J, Guo F. Correction to: LightCpG: a multi-view CpG sites detection on single-cell whole genome sequence data. BMC Genomics 2019; 20:365. [PMID: 31084602 PMCID: PMC6513517 DOI: 10.1186/s12864-019-5742-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/29/2019] [Accepted: 04/29/2019] [Indexed: 11/10/2022] Open
Affiliation(s)
- Limin Jiang
- School of Computer Science and Technology, College of Intelligence and Computing, Tianjin University, Tianjin, China
| | - Chongqing Wang
- School of Chemical Engineering and Technology, Tianjin University, Tianjin, China
| | - Jijun Tang
- School of Computer Science and Technology, College of Intelligence and Computing, Tianjin University, Tianjin, China.,Department of Computer Science and Engineering, University of South Carolina, Columbia, SC, USA
| | - Fei Guo
- School of Computer Science and Technology, College of Intelligence and Computing, Tianjin University, Tianjin, China.
| |
Collapse
|