1
|
Jin J, Feng J. iDHS-RGME: Identification of DNase I hypersensitive sites by integrating information on nucleotide composition and physicochemical properties. Biochem Biophys Res Commun 2024; 734:150618. [PMID: 39222575 DOI: 10.1016/j.bbrc.2024.150618] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/18/2024] [Revised: 08/19/2024] [Accepted: 08/28/2024] [Indexed: 09/04/2024]
Abstract
As pivotal markers of chromatin accessibility, DNase I hypersensitive sites (DHSs) intimately link to fundamental biological processes encompassing gene expression regulation and disease pathogenesis. Developing efficient and precise algorithms for DHSs identification holds paramount importance for unraveling genome functionality and elucidating disease mechanisms. This study innovatively presents iDHS-RGME, an Extremely Randomized Trees (Extra-Trees)-based algorithm that integrates unique feature extraction techniques for enhanced DHSs prediction. Specifically, iDHS-RGME utilizes two feature extraction approaches: Reverse Complementary Kmer (RCKmer) and Geary Spatial Autocorrelation (GSA), which comprehensively capture sequence attributes from diverse angles, bolstering information richness and accuracy. To address data imbalance, Borderline-SMOTE is employed, followed by Maximum Information Coefficient (MIC) for meticulous feature selection. Comparative evaluations underscored the superiority of the Extra-Trees classifier, which was subsequently adopted for model prediction. Through rigorous five-fold cross-validation, iDHS-RGME achieved remarkable accuracies of 94.71 % and 95.07 % on two independent datasets, outperforming previous models in terms of both precision and effectiveness.
Collapse
Affiliation(s)
- Jian Jin
- School of Science, Minzu University of China, Beijing, 100081, China
| | - Jie Feng
- School of Science, Minzu University of China, Beijing, 100081, China.
| |
Collapse
|
2
|
Yao Y, Zhang S, Xue T. Integrating LASSO Feature Selection and Soft Voting Classifier to Identify Origins of Replication Sites. Curr Genomics 2022; 23:83-93. [PMID: 36778978 PMCID: PMC9878833 DOI: 10.2174/1389202923666220214122506] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/30/2021] [Revised: 12/11/2021] [Accepted: 01/18/2022] [Indexed: 11/22/2022] Open
Abstract
Background: DNA replication plays an indispensable role in the transmission of genetic information. It is considered to be the basis of biological inheritance and the most fundamental process in all biological life. Considering that DNA replication initiates with a special location, namely the origin of replication, a better and accurate prediction of the origins of replication sites (ORIs) is essential to gain insight into the relationship with gene expression. Objective: In this study, we have developed an efficient predictor called iORI-LAVT for ORIs identification. Methods: This work focuses on extracting feature information from three aspects, including mono-nucleotide encoding, k-mer and ring-function-hydrogen-chemical properties. Subsequently, least absolute shrinkage and selection operator (LASSO) as a feature selection is applied to select the optimal features. Comparing the different combined soft voting classifiers results, the soft voting classifier based on GaussianNB and Logistic Regression is employed as the final classifier. Results: Based on 10-fold cross-validation test, the prediction accuracies of two benchmark datasets are 90.39% and 95.96%, respectively. As for the independent dataset, our method achieves high accuracy of 91.3%. Conclusion: Compared with previous predictors, iORI-LAVT outperforms the existing methods. It is believed that iORI-LAVT predictor is a promising alternative for further research on identifying ORIs.
Collapse
Affiliation(s)
- Yingying Yao
- School of Mathematics and Statistics, Xidian University, Xi’an 710071, P.R. China
| | - Shengli Zhang
- School of Mathematics and Statistics, Xidian University, Xi’an 710071, P.R. China,Address correspondence to this author at the School of Mathematics and Statistics, Xidian University, Xi’an 710071, P.R. China; Tel/Fax: +86-29- 88202860; E-mail:
| | - Tian Xue
- School of Mathematics and Statistics, Xidian University, Xi’an 710071, P.R. China
| |
Collapse
|
3
|
Guo W, Liu X, Ma Y, Zhang R. iRspot-DCC: Recombination hot/ cold spots identification based on dinucleotide-based correlation coefficient and convolutional neural network. JOURNAL OF INTELLIGENT & FUZZY SYSTEMS 2021. [DOI: 10.3233/jifs-210213] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/29/2022]
Abstract
The correct identification of gene recombination cold/hot spots is of great significance for studying meiotic recombination and genetic evolution. However, most of the existing recombination spots recognition methods ignore the global sequence information hidden in the DNA sequence, resulting in their low recognition accuracy. A computational predictor called iRSpot-DCC was proposed in this paper to improve the accuracy of cold/hot spots identification. In this approach, we propose a feature extraction method based on dinucleotide correlation coefficients that focus more on extracting potential DNA global sequence information. Then, 234 representative features vectors are filtered by SVM weight calculation. Finally, a convolutional neural network with better performance than SVM is selected as a classifier. The experimental results of 5-fold cross-validation test on two standard benchmark datasets showed that the prediction accuracy of our recognition method reached 95.11%, and the Mathew correlation coefficient (MCC) reaches 90.04%, outperforming most other methods. Therefore, iRspot-DCC is a high-precision cold/hot spots identification method for gene recombination, which effectively extracts potential global sequence information from DNA sequences.
Collapse
Affiliation(s)
- Wang Guo
- Chongqing Key Laboratory of Complex Systems and Bionic Control, Chongqing University of Posts and Telecommunications, Chongqing, China
| | - Xingmou Liu
- Chongqing Key Laboratory of Complex Systems and Bionic Control, Chongqing University of Posts and Telecommunications, Chongqing, China
| | - You Ma
- Chongqing Key Laboratory of Complex Systems and Bionic Control, Chongqing University of Posts and Telecommunications, Chongqing, China
| | - Rongjie Zhang
- Chongqing Key Laboratory of Complex Systems and Bionic Control, Chongqing University of Posts and Telecommunications, Chongqing, China
| |
Collapse
|
4
|
Semwal R, Varadwaj PK. HumDLoc: Human Protein Subcellular Localization Prediction Using Deep Neural Network. Curr Genomics 2020; 21:546-557. [PMID: 33214771 PMCID: PMC7604748 DOI: 10.2174/1389202921999200528160534] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/10/2020] [Revised: 03/27/2020] [Accepted: 03/30/2020] [Indexed: 11/24/2022] Open
Abstract
Aims To develop a tool that can annotate subcellular localization of human proteins. Background With the progression of high throughput human proteomics projects, an enormous amount of protein sequence data has been discovered in the recent past. All these raw sequence data require precise mapping and annotation for their respective biological role and functional attributes. The functional characteristics of protein molecules are highly dependent on the subcellular localization/compartment. Therefore, a fully automated and reliable protein subcellular localization prediction system would be very useful for current proteomic research. Objective To develop a machine learning-based predictive model that can annotate the subcellular localization of human proteins with high accuracy and precision. Methods In this study, we used the PSI-CD-HIT homology criterion and utilized the sequence-based features of protein sequences to develop a powerful subcellular localization predictive model. The dataset used to train the HumDLoc model was extracted from a reliable data source, Uniprot knowledge base, which helps the model to generalize on the unseen dataset. Results The proposed model, HumDLoc, was compared with two of the most widely used techniques: CELLO and DeepLoc, and other machine learning-based tools. The result demonstrated promising predictive performance of HumDLoc model based on various machine learning parameters such as accuracy (≥97.00%), precision (≥0.86), recall (≥0.89), MCC score (≥0.86), ROC curve (0.98 square unit), and precision-recall curve (0.93 square unit). Conclusion In conclusion, HumDLoc was able to outperform several alternative tools for correctly predicting subcellular localization of human proteins. The HumDLoc has been hosted as a web-based tool at https://bioserver.iiita.ac.in/HumDLoc/.
Collapse
Affiliation(s)
- Rahul Semwal
- 1Department of Information Technology (Bioinformatics), Indian Institute of Information Technology-Allahabad, Jhalwa, Prayagraj, India; 2Department of Bioinformatics and Applied Science, Indian Institute of Information Technology-Allahabad, Jhalwa, Prayagraj, India
| | - Pritish Kumar Varadwaj
- 1Department of Information Technology (Bioinformatics), Indian Institute of Information Technology-Allahabad, Jhalwa, Prayagraj, India; 2Department of Bioinformatics and Applied Science, Indian Institute of Information Technology-Allahabad, Jhalwa, Prayagraj, India
| |
Collapse
|
5
|
Zhang S, Qiao H. KD-KLNMF: Identification of lncRNAs subcellular localization with multiple features and nonnegative matrix factorization. Anal Biochem 2020; 610:113995. [PMID: 33080214 DOI: 10.1016/j.ab.2020.113995] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/28/2020] [Revised: 09/07/2020] [Accepted: 10/12/2020] [Indexed: 12/18/2022]
Abstract
Long non-coding RNAs (lncRNAs) refer to functional RNA molecules with a length more than 200 nucleotides and have minimal or no function to encode proteins. In recent years, more studies show that lncRNAs subcellular localization has valuable clues for their biological functions. So it is count for much to identify lncRNAs subcellular localization. In this paper, a novel statistical model named KD-KLNMF is constructed to predict lncRNAs subcellular localization. Firstly, k-mer and dinucleotide-based spatial autocorrelation are incorporated as the feature vector. Then, Synthetic Minority Over-sampling Technique is used to deal with the imbalance dataset. Next, Kullback-Leibler divergence-based nonnegative matrix factorization is applied to select optimal features. And then we utilize support vector machine as the classifier after comparing with other classifiers. Finally, the jackknife test is performed to evaluate the model. The overall accuracies reach 97.24% and 92.86% on training dataset and independent dataset, respectively. The results are better than the previous methods, which indicate that our model will be a useful and feasible tool to identify lncRNAs subcellular localization. The datasets and source code are freely available at https://github.com/HuijuanQiao/KD-KLNMF.
Collapse
Affiliation(s)
- Shengli Zhang
- School of Mathematics and Statistics, Xidian University, Xi'an, 710071, PR China.
| | - Huijuan Qiao
- School of Mathematics and Statistics, Xidian University, Xi'an, 710071, PR China
| |
Collapse
|
6
|
Chou KC. An Insightful 10-year Recollection Since the Emergence of the 5-steps Rule. Curr Pharm Des 2020; 25:4223-4234. [PMID: 31782354 DOI: 10.2174/1381612825666191129164042] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/18/2019] [Accepted: 11/25/2019] [Indexed: 11/22/2022]
Abstract
OBJECTIVE One of the most challenging and also the most difficult problems is how to formulate a biological sequence with a vector but considerably keep its sequence order information. METHODS To address such a problem, the approach of Pseudo Amino Acid Components or PseAAC has been developed. RESULTS AND CONCLUSION It has become increasingly clear via the 10-year recollection that the aforementioned proposal has been indeed very powerful.
Collapse
Affiliation(s)
- Kuo-Chen Chou
- Gordon Life Science Institute, Boston, Massachusetts 02478, United States.,Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China
| |
Collapse
|
7
|
Maghsoudloo M, Azimzadeh Jamalkandi S, Najafi A, Masoudi-Nejad A. An efficient hybrid feature selection method to identify potential biomarkers in common chronic lung inflammatory diseases. Genomics 2020; 112:3284-3293. [PMID: 32540493 DOI: 10.1016/j.ygeno.2020.06.010] [Citation(s) in RCA: 12] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/22/2020] [Revised: 05/21/2020] [Accepted: 06/04/2020] [Indexed: 12/13/2022]
Abstract
Asthma, chronic obstructive pulmonary disease (COPD), and idiopathic pulmonary fibrosis (IPF) are three serious lung inflammatory diseases. The understanding of the pathogenesis mechanism and the identification of potential prognostic biomarkers of these diseases can provide the patients with more efficient treatments. In this study, an efficient hybrid feature selection method was introduced in order to extract informative genes. We implemented an ontology-based ranking approach on differentially expressed genes following a wrapper method. The examination of the different gene ontologies and their combinations motivated us to propose a biological functional-based method to improve the performance of further wrapper methods. The results identified: TOM1L1, SRSF1, and GIT2 in asthma; CHCHD4, PAIP2, CRLF3, UBQLN4, TRAK1, PRELID1, VAMP4, CCM2, and APBB1IP in COPD; and TUFT1, GAB2, B4GALNT1, TNFRSF17, PRDM8, and SETDB2 in IPF as the potential biomarkers. The proposed method can be used to identify hub genes in other high-throughput datasets.
Collapse
Affiliation(s)
- Mazaher Maghsoudloo
- Laboratory of Systems Biology and Bioinformatics (LBB), Department of Bioinformatics, Kish International Campus, University of Tehran, Kish Island, Iran; Laboratory of Systems Biology and Bioinformatics (LBB), Institute of Biochemistry and Biophysics, University of Tehran, Tehran, Iran
| | | | - Ali Najafi
- Molecular Biology Research Center, Systems Biology and Poisonings Institute, Tehran, Iran
| | - Ali Masoudi-Nejad
- Laboratory of Systems Biology and Bioinformatics (LBB), Department of Bioinformatics, Kish International Campus, University of Tehran, Kish Island, Iran; Laboratory of Systems Biology and Bioinformatics (LBB), Institute of Biochemistry and Biophysics, University of Tehran, Tehran, Iran.
| |
Collapse
|
8
|
|
9
|
Some illuminating remarks on molecular genetics and genomics as well as drug development. Mol Genet Genomics 2020; 295:261-274. [PMID: 31894399 DOI: 10.1007/s00438-019-01634-z] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/21/2019] [Accepted: 12/05/2019] [Indexed: 02/07/2023]
Abstract
Facing the explosive growth of biological sequences unearthed in the post-genomic age, one of the most important but also most difficult problems in computational biology is how to express a biological sequence with a discrete model or a vector, but still keep it with considerable sequence-order information or its special pattern. To deal with such a challenging problem, the ideas of "pseudo amino acid components" and "pseudo K-tuple nucleotide composition" have been proposed. The ideas and their approaches have further stimulated the birth for "distorted key theory", "wenxing diagram", and substantially strengthening the power in treating the multi-label systems, as well as the establishment of the famous "5-steps rule". All these logic developments are quite natural that are very useful not only for theoretical scientists but also for experimental scientists in conducting genetics/genomics analysis and drug development. Presented in this review paper are also their future perspectives; i.e., their impacts will become even more significant and propounding.
Collapse
|
10
|
Lu Z, Chou KC. pLoc_Deep-mGpos: Predict Subcellular Localization of Gram Positive Bacteria Proteins by Deep Learning. ACTA ACUST UNITED AC 2020. [DOI: 10.4236/jbise.2020.135005] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022]
|
11
|
Shao Y, Chou KC. pLoc_Deep-mVirus: A CNN Model for Predicting Subcellular Localization of Virus Proteins by Deep Learning. ACTA ACUST UNITED AC 2020. [DOI: 10.4236/ns.2020.126033] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022]
|
12
|
Shao Y, Chou KC. pLoc_Deep-mEuk: Predict Subcellular Localization of Eukaryotic Proteins by Deep Learning. ACTA ACUST UNITED AC 2020. [DOI: 10.4236/ns.2020.126034] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022]
|
13
|
Chou KC. Impacts of Pseudo Amino Acid Components and 5-steps Rule to Proteomics and Proteome Analysis. Curr Top Med Chem 2019; 19:2283-2300. [DOI: 10.2174/1568026619666191018100141] [Citation(s) in RCA: 24] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/07/2019] [Revised: 08/18/2019] [Accepted: 08/26/2019] [Indexed: 01/27/2023]
Abstract
Stimulated by the 5-steps rule during the last decade or so, computational proteomics has achieved remarkable progresses in the following three areas: (1) protein structural class prediction; (2) protein subcellular location prediction; (3) post-translational modification (PTM) site prediction. The results obtained by these predictions are very useful not only for an in-depth study of the functions of proteins and their biological processes in a cell, but also for developing novel drugs against major diseases such as cancers, Alzheimer’s, and Parkinson’s. Moreover, since the targets to be predicted may have the multi-label feature, two sets of metrics are introduced: one is for inspecting the global prediction quality, while the other for the local prediction quality. All the predictors covered in this review have a userfriendly web-server, through which the majority of experimental scientists can easily obtain their desired data without the need to go through the complicated mathematics.
Collapse
Affiliation(s)
- Kuo-Chen Chou
- Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, 610054, China
| |
Collapse
|
14
|
Chou KC. Advances in Predicting Subcellular Localization of Multi-label Proteins and its Implication for Developing Multi-target Drugs. Curr Med Chem 2019; 26:4918-4943. [PMID: 31060481 DOI: 10.2174/0929867326666190507082559] [Citation(s) in RCA: 78] [Impact Index Per Article: 15.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/28/2018] [Revised: 01/29/2019] [Accepted: 01/31/2019] [Indexed: 12/16/2022]
Abstract
The smallest unit of life is a cell, which contains numerous protein molecules. Most
of the functions critical to the cell’s survival are performed by these proteins located in its different
organelles, usually called ‘‘subcellular locations”. Information of subcellular localization
for a protein can provide useful clues about its function. To reveal the intricate pathways at the
cellular level, knowledge of the subcellular localization of proteins in a cell is prerequisite.
Therefore, one of the fundamental goals in molecular cell biology and proteomics is to determine
the subcellular locations of proteins in an entire cell. It is also indispensable for prioritizing
and selecting the right targets for drug development. Unfortunately, it is both timeconsuming
and costly to determine the subcellular locations of proteins purely based on experiments.
With the avalanche of protein sequences generated in the post-genomic age, it is highly
desired to develop computational methods for rapidly and effectively identifying the subcellular
locations of uncharacterized proteins based on their sequences information alone. Actually,
considerable progresses have been achieved in this regard. This review is focused on those
methods, which have the capacity to deal with multi-label proteins that may simultaneously
exist in two or more subcellular location sites. Protein molecules with this kind of characteristic
are vitally important for finding multi-target drugs, a current hot trend in drug development.
Focused in this review are also those methods that have use-friendly web-servers established so
that the majority of experimental scientists can use them to get the desired results without the
need to go through the detailed mathematics involved.
Collapse
Affiliation(s)
- Kuo-Chen Chou
- Gordon Life Science Institute, Boston, MA 02478, United States
| |
Collapse
|
15
|
Abstract
The smallest unit of life is a cell, which contains numerous protein molecules. Most
of the functions critical to the cell’s survival are performed by these proteins located in its different
organelles, usually called ‘‘subcellular locations”. Information of subcellular localization
for a protein can provide useful clues about its function. To reveal the intricate pathways at the
cellular level, knowledge of the subcellular localization of proteins in a cell is prerequisite.
Therefore, one of the fundamental goals in molecular cell biology and proteomics is to determine
the subcellular locations of proteins in an entire cell. It is also indispensable for prioritizing
and selecting the right targets for drug development. Unfortunately, it is both timeconsuming
and costly to determine the subcellular locations of proteins purely based on experiments.
With the avalanche of protein sequences generated in the post-genomic age, it is highly
desired to develop computational methods for rapidly and effectively identifying the subcellular
locations of uncharacterized proteins based on their sequences information alone. Actually,
considerable progresses have been achieved in this regard. This review is focused on those
methods, which have the capacity to deal with multi-label proteins that may simultaneously
exist in two or more subcellular location sites. Protein molecules with this kind of characteristic
are vitally important for finding multi-target drugs, a current hot trend in drug development.
Focused in this review are also those methods that have use-friendly web-servers established so
that the majority of experimental scientists can use them to get the desired results without the
need to go through the detailed mathematics involved.
Collapse
Affiliation(s)
- Kuo-Chen Chou
- Gordon Life Science Institute, Boston, MA 02478, United States
| |
Collapse
|
16
|
|