1
|
Ye DX, Yu JW, Li R, Hao YD, Wang TY, Yang H, Ding H. The Prediction of Recombination Hotspot Based on Automated Machine Learning. J Mol Biol 2024:168653. [PMID: 38871176 DOI: 10.1016/j.jmb.2024.168653] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/10/2024] [Revised: 05/12/2024] [Accepted: 06/06/2024] [Indexed: 06/15/2024]
Abstract
Meiotic recombination plays a pivotal role in genetic evolution. Genetic variation induced by recombination is a crucial factor in generating biodiversity and a driving force for evolution. At present, the development of recombination hotspot prediction methods has encountered challenges related to insufficient feature extraction and limited generalization capabilities. This paper focused on the research of recombination hotspot prediction methods. We explored deep learning-based recombination hotspot prediction and scrutinized the shortcomings of prevalent models in addressing the challenge of recombination hotspot prediction. To addressing these deficiencies, an automated machine learning approach was utilized to construct recombination hotspot prediction model. The model combined sequence information with physicochemical properties by employing TF-IDF-Kmer and DNA composition components to acquire more effective feature data. Experimental results validate the effectiveness of the feature extraction method and automated machine learning technology used in this study. The final model was validated on three distinct datasets and yielded accuracy rates of 97.14%, 79.71%, and 98.73%, surpassing the current leading models by 2%, 2.56%, and 4%, respectively. In addition, we incorporated tools such as SHAP and AutoGluon to analyze the interpretability of black-box models, delved into the impact of individual features on the results, and investigated the reasons behind misclassification of samples. Finally, an application of recombination hotspot prediction was established to facilitate easy access to necessary information and tools for researchers. The research outcomes of this paper underscore the enormous potential of automated machine learning methods in gene sequence prediction.
Collapse
Affiliation(s)
- Dong-Xin Ye
- School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu 610054, China
| | - Jun-Wen Yu
- School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu 610054, China
| | - Rui Li
- School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu 610054, China
| | - Yu-Duo Hao
- School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu 610054, China
| | - Tian-Yu Wang
- School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu 610054, China
| | - Hui Yang
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou, Zhejiang, China.
| | - Hui Ding
- School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu 610054, China.
| |
Collapse
|
2
|
Guo W, Liu X, Ma Y, Zhang R. iRspot-DCC: Recombination hot/ cold spots identification based on dinucleotide-based correlation coefficient and convolutional neural network. JOURNAL OF INTELLIGENT & FUZZY SYSTEMS 2021. [DOI: 10.3233/jifs-210213] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/29/2022]
Abstract
The correct identification of gene recombination cold/hot spots is of great significance for studying meiotic recombination and genetic evolution. However, most of the existing recombination spots recognition methods ignore the global sequence information hidden in the DNA sequence, resulting in their low recognition accuracy. A computational predictor called iRSpot-DCC was proposed in this paper to improve the accuracy of cold/hot spots identification. In this approach, we propose a feature extraction method based on dinucleotide correlation coefficients that focus more on extracting potential DNA global sequence information. Then, 234 representative features vectors are filtered by SVM weight calculation. Finally, a convolutional neural network with better performance than SVM is selected as a classifier. The experimental results of 5-fold cross-validation test on two standard benchmark datasets showed that the prediction accuracy of our recognition method reached 95.11%, and the Mathew correlation coefficient (MCC) reaches 90.04%, outperforming most other methods. Therefore, iRspot-DCC is a high-precision cold/hot spots identification method for gene recombination, which effectively extracts potential global sequence information from DNA sequences.
Collapse
Affiliation(s)
- Wang Guo
- Chongqing Key Laboratory of Complex Systems and Bionic Control, Chongqing University of Posts and Telecommunications, Chongqing, China
| | - Xingmou Liu
- Chongqing Key Laboratory of Complex Systems and Bionic Control, Chongqing University of Posts and Telecommunications, Chongqing, China
| | - You Ma
- Chongqing Key Laboratory of Complex Systems and Bionic Control, Chongqing University of Posts and Telecommunications, Chongqing, China
| | - Rongjie Zhang
- Chongqing Key Laboratory of Complex Systems and Bionic Control, Chongqing University of Posts and Telecommunications, Chongqing, China
| |
Collapse
|
3
|
Haque HMF, Rafsanjani M, Arifin F, Adilina S, Shatabda S. SubFeat: Feature subspacing ensemble classifier for function prediction of DNA, RNA and protein sequences. Comput Biol Chem 2021; 92:107489. [PMID: 33932779 DOI: 10.1016/j.compbiolchem.2021.107489] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/21/2020] [Revised: 03/07/2021] [Accepted: 04/19/2021] [Indexed: 11/16/2022]
Abstract
The information of a cell is primarily contained in deoxyribonucleic acid (DNA). There is a flow of DNA information to protein sequences via ribonucleic acids (RNA) through transcription and translation. These entities are vital for the genetic process. Recent epigenetics developments also show the importance of the genetic material and knowledge of their attributes and functions. However, the growth in these entities' available features or functionalities is still slow due to the time-consuming and expensive in vitro experimental methods. In this paper, we have proposed an ensemble classification algorithm called SubFeat to predict biological entities' functionalities from different types of datasets. Our model uses a feature subspace-based novel ensemble method. It divides the feature space into sub-spaces, which are then passed to learn individual classifier models. The ensemble is built on these base classifiers that use a weighted majority voting mechanism. SubFeat tested on four datasets comprising two DNA, one RNA, and one protein dataset, and it outperformed all the existing single classifiers and the ensemble classifiers. SubFeat is made available as a Python-based tool. We have made the package SubFeat available online along with a user manual. It is freely accessible from here: https://github.com/fazlulhaquejony/SubFeat.
Collapse
Affiliation(s)
- H M Fazlul Haque
- Department of Computer Science and Engineering, United International University, United City, Madani Avenue, Badda, Dhaka 1212, Bangladesh
| | - Muhammod Rafsanjani
- Department of Computer Science and Engineering, United International University, United City, Madani Avenue, Badda, Dhaka 1212, Bangladesh
| | - Fariha Arifin
- Department of Computer Science and Engineering, United International University, United City, Madani Avenue, Badda, Dhaka 1212, Bangladesh
| | - Sheikh Adilina
- Department of Computer Science and Engineering, United International University, United City, Madani Avenue, Badda, Dhaka 1212, Bangladesh
| | - Swakkhar Shatabda
- Department of Computer Science and Engineering, United International University, United City, Madani Avenue, Badda, Dhaka 1212, Bangladesh.
| |
Collapse
|
4
|
Khan F, Khan M, Iqbal N, Khan S, Muhammad Khan D, Khan A, Wei DQ. Prediction of Recombination Spots Using Novel Hybrid Feature Extraction Method via Deep Learning Approach. Front Genet 2020; 11:539227. [PMID: 33093842 PMCID: PMC7527634 DOI: 10.3389/fgene.2020.539227] [Citation(s) in RCA: 12] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/10/2020] [Accepted: 08/13/2020] [Indexed: 01/20/2023] Open
Abstract
Meiotic recombination is the driving force of evolutionary development and an important source of genetic variation. The meiotic recombination does not take place randomly in a chromosome but occurs in some regions of the chromosome. A region in chromosomes with higher rate of meiotic recombination events are considered as hotspots and a region where frequencies of the recombination events are lower are called coldspots. Prediction of meiotic recombination spots provides useful information about the basic functionality of inheritance and genome diversity. This study proposes an intelligent computational predictor called iRSpots-DNN for the identification of recombination spots. The proposed predictor is based on a novel feature extraction method and an optimized deep neural network (DNN). The DNN was employed as a classification engine whereas, the novel features extraction method was developed to extract meaningful features for the identification of hotspots and coldspots across the yeast genome. Unlike previous algorithms, the proposed feature extraction avoids bias among different selected features and preserved the sequence discriminant properties along with the sequence-structure information simultaneously. This study also considered other effective classifiers named support vector machine (SVM), K-nearest neighbor (KNN), and random forest (RF) to predict recombination spots. Experimental results on a benchmark dataset with 10-fold cross-validation showed that iRSpots-DNN achieved the highest accuracy, i.e., 95.81%. Additionally, the performance of the proposed iRSpots-DNN is significantly better than the existing predictors on a benchmark dataset. The relevant benchmark dataset and source code are freely available at: https://github.com/Fatima-Khan12/iRspot_DNN/tree/master/iRspot_DNN.
Collapse
Affiliation(s)
- Fatima Khan
- Department of Computer Science, Abdul Wali Khan University Mardan, Mardan, Pakistan
| | - Mukhtaj Khan
- Department of Computer Science, Abdul Wali Khan University Mardan, Mardan, Pakistan
| | - Nadeem Iqbal
- Department of Computer Science, Abdul Wali Khan University Mardan, Mardan, Pakistan
| | - Salman Khan
- Department of Computer Science, Abdul Wali Khan University Mardan, Mardan, Pakistan
| | - Dost Muhammad Khan
- Department of Statistics, Abdul Wali Khan University Mardan, Mardan, Pakistan
| | - Abbas Khan
- Department of Bioinformatics and Biological Statistics, School of Life Sciences and Biotechnology, Shanghai Jiao Tong University, Shanghai, China
| | - Dong-Qing Wei
- Department of Bioinformatics and Biological Statistics, School of Life Sciences and Biotechnology, Shanghai Jiao Tong University, Shanghai, China.,State Key Laboratory of Microbial Metabolism, Shanghai-Islamabad-Belgrade Joint Innovation Center on Antibacterial Resistances, Joint Laboratory of International Cooperation in Metabolic and Developmental Sciences, School of Life Sciences and Biotechnology, Shanghai Jiao Tong University, Ministry of Education, Shanghai, China.,Peng Cheng Laboratory, Shenzhen, China
| |
Collapse
|
5
|
|
6
|
Some illuminating remarks on molecular genetics and genomics as well as drug development. Mol Genet Genomics 2020; 295:261-274. [PMID: 31894399 DOI: 10.1007/s00438-019-01634-z] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/21/2019] [Accepted: 12/05/2019] [Indexed: 02/07/2023]
Abstract
Facing the explosive growth of biological sequences unearthed in the post-genomic age, one of the most important but also most difficult problems in computational biology is how to express a biological sequence with a discrete model or a vector, but still keep it with considerable sequence-order information or its special pattern. To deal with such a challenging problem, the ideas of "pseudo amino acid components" and "pseudo K-tuple nucleotide composition" have been proposed. The ideas and their approaches have further stimulated the birth for "distorted key theory", "wenxing diagram", and substantially strengthening the power in treating the multi-label systems, as well as the establishment of the famous "5-steps rule". All these logic developments are quite natural that are very useful not only for theoretical scientists but also for experimental scientists in conducting genetics/genomics analysis and drug development. Presented in this review paper are also their future perspectives; i.e., their impacts will become even more significant and propounding.
Collapse
|
7
|
Shao YT, Liu XX, Lu Z, Chou KC. pLoc_Deep-mHum: Predict Subcellular Localization of Human Proteins by Deep Learning. ACTA ACUST UNITED AC 2020. [DOI: 10.4236/ns.2020.127042] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022]
|
8
|
Shao Y, Chou KC. pLoc_Deep-mEuk: Predict Subcellular Localization of Eukaryotic Proteins by Deep Learning. ACTA ACUST UNITED AC 2020. [DOI: 10.4236/ns.2020.126034] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022]
|
9
|
Chou KC. Advances in Predicting Subcellular Localization of Multi-label Proteins and its Implication for Developing Multi-target Drugs. Curr Med Chem 2019; 26:4918-4943. [PMID: 31060481 DOI: 10.2174/0929867326666190507082559] [Citation(s) in RCA: 78] [Impact Index Per Article: 15.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/28/2018] [Revised: 01/29/2019] [Accepted: 01/31/2019] [Indexed: 12/16/2022]
Abstract
The smallest unit of life is a cell, which contains numerous protein molecules. Most
of the functions critical to the cell’s survival are performed by these proteins located in its different
organelles, usually called ‘‘subcellular locations”. Information of subcellular localization
for a protein can provide useful clues about its function. To reveal the intricate pathways at the
cellular level, knowledge of the subcellular localization of proteins in a cell is prerequisite.
Therefore, one of the fundamental goals in molecular cell biology and proteomics is to determine
the subcellular locations of proteins in an entire cell. It is also indispensable for prioritizing
and selecting the right targets for drug development. Unfortunately, it is both timeconsuming
and costly to determine the subcellular locations of proteins purely based on experiments.
With the avalanche of protein sequences generated in the post-genomic age, it is highly
desired to develop computational methods for rapidly and effectively identifying the subcellular
locations of uncharacterized proteins based on their sequences information alone. Actually,
considerable progresses have been achieved in this regard. This review is focused on those
methods, which have the capacity to deal with multi-label proteins that may simultaneously
exist in two or more subcellular location sites. Protein molecules with this kind of characteristic
are vitally important for finding multi-target drugs, a current hot trend in drug development.
Focused in this review are also those methods that have use-friendly web-servers established so
that the majority of experimental scientists can use them to get the desired results without the
need to go through the detailed mathematics involved.
Collapse
Affiliation(s)
- Kuo-Chen Chou
- Gordon Life Science Institute, Boston, MA 02478, United States
| |
Collapse
|
10
|
Abstract
The smallest unit of life is a cell, which contains numerous protein molecules. Most
of the functions critical to the cell’s survival are performed by these proteins located in its different
organelles, usually called ‘‘subcellular locations”. Information of subcellular localization
for a protein can provide useful clues about its function. To reveal the intricate pathways at the
cellular level, knowledge of the subcellular localization of proteins in a cell is prerequisite.
Therefore, one of the fundamental goals in molecular cell biology and proteomics is to determine
the subcellular locations of proteins in an entire cell. It is also indispensable for prioritizing
and selecting the right targets for drug development. Unfortunately, it is both timeconsuming
and costly to determine the subcellular locations of proteins purely based on experiments.
With the avalanche of protein sequences generated in the post-genomic age, it is highly
desired to develop computational methods for rapidly and effectively identifying the subcellular
locations of uncharacterized proteins based on their sequences information alone. Actually,
considerable progresses have been achieved in this regard. This review is focused on those
methods, which have the capacity to deal with multi-label proteins that may simultaneously
exist in two or more subcellular location sites. Protein molecules with this kind of characteristic
are vitally important for finding multi-target drugs, a current hot trend in drug development.
Focused in this review are also those methods that have use-friendly web-servers established so
that the majority of experimental scientists can use them to get the desired results without the
need to go through the detailed mathematics involved.
Collapse
Affiliation(s)
- Kuo-Chen Chou
- Gordon Life Science Institute, Boston, MA 02478, United States
| |
Collapse
|
11
|
|
12
|
Jia J, Li X, Qiu W, Xiao X, Chou KC. iPPI-PseAAC(CGR): Identify protein-protein interactions by incorporating chaos game representation into PseAAC. J Theor Biol 2019; 460:195-203. [DOI: 10.1016/j.jtbi.2018.10.021] [Citation(s) in RCA: 78] [Impact Index Per Article: 15.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/05/2018] [Revised: 09/16/2018] [Accepted: 10/08/2018] [Indexed: 01/11/2023]
|
13
|
Ju Z, Wang SY. Prediction of S-sulfenylation sites using mRMR feature selection and fuzzy support vector machine algorithm. J Theor Biol 2018; 457:6-13. [DOI: 10.1016/j.jtbi.2018.08.022] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/17/2018] [Revised: 08/07/2018] [Accepted: 08/15/2018] [Indexed: 11/29/2022]
|