1
|
Liu X, Teng L, Luo Y, Xu Y. Prediction of prokaryotic and eukaryotic promoters based on information-theoretic features. Biosystems 2023; 231:104979. [PMID: 37423595 DOI: 10.1016/j.biosystems.2023.104979] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/08/2022] [Revised: 07/06/2023] [Accepted: 07/07/2023] [Indexed: 07/11/2023]
Abstract
Promoters are DNA regulatory elements located near the transcription start site and are responsible for regulating the transcription of genes. DNA fragments arranged in a certain order form specific functional regions with different information contents. Information theory is the science that studies the extraction, measurement and transmission of information. The genetic information contained in DNA follows the general laws of information storage. Therefore, method in information theory can be used for the analysis of promoters carrying genetic information. In this study, we introduced the concept of information theory to the study of promoter prediction. We used 107 features extracted based on information theory methods and a backpropagation neural network to build a classifier. Then, the trained classifier was applied to predict the promoters of 6 organisms. The average AUCs of the 6 organisms obtained by using hold-out validation and ten-fold cross-validation were 0.885 and 0.886, respectively. The results verified the effectiveness of information-theoretic features in promoter prediction. Considering the possible redundancy in the feature set, we performed feature selection and obtained key feature subsets related to promoter characteristics. The results indicate the potential utility of information-theoretic features in promoter prediction.
Collapse
Affiliation(s)
- Xiao Liu
- School of Microelectronics and Communication Engineering, Chongqing University, 174 ShaPingBa District, Chongqing, 400044, China.
| | - Li Teng
- School of Microelectronics and Communication Engineering, Chongqing University, 174 ShaPingBa District, Chongqing, 400044, China
| | - Yachuan Luo
- School of Microelectronics and Communication Engineering, Chongqing University, 174 ShaPingBa District, Chongqing, 400044, China
| | - Yuqiao Xu
- School of Microelectronics and Communication Engineering, Chongqing University, 174 ShaPingBa District, Chongqing, 400044, China
| |
Collapse
|
2
|
Affinity and Correlation in DNA. J 2022. [DOI: 10.3390/j5020016] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022] Open
Abstract
A statistical analysis of important DNA sequences and related proteins has been performed to study the relationships between monomers, and some general considerations about these macromolecules can be provided from the results. First, the most important relationship between sites in all the DNA sequences examined is that between two consecutive base pairs. This is an indication of an energetic stabilization due to the stacking interaction of these couples of base pairs. Secondly, the difference between human chromosome sequences and their coding parts is relevant both in the relationships between sites and in some specific compositional rules, such as the second Chargaff rule. Third, the evidence of the relationship in two successive triplets of DNA coding sequences generates a relationship between two successive amino acids in the proteins. This is obviously impossible if all the relationships between the sites are statistical evidence and do not involve causes; therefore, in this article, due to stacking interactions and this relationship in coding sequences, we will divide the concept of the relationship between sites into two concepts: affinity and correlation, the first with physical causes and the second without. Finally, from the statistical analyses carried out, it will emerge that the human genome is uniform, with the only significant exception being the Y chromosome.
Collapse
|
3
|
Li J, Li H, Ye X, Zhang L, Xu Q, Ping Y, Jing X, Jiang W, Liao Q, Liu B, Wang Y. IIMLP: integrated information-entropy-based method for LncRNA prediction. BMC Bioinformatics 2021; 22:243. [PMID: 33980144 PMCID: PMC8117603 DOI: 10.1186/s12859-020-03884-w] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/15/2020] [Accepted: 11/17/2020] [Indexed: 11/26/2022] Open
Abstract
BACKGROUND The prediction of long non-coding RNA (lncRNA) has attracted great attention from researchers, as more and more evidence indicate that various complex human diseases are closely related to lncRNAs. In the era of bio-med big data, in addition to the prediction of lncRNAs by biological experimental methods, many computational methods based on machine learning have been proposed to make better use of the sequence resources of lncRNAs. RESULTS We developed the lncRNA prediction method by integrating information-entropy-based features and machine learning algorithms. We calculate generalized topological entropy and generate 6 novel features for lncRNA sequences. By employing these 6 features and other features such as open reading frame, we apply supporting vector machine, XGBoost and random forest algorithms to distinguish human lncRNAs. We compare our method with the one which has more K-mer features and results show that our method has higher area under the curve up to 99.7905%. CONCLUSIONS We develop an accurate and efficient method which has novel information entropy features to analyze and classify lncRNAs. Our method is also extendable for research on the other functional elements in DNA sequences.
Collapse
Affiliation(s)
- Junyi Li
- School of Computer Science and Technology, Harbin Institute of Technology (Shenzhen), Shenzhen, 518055, Guangdong, China.
| | - Huinian Li
- School of Computer Science and Technology, Harbin Institute of Technology (Shenzhen), Shenzhen, 518055, Guangdong, China
| | - Xiao Ye
- School of Computer Science and Technology, Harbin Institute of Technology (Shenzhen), Shenzhen, 518055, Guangdong, China
| | - Li Zhang
- School of Computer Science and Technology, Harbin Institute of Technology (Shenzhen), Shenzhen, 518055, Guangdong, China
| | - Qingzhe Xu
- School of Computer Science and Technology, Harbin Institute of Technology (Shenzhen), Shenzhen, 518055, Guangdong, China
| | - Yuan Ping
- School of Computer Science and Technology, Harbin Institute of Technology (Shenzhen), Shenzhen, 518055, Guangdong, China
| | - Xiaozhu Jing
- School of Computer Science and Technology, Harbin Institute of Technology (Shenzhen), Shenzhen, 518055, Guangdong, China
| | - Wei Jiang
- School of Computer Science and Technology, Harbin Institute of Technology (Shenzhen), Shenzhen, 518055, Guangdong, China
| | - Qing Liao
- School of Computer Science and Technology, Harbin Institute of Technology (Shenzhen), Shenzhen, 518055, Guangdong, China
| | - Bo Liu
- Center for Bioinformatics, School of Computer Science and Technology, Harbin Institute of Technology, Harbin, 150001, Heilongjiang, China
| | - Yadong Wang
- School of Computer Science and Technology, Harbin Institute of Technology (Shenzhen), Shenzhen, 518055, Guangdong, China.
- Center for Bioinformatics, School of Computer Science and Technology, Harbin Institute of Technology, Harbin, 150001, Heilongjiang, China.
| |
Collapse
|
4
|
Spatial constrains and information content of sub-genomic regions of the human genome. iScience 2021; 24:102048. [PMID: 33554061 PMCID: PMC7843455 DOI: 10.1016/j.isci.2021.102048] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/12/2020] [Revised: 11/30/2020] [Accepted: 01/06/2021] [Indexed: 02/08/2023] Open
Abstract
Complexity metrics and machine learning (ML) models have been utilized to analyze the lengths of segmental genomic entities of DNA sequences (exonic, intronic, intergenic, repeat, unique) with the purpose to ask questions regarding the segmental organization of the human genome within the size distribution of these sequences. For this we developed an integrated methodology that is based upon the reconstructed phase space theorem, the non-extensive statistical theory of Tsallis, ML techniques, and a technical index, integrating the generated information, which we introduce and named complexity factor (COFA). Our analysis revealed that the size distribution of the genomic regions within chromosomes are not random but follow patterns with characteristic features that have been seen through its complexity character, and it is part of the dynamics of the whole genome. Finally, this picture of dynamics in DNA is recognized using ML tools for clustering, classification, and prediction with high accuracy. The lengths of DNA subgenomic entities satisfied the Tsallis non-extensive statistics The size distribution of the subgenomic entities within chromosomes follow specific patterns A technical index COFA was introduced to characterize the degree of complexity The degree of complexity behavior in DNA is identifiable using ML approaches
Collapse
|
5
|
Ghosh SK, Ghosh A. A Novel Human Diabetes Biomarker Recognition Approach Using Fuzzy Rough Multigranulation Nearest Neighbour Classifier Model. Interdiscip Sci 2020; 12:461-475. [PMID: 32920773 DOI: 10.1007/s12539-020-00391-7] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/09/2020] [Revised: 08/22/2020] [Accepted: 08/31/2020] [Indexed: 10/23/2022]
Abstract
The selection of gene identifier from microarray databases is a challenging task since microarray contains large number of gene attributes for a few samples. This article proposes a novel fuzzy-rough set-based gene expression features selection using fuzzy-rough reduct under multi-granular space for human diabetes patient. Firstly, fuzzy multi-granular gain has been computed from the expression datasets via fuzzy entropy which reduces the dimension of the database. Thereafter, the features have been selected from microarray using the fuzzy rough reduct and information gain with respect to their expression patterns. To reduce the computational cost, a decision making scheme has been designed using a rough approximation of a fuzzy concept in the field of multi-granulation framework. Finally, we have recognized the association among the genomes that have expressively different expression patterns from controlled state to the diabetic state with respect to their impression using modified fuzzy-rough nearest neighbour classifier (FRNNC). Five standard diabetic microarray datasets have been considered to quantify the efficiency of the designed FRNNC model and are validated with F measure using diabetes gene expression NCBI database and it performs superior compared to existing methods.
Collapse
Affiliation(s)
- Swarup Kr Ghosh
- Department of Computer Science and Engineering, Sister Nivedita University, Kolkata, India.
| | - Anupam Ghosh
- Department of Computer Science and Engineering, Netaji Subhash Engineering College, Kolkata, India
| |
Collapse
|
6
|
Identification of Regulatory SNPs Associated with Vicine and Convicine Content of Vicia faba Based on Genotyping by Sequencing Data Using Deep Learning. Genes (Basel) 2020; 11:genes11060614. [PMID: 32516876 PMCID: PMC7349281 DOI: 10.3390/genes11060614] [Citation(s) in RCA: 14] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/29/2020] [Revised: 05/26/2020] [Accepted: 05/28/2020] [Indexed: 12/15/2022] Open
Abstract
Faba bean (Vicia faba) is a grain legume, which is globally grown for both human consumption as well as feed for livestock. Despite its agro-ecological importance the usage of Vicia faba is severely hampered by its anti-nutritive seed-compounds vicine and convicine (V+C). The genes responsible for a low V+C content have not yet been identified. In this study, we aim to computationally identify regulatory SNPs (rSNPs), i.e., SNPs in promoter regions of genes that are deemed to govern the V+C content of Vicia faba. For this purpose we first trained a deep learning model with the gene annotations of seven related species of the Leguminosae family. Applying our model, we predicted putative promoters in a partial genome of Vicia faba that we assembled from genotyping-by-sequencing (GBS) data. Exploiting the synteny between Medicago truncatula and Vicia faba, we identified two rSNPs which are statistically significantly associated with V+C content. In particular, the allele substitutions regarding these rSNPs result in dramatic changes of the binding sites of the transcription factors (TFs) MYB4, MYB61, and SQUA. The knowledge about TFs and their rSNPs may enhance our understanding of the regulatory programs controlling V+C content of Vicia faba and could provide new hypotheses for future breeding programs.
Collapse
|