1
|
Rudenko V, Korotkov E. Study of Dispersed Repeats in the Cyanidioschyzon merolae Genome. Int J Mol Sci 2024; 25:4441. [PMID: 38674025 PMCID: PMC11050394 DOI: 10.3390/ijms25084441] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/19/2024] [Revised: 04/08/2024] [Accepted: 04/15/2024] [Indexed: 04/28/2024] Open
Abstract
In this study, we applied the iterative procedure (IP) method to search for families of highly diverged dispersed repeats in the genome of Cyanidioschyzon merolae, which contains over 16 million bases. The algorithm included the construction of position weight matrices (PWMs) for repeat families and the identification of more dispersed repeats based on the PWMs using dynamic programming. The results showed that the C. merolae genome contained 20 repeat families comprising a total of 33,938 dispersed repeats, which is significantly more than has been previously found using other methods. The repeats varied in length from 108 to 600 bp (522.54 bp in average) and occupied more than 72% of the C. merolae genome, whereas previously identified repeats, including tandem repeats, have been shown to constitute only about 28%. The high genomic content of dispersed repeats and their location in the coding regions suggest a significant role in the regulation of the functional activity of the genome.
Collapse
Affiliation(s)
- Valentina Rudenko
- Institute of Bioengineering, Research Center of Biotechnology of the Russian Academy of Sciences, Moscow 119071, Russia;
| | | |
Collapse
|
2
|
Lavezzo GM, Lauretto MDS, Andrioli LPM, Machado-Lima A. Position Weight Matrix or Acyclic Probabilistic Finite Automaton: Which model to use? A decision rule inferred for the prediction of transcription factor binding sites. Genet Mol Biol 2024; 46:e20230048. [PMID: 38285430 PMCID: PMC10945726 DOI: 10.1590/1678-4685-gmb-2023-0048] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/27/2023] [Accepted: 10/18/2023] [Indexed: 01/30/2024] Open
Abstract
Prediction of transcription factor binding sites (TFBS) is an example of application of Bioinformatics where DNA molecules are represented as sequences of A, C, G and T symbols. The most used model in this problem is Position Weight Matrix (PWM). Notwithstanding the advantage of being simple, PWMs cannot capture dependency between nucleotide positions, which may affect prediction performance. Acyclic Probabilistic Finite Automata (APFA) is an alternative model able to accommodate position dependencies. However, APFA is a more complex model, which means more parameters have to be learned. In this paper, we propose an innovative method to identify when position dependencies influence preference for PWMs or APFAs. This implied using position dependency features extracted from 1106 sets of TFBS to infer a decision tree able to predict which is the best model - PWM or APFA - for a given set of TFBSs. According to our results, as few as three pinpointed features are able to choose the best model, providing a balance of performance (average precision) and model simplicity.
Collapse
Affiliation(s)
- Guilherme Miura Lavezzo
- Universidade de São Paulo, Instituto de Matemática e Estatística,
Programa Interunidades de Pós-Graduação em Bioinformática, São Paulo, SP,
Brazil
| | | | | | - Ariane Machado-Lima
- Universidade de São Paulo, Escola de Artes, Ciências e Humanidades,
São Paulo, SP, Brazil
| |
Collapse
|
3
|
Ali S, Bello B, Chourasia P, Punathil RT, Zhou Y, Patterson M. PWM2Vec: An Efficient Embedding Approach for Viral Host Specification from Coronavirus Spike Sequences. Biology (Basel) 2022; 11:418. [PMID: 35336792 DOI: 10.3390/biology11030418] [Citation(s) in RCA: 7] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 01/04/2022] [Revised: 02/24/2022] [Accepted: 03/07/2022] [Indexed: 01/14/2023]
Abstract
Simple Summary The family of coronaviruses comprises a diverse set of strains and variants which cause diseases from the common cold to COVID-19. Moreover, they infect a wide array of hosts from bats, camels, birds, to humans. Studying coronaviruses through the lens of host specificity provides a unique perspective to understanding the evolution, diversity and dynamics of this family. In particular, this can reveal groups of different hosts infected by similar strains, giving clues on strains which were more likely to have evolved to jump from one host to another. In this work, we frame host specificity as a classification task, in designing a very compact numerical representation of the spike sequences of different coronaviruses. Based on this numerical representation, classification methods are able to detect the target host with high accuracy. Such an approach can used to efficiently scale to large volumes of sequences, in order to unveil trends in the host specificity of different coronavirus strains. Abstract The study of host specificity has important connections to the question about the origin of SARS-CoV-2 in humans which led to the COVID-19 pandemic—an important open question. There are speculations that bats are a possible origin. Likewise, there are many closely related (corona)viruses, such as SARS, which was found to be transmitted through civets. The study of the different hosts which can be potential carriers and transmitters of deadly viruses to humans is crucial to understanding, mitigating, and preventing current and future pandemics. In coronaviruses, the surface (S) protein, or spike protein, is important in determining host specificity, since it is the point of contact between the virus and the host cell membrane. In this paper, we classify the hosts of over five thousand coronaviruses from their spike protein sequences, segregating them into clusters of distinct hosts among birds, bats, camels, swine, humans, and weasels, to name a few. We propose a feature embedding based on the well-known position weight matrix (PWM), which we call PWM2Vec, and we use it to generate feature vectors from the spike protein sequences of these coronaviruses. While our embedding is inspired by the success of PWMs in biological applications, such as determining protein function and identifying transcription factor binding sites, we are the first (to the best of our knowledge) to use PWMs from viral sequences to generate fixed-length feature vector representations, and use them in the context of host classification. The results on real world data show that when using PWM2Vec, machine learning classifiers are able to perform comparably to the baseline models in terms of predictive performance and runtime—in some cases, the performance is better. We also measure the importance of different amino acids using information gain to show the amino acids which are important for predicting the host of a given coronavirus. Finally, we perform some statistical analyses on these results to show that our embedding is more compact than the embeddings of the baseline models.
Collapse
|
4
|
Abstract
Multiple sequence alignment (MSA) is the basis for almost all sequence comparison and molecular phylogenetic inferences. Large-scale genomic analyses are typically associated with automated progressive MSA without subsequent manual adjustment, which itself is often error-prone because of the lack of a consistent and explicit criterion. Here, I outlined several commonly encountered alignment errors that cannot be avoided by progressive MSA for nucleotide, amino acid, and codon sequences. Methods that could be automated to fix such alignment errors were then presented. I emphasized the utility of position weight matrix as a new tool for MSA refinement and illustrated its usage by refining the MSA of nucleotide and amino acid sequences. The main advantages of the position weight matrix approach include (1) its use of information from all sequences, in contrast to other commonly used methods based on pairwise alignment scores and inconsistency measures, and (2) its speedy computation, making it suitable for a large number of long viral genomic sequences.
Collapse
Affiliation(s)
- Xuhua Xia
- Department of Biology, University of Ottawa, Marie-Curie Private, Ottawa, ON K1N 9A7, Canada; ; Tel.: +1-613-562-5718
- Ottawa Institute of Systems Biology, University of Ottawa, Ottawa, ON K1H 8M5, Canada
| |
Collapse
|
5
|
Jin Y, Jiang J, Wang R, Qin ZS. Systematic Evaluation of DNA Sequence Variations on in vivo Transcription Factor Binding Affinity. Front Genet 2021; 12:667866. [PMID: 34567058 PMCID: PMC8458901 DOI: 10.3389/fgene.2021.667866] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/15/2021] [Accepted: 08/02/2021] [Indexed: 02/01/2023] Open
Abstract
The majority of the single nucleotide variants (SNVs) identified by genome-wide association studies (GWAS) fall outside of the protein-coding regions. Elucidating the functional implications of these variants has been a major challenge. A possible mechanism for functional non-coding variants is that they disrupted the canonical transcription factor (TF) binding sites that affect the in vivo binding of the TF. However, their impact varies since many positions within a TF binding motif are not well conserved. Therefore, simply annotating all variants located in putative TF binding sites may overestimate the functional impact of these SNVs. We conducted a comprehensive survey to study the effect of SNVs on the TF binding affinity. A sequence-based machine learning method was used to estimate the change in binding affinity for each SNV located inside a putative motif site. From the results obtained on 18 TF binding motifs, we found that there is a substantial variation in terms of a SNV’s impact on TF binding affinity. We found that only about 20% of SNVs located inside putative TF binding sites would likely to have significant impact on the TF-DNA binding.
Collapse
Affiliation(s)
- Yutong Jin
- Department of Biostatistics and Bioinformatics, Emory University, Atlanta, GA, United States
| | - Jiahui Jiang
- Department of Biostatistics and Bioinformatics, Emory University, Atlanta, GA, United States
| | - Ruixuan Wang
- College of Environmental Sciences and Engineering, Peking University, Beijing, China
| | - Zhaohui S Qin
- Department of Biostatistics and Bioinformatics, Emory University, Atlanta, GA, United States
| |
Collapse
|
6
|
Yu CP, Kuo CH, Nelson CW, Chen CA, Soh ZT, Lin JJ, Hsiao RX, Chang CY, Li WH. Discovering unknown human and mouse transcription factor binding sites and their characteristics from ChIP-seq data. Proc Natl Acad Sci U S A 2021; 118:e2026754118. [PMID: 33975951 DOI: 10.1073/pnas.2026754118] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/14/2022] Open
Abstract
Transcription factor binding sites (TFBSs) are essential for gene regulation, but the number of known TFBSs remains limited. We aimed to discover and characterize unknown TFBSs by developing a computational pipeline for analyzing ChIP-seq (chromatin immunoprecipitation followed by sequencing) data. Applying it to the latest ENCODE ChIP-seq data for human and mouse, we found that using the irreproducible discovery rate as a quality-control criterion resulted in many experiments being unnecessarily discarded. By contrast, the number of motif occurrences in ChIP-seq peak regions provides a highly effective criterion, which is reliable even if supported by only one experimental replicate. In total, we obtained 2,058 motifs from 1,089 experiments for 354 human TFs and 163 motifs from 101 experiments for 34 mouse TFs. Among these motifs, 487 have not previously been reported. Mapping the canonical motifs to the human genome reveals a high TFBS density ±2 kb around transcription start sites (TSSs) with a peak at -50 bp. On average, a promoter contains 5.7 TFBSs. However, 70% of TFBSs are in introns (41%) and intergenic regions (29%), whereas only 12% are in promoters (-1 kb to +100 bp from TSSs). Notably, some TFs (e.g., CTCF, JUN, JUNB, and NFE2) have motifs enriched in intergenic regions, including enhancers. We inferred 142 cobinding TF pairs and 186 (including 115 completely) tethered binding TF pairs, indicating frequent interactions between TFs and a higher frequency of tethered binding than cobinding. This study provides a large number of previously undocumented motifs and insights into the biological and genomic features of TFBSs.
Collapse
|
7
|
崔 颖, 徐 泽, 李 建. [Identification of nucleosome positioning using support vector machine method based on comprehensive DNA sequence feature]. Sheng Wu Yi Xue Gong Cheng Xue Za Zhi 2020; 37:496-501. [PMID: 32597092 PMCID: PMC10319573 DOI: 10.7507/1001-5515.201911064] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Subscribe] [Scholar Register] [Received: 11/23/2019] [Indexed: 11/03/2022]
Abstract
In this article, based on z-curve theory and position weight matrix (PWM), a model for nucleosome sequences was constructed. Nucleosome sequence dataset was transformed into three-dimensional coordinates, PWM of the nucleosome sequences was calculated and the similarity score was obtained. After integrating them, a nucleosome feature model based on the comprehensive DNA sequences was obtained and named CSeqFM. We calculated the Euclidean distance between nucleosome sequence candidates or linker sequences and CSeqFM model as the feature dataset, and put the feature datasets into the support vector machine (SVM) for training and testing by ten-fold cross-validation. The results showed that the sensitivity, specificity, accuracy and Matthews correlation coefficient (MCC) of identifying nucleosome positioning for S. cerevisiae were 97.1%, 96.9%, 94.2% and 0.89, respectively, and the area under the receiver operating characteristic curve (AUC) was 0.980 1. Compared with another z-curve method, it was found that our method had better identifying effect and each evaluation performance showed better superiority. CSeqFM method was applied to identify nucleosome positioning for other three species, including C. elegans, H. sapiens and D. melanogaster. The results showed that AUCs of the three species were all higher than 0.90, and CSeqFM method also showed better stability and effectiveness compared with iNuc-STNC and iNuc-PseKNC methods, which is further demonstrated that CSeqFM method has strong reliability and good identification performance.
Collapse
Affiliation(s)
- 颖 崔
- 黑龙江大学 电子工程学院(哈尔滨 150080)Electronic Engineering College, Heilongjiang University, Harbin 150080, P.R.China
- 哈尔滨医科大学 生物信息科学与技术学院(哈尔滨 150081)School of Bioinformatics Sciences and Technology, Harbin Medical University, Harbin 150081, P.R.China
| | - 泽龙 徐
- 黑龙江大学 电子工程学院(哈尔滨 150080)Electronic Engineering College, Heilongjiang University, Harbin 150080, P.R.China
| | - 建中 李
- 黑龙江大学 电子工程学院(哈尔滨 150080)Electronic Engineering College, Heilongjiang University, Harbin 150080, P.R.China
- 哈尔滨医科大学 生物信息科学与技术学院(哈尔滨 150081)School of Bioinformatics Sciences and Technology, Harbin Medical University, Harbin 150081, P.R.China
| |
Collapse
|
8
|
Hu X, Feng Z, Zhang X, Liu L, Wang S. The Identification of Metal Ion Ligand-Binding Residues by Adding the Reclassified Relative Solvent Accessibility. Front Genet 2020; 11:214. [PMID: 32265982 PMCID: PMC7096583 DOI: 10.3389/fgene.2020.00214] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/09/2019] [Accepted: 02/24/2020] [Indexed: 11/13/2022] Open
Abstract
Many proteins realize their special functions by binding with specific metal ion ligands during a cell's life cycle. The ability to correctly identify metal ion ligand-binding residues is valuable for the human health and the design of molecular drug. Precisely identifying these residues, however, remains challenging work. We have presented an improved computational approach for predicting the binding residues of 10 metal ion ligands (Zn2+, Cu2+, Fe2+, Fe3+, Co2+, Ca2+, Mg2+, Mn2+, Na+, and K+) by adding reclassified relative solvent accessibility (RSA). The best accuracy of fivefold cross-validation was higher than 77.9%, which was about 16% higher than the previous result on the same dataset. It was found that different reclassification of the RSA information can make different contributions to the identification of specific ligand binding residues. Our study has provided an additional understanding of the effect of the RSA on the identification of metal ion ligand binding residues.
Collapse
Affiliation(s)
| | - Zhenxing Feng
- College of Sciences, Inner Mongolla University of Technology, Hohhot, China
| | - Xiaojin Zhang
- College of Sciences, Inner Mongolla University of Technology, Hohhot, China
| | | | | |
Collapse
|
9
|
Cui Y, Xu Z, Li J. ZCMM: A Novel Method Using Z-Curve Theory- Based and Position Weight Matrix for Predicting Nucleosome Positioning. Genes (Basel) 2019; 10:E765. [PMID: 31569414 DOI: 10.3390/genes10100765] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/17/2019] [Revised: 09/25/2019] [Accepted: 09/26/2019] [Indexed: 02/04/2023] Open
Abstract
Nucleosomes are the basic units of eukaryotes. The accurate positioning of nucleosomes plays a significant role in understanding many biological processes such as transcriptional regulation mechanisms and DNA replication and repair. Here, we describe the development of a novel method, termed ZCMM, based on Z-curve theory and position weight matrix (PWM). The ZCMM was trained and tested using the nucleosomal and linker sequences determined by support vector machine (SVM) in Saccharomyces cerevisiae (S. cerevisiae), and experimental results showed that the sensitivity (Sn), specificity (Sp), accuracy (Acc), and Matthews correlation coefficient (MCC) values for ZCMM were 91.40%, 96.56%, 96.75%, and 0.88, respectively, and the average area under the receiver operating characteristic curve (AUC) value was 0.972. A ZCMM predictor was developed to predict nucleosome positioning in Homo sapiens (H. sapiens), Caenorhabditis elegans (C. elegans), and Drosophila melanogaster (D. melanogaster) genomes, and the accuracy (Acc) values were 77.72%, 85.34%, and 93.62%, respectively. The maximum AUC values of the four species were 0.982, 0.861, 0.912 and 0.911, respectively. Another independent dataset for S. cerevisiae was used to predict nucleosome positioning. Compared with the results of Wu's method, it was found that the Sn, Sp, Acc, and MCC of ZCMM results for S. cerevisiae were all higher, reaching 96.72%, 96.54%, 94.10%, and 0.88. Compared with the Guo's method 'iNuc-PseKNC', the results of ZCMM for D. melanogaster were better. Meanwhile, the ZCMM was compared with some experimental data in vitro and in vivo for S. cerevisiae, and the results showed that the nucleosomes predicted by ZCMM were highly consistent with those confirmed by these experiments. Therefore, it was further confirmed that the ZCMM method has good accuracy and reliability in predicting nucleosome positioning.
Collapse
|
10
|
Townley RA, Bülow HE. Deciphering functional glycosaminoglycan motifs in development. Curr Opin Struct Biol 2018; 50:144-154. [PMID: 29579579 PMCID: PMC6078790 DOI: 10.1016/j.sbi.2018.03.011] [Citation(s) in RCA: 33] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/25/2017] [Revised: 03/07/2018] [Accepted: 03/08/2018] [Indexed: 01/12/2023]
Abstract
Glycosaminoglycans (GAGs) such as heparan sulfate, chondroitin/dermatan sulfate, and keratan sulfate are linear glycans, which when attached to protein backbones form proteoglycans. GAGs are essential components of the extracellular space in metazoans. Extensive modifications of the glycans such as sulfation, deacetylation and epimerization create structural GAG motifs. These motifs regulate protein-protein interactions and are thereby repsonsible for many of the essential functions of GAGs. This review focusses on recent genetic approaches to characterize GAG motifs and their function in defined signaling pathways during development. We discuss a coding approach for GAGs that would enable computational analyses of GAG sequences such as alignments and the computation of position weight matrices to describe GAG motifs.
Collapse
Affiliation(s)
- Robert A Townley
- Department of Biological Sciences, Columbia University, New York, NY 10027, United States
| | - Hannes E Bülow
- Department of Genetics, Albert Einstein College of Medicine, Bronx, NY 10461, United States; Dominick P. Purpura Department of Neuroscience, Albert Einstein College of Medicine, Bronx, NY 10461, United States.
| |
Collapse
|
11
|
Javed M, Solanki M, Sinha A, Shukla LI. Position Based Nucleotide Analysis of miR168 Family in Higher Plants and its Targets in Mammalian Transcripts. Microrna 2018; 6:136-142. [PMID: 28215140 DOI: 10.2174/2211536606666170215154151] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/18/2016] [Revised: 01/20/2017] [Accepted: 02/10/2017] [Indexed: 11/22/2022]
Abstract
BACKGROUND miRNA are the post transcriptional regulator of the genes. The conserved miR168 family is evaluated for position based nucleotide preference in higher plants. Low density lipoprotein receptor adaptor protein 1 (LDLRAP1) target validated for miR168a obtained from rice origin is reported. METHODS The mature miRNA sequences include miR168-5p and miR168-3p, were obtained from miRBase (v21, June 2014) for 15 families (28 plants). The preferred position based nucleotide sequences were obtained using Data Analysis in Molecular Biology and Evolution software. The miR168-5p was subjected to cross kingdom analysis using psRNATarget. Target expression and functional annotation was analyzed by using Human Protein Atlas database WEB-based Gene SeTAnaLysis Toolkit. RESULTS miR168-5p shows same nucleotides at positions 1-6, 8-9, 11-12, 15-17 and 19. Also, miR168-3p is present in 3 families (10 plants) shows the same nucleotide at position 1-11, 13-15 and 17-21. The 123 targets in human transcriptome were identified showing 58% cleavage and 41% translation repression. Low density lipoprotein receptor adaptor protein 1 (LDLRAP1) target validated for miR168a obtained from rice origin, could also be targeted from miR168 from any other plant sources. The randomly selected 10 targets include some important genes likeRPL34, ATXN1, AKAPI3 and ALS2 and is involved in transcription, cell trafficking, cell metabolism and neurodegenerative disorder. CONCLUSION Our work suggests that miR168 family has conserved sequence in higher plants. The seed region position 2-8 shows 70-95% pairing with human targets. Cleavage site at position 10-14 and these were analysed for the base preference with the targets showed 80-96% Watson Crick pairing.
Collapse
Affiliation(s)
- Mohammed Javed
- Department of Biotechnology, School of Life Sciences, Pondicherry University, Kalapet, Puducherry - 605014. India
| | - Manish Solanki
- Department of Biotechnology, School of Life Sciences, Pondicherry University, Kalapet, Puducherry - 605014. India
| | - Anshika Sinha
- Department of Biotechnology, School of Life Sciences, Pondicherry University, Kalapet, Puducherry - 605014. India
| | - Lata I Shukla
- Department of Biotechnology, School of Life Sciences, Pondicherry University, Kalapet, Puducherry - 605014. India
| |
Collapse
|
12
|
Zhang N, Rao RSP, Salvato F, Havelund JF, Møller IM, Thelen JJ, Xu D. MU-LOC: A Machine-Learning Method for Predicting Mitochondrially Localized Proteins in Plants. Front Plant Sci 2018; 9:634. [PMID: 29875778 PMCID: PMC5974146 DOI: 10.3389/fpls.2018.00634] [Citation(s) in RCA: 16] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/19/2018] [Accepted: 04/23/2018] [Indexed: 05/19/2023]
Abstract
Targeting and translocation of proteins to the appropriate subcellular compartments are crucial for cell organization and function. Newly synthesized proteins are transported to mitochondria with the assistance of complex targeting sequences containing either an N-terminal pre-sequence or a multitude of internal signals. Compared with experimental approaches, computational predictions provide an efficient way to infer subcellular localization of a protein. However, it is still challenging to predict plant mitochondrially localized proteins accurately due to various limitations. Consequently, the performance of current tools can be improved with new data and new machine-learning methods. We present MU-LOC, a novel computational approach for large-scale prediction of plant mitochondrial proteins. We collected a comprehensive dataset of plant subcellular localization, extracted features including amino acid composition, protein position weight matrix, and gene co-expression information, and trained predictors using deep neural network and support vector machine. Benchmarked on two independent datasets, MU-LOC achieved substantial improvements over six state-of-the-art tools for plant mitochondrial targeting prediction. In addition, MU-LOC has the advantage of predicting plant mitochondrial proteins either possessing or lacking N-terminal pre-sequences. We applied MU-LOC to predict candidate mitochondrial proteins for the whole proteome of Arabidopsis and potato. MU-LOC is publicly available at http://mu-loc.org.
Collapse
Affiliation(s)
- Ning Zhang
- Informatics Institute, University of Missouri, Columbia, MO, United States
- Christopher S. Bond Life Sciences Center, University of Missouri, Columbia, MO, United States
| | - R. S. P. Rao
- Christopher S. Bond Life Sciences Center, University of Missouri, Columbia, MO, United States
- Department of Biochemistry, University of Missouri, Columbia, MO, United States
| | - Fernanda Salvato
- Department of Biochemistry, University of Missouri, Columbia, MO, United States
| | - Jesper F. Havelund
- Department of Molecular Biology and Genetics, Aarhus University, Aarhus, Denmark
| | - Ian M. Møller
- Department of Molecular Biology and Genetics, Aarhus University, Aarhus, Denmark
| | - Jay J. Thelen
- Christopher S. Bond Life Sciences Center, University of Missouri, Columbia, MO, United States
- Department of Biochemistry, University of Missouri, Columbia, MO, United States
| | - Dong Xu
- Informatics Institute, University of Missouri, Columbia, MO, United States
- Christopher S. Bond Life Sciences Center, University of Missouri, Columbia, MO, United States
- Department of Electrical Engineering and Computer Science, University of Missouri, Columbia, MO, United States
- *Correspondence: Dong Xu,
| |
Collapse
|
13
|
Dresch JM, Zellers RG, Bork DK, Drewell RA. Nucleotide Interdependency in Transcription Factor Binding Sites in the Drosophila Genome. Gene Regul Syst Bio 2016; 10:21-33. [PMID: 27330274 PMCID: PMC4907338 DOI: 10.4137/grsb.s38462] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 02/05/2016] [Revised: 04/17/2016] [Accepted: 04/28/2016] [Indexed: 01/14/2023]
Abstract
A long-standing objective in modern biology is to characterize the molecular components that drive the development of an organism. At the heart of eukaryotic development lies gene regulation. On the molecular level, much of the research in this field has focused on the binding of transcription factors (TFs) to regulatory regions in the genome known as cis-regulatory modules (CRMs). However, relatively little is known about the sequence-specific binding preferences of many TFs, especially with respect to the possible interdependencies between the nucleotides that make up binding sites. A particular limitation of many existing algorithms that aim to predict binding site sequences is that they do not allow for dependencies between nonadjacent nucleotides. In this study, we use a recently developed computational algorithm, MARZ, to compare binding site sequences using 32 distinct models in a systematic and unbiased approach to explore nucleotide dependencies within binding sites for 15 distinct TFs known to be critical to Drosophila development. Our results indicate that many of these proteins have varying levels of nucleotide interdependencies within their DNA recognition sequences, and that, in some cases, models that account for these dependencies greatly outperform traditional models that are used to predict binding sites. We also directly compare the ability of different models to identify the known KRUPPEL TF binding sites in CRMs and demonstrate that a more complex model that accounts for nucleotide interdependencies performs better when compared with simple models. This ability to identify TFs with critical nucleotide interdependencies in their binding sites will lead to a deeper understanding of how these molecular characteristics contribute to the architecture of CRMs and the precise regulation of transcription during organismal development.
Collapse
Affiliation(s)
- Jacqueline M. Dresch
- Department of Mathematics and Computer Science, Clark University, Worcester, MA, USA
| | - Rowan G. Zellers
- Computer Science Department, Harvey Mudd College, Claremont, CA, USA
- Mathematics Department, Harvey Mudd College, Claremont, CA, USA
| | - Daniel K. Bork
- Computer Science Department, Harvey Mudd College, Claremont, CA, USA
- Mathematics Department, Harvey Mudd College, Claremont, CA, USA
| | | |
Collapse
|
14
|
Zemlyanskaya EV, Levitsky VG, Oshchepkov DY, Grosse I, Mironova VV. The Interplay of Chromatin Landscape and DNA-Binding Context Suggests Distinct Modes of EIN3 Regulation in Arabidopsis thaliana. Front Plant Sci 2016; 7:2044. [PMID: 28119721 PMCID: PMC5220190 DOI: 10.3389/fpls.2016.02044] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/20/2016] [Accepted: 12/21/2016] [Indexed: 05/08/2023]
Abstract
The plant hormone ethylene regulates numerous developmental processes and stress responses. Ethylene signaling proceeds via a linear pathway, which activates transcription factor (TF) EIN3, a primary transcriptional regulator of ethylene response. EIN3 influences gene expression upon binding to a specific sequence in gene promoters. This interaction, however, might be considerably affected by additional co-factors. In this work, we perform whole genome bioinformatics study to identify the impact of epigenetic factors in EIN3 functioning. The analysis of publicly available ChIP-Seq data on EIN3 binding in Arabidopsis thaliana showed bimodality of distribution of EIN3 binding regions (EBRs) in gene promoters. Besides a sharp peak in close proximity to transcription start site, which is a common binding region for a wide variety of TFs, we found an additional extended peak in the distal promoter region. We characterized all EBRs with respect to the epigenetic status appealing to previously published genome-wide map of nine chromatin states in A. thaliana. We found that the implicit distal peak was associated with a specific chromatin state (referred to as chromatin state 4 in the primary source), which was just poorly represented in the pronounced proximal peak. Intriguingly, EBRs corresponding to this chromatin state 4 were significantly associated with ethylene response, unlike the others representing the overwhelming majority of EBRs related to the explicit proximal peak. Moreover, we found that specific EIN3 binding sequences predicted with previously described model were enriched in the EBRs mapped to the chromatin state 4, but not to the rest ones. These results allow us to conclude that the interplay of genetic and epigenetic factors might cause the distinct modes of EIN3 regulation.
Collapse
Affiliation(s)
- Elena V. Zemlyanskaya
- Institute of Cytology and Genetics, Siberian Branch of the Russian Academy of Sciences (SB RAS), NovosibirskRussia
- Department of Natural Sciences, Novosibirsk State UniversityNovosibirsk, Russia
- *Correspondence: Elena V. Zemlyanskaya,
| | - Victor G. Levitsky
- Institute of Cytology and Genetics, Siberian Branch of the Russian Academy of Sciences (SB RAS), NovosibirskRussia
- Department of Natural Sciences, Novosibirsk State UniversityNovosibirsk, Russia
| | - Dmitry Y. Oshchepkov
- Institute of Cytology and Genetics, Siberian Branch of the Russian Academy of Sciences (SB RAS), NovosibirskRussia
| | - Ivo Grosse
- Department of Natural Sciences, Novosibirsk State UniversityNovosibirsk, Russia
- Institute of Computer Science, Martin Luther University Halle-WittenbergHalle(Saale), Germany
- German Centre for Integrative Biodiversity Research (iDiv) Halle-Jena-LeipzigLeipzig, Germany
| | - Victoria V. Mironova
- Institute of Cytology and Genetics, Siberian Branch of the Russian Academy of Sciences (SB RAS), NovosibirskRussia
- Department of Natural Sciences, Novosibirsk State UniversityNovosibirsk, Russia
| |
Collapse
|
15
|
Nettling M, Treutler H, Grau J, Keilwagen J, Posch S, Grosse I. DiffLogo: a comparative visualization of sequence motifs. BMC Bioinformatics 2015; 16:387. [PMID: 26577052 PMCID: PMC4650857 DOI: 10.1186/s12859-015-0767-x] [Citation(s) in RCA: 45] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/10/2015] [Accepted: 10/08/2015] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND For three decades, sequence logos are the de facto standard for the visualization of sequence motifs in biology and bioinformatics. Reasons for this success story are their simplicity and clarity. The number of inferred and published motifs grows with the number of data sets and motif extraction algorithms. Hence, it becomes more and more important to perceive differences between motifs. However, motif differences are hard to detect from individual sequence logos in case of multiple motifs for one transcription factor, highly similar binding motifs of different transcription factors, or multiple motifs for one protein domain. RESULTS Here, we present DiffLogo, a freely available, extensible, and user-friendly R package for visualizing motif differences. DiffLogo is capable of showing differences between DNA motifs as well as protein motifs in a pair-wise manner resulting in publication-ready figures. In case of more than two motifs, DiffLogo is capable of visualizing pair-wise differences in a tabular form. Here, the motifs are ordered by similarity, and the difference logos are colored for clarity. We demonstrate the benefit of DiffLogo on CTCF motifs from different human cell lines, on E-box motifs of three basic helix-loop-helix transcription factors as examples for comparison of DNA motifs, and on F-box domains from three different families as example for comparison of protein motifs. CONCLUSIONS DiffLogo provides an intuitive visualization of motif differences. It enables the illustration and investigation of differences between highly similar motifs such as binding patterns of transcription factors for different cell types, treatments, and algorithmic approaches.
Collapse
Affiliation(s)
- Martin Nettling
- Institute of Computer Science, Martin Luther University Halle-Wittenberg, Halle (Saale), Germany.
| | - Hendrik Treutler
- Leibniz Institute of Plant Biochemistry, Halle (Saale), Germany.
| | - Jan Grau
- Institute of Computer Science, Martin Luther University Halle-Wittenberg, Halle (Saale), Germany.
| | - Jens Keilwagen
- Institute for Biosafety in Plant Biotechnology, Julius Kühn-Institut (JKI), Federal Research Centre for Cultivated Plants, Quedlinburg, Germany.
| | - Stefan Posch
- Institute of Computer Science, Martin Luther University Halle-Wittenberg, Halle (Saale), Germany.
| | - Ivo Grosse
- Institute of Computer Science, Martin Luther University Halle-Wittenberg, Halle (Saale), Germany.
- German Centre for Integrative Biodiversity Research (iDiv) Halle-Jena-Leipzig, Leipzig, Germany.
| |
Collapse
|
16
|
Bragin EY, Shtratnikova VY, Dovbnya DV, Schelkunov MI, Pekov YA, Malakho SG, Egorova OV, Ivashina TV, Sokolov SL, Ashapkin VV, Donova MV. Comparative analysis of genes encoding key steroid core oxidation enzymes in fast-growing Mycobacterium spp. strains. J Steroid Biochem Mol Biol 2013; 138:41-53. [PMID: 23474435 DOI: 10.1016/j.jsbmb.2013.02.016] [Citation(s) in RCA: 51] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 10/24/2012] [Revised: 01/28/2013] [Accepted: 02/24/2013] [Indexed: 11/27/2022]
Abstract
A comparative genome analysis of Mycobacterium spp. VKM Ac-1815D, 1816D and 1817D strains used for efficient production of key steroid intermediates (androst-4-ene-3,17-dione, AD, androsta-1,4-diene-3,17-dione, ADD, 9α-hydroxy androst-4-ene-3,17-dione, 9-OH-AD) from phytosterol has been carried out by deep sequencing. The assembled contig sequences were analyzed for the presence putative genes of steroid catabolism pathways. Since 3-ketosteroid-9α-hydroxylases (KSH) and 3-ketosteroid-Δ(1)-dehydrogenase (Δ(1) KSTD) play key role in steroid core oxidation, special attention was paid to the genes encoding these enzymes. At least three genes of Δ(1) KSTD (kstD), five genes of KSH subunit A (kshA), and one gene of KSH subunit B of 3-ketosteroid-9α-hydroxylases (kshB) have been found in Mycobacterium sp. VKM Ac-1817D. Strains of Mycobacterium spp. VKM Ac-1815D and 1816D were found to possess at least one kstD, one kshB and two kshA genes. The assembled genome sequence of Mycobacterium sp. VKM Ac-1817D differs from those of 1815D and 1816D strains, whereas these last two are nearly identical, differing by 13 single nucleotide substitutions (SNPs). One of these SNPs is located in the coding region of a kstD gene and corresponds to an amino acid substitution Lys (135) in 1816D for Ser (135) in 1815D. The findings may be useful for targeted genetic engineering of the biocatalysts for biotechnological application.
Collapse
Key Words
- 2,3-dehydroxyphenyl dioxygenase
- 2-enoyl acyl-CoA hydratase
- 2-hydroxypenta-2,4-dienoate hydratase
- 3,4-dihydroxy-9,10-secoandrosta-1,3,5(10)-triene-9,17-dione 4,5-dioxygenase
- 3-hydroxy-9,10-secoandrosta-1,3,5(10)-triene-9,17-dione monooxygenase
- 3-hydroxy-9,10-secoandrosta-1,3,5(10)-triene-9,17-dione monooxygenase subunit
- 3-ketosteroid-9α-hydroxylase
- 3-ketosteroid-Δ(1)-dehydrogenase
- 3β-hydroxysteroid-dehydrogenase
- 4,5:9,10-diseco-3-hydroxy-5,9,17-trioxoandrosta-1(10),2-diene-4-oate hydrolase
- 4-hydroxy-2-oxovalerate aldolase
- 9-OH-AD
- 9α-hydroxy androst-4-ene-3,17-dione
- AD
- ADD
- Androst-1,4-diene-3,17-dione
- Androst-4-ene-3,17-dione
- BWA
- Broadband-Wheeler Aligner
- CTAB
- ChoX
- ChoX(D,E)
- EchA19
- FAD
- FadA5
- FadD17
- FadD19
- FadE26
- FadE27
- FadE28
- Genome sequencing
- HSD
- HTH-type transcriptional repressor
- HsaA
- HsaAB
- HsaB
- HsaC
- HsaD
- HsaE
- HsaF
- HsaG
- Hsd4A
- Hsd4B
- KSH
- KshA
- KshB
- KstR
- KstR2
- Ltp2
- Ltp3
- Ltp4
- Mycobacterium
- ORFs
- PWM
- Phytosterol
- SNP
- Steroid bioconversion
- TesB
- YrbE4A
- YrbE4B
- acetaldehyde dehydrogenase
- acetyl-CoA acetyltransferase
- acyl-CoA dehydrogenase
- acyl-CoA synthetase
- acyl-CoA thioesterase II
- androst-4-ene-3,17-dione
- androsta-1,4-diene-3,17-dione
- base pair
- bp
- cetyl trimethyl ammonium bromide
- cholesterol oxidase
- enoyl-CoA hydratase
- flavin adenine dinucleotide
- hydroxysteroid dehydrogenase
- integral membrane protein
- lipid transfer protein 4 (keto acyl-CoA thiolase)
- lipid-transfer protein 2
- lipid-transfer protein 3 (acetyl-CoA acetyltransferase)
- open reading frames
- position weight matrix
- single nucleotide substitution
- subunit A of 3-ketosteroid-9α-hydroxylase
- subunit B of 3-ketosteroid-9α-hydroxylases
- Δ(1) KSTD
Collapse
Affiliation(s)
- E Yu Bragin
- Center of Innovations and Technologies "Biological Active Compounds and Their Applications", Russian Academy of Sciences, Moscow 119991, Russian Federation; G.K.Skryabin Institute of Biochemistry & Physiology of Microorganisms, Russian Academy of Sciences, Pushchino, Moscow Region, Russian Federation.
| | | | | | | | | | | | | | | | | | | | | |
Collapse
|
17
|
Nandi S, Ioshikhes I. Optimizing the GATA-3 position weight matrix to improve the identification of novel binding sites. BMC Genomics 2012; 13:416. [PMID: 22913572 PMCID: PMC3481455 DOI: 10.1186/1471-2164-13-416] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/08/2011] [Accepted: 08/02/2012] [Indexed: 11/21/2022] Open
Abstract
BACKGROUND The identifying of binding sites for transcription factors is a key component of gene regulatory network analysis. This is often done using position-weight matrices (PWMs). Because of the importance of in silico mapping of tentative binding sites, we previously developed an approach for PWM optimization that substantially improves the accuracy of such mapping. RESULTS The present work implements the optimization algorithm applied to the existing PWM for GATA-3 transcription factor and builds a new di-nucleotide PWM. The existing available PWM is based on experimental data adopted from Jaspar. The optimized PWM substantially improves the sensitivity and specificity of the TF mapping compared to the conventional applications. The refined PWM also facilitates in silico identification of novel binding sites that are supported by experimental data. We also describe uncommon positioning of binding motifs for several T-cell lineage specific factors in human promoters. CONCLUSION Our proposed di-nucleotide PWM approach outperforms the conventional mono-nucleotide PWM approach with respect to GATA-3. Therefore our new di-nucleotide PWM provides new insight into plausible transcriptional regulatory interactions in human promoters.
Collapse
Affiliation(s)
- Soumyadeep Nandi
- Ottawa Institute of Systems Biology and Department of Biochemistry, Microbiology and Immunology, Faculty of Medicine, University of Ottawa, Ottawa, Ontario, Canada
| | - Ilya Ioshikhes
- Ottawa Institute of Systems Biology and Department of Biochemistry, Microbiology and Immunology, Faculty of Medicine, University of Ottawa, Ottawa, Ontario, Canada
| |
Collapse
|
18
|
Sinha S, Ling X, Whitfield CW, Zhai C, Robinson GE. Genome scan for cis-regulatory DNA motifs associated with social behavior in honey bees. Proc Natl Acad Sci U S A 2006; 103:16352-7. [PMID: 17065326 PMCID: PMC1637586 DOI: 10.1073/pnas.0607448103] [Citation(s) in RCA: 50] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/23/2022] Open
Abstract
Honey bees (Apis mellifera) undergo an age-related, socially regulated transition from working in the hive to foraging, which is associated with changes in the expression of thousands of genes in the brain. To begin to study the cis-regulatory code underlying this massive social regulation of gene expression, we used the newly sequenced honey bee genome to scan the promoter regions of eight sets of behaviorally related genes differentially expressed in the brain in the context of division of labor among worker bees, for 41 cis-regulatory motifs previously characterized in Drosophila melanogaster. Binding sites for the transcription factors Hairy, GAGA, Adf1, Cf1, Snail, and Dri, known to function in nervous system development, olfactory learning, or hormone binding in Drosophila, were significantly associated with one or more gene sets. The presence of some binding sites also predicted expression patterns for as many as 71% of the genes in some gene sets. These results suggest that there is a robust relationship between cis and social regulation of brain gene expression, especially considering that we studied <15% of all known transcription factors. These results also suggest that transcriptional networks involved in the regulation of development in Drosophila are used to regulate behavioral development in adult honey bees. However, differences in gene regulation between these two processes are suggested by the finding that the promoter regions for the behaviorally related bee genes differed in both motif occurrence and G/C content relative to their Drosophila orthologs.
Collapse
Affiliation(s)
| | - Xu Ling
- Departments of *Computer Science and
| | - Charles W. Whitfield
- Entomology
- Institute of Genomic Biology, and
- Neuroscience Program, University of Illinois at Urbana–Champaign, Urbana, IL 61801
| | - Chengxiang Zhai
- Departments of *Computer Science and
- Institute of Genomic Biology, and
| | - Gene E. Robinson
- Entomology
- Institute of Genomic Biology, and
- Neuroscience Program, University of Illinois at Urbana–Champaign, Urbana, IL 61801
| |
Collapse
|