1
|
Cao T, Li Q, Huang Y, Li A. plotnineSeqSuite: a Python package for visualizing sequence data using ggplot2 style. BMC Genomics 2023; 24:585. [PMID: 37789265 PMCID: PMC10546746 DOI: 10.1186/s12864-023-09677-8] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/05/2023] [Accepted: 09/14/2023] [Indexed: 10/05/2023] Open
Abstract
BACKGROUND The visual sequence logo has been a hot area in the development of bioinformatics tools. ggseqlogo written in R language has been the most popular API since it was published. With the popularity of artificial intelligence and deep learning, Python is currently the most popular programming language. The programming language used by bioinformaticians began to shift to Python. Providing APIs in Python that are similar to those in R can reduce the learning cost of relearning a programming language. And compared to ggplot2 in R, drawing framework is not as easy to use in Python. The appearance of plotnine (ggplot2 in Python version) makes it possible to unify the programming methods of bioinformatics visualization tools between R and Python. RESULTS Here, we introduce plotnineSeqSuite, a new plotnine-based Python package provides a ggseqlogo-like API for programmatic drawing of sequence logos, sequence alignment diagrams and sequence histograms. To be more precise, it supports custom letters, color themes, and fonts. Moreover, the class for drawing layers is based on object-oriented design so that users can easily encapsulate and extend it. CONCLUSIONS plotnineSeqSuite is the first ggplot2-style package to implement visualization of sequence -related graphs in Python. It enhances the uniformity of programmatic plotting between R and Python. Compared with tools appeared already, the categories supported by plotnineSeqSuite are much more complete. The source code of plotnineSeqSuite can be obtained on GitHub ( https://github.com/caotianze/plotnineseqsuite ) and PyPI ( https://pypi.org/project/plotnineseqsuite ), and the documentation homepage is freely available on GitHub at ( https://caotianze.github.io/plotnineseqsuite/ ).
Collapse
Affiliation(s)
- Tianze Cao
- School of Mathematics, Hangzhou Normal University, Hangzhou, Zhejiang Province, China
| | - Qian Li
- Department of Rehabilitation, Tongji Hospital, Tongji Medical College, Huazhong University of Science and Technology, Wuhan, Hubei Province, China
| | - Yuexia Huang
- School of Mathematics, Hangzhou Normal University, Hangzhou, Zhejiang Province, China.
| | - Anshui Li
- Department of Statistics, Shaoxing University, Shaoxing, Zhejiang Province, China.
| |
Collapse
|
2
|
Onah E, Uzor PF, Ugwoke IC, Eze JU, Ugwuanyi ST, Chukwudi IR, Ibezim A. Prediction of HIV-1 protease cleavage site from octapeptide sequence information using selected classifiers and hybrid descriptors. BMC Bioinformatics 2022; 23:466. [DOI: 10.1186/s12859-022-05017-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/24/2022] [Accepted: 10/11/2022] [Indexed: 11/10/2022] Open
Abstract
Abstract
Background
In most parts of the world, especially in underdeveloped countries, acquired immunodeficiency syndrome (AIDS) still remains a major cause of death, disability, and unfavorable economic outcomes. This has necessitated intensive research to develop effective therapeutic agents for the treatment of human immunodeficiency virus (HIV) infection, which is responsible for AIDS. Peptide cleavage by HIV-1 protease is an essential step in the replication of HIV-1. Thus, correct and timely prediction of the cleavage site of HIV-1 protease can significantly speed up and optimize the drug discovery process of novel HIV-1 protease inhibitors. In this work, we built and compared the performance of selected machine learning models for the prediction of HIV-1 protease cleavage site utilizing a hybrid of octapeptide sequence information comprising bond composition, amino acid binary profile (AABP), and physicochemical properties as numerical descriptors serving as input variables for some selected machine learning algorithms. Our work differs from antecedent studies exploring the same subject in the combination of octapeptide descriptors and method used. Instead of using various subsets of the dataset for training and testing the models, we combined the dataset, applied a 3-way data split, and then used a "stratified" 10-fold cross-validation technique alongside the testing set to evaluate the models.
Results
Among the 8 models evaluated in the “stratified” 10-fold CV experiment, logistic regression, multi-layer perceptron classifier, linear discriminant analysis, gradient boosting classifier, Naive Bayes classifier, and decision tree classifier with AUC, F-score, and B. Acc. scores in the ranges of 0.91–0.96, 0.81–0.88, and 80.1–86.4%, respectively, have the closest predictive performance to the state-of-the-art model (AUC 0.96, F-score 0.80 and B. Acc. ~ 80.0%). Whereas, the perceptron classifier and the K-nearest neighbors had statistically lower performance (AUC 0.77–0.82, F-score 0.53–0.69, and B. Acc. 60.0–68.5%) at p < 0.05. On the other hand, logistic regression, and multi-layer perceptron classifier (AUC of 0.97, F-score > 0.89, and B. Acc. > 90.0%) had the best performance on further evaluation on the testing set, though linear discriminant analysis, gradient boosting classifier, and Naive Bayes classifier equally performed well (AUC > 0.94, F-score > 0.87, and B. Acc. > 86.0%).
Conclusions
Logistic regression and multi-layer perceptron classifiers have comparable predictive performances to the state-of-the-art model when octapeptide sequence descriptors consisting of AABP, bond composition and standard physicochemical properties are used as input variables. In our future work, we hope to develop a standalone software for HIV-1 protease cleavage site prediction utilizing the linear regression algorithm and the aforementioned octapeptide sequence descriptors.
Collapse
|
3
|
Zhou R, Tian K, Huang J, Duan W, Fu H, Feng Y, Wang H, Jiang Y, Li Y, Wang R, Hu J, Ma H, Qi Z, Ji X. CTCF DNA binding domain undergoes dynamic and selective protein–protein interactions. iScience 2022; 25:105011. [PMID: 36117989 PMCID: PMC9474293 DOI: 10.1016/j.isci.2022.105011] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/02/2022] [Revised: 05/13/2022] [Accepted: 08/19/2022] [Indexed: 11/24/2022] Open
Abstract
CTCF is a predominant insulator protein required for three-dimensional chromatin organization. However, the roles of its insulation of enhancers in a 3D nuclear organization have not been fully explained. Here, we found that the CTCF DNA-binding domain (DBD) forms dynamic self-interacting clusters. Strikingly, CTCF DBD clusters were found to incorporate other insulator proteins but are not coenriched with transcriptional activators in the nucleus. This property is not observed in other domains of CTCF or the DBDs of other transcription factors. Moreover, endogenous CTCF shows a phenotype consistent with the DBD by forming small protein clusters and interacting with CTCF motif arrays that have fewer transcriptional activators bound. Our results reveal an interesting phenomenon in which CTCF DBD interacts with insulator proteins and selectively localizes to nuclear positions with lower concentrations of transcriptional activators, providing insights into the insulation function of CTCF. The CTCF DNA-binding domain forms protein clusters in vivo and in vitro CTCF DBD clusters colocalize with insulator proteins but not with activators Arginine residues of CTCF DBD are frequently mutated in cancers Multiple transcription factor DBDs form protein clusters
Collapse
|
4
|
Rudić J, Dragićević MB, Momčilović I, Simonović AD, Pantelić D. In Silico Study of Superoxide Dismutase Gene Family in Potato and Effects of Elevated Temperature and Salicylic Acid on Gene Expression. Antioxidants (Basel) 2022; 11:antiox11030488. [PMID: 35326138 PMCID: PMC8944489 DOI: 10.3390/antiox11030488] [Citation(s) in RCA: 8] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/31/2021] [Revised: 02/14/2022] [Accepted: 02/22/2022] [Indexed: 12/13/2022] Open
Abstract
Potato (Solanum tuberosum L.) is the most important vegetable crop globally and is very susceptible to high ambient temperatures. Since heat stress causes the accumulation of reactive oxygen species (ROS), investigations regarding major enzymatic components of the antioxidative system are of the essence. Superoxide dismutases (SODs) represent the first line of defense against ROS but detailed in silico analysis and characterization of the potato SOD gene family have not been performed thus far. We have analyzed eight functional SOD genes, three StCuZnSODs, one StMnSOD, and four StFeSODs, annotated in the updated version of potato genome (Spud DB DM v6.1). The StSOD genes and their respective proteins were analyzed in silico to determine the exon-intron organization, splice variants, cis-regulatory promoter elements, conserved domains, signals for subcellular targeting, 3D-structures, and phylogenetic relations. Quantitative PCR analysis revealed higher induction of StCuZnSODs (the major potato SODs) and StFeSOD3 in thermotolerant cultivar Désirée than in thermosensitive Agria and Kennebec during long-term exposure to elevated temperature. StMnSOD was constitutively expressed, while expression of StFeSODs was cultivar-dependent. The effects of salicylic acid (10−5 M) on StSODs expression were minor. Our results provide the basis for further research on StSODs and their regulation in potato, particularly in response to elevated temperatures.
Collapse
|
5
|
Tareen A, Kinney JB. Logomaker: beautiful sequence logos in Python. Bioinformatics 2020; 36:2272-2274. [PMID: 31821414 PMCID: PMC7141850 DOI: 10.1093/bioinformatics/btz921] [Citation(s) in RCA: 187] [Impact Index Per Article: 46.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/10/2019] [Revised: 11/14/2019] [Accepted: 12/06/2019] [Indexed: 01/09/2023] Open
Abstract
Summary Sequence logos are visually compelling ways of illustrating the biological properties of DNA, RNA and protein sequences, yet it is currently difficult to generate and customize such logos within the Python programming environment. Here we introduce Logomaker, a Python API for creating publication-quality sequence logos. Logomaker can produce both standard and highly customized logos from either a matrix-like array of numbers or a multiple-sequence alignment. Logos are rendered as native matplotlib objects that are easy to stylize and incorporate into multi-panel figures. Availability and implementation Logomaker can be installed using the pip package manager and is compatible with both Python 2.7 and Python 3.6. Documentation is provided at http://logomaker.readthedocs.io; source code is available at http://github.com/jbkinney/logomaker.
Collapse
Affiliation(s)
- Ammar Tareen
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY 11724, USA
| | - Justin B Kinney
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY 11724, USA
| |
Collapse
|
6
|
Dhar J, Kishore R, Chakrabarti P. Delineation of a new structural motif involving NHN γ-turn. Proteins 2019; 88:431-439. [PMID: 31587358 DOI: 10.1002/prot.25820] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/16/2019] [Revised: 09/17/2019] [Accepted: 09/18/2019] [Indexed: 10/25/2022]
Abstract
Macromolecules are characterized by distinctive arrangement of hydrogen bonds. Different patterns of hydrogen bonds give rise to distinct and stable structural motifs. An analysis of 4114 non-redundant protein chains reveals the existence of a three-residue, (i - 1) to (i + 1), structural motif, having two hydrogen-bonded five-membered pseudo rings (the first, an NH···OC involving the first residue, and the second being NH∙∙∙N involving the last two residues), separated by a peptide bond. There could be an additional hydrogen bond between the side-chain at (i-1) and the main-chain NH of (i + 1). The average backbone torsion angles of -76(±21)° and - 12(±17)° at i creates a tight turn in the polypeptide chain, akin to a γ-turn. Indeed, a search of three-residue fragments with restriction on the terminal Cα ···Cα distance and the existence of the two pseudo rings on either side revealed the presence 14 846 cases of a variant, termed NHN γ-turn, distinct from the NHO γ-turn (2032 cases) that has traditionally been characterized by the presence of NHO hydrogen bond linking the terminal main-chain atoms. As in the latter, the newly identified γ-turns are also of two types-classical and inverse, occurring in the ratio of 1:6. The propensities of residues to occur in these turns and their secondary structural features have been enumerated. An understanding of these turns would be useful for structure prediction and loop modeling, and may serve as models to represent some of the unfolded state or disordered region in proteins.
Collapse
Affiliation(s)
- Jesmita Dhar
- Bioinformatics Centre, Bose Institute, Kolkata, India
| | - Raghuvansh Kishore
- Department of Zoology and Department of Biotechnology, Mizoram University, Aizawl, India
| | - Pinak Chakrabarti
- Bioinformatics Centre, Bose Institute, Kolkata, India.,Department of Biochemistry, Bose Institute, Kolkata, India
| |
Collapse
|
7
|
Beltran T, Barroso C, Birkle TY, Stevens L, Schwartz HT, Sternberg PW, Fradin H, Gunsalus K, Piano F, Sharma G, Cerrato C, Ahringer J, Martínez-Pérez E, Blaxter M, Sarkies P. Comparative Epigenomics Reveals that RNA Polymerase II Pausing and Chromatin Domain Organization Control Nematode piRNA Biogenesis. Dev Cell 2019; 48:793-810.e6. [PMID: 30713076 PMCID: PMC6436959 DOI: 10.1016/j.devcel.2018.12.026] [Citation(s) in RCA: 24] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/23/2018] [Revised: 12/06/2018] [Accepted: 12/27/2018] [Indexed: 12/30/2022]
Abstract
Piwi-interacting RNAs (piRNAs) are important for genome regulation across metazoans, but their biogenesis evolves rapidly. In Caenorhabditis elegans, piRNA loci are clustered within two 3-Mb regions on chromosome IV. Each piRNA locus possesses an upstream motif that recruits RNA polymerase II to produce an ∼28 nt primary transcript. We used comparative epigenomics across nematodes to gain insight into the origin, evolution, and mechanism of nematode piRNA biogenesis. We show that the piRNA upstream motif is derived from core promoter elements controlling snRNA transcription. We describe two alternative modes of piRNA organization in nematodes: in C. elegans and closely related nematodes, piRNAs are clustered within repressive H3K27me3 chromatin, while in other species, typified by Pristionchus pacificus, piRNAs are found within introns of active genes. Additionally, we discover that piRNA production depends on sequence signals associated with RNA polymerase II pausing. We show that pausing signals synergize with chromatin to control piRNA transcription. Nematode piRNA transcription evolved from small nuclear RNA biogenesis Clustered piRNAs are produced from regulated (H3K27me3) chromatin domains Dispersed piRNAs are produced from active (H3K36me3) chromatin domains RNA polymerase II pausing determines the short (∼28 nt) length of piRNA precursors
Collapse
Affiliation(s)
- Toni Beltran
- MRC London Institute of Medical Sciences, London W12 0NN, UK; Institute of Clinical Sciences, Imperial College London, London W12 0NN, UK
| | - Consuelo Barroso
- MRC London Institute of Medical Sciences, London W12 0NN, UK; Institute of Clinical Sciences, Imperial College London, London W12 0NN, UK
| | - Timothy Y Birkle
- MRC London Institute of Medical Sciences, London W12 0NN, UK; Institute of Clinical Sciences, Imperial College London, London W12 0NN, UK
| | - Lewis Stevens
- Institute of Evolutionary Biology, School of Biological Sciences, University of Edinburgh, Edinburgh EH9 3TF, UK
| | - Hillel T Schwartz
- Division of Biology and Biological Engineering, California Institute of Technology, Pasadena, CA 91125, USA
| | - Paul W Sternberg
- Division of Biology and Biological Engineering, California Institute of Technology, Pasadena, CA 91125, USA
| | - Hélène Fradin
- Department of Biology, New York University, New York, NY 10003, USA; Center for Genomics and Systems Biology, New York University, New York, NY 10003, USA; Center for Genomics and Systems Biology, New York University Abu Dhabi, Abu Dhabi, United Arab Emirates
| | - Kristin Gunsalus
- Department of Biology, New York University, New York, NY 10003, USA; Center for Genomics and Systems Biology, New York University, New York, NY 10003, USA; Center for Genomics and Systems Biology, New York University Abu Dhabi, Abu Dhabi, United Arab Emirates
| | - Fabio Piano
- Department of Biology, New York University, New York, NY 10003, USA; Center for Genomics and Systems Biology, New York University, New York, NY 10003, USA; Center for Genomics and Systems Biology, New York University Abu Dhabi, Abu Dhabi, United Arab Emirates
| | - Garima Sharma
- The Gurdon Institute and Department of Genetics, University of Cambridge, Cambridge, UK
| | - Chiara Cerrato
- The Gurdon Institute and Department of Genetics, University of Cambridge, Cambridge, UK
| | - Julie Ahringer
- The Gurdon Institute and Department of Genetics, University of Cambridge, Cambridge, UK
| | - Enrique Martínez-Pérez
- MRC London Institute of Medical Sciences, London W12 0NN, UK; Institute of Clinical Sciences, Imperial College London, London W12 0NN, UK
| | - Mark Blaxter
- Institute of Evolutionary Biology, School of Biological Sciences, University of Edinburgh, Edinburgh EH9 3TF, UK.
| | - Peter Sarkies
- MRC London Institute of Medical Sciences, London W12 0NN, UK; Institute of Clinical Sciences, Imperial College London, London W12 0NN, UK.
| |
Collapse
|
8
|
ChEC-seq kinetics discriminates transcription factor binding sites by DNA sequence and shape in vivo. Nat Commun 2015; 6:8733. [PMID: 26490019 PMCID: PMC4618392 DOI: 10.1038/ncomms9733] [Citation(s) in RCA: 117] [Impact Index Per Article: 13.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/27/2015] [Accepted: 09/25/2015] [Indexed: 12/31/2022] Open
Abstract
Chromatin endogenous cleavage (ChEC) uses fusion of a protein of interest to micrococcal nuclease (MNase) to target calcium-dependent cleavage to specific genomic loci in vivo. Here we report the combination of ChEC with high-throughput sequencing (ChEC-seq) to map budding yeast transcription factor (TF) binding. Temporal analysis of ChEC-seq data reveals two classes of sites for TFs, one displaying rapid cleavage at sites with robust consensus motifs and the second showing slow cleavage at largely unique sites with low-scoring motifs. Sites with high-scoring motifs also display asymmetric cleavage, indicating that ChEC-seq provides information on the directionality of TF-DNA interactions. Strikingly, similar DNA shape patterns are observed regardless of motif strength, indicating that the kinetics of ChEC-seq discriminates DNA recognition through sequence and/or shape. We propose that time-resolved ChEC-seq detects both high-affinity interactions of TFs with consensus motifs and sites preferentially sampled by TFs during diffusion and sliding. In chromatin endogenous cleavage (ChEC), micrococcal nuclease (MNase) is fused to a protein of interest and its cleavage is thus targeted to specific genomic loci in vivo. Here, the authors show that time-resolved ChEC-seq (high-throughput sequencing after ChEC) can detect DNA shape patterns regardless of motif strength.
Collapse
|
9
|
Simonti CN, Pollard KS, Schröder S, He D, Bruneau BG, Ott M, Capra JA. Evolution of lysine acetylation in the RNA polymerase II C-terminal domain. BMC Evol Biol 2015; 15:35. [PMID: 25887984 PMCID: PMC4362643 DOI: 10.1186/s12862-015-0327-z] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/24/2014] [Accepted: 02/24/2015] [Indexed: 12/31/2022] Open
Abstract
BACKGROUND RPB1, the largest subunit of RNA polymerase II, contains a highly modifiable C-terminal domain (CTD) that consists of variations of a consensus heptad repeat sequence (Y1S2P3T4S5P6S7). The consensus CTD repeat motif and tandem organization represent the ancestral state of eukaryotic RPB1, but across eukaryotes CTDs show considerable diversity in repeat organization and sequence content. These differences may reflect lineage-specific CTD functions mediated by protein interactions. Mammalian CTDs contain eight non-consensus repeats with a lysine in the seventh position (K7). Posttranslational acetylation of these sites was recently shown to be required for proper polymerase pausing and regulation of two growth factor-regulated genes. RESULTS To investigate the origins and function of RPB1 CTD acetylation (acRPB1), we computationally reconstructed the evolution of the CTD repeat sequence across eukaryotes and analyzed the evolution and function of genes dysregulated when acRPB1 is disrupted. Modeling the evolutionary dynamics of CTD repeat count and sequence content across diverse eukaryotes revealed an expansion of the CTD in the ancestors of Metazoa. The new CTD repeats introduced the potential for acRPB1 due to the appearance of distal repeats with lysine at position seven. This was followed by a further increase in the number of lysine-containing repeats in developmentally complex clades like Deuterostomia. Mouse genes enriched for acRPB1 occupancy at their promoters and genes with significant expression changes when acRPB1 is disrupted are enriched for several functions, such as growth factor response, gene regulation, cellular adhesion, and vascular development. Genes occupied and regulated by acRPB1 show significant enrichment for evolutionary origins in the early history of eukaryotes through early vertebrates. CONCLUSIONS Our combined functional and evolutionary analyses show that RPB1 CTD acetylation was possible in the early history of animals, and that the K7 content of the CTD expanded in specific developmentally complex metazoan lineages. The functional analysis of genes regulated by acRPB1 highlight functions involved in the origin of and diversification of complex Metazoa. This suggests that acRPB1 may have played a role in the success of animals.
Collapse
Affiliation(s)
- Corinne N Simonti
- Center for Human Genetics Research, Vanderbilt University, Nashville, TN, 37232, USA.
| | - Katherine S Pollard
- Gladstone Institutes, University of California, San Francisco, San Francisco, CA, 94158, USA. .,Department of Epidemiology & Biostatistics and Institute for Human Genetics, University of California, San Francisco, San Francisco, CA, 94158, USA.
| | - Sebastian Schröder
- Gladstone Institutes, University of California, San Francisco, San Francisco, CA, 94158, USA.
| | - Daniel He
- Gladstone Institutes, University of California, San Francisco, San Francisco, CA, 94158, USA.
| | - Benoit G Bruneau
- Gladstone Institutes, University of California, San Francisco, San Francisco, CA, 94158, USA.
| | - Melanie Ott
- Gladstone Institutes, University of California, San Francisco, San Francisco, CA, 94158, USA.
| | - John A Capra
- Center for Human Genetics Research, Vanderbilt University, Nashville, TN, 37232, USA. .,Departments of Biological Sciences and Biomedical Informatics, Vanderbilt University, Nashville, TN, 37232, USA.
| |
Collapse
|