Reference Citation Analysis: Find an Article, Find a Category, Find a Journal, Find a Scholar

For: Zhang R, Zhang CT. A Brief Review: The Z-curve Theory and its Application in Genome Analysis. Curr Genomics 2014;15:78-94. [PMID: 24822026 PMCID: PMC4009844 DOI: 10.2174/1389202915999140328162433] [Citation(s) in RCA: 24] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/08/2013] [Revised: 10/16/2013] [Accepted: 10/16/2013] [Indexed: 11/22/2022] Open

For:	Zhang R, Zhang CT. A Brief Review: The Z-curve Theory and its Application in Genome Analysis. Curr Genomics 2014;15:78-94. [PMID: 24822026 PMCID: PMC4009844 DOI: 10.2174/1389202915999140328162433] [Citation(s) in RCA: 24] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/08/2013] [Revised: 10/16/2013] [Accepted: 10/16/2013] [Indexed: 11/22/2022] Open

Number

Cited by Other Article(s)

Wei PJ, Guo Z, Gao Z, Ding Z, Cao RF, Su Y, Zheng CH. Inference of gene regulatory networks based on directed graph convolutional networks. Brief Bioinform 2024;25:bbae309. [PMID: 38935070 DOI: 10.1093/bib/bbae309] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/26/2023] [Revised: 05/17/2024] [Indexed: 06/28/2024] Open

Biró B, Gál Z, Fekete Z, Klecska E, Hoffmann OI. Mitochondrial genome plasticity of mammalian species. BMC Genomics 2024;25:278. [PMID: 38486136 PMCID: PMC10941376 DOI: 10.1186/s12864-024-10201-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/27/2023] [Accepted: 03/08/2024] [Indexed: 03/17/2024] Open

Musleh S, Arif M, Alajez NM, Alam T. Unified mRNA Subcellular Localization Predictor based on machine learning techniques. BMC Genomics 2024;25:151. [PMID: 38326777 PMCID: PMC10848524 DOI: 10.1186/s12864-024-10077-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/21/2023] [Accepted: 02/01/2024] [Indexed: 02/09/2024] Open

Yin ZN, Lai FL, Gao F. Unveiling human origins of replication using deep learning: accurate prediction and comprehensive analysis. Brief Bioinform 2023;25:bbad432. [PMID: 38008420 PMCID: PMC10676776 DOI: 10.1093/bib/bbad432] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/14/2023] [Revised: 10/11/2023] [Accepted: 11/06/2023] [Indexed: 11/28/2023] Open

Chen R. A Historic Retrospective on the Early Bioinformatics Research in China. GENOMICS, PROTEOMICS & BIOINFORMATICS 2023;21:897-899. [PMID: 37923291 PMCID: PMC10928369 DOI: 10.1016/j.gpb.2023.10.006] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/14/2023] [Revised: 10/28/2023] [Accepted: 10/28/2023] [Indexed: 11/07/2023]

Musleh S, Islam MT, Qureshi R, Alajez N, Alam T. MSLP: mRNA subcellular localization predictor based on machine learning techniques. BMC Bioinformatics 2023;24:109. [PMID: 36949389 PMCID: PMC10035125 DOI: 10.1186/s12859-023-05232-0] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/04/2022] [Accepted: 03/15/2023] [Indexed: 03/24/2023] Open

Abstract

BACKGROUND

Subcellular localization of messenger RNA (mRNAs) plays a pivotal role in the regulation of gene expression, cell migration as well as in cellular adaptation. Experiment techniques for pinpointing the subcellular localization of mRNAs are laborious, time-consuming and expensive. Therefore, in silico approaches for this purpose are attaining great attention in the RNA community.

METHODS

In this article, we propose MSLP, a machine learning-based method to predict the subcellular localization of mRNA. We propose a novel combination of four types of features representing k-mer, pseudo k-tuple nucleotide composition (PseKNC), physicochemical properties of nucleotides, and 3D representation of sequences based on Z-curve transformation to feed into machine learning algorithm to predict the subcellular localization of mRNAs.

RESULTS

Considering the combination of the above-mentioned features, ennsemble-based models achieved state-of-the-art results in mRNA subcellular localization prediction tasks for multiple benchmark datasets. We evaluated the performance of our method in ten subcellular locations, covering cytoplasm, nucleus, endoplasmic reticulum (ER), extracellular region (ExR), mitochondria, cytosol, pseudopodium, posterior, exosome, and the ribosome. Ablation study highlighted k-mer and PseKNC to be more dominant than other features for predicting cytoplasm, nucleus, and ER localizations. On the other hand, physicochemical properties and Z-curve based features contributed the most to ExR and mitochondria detection. SHAP-based analysis revealed the relative importance of features to provide better insights into the proposed approach.

AVAILABILITY

We have implemented a Docker container and API for end users to run their sequences on our model. Datasets, the code of API and the Docker are shared for the community in GitHub at: https://github.com/smusleh/MSLP .

Collapse

Genome-wide identification and characterization of DNA enhancers with a stacked multivariate fusion framework. PLoS Comput Biol 2022;18:e1010779. [PMID: 36520922 PMCID: PMC9836277 DOI: 10.1371/journal.pcbi.1010779] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/27/2022] [Revised: 01/12/2023] [Accepted: 11/29/2022] [Indexed: 12/23/2022] Open

Abstract

Enhancers are short non-coding DNA sequences outside of the target promoter regions that can be bound by specific proteins to increase a gene's transcriptional activity, which has a crucial role in the spatiotemporal and quantitative regulation of gene expression. However, enhancers do not have a specific sequence motifs or structures, and their scattered distribution in the genome makes the identification of enhancers from human cell lines particularly challenging. Here we present a novel, stacked multivariate fusion framework called SMFM, which enables a comprehensive identification and analysis of enhancers from regulatory DNA sequences as well as their interpretation. Specifically, to characterize the hierarchical relationships of enhancer sequences, multi-source biological information and dynamic semantic information are fused to represent regulatory DNA enhancer sequences. Then, we implement a deep learning-based sequence network to learn the feature representation of the enhancer sequences comprehensively and to extract the implicit relationships in the dynamic semantic information. Ultimately, an ensemble machine learning classifier is trained based on the refined multi-source features and dynamic implicit relations obtained from the deep learning-based sequence network. Benchmarking experiments demonstrated that SMFM significantly outperforms other existing methods using several evaluation metrics. In addition, an independent test set was used to validate the generalization performance of SMFM by comparing it to other state-of-the-art enhancer identification methods. Moreover, we performed motif analysis based on the contribution scores of different bases of enhancer sequences to the final identification results. Besides, we conducted interpretability analysis of the identified enhancer sequences based on attention weights of EnhancerBERT, a fine-tuned BERT model that provides new insights into exploring the gene semantic information likely to underlie the discovered enhancers in an interpretable manner. Finally, in a human placenta study with 4,562 active distal gene regulatory enhancers, SMFM successfully exposed tissue-related placental development and the differential mechanism, demonstrating the generalizability and stability of our proposed framework.

Collapse

Kania A, Sarapata K. Multifarious aspects of the chaos game representation and its applications in biological sequence analysis. Comput Biol Med 2022;151:106243. [PMID: 36335814 DOI: 10.1016/j.compbiomed.2022.106243] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/26/2022] [Revised: 10/18/2022] [Accepted: 10/22/2022] [Indexed: 12/27/2022]

Abstract

Chaos game representation (CGR) has been successfully applied in bioinformatics for over 30 years. Since then, many further extensions were announced. Numerical encoding of biological sequences is especially convenient in the visualisation process, free-alignment methods and input preparation for machine learning techniques. The development and applications of CGR have embraced mainly linear nucleotide sequences. However, there were also some attempts to create a representation of proteins. The latter need to be more sophisticated, as arbitrary coordinates for amino acids do not reflect their properties which is crucial during the encoding process. In this paper, the authors summarised various variations of CGRs and their limitations. We began by studying the PROSITE motifs and showed the immense number of amino acid properties employed by different proteins. To this aim, we harnessed the Principal Component Analysis (PCA) and studied the relation between explained variance and the number of features that describe them. It appeared that even after many reductions, about 50 features are non-redundant. This was the reason we introduced an embedding concept from natural language processing which enables adjusting features for a given list of sequences. We presented a simple neural network architecture with one hidden layer and one neuron within it and showed it provides satisfactory results in phylogenetic tree construction in ND5 and SPARC protein cases. To this aim, we transformed CGR representations for all considered sequences using Discrete Fourier Transform (DFT) and applied Unweighted Pair Group Method with Arithmetic Mean (UPGMA) algorithm. Moreover, we indicated some similarities between CGR and Recurrent Neural Networks (RNN). In the end, we attempted to include information about the RNA secondary structure and defined some measures to validate biological significance. We studied their properties and showed on ALMV-3 example its usefulness.

Collapse

Hubert B. SkewDB, a comprehensive database of GC and 10 other skews for over 30,000 chromosomes and plasmids. Sci Data 2022;9:92. [PMID: 35318332 PMCID: PMC8941118 DOI: 10.1038/s41597-022-01179-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/01/2021] [Accepted: 01/25/2022] [Indexed: 11/12/2022] Open

Azim SM, Haque MR, Shatabda S. OriC-ENS: A sequence-based ensemble classifier for predicting origin of replication in S. cerevisiae. Comput Biol Chem 2021;92:107502. [PMID: 33962169 DOI: 10.1016/j.compbiolchem.2021.107502] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/23/2021] [Accepted: 04/21/2021] [Indexed: 01/08/2023]

Kania A, Sarapata K. The robustness of the chaos game representation to mutations and its application in free-alignment methods. Genomics 2021;113:1428-1437. [PMID: 33713823 DOI: 10.1016/j.ygeno.2021.03.015] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/22/2020] [Revised: 01/22/2021] [Accepted: 03/05/2021] [Indexed: 02/06/2023]

Wang D, Lai FL, Gao F. Ori-Finder 3: a web server for genome-wide prediction of replication origins in Saccharomyces cerevisiae. Brief Bioinform 2020;22:6278693. [PMID: 34020544 DOI: 10.1093/bib/bbaa182] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/08/2020] [Revised: 06/29/2020] [Accepted: 07/15/2020] [Indexed: 12/26/2022] Open

Romdhane L, Bouhamed H, Ghedira K, Ben Hamda C, Louhichi A, Jmel H, Romdhane S, Charfeddine C, Mokni M, Abdelhak S, Rebai A. The morbid cutaneous anatomy of the human genome revealed by a bioinformatic approach. Genomics 2020;112:4232-4241. [PMID: 32650097 DOI: 10.1016/j.ygeno.2020.07.009] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/13/2019] [Revised: 03/28/2020] [Accepted: 07/02/2020] [Indexed: 01/05/2023]

崔颖, 徐泽, 李建. [Identification of nucleosome positioning using support vector machine method based on comprehensive DNA sequence feature]. SHENG WU YI XUE GONG CHENG XUE ZA ZHI = JOURNAL OF BIOMEDICAL ENGINEERING = SHENGWU YIXUE GONGCHENGXUE ZAZHI 2020;37:496-501. [PMID: 32597092 PMCID: PMC10319573 DOI: 10.7507/1001-5515.201911064] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Subscribe] [Scholar Register] [Received: 11/23/2019] [Indexed: 11/03/2022]

ZCMM: A Novel Method Using Z-Curve Theory- Based and Position Weight Matrix for Predicting Nucleosome Positioning. Genes (Basel) 2019;10:genes10100765. [PMID: 31569414 PMCID: PMC6827144 DOI: 10.3390/genes10100765] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/17/2019] [Revised: 09/25/2019] [Accepted: 09/26/2019] [Indexed: 02/04/2023] Open

Abstract

Nucleosomes are the basic units of eukaryotes. The accurate positioning of nucleosomes plays a significant role in understanding many biological processes such as transcriptional regulation mechanisms and DNA replication and repair. Here, we describe the development of a novel method, termed ZCMM, based on Z-curve theory and position weight matrix (PWM). The ZCMM was trained and tested using the nucleosomal and linker sequences determined by support vector machine (SVM) in Saccharomyces cerevisiae (S. cerevisiae), and experimental results showed that the sensitivity (Sn), specificity (Sp), accuracy (Acc), and Matthews correlation coefficient (MCC) values for ZCMM were 91.40%, 96.56%, 96.75%, and 0.88, respectively, and the average area under the receiver operating characteristic curve (AUC) value was 0.972. A ZCMM predictor was developed to predict nucleosome positioning in Homo sapiens (H. sapiens), Caenorhabditis elegans (C. elegans), and Drosophila melanogaster (D. melanogaster) genomes, and the accuracy (Acc) values were 77.72%, 85.34%, and 93.62%, respectively. The maximum AUC values of the four species were 0.982, 0.861, 0.912 and 0.911, respectively. Another independent dataset for S. cerevisiae was used to predict nucleosome positioning. Compared with the results of Wu's method, it was found that the Sn, Sp, Acc, and MCC of ZCMM results for S. cerevisiae were all higher, reaching 96.72%, 96.54%, 94.10%, and 0.88. Compared with the Guo's method 'iNuc-PseKNC', the results of ZCMM for D. melanogaster were better. Meanwhile, the ZCMM was compared with some experimental data in vitro and in vivo for S. cerevisiae, and the results showed that the nucleosomes predicted by ZCMM were highly consistent with those confirmed by these experiments. Therefore, it was further confirmed that the ZCMM method has good accuracy and reliability in predicting nucleosome positioning.

Collapse

Wang D, Gao F. Comprehensive Analysis of Replication Origins in Saccharomyces cerevisiae Genomes. Front Microbiol 2019;10:2122. [PMID: 31572328 PMCID: PMC6753640 DOI: 10.3389/fmicb.2019.02122] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/16/2018] [Accepted: 08/29/2019] [Indexed: 12/15/2022] Open

Abstract

DNA replication initiates from multiple replication origins (ORIs) in eukaryotes. Discovery and characterization of replication origins are essential for a better understanding of the molecular mechanism of DNA replication. In this study, the features of autonomously replicating sequences (ARSs) in Saccharomyces cerevisiae have been comprehensively analyzed as follows. Firstly, we carried out the analysis of the ARSs available in S. cerevisiae S288C. By evaluating the sequence similarity of experimentally established ARSs, we found that 94.32% of ARSs are unique across the whole genome of S. cerevisiae S288C and those with high sequence similarity are prone to locate in subtelomeres. Subsequently, we built a non-redundant dataset with a total of 520 ARSs, which are based on ARSs annotation of S. cerevisiae S288C from SGD and then supplemented with those from OriDB and DeOri databases. We conducted a large-scale comparison of ORIs among the diverse budding yeast strains from a population genomics perspective. We found that 82.7% of ARSs are not only conserved in genomic sequence but also relatively conserved in chromosomal position. The non-conserved ARSs tend to distribute in the subtelomeric regions. We also conducted a pan-genome analysis of ARSs among the S. cerevisiae strains, and a total of 183 core ARSs existing in all yeast strains were determined. We extracted the genes adjacent to replication origins among the 104 yeast strains to examine whether there are differences in their gene functions. The result showed that the genes involved in the initiation of DNA replication, such as orc3, mcm2, mcm4, mcm6, and cdc45, are conservatively located adjacent to the replication origins. Furthermore, we found the genes adjacent to conserved ARSs are significantly enriched in DNA binding, enzyme activity, transportation, and energy, whereas for the genes adjacent to non-conserved ARSs are significantly enriched in response to environmental stress, metabolites biosynthetic process and biosynthesis of antibiotics. In general, we characterized the replication origins from the genome-wide and population genomics perspectives, which would provide new insights into the replication mechanism of S. cerevisiae and facilitate the design of algorithms to identify genome-wide replication origins in yeast.

Collapse

Jani MR, Khan Mozlish MT, Ahmed S, Tahniat NS, Farid DM, Shatabda S. iRecSpot-EF: Effective sequence based features for recombination hotspot prediction. Comput Biol Med 2018;103:17-23. [DOI: 10.1016/j.compbiomed.2018.10.005] [Citation(s) in RCA: 18] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/08/2018] [Revised: 10/07/2018] [Accepted: 10/07/2018] [Indexed: 01/19/2023]

2DPR-Tree: Two-Dimensional Priority R-Tree Algorithm for Spatial Partitioning in SpatialHadoop. ISPRS INTERNATIONAL JOURNAL OF GEO-INFORMATION 2018. [DOI: 10.3390/ijgi7050179] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]

Luo H, Quan CL, Peng C, Gao F. Recent development of Ori-Finder system and DoriC database for microbial replication origins. Brief Bioinform 2018;20:1114-1124. [DOI: 10.1093/bib/bbx174] [Citation(s) in RCA: 28] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/20/2017] [Revised: 12/04/2017] [Indexed: 01/28/2023] Open

Adetiba E, Olugbara OO, Taiwo TB, Adebiyi MO, Badejo JA, Akanle MB, Matthews VO. Alignment-Free Z-Curve Genomic Cepstral Coefficients and Machine Learning for Classification of Viruses. BIOINFORMATICS AND BIOMEDICAL ENGINEERING 2018. [PMCID: PMC7120486 DOI: 10.1007/978-3-319-78723-7_25] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/11/2022]

Quantitative analysis of correlation between AT and GC biases among bacterial genomes. PLoS One 2017;12:e0171408. [PMID: 28158313 PMCID: PMC5291525 DOI: 10.1371/journal.pone.0171408] [Citation(s) in RCA: 19] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/18/2016] [Accepted: 01/20/2017] [Indexed: 01/03/2023] Open

Li Y, Shi X, Liang Y, Xie J, Zhang Y, Ma Q. RNA-TVcurve: a Web server for RNA secondary structure comparison based on a multi-scale similarity of its triple vector curve representation. BMC Bioinformatics 2017;18:51. [PMID: 28109252 PMCID: PMC5251234 DOI: 10.1186/s12859-017-1481-7] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/25/2016] [Accepted: 01/10/2017] [Indexed: 01/10/2023] Open

Abstract

Background

RNAs have been found to carry diverse functionalities in nature. Inferring the similarity between two given RNAs is a fundamental step to understand and interpret their functional relationship. The majority of functional RNAs show conserved secondary structures, rather than sequence conservation. Those algorithms relying on sequence-based features usually have limitations in their prediction performance. Hence, integrating RNA structure features is very critical for RNA analysis. Existing algorithms mainly fall into two categories: alignment-based and alignment-free. The alignment-free algorithms of RNA comparison usually have lower time complexity than alignment-based algorithms.

Results

An alignment-free RNA comparison algorithm was proposed, in which novel numerical representations RNA-TVcurve (triple vector curve representation) of RNA sequence and corresponding secondary structure features are provided. Then a multi-scale similarity score of two given RNAs was designed based on wavelet decomposition of their numerical representation. In support of RNA mutation and phylogenetic analysis, a web server (RNA-TVcurve) was designed based on this alignment-free RNA comparison algorithm. It provides three functional modules: 1) visualization of numerical representation of RNA secondary structure; 2) detection of single-point mutation based on secondary structure; and 3) comparison of pairwise and multiple RNA secondary structures. The inputs of the web server require RNA primary sequences, while corresponding secondary structures are optional. For the primary sequences alone, the web server can compute the secondary structures using free energy minimization algorithm in terms of RNAfold tool from Vienna RNA package.

Conclusion

RNA-TVcurve is the first integrated web server, based on an alignment-free method, to deliver a suite of RNA analysis functions, including visualization, mutation analysis and multiple RNAs structure comparison. The comparison results with two popular RNA comparison tools, RNApdist and RNAdistance, showcased that RNA-TVcurve can efficiently capture subtle relationships among RNAs for mutation detection and non-coding RNA classification. All the relevant results were shown in an intuitive graphical manner, and can be freely downloaded from this server. RNA-TVcurve, along with test examples and detailed documents, are available at: http://ml.jlu.edu.cn/tvcurve/.

Electronic supplementary material

The online version of this article (doi:10.1186/s12859-017-1481-7) contains supplementary material, which is available to authorized users.

Collapse

Druzhinina IS, Kopchinskiy AG, Kubicek EM, Kubicek CP. A complete annotation of the chromosomes of the cellulase producer Trichoderma reesei provides insights in gene clusters, their expression and reveals genes required for fitness. BIOTECHNOLOGY FOR BIOFUELS 2016;9:75. [PMID: 27030800 PMCID: PMC4812632 DOI: 10.1186/s13068-016-0488-z] [Citation(s) in RCA: 25] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/11/2016] [Accepted: 03/15/2016] [Indexed: 05/15/2023]

Abstract

BACKGROUND

Investigations on a few eukaryotic model organisms showed that many genes are non-randomly distributed on chromosomes. In addition, chromosome ends frequently possess genes that are important for the fitness of the organisms. Trichoderma reesei is an industrial producer of enzymes for food, feed and biorefinery production. Its seven chromosomes have recently been assembled, thus making an investigation of its chromosome architecture possible.

RESULTS

We manually annotated and mapped 9194 ORFs on their respective chromosomes and investigated the clustering of the major gene categories and of genes encoding carbohydrate-active enzymes (CAZymes), and the relationship between clustering and expression. Genes responsible for RNA processing and modification, amino acid metabolism, transcription, translation and ribosomal structure and biogenesis indeed showed loose clustering, but this had no impact on their expression. A third of the genes encoding CAZymes also occurred in loose clusters that also contained a high number of genes encoding small secreted cysteine-rich proteins. Five CAZyme clusters were located less than 50 kb apart from the chromosome ends. These genes exhibited the lowest basal (but not induced) expression level, which correlated with an enrichment of H3K9 methylation in the terminal 50 kb areas indicating gene silencing. No differences were found in the expression of CAZyme genes present in other parts of the chromosomes. The putative subtelomeric areas were also enriched in genes encoding secreted proteases, amino acid permeases, enzyme clusters for polyketide synthases (PKS)-non-ribosomal peptide synthase (NRPS) fusion proteins (PKS-NRPS) and proteins involved in iron scavenging. They were strongly upregulated during conidiation and interaction with other fungi.

CONCLUSIONS

Our findings suggest that gene clustering on the T. reesei chromosomes occurs but generally has no impact on their expression. CAZyme genes, located in subtelomers, however, exhibited a much lower basal expression level. The gene inventory of the subtelomers suggests a major role of competition for nitrogen and iron supported by antibiosis for the fitness of T. reesei. The availability of fully annotated chromosomes will facilitate the use of genetic crossings in identifying still unknown genes responsible for specific traits of T. reesei.

Collapse

Adetiba E, Olugbara OO. Improved Classification of Lung Cancer Using Radial Basis Function Neural Network with Affine Transforms of Voss Representation. PLoS One 2015;10:e0143542. [PMID: 26625358 PMCID: PMC4666594 DOI: 10.1371/journal.pone.0143542] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/17/2015] [Accepted: 11/05/2015] [Indexed: 11/18/2022] Open

Assessing the effects of data selection and representation on the development of reliable E. coli sigma 70 promoter region predictors. PLoS One 2015;10:e0119721. [PMID: 25803493 PMCID: PMC4372424 DOI: 10.1371/journal.pone.0119721] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/12/2014] [Accepted: 01/26/2015] [Indexed: 11/27/2022] Open

Abstract

As the number of sequenced bacterial genomes increases, the need for rapid and reliable tools for the annotation of functional elements (e.g., transcriptional regulatory elements) becomes more desirable. Promoters are the key regulatory elements, which recruit the transcriptional machinery through binding to a variety of regulatory proteins (known as sigma factors). The identification of the promoter regions is very challenging because these regions do not adhere to specific sequence patterns or motifs and are difficult to determine experimentally. Machine learning represents a promising and cost-effective approach for computational identification of prokaryotic promoter regions. However, the quality of the predictors depends on several factors including: i) training data; ii) data representation; iii) classification algorithms; iv) evaluation procedures. In this work, we create several variants of E. coli promoter data sets and utilize them to experimentally examine the effect of these factors on the predictive performance of E. coli σ⁷⁰ promoter models. Our results suggest that under some combinations of the first three criteria, a prediction model might perform very well on cross-validation experiments while its performance on independent test data is drastically very poor. This emphasizes the importance of evaluating promoter region predictors using independent test data, which corrects for the over-optimistic performance that might be estimated using the cross-validation procedure. Our analysis of the tested models shows that good prediction models often perform well despite how the non-promoter data was obtained. On the other hand, poor prediction models seems to be more sensitive to the choice of non-promoter sequences. Interestingly, the best performing sequence-based classifiers outperform the best performing structure-based classifiers on both cross-validation and independent test performance evaluation experiments. Finally, we propose a meta-predictor method combining two top performing sequence-based and structure-based classifiers and compare its performance with some of the state-of-the-art E. coli σ⁷⁰ promoter prediction methods.

Collapse