1
|
Pham NT, Terrance AT, Jeon YJ, Rakkiyappan R, Manavalan B. ac4C-AFL: A high-precision identification of human mRNA N4-acetylcytidine sites based on adaptive feature representation learning. MOLECULAR THERAPY. NUCLEIC ACIDS 2024; 35:102192. [PMID: 38779332 PMCID: PMC11108997 DOI: 10.1016/j.omtn.2024.102192] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 09/07/2023] [Accepted: 04/18/2024] [Indexed: 05/25/2024]
Abstract
RNA N4-acetylcytidine (ac4C) is a highly conserved RNA modification that plays a crucial role in controlling mRNA stability, processing, and translation. Consequently, accurate identification of ac4C sites across the genome is critical for understanding gene expression regulation mechanisms. In this study, we have developed ac4C-AFL, a bioinformatics tool that precisely identifies ac4C sites from primary RNA sequences. In ac4C-AFL, we identified the optimal sequence length for model building and implemented an adaptive feature representation strategy that is capable of extracting the most representative features from RNA. To identify the most relevant features, we proposed a novel ensemble feature importance scoring strategy to rank features effectively. We then used this information to conduct the sequential forward search, which individually determine the optimal feature set from the 16 sequence-derived feature descriptors. Utilizing these optimal feature descriptors, we constructed 176 baseline models using 11 popular classifiers. The most efficient baseline models were identified using the two-step feature selection approach, whose predicted scores were integrated and trained with the appropriate classifier to develop the final prediction model. Our rigorous cross-validations and independent tests demonstrate that ac4C-AFL surpasses contemporary tools in predicting ac4C sites. Moreover, we have developed a publicly accessible web server at https://balalab-skku.org/ac4C-AFL/.
Collapse
Affiliation(s)
- Nhat Truong Pham
- Department of Integrative Biotechnology, College of Biotechnology and Bioengineering, Sungkyunkwan University, Suwon, Gyeonggi-do 16419, Republic of Korea
| | - Annie Terrina Terrance
- Department of Integrative Biotechnology, College of Biotechnology and Bioengineering, Sungkyunkwan University, Suwon, Gyeonggi-do 16419, Republic of Korea
| | - Young-Jun Jeon
- Department of Integrative Biotechnology, College of Biotechnology and Bioengineering, Sungkyunkwan University, Suwon, Gyeonggi-do 16419, Republic of Korea
| | - Rajan Rakkiyappan
- Department of Mathematics, Bharathiar University, Coimbatore, Tamil Nadu 641046, India
| | - Balachandran Manavalan
- Department of Integrative Biotechnology, College of Biotechnology and Bioengineering, Sungkyunkwan University, Suwon, Gyeonggi-do 16419, Republic of Korea
| |
Collapse
|
2
|
Ataş PK. A novel hybrid model to predict concomitant diseases for Hashimoto's thyroiditis. BMC Bioinformatics 2023; 24:319. [PMID: 37620755 PMCID: PMC10464155 DOI: 10.1186/s12859-023-05443-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/25/2023] [Accepted: 08/10/2023] [Indexed: 08/26/2023] Open
Abstract
Hashimoto's thyroiditis is an autoimmune disorder characterized by the destruction of thyroid cells through immune-mediated mechanisms involving cells and antibodies. The condition can trigger disturbances in metabolism, leading to the development of other autoimmune diseases, known as concomitant diseases. Multiple concomitant diseases may coexist in a single individual, making it challenging to diagnose and manage them effectively. This study aims to propose a novel hybrid algorithm that classifies concomitant diseases associated with Hashimoto's thyroiditis based on sequences. The approach involves building distinct prediction models for each class and using the output of one model as input for the subsequent one, resulting in a dynamic decision-making process. Genes associated with concomitant diseases were collected alongside those related to Hashimoto's thyroiditis, and their sequences were obtained from the NCBI site in fasta format. The hybrid algorithm was evaluated against common machine learning algorithms and their various combinations. The experimental results demonstrate that the proposed hybrid model outperforms existing classification methods in terms of performance metrics. The significance of this study lies in its two distinctive aspects. Firstly, it presents a new benchmarking dataset that has not been previously developed in this field, using diverse methods. Secondly, it proposes a more effective and efficient solution that accounts for the dynamic nature of the dataset. The hybrid approach holds promise in investigating the genetic heterogeneity of complex diseases such as Hashimoto's thyroiditis and identifying new autoimmune disease genes. Additionally, the results of this study may aid in the development of genetic screening tools and laboratory experiments targeting Hashimoto's thyroiditis genetic risk factors. New software, models, and techniques for computing, including systems biology, machine learning, and artificial intelligence, are used in our study.
Collapse
Affiliation(s)
- Pınar Karadayı Ataş
- Department of Computer Engineering, Istanbul Arel University, 34537, Buyukcekmece, Istanbul, Turkey.
| |
Collapse
|
3
|
Kim S, Yuan JB, Woods WS, Newton DA, Perez-Pinera P, Song JS. Chromatin structure and context-dependent sequence features control prime editing efficiency. Front Genet 2023; 14:1222112. [PMID: 37456665 PMCID: PMC10344898 DOI: 10.3389/fgene.2023.1222112] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/13/2023] [Accepted: 06/16/2023] [Indexed: 07/18/2023] Open
Abstract
Prime editing (PE) is a highly versatile CRISPR-Cas9 genome editing technique. The current constructs, however, have variable efficiency and may require laborious experimental optimization. This study presents statistical models for learning the salient epigenomic and sequence features of target sites modulating the editing efficiency and provides guidelines for designing optimal PEs. We found that both regional constitutive heterochromatin and local nucleosome occlusion of target sites impede editing, while position-specific G/C nucleotides in the primer-binding site (PBS) and reverse transcription (RT) template regions of PE guide RNA (pegRNA) yield high editing efficiency, especially for short PBS designs. The presence of G/C nucleotides was most critical immediately 5' to the protospacer adjacent motif (PAM) site for all designs. The effects of different last templated nucleotides were quantified and observed to depend on the length of both PBS and RT templates. Our models found AGG to be the preferred PAM and detected a guanine nucleotide four bases downstream of the PAM to facilitate editing, suggesting a hitherto-unrecognized interaction with Cas9. A neural network interpretation method based on nonextensive statistical mechanics further revealed multi-nucleotide preferences, indicating dependency among several bases across pegRNA. Our work clarifies previous conflicting observations and uncovers context-dependent features important for optimizing PE designs.
Collapse
Affiliation(s)
- Somang Kim
- Department of Physics, University of Illinois at Urbana-Champaign, Urbana, IL, United States
- Carl R. Woese Institute for Genomic Biology, University of Illinois at Urbana-Champaign, Urbana, IL, United States
| | - Jimmy B. Yuan
- Department of Physics, University of Illinois at Urbana-Champaign, Urbana, IL, United States
- Carl R. Woese Institute for Genomic Biology, University of Illinois at Urbana-Champaign, Urbana, IL, United States
| | - Wendy S. Woods
- Department of Bioengineering, University of Illinois at Urbana-Champaign, Urbana, IL, United States
| | - Destry A. Newton
- Department of Physics, University of Illinois at Urbana-Champaign, Urbana, IL, United States
| | - Pablo Perez-Pinera
- Carl R. Woese Institute for Genomic Biology, University of Illinois at Urbana-Champaign, Urbana, IL, United States
- Department of Bioengineering, University of Illinois at Urbana-Champaign, Urbana, IL, United States
- Department of Biomedical and Translational Sciences, Carle-Illinois College of Medicine, University of Illinois at Urbana-Champaign, Urbana, IL, United States
- Cancer Center at Illinois, University of Illinois at Urbana-Champaign, Urbana, IL, United States
- Department of Molecular and Integrative Physiology, University of Illinois at Urbana-Champaign, Urbana, IL, United States
| | - Jun S. Song
- Department of Physics, University of Illinois at Urbana-Champaign, Urbana, IL, United States
- Carl R. Woese Institute for Genomic Biology, University of Illinois at Urbana-Champaign, Urbana, IL, United States
- Cancer Center at Illinois, University of Illinois at Urbana-Champaign, Urbana, IL, United States
- Center for Theoretical Physics, Department of Physics, Massachusetts Institute of Technology, Cambridge, MA, United States
- Department of Statistics, Harvard University, Cambridge, MA, United States
| |
Collapse
|
4
|
Kim S, Yuan JB, Woods WS, Newton DA, Perez-Pinera P, Song JS. Chromatin structure and context-dependent sequence features control prime editing efficiency. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.04.15.536944. [PMID: 37162994 PMCID: PMC10168420 DOI: 10.1101/2023.04.15.536944] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/11/2023]
Abstract
Prime editor (PE) is a highly versatile CRISPR-Cas9 genome editing technique. The current constructs, however, have variable efficiency and may require laborious experimental optimization. This study presents statistical models for learning the salient epigenomic and sequence features of target sites modulating the editing efficiency and provides guidelines for designing optimal PEs. We found that both regional constitutive heterochromatin and local nucleosome occlusion of target sites impede editing, while position-specific G/C nucleotides in the primer binding site (PBS) and reverse transcription (RT) template regions of PE guide-RNA (pegRNA) yield high editing efficiency, especially for short PBS designs. The presence of G/C nucleotides was most critical immediately 5' to the protospacer adjacent motif (PAM) site for all designs. The effects of different last templated nucleotides were quantified and seen to depend on both PBS and RT template lengths. Our models found AGG to be the preferred PAM and detected a guanine nucleotide four bases downstream of PAM to facilitate editing, suggesting a hitherto-unrecognized interaction with Cas9. A neural network interpretation method based on nonextensive statistical mechanics further revealed multi-nucleotide preferences, indicating dependency among several bases across pegRNA. Our work clarifies previous conflicting observations and uncovers context-dependent features important for optimizing PE designs.
Collapse
Affiliation(s)
- Somang Kim
- Department of Physics, University of Illinois at Urbana-Champaign, Urbana, IL, USA
- Carl R. Woese Institute for Genomic Biology, University of Illinois at Urbana-Champaign, Urbana, IL, USA
| | - Jimmy B. Yuan
- Department of Physics, University of Illinois at Urbana-Champaign, Urbana, IL, USA
- Carl R. Woese Institute for Genomic Biology, University of Illinois at Urbana-Champaign, Urbana, IL, USA
| | - Wendy S. Woods
- Department of Bioengineering, University of Illinois at Urbana-Champaign, Urbana, IL, USA
| | - Destry A. Newton
- Department of Physics, University of Illinois at Urbana-Champaign, Urbana, IL, USA
| | - Pablo Perez-Pinera
- Carl R. Woese Institute for Genomic Biology, University of Illinois at Urbana-Champaign, Urbana, IL, USA
- Department of Bioengineering, University of Illinois at Urbana-Champaign, Urbana, IL, USA
- Department of Biomedical and Translational Sciences, Carle-Illinois College of Medicine, University of Illinois, Urbana, IL 61801, USA
- Cancer Center at Illinois, University of Illinois, Urbana, IL 61801, USA
- Department of Molecular and Integrative Physiology, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA
| | - Jun S. Song
- Department of Physics, University of Illinois at Urbana-Champaign, Urbana, IL, USA
- Carl R. Woese Institute for Genomic Biology, University of Illinois at Urbana-Champaign, Urbana, IL, USA
- Cancer Center at Illinois, University of Illinois, Urbana, IL 61801, USA
| |
Collapse
|
5
|
Cai J, Wang T, Deng X, Tang L, Liu L. GM-lncLoc: LncRNAs subcellular localization prediction based on graph neural network with meta-learning. BMC Genomics 2023; 24:52. [PMID: 36709266 PMCID: PMC9883864 DOI: 10.1186/s12864-022-09034-1] [Citation(s) in RCA: 4] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/29/2022] [Accepted: 11/21/2022] [Indexed: 01/29/2023] Open
Abstract
In recent years, a large number of studies have shown that the subcellular localization of long non-coding RNAs (lncRNAs) can bring crucial information to the recognition of lncRNAs function. Therefore, it is of great significance to establish a computational method to accurately predict the subcellular localization of lncRNA. Previous prediction models are based on low-level sequences information and are troubled by the few samples problem. In this study, we propose a new prediction model, GM-lncLoc, which is based on the initial information extracted from the lncRNA sequence, and also combines the graph structure information to extract high level features of lncRNA. In addition, the training mode of meta-learning is introduced to obtain meta-parameters by training a series of tasks. With the meta-parameters, the final parameters of other similar tasks can be learned quickly, so as to solve the problem of few samples in lncRNA subcellular localization. Compared with the previous methods, GM-lncLoc achieved the best results with an accuracy of 93.4 and 94.2% in the benchmark datasets of 5 and 4 subcellular compartments, respectively. Furthermore, the prediction performance of GM-lncLoc was also better on the independent dataset. It shows the effectiveness and great potential of our proposed method for lncRNA subcellular localization prediction. The datasets and source code are freely available at https://github.com/JunzheCai/GM-lncLoc .
Collapse
Affiliation(s)
- Junzhe Cai
- grid.410739.80000 0001 0723 6903School of Information, Yunnan Normal University, Kunming, Yunnan China
| | - Ting Wang
- grid.410739.80000 0001 0723 6903School of Information, Yunnan Normal University, Kunming, Yunnan China
| | - Xi Deng
- grid.410739.80000 0001 0723 6903School of Information, Yunnan Normal University, Kunming, Yunnan China
| | - Lin Tang
- grid.410739.80000 0001 0723 6903Key Laboratory of Educational Information for Nationalities Ministry of Education, Yunnan Normal University, Kunming, Yunnan China
| | - Lin Liu
- grid.410739.80000 0001 0723 6903School of Information, Yunnan Normal University, Kunming, Yunnan China
| |
Collapse
|
6
|
Ding Y, Tiwari P, Guo F, Zou Q. Shared subspace-based radial basis function neural network for identifying ncRNAs subcellular localization. Neural Netw 2022; 156:170-178. [DOI: 10.1016/j.neunet.2022.09.026] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/21/2022] [Revised: 07/25/2022] [Accepted: 09/26/2022] [Indexed: 11/11/2022]
|
7
|
Yang Y, Lv H, Chen N. A Survey on ensemble learning under the era of deep learning. Artif Intell Rev 2022. [DOI: 10.1007/s10462-022-10283-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022]
|
8
|
Roy T, Sharma K, Dhall A, Patiyal S, Raghava GPS. In silico method for predicting infectious strains of influenza A virus from its genome and protein sequences. J Gen Virol 2022; 103. [PMID: 36318663 DOI: 10.1099/jgv.0.001802] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/16/2023] Open
Abstract
Influenza A is a contagious viral disease responsible for four pandemics in the past and a major public health concern. Being zoonotic in nature, the virus can cross the species barrier and transmit from wild aquatic bird reservoirs to humans via intermediate hosts. In this study, we have developed a computational method for the prediction of human-associated and non-human-associated influenza A virus sequences. The models were trained and validated on proteins and genome sequences of influenza A virus. Firstly, we have developed prediction models for 15 types of influenza A proteins using composition-based and one-hot-encoding features. We have achieved a highest AUC of 0.98 for HA protein on a validation dataset using dipeptide composition-based features. Of note, we obtained a maximum AUC of 0.99 using one-hot-encoding features for protein-based models on a validation dataset. Secondly, we built models using whole genome sequences which achieved an AUC of 0.98 on a validation dataset. In addition, we showed that our method outperforms a similarity-based approach (i.e., blast) on the same validation dataset. Finally, we integrated our best models into a user-friendly web server 'FluSPred' (https://webs.iiitd.edu.in/raghava/fluspred/index.html) and a standalone version (https://github.com/raghavagps/FluSPred) for the prediction of human-associated/non-human-associated influenza A virus strains.
Collapse
Affiliation(s)
- Trinita Roy
- Department of Computational Biology, Indraprastha Institute of Information Technology, Okhla Phase 3, New Delhi-110020, India
| | - Khushal Sharma
- Department of Computational Biology, Indraprastha Institute of Information Technology, Okhla Phase 3, New Delhi-110020, India
| | - Anjali Dhall
- Department of Computational Biology, Indraprastha Institute of Information Technology, Okhla Phase 3, New Delhi-110020, India
| | - Sumeet Patiyal
- Department of Computational Biology, Indraprastha Institute of Information Technology, Okhla Phase 3, New Delhi-110020, India
| | - Gajendra Pal Singh Raghava
- Department of Computational Biology, Indraprastha Institute of Information Technology, Okhla Phase 3, New Delhi-110020, India
| |
Collapse
|
9
|
Zhou H, Wang H, Tang J, Ding Y, Guo F. Identify ncRNA Subcellular Localization via Graph Regularized k-Local Hyperplane Distance Nearest Neighbor Model on Multi-Kernel Learning. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022; 19:3517-3529. [PMID: 34432632 DOI: 10.1109/tcbb.2021.3107621] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/13/2023]
Abstract
Non-coding RNAs (ncRNAs) are a type of RNAs which are not used to encode protein sequences. Emerging evidence shows that lots of ncRNAs may participate in many biological processes and must be widely involved in many types of cancers. Therefore, understanding their functionality is of great importance. Similar to proteins, various functions of ncRNAs relies on their subcellular localizations. Traditional high-throughput methods in wet-lab to identify subcellular localization is time-consuming and costly. In this paper, we propose a novel computational method based on multi-kernel learning to identify multi-label ncRNA subcellular localizations, via graph regularized k-local hyperplane distance nearest neighbor algorithm. First, we construct six types of sequence-based feature descriptors and select important feature vectors. Then, we build a multi-kernel learning model with Hilbert-Schmidt independence criterion (HSIC) to obtain optimal weights for vairous features. Furthermore, we propose the graph regularized k-local hyperplane distance nearest neighbor algorithm (GHKNN) as a binary classification model for detecting one kind of non-coding RNA subcellular localization. Finally, we apply One-vs-Rest strategy to decompose multi-label problem of non-coding RNA subcellular localizations. Our method achieves excellent performance on three ncRNA datasets and three human ncRNA datasets, and out-performs other outstanding machine learning methods. Comparing to existing method, our model also performs well especially on small datasets. We expect that this model will be useful for the prediction of subcellular localization and the study of important functional mechanisms of ncRNAs. Furthermore, we establish user-friendly web server (http://ncrna.lbci.net/) with the implementation of our method, which can be easily used by most experimental scientists.
Collapse
|
10
|
Asim MN, Ibrahim MA, Imran Malik M, Dengel A, Ahmed S. Circ-LocNet: A Computational Framework for Circular RNA Sub-Cellular Localization Prediction. Int J Mol Sci 2022; 23:ijms23158221. [PMID: 35897818 PMCID: PMC9329987 DOI: 10.3390/ijms23158221] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/16/2022] [Revised: 07/15/2022] [Accepted: 07/20/2022] [Indexed: 02/04/2023] Open
Abstract
Circular ribonucleic acids (circRNAs) are novel non-coding RNAs that emanate from alternative splicing of precursor mRNA in reversed order across exons. Despite the abundant presence of circRNAs in human genes and their involvement in diverse physiological processes, the functionality of most circRNAs remains a mystery. Like other non-coding RNAs, sub-cellular localization knowledge of circRNAs has the aptitude to demystify the influence of circRNAs on protein synthesis, degradation, destination, their association with different diseases, and potential for drug development. To date, wet experimental approaches are being used to detect sub-cellular locations of circular RNAs. These approaches help to elucidate the role of circRNAs as protein scaffolds, RNA-binding protein (RBP) sponges, micro-RNA (miRNA) sponges, parental gene expression modifiers, alternative splicing regulators, and transcription regulators. To complement wet-lab experiments, considering the progress made by machine learning approaches for the determination of sub-cellular localization of other non-coding RNAs, the paper in hand develops a computational framework, Circ-LocNet, to precisely detect circRNA sub-cellular localization. Circ-LocNet performs comprehensive extrinsic evaluation of 7 residue frequency-based, residue order and frequency-based, and physio-chemical property-based sequence descriptors using the five most widely used machine learning classifiers. Further, it explores the performance impact of K-order sequence descriptor fusion where it ensembles similar as well dissimilar genres of statistical representation learning approaches to reap the combined benefits. Considering the diversity of statistical representation learning schemes, it assesses the performance of second-order, third-order, and going all the way up to seventh-order sequence descriptor fusion. A comprehensive empirical evaluation of Circ-LocNet over a newly developed benchmark dataset using different settings reveals that standalone residue frequency-based sequence descriptors and tree-based classifiers are more suitable to predict sub-cellular localization of circular RNAs. Further, K-order heterogeneous sequence descriptors fusion in combination with tree-based classifiers most accurately predict sub-cellular localization of circular RNAs. We anticipate this study will act as a rich baseline and push the development of robust computational methodologies for the accurate sub-cellular localization determination of novel circRNAs.
Collapse
Affiliation(s)
- Muhammad Nabeel Asim
- German Research Center for Artificial Intelligence (DFKI), 67663 Kaiserslautern, Germany; (M.A.I.); (A.D.); (S.A.)
- Department of Computer Science, Technical University of Kaiserslautern, 67663 Kaiserslautern, Germany
- Correspondence:
| | - Muhammad Ali Ibrahim
- German Research Center for Artificial Intelligence (DFKI), 67663 Kaiserslautern, Germany; (M.A.I.); (A.D.); (S.A.)
- Department of Computer Science, Technical University of Kaiserslautern, 67663 Kaiserslautern, Germany
| | - Muhammad Imran Malik
- School of Computer Science & Electrical Engineering, National University of Sciences and Technology, Islamabad 44000, Pakistan;
| | - Andreas Dengel
- German Research Center for Artificial Intelligence (DFKI), 67663 Kaiserslautern, Germany; (M.A.I.); (A.D.); (S.A.)
- Department of Computer Science, Technical University of Kaiserslautern, 67663 Kaiserslautern, Germany
| | - Sheraz Ahmed
- German Research Center for Artificial Intelligence (DFKI), 67663 Kaiserslautern, Germany; (M.A.I.); (A.D.); (S.A.)
- DeepReader GmbH, Trippstadter Str. 122, 67663 Kaiserslautern, Germany
| |
Collapse
|
11
|
Li J, Yang Z, Wang D, Li Z. WAFNRLTG: A Novel Model for Predicting LncRNA Target Genes Based on Weighted Average Fusion Network Representation Learning Method. Front Cell Dev Biol 2022; 9:820342. [PMID: 35127729 PMCID: PMC8807548 DOI: 10.3389/fcell.2021.820342] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/22/2021] [Accepted: 12/14/2021] [Indexed: 11/29/2022] Open
Abstract
Long non-coding RNAs (lncRNAs) do not encode proteins, yet they have been well established to be involved in complex regulatory functions, and lncRNA regulatory dysfunction can lead to a variety of human complex diseases. LncRNAs mostly exert their functions by regulating the expressions of target genes, and accurate prediction of potential lncRNA target genes would be helpful to further understanding the functional annotations of lncRNAs. Considering the limitations in traditional computational methods for predicting lncRNA target genes, a novel model which was named Weighted Average Fusion Network Representation learning for predicting LncRNA Target Genes (WAFNRLTG) was proposed. First, a novel heterogeneous network was constructed by integrating lncRNA sequence similarity network, mRNA sequence similarity network, lncRNA-mRNA interaction network, lncRNA-miRNA interaction network and mRNA-miRNA interaction network. Next, four popular network representation learning methods were utilized to gain the representation vectors of lncRNA and mRNA nodes. Then, the representations of lncRNAs and target genes in the heterogeneous network were obtained with the weighted average fusion network representation learning method. Finally, we merged the representations of lncRNAs and related target genes to form lncRNA-gene pairs, trained the XGBoost classifier and predicted potential lncRNA target genes. In five-cross validations on the training and independent datasets, the experimental results demonstrated that WAFNRLTG obtained better AUC scores (0.9410, 0.9350) and AUPR scores (0.9391, 0.9350). Moreover, case studies of three common lncRNAs were performed for predicting their potential lncRNA target genes and the results confirmed the effectiveness of WAFNRLTG. The source codes and all data of WAFNRLTG can be freely downloaded at https://github.com/HGDYZW/WAFNRLTG.
Collapse
Affiliation(s)
- Jianwei Li
- School of Artificial Intelligence, Institute of Computational Medicine, Hebei University of Technology, Tianjin, China
- Hebei Province Key Laboratory of Big Data Calculation, Hebei University of Technology, Tianjin, China
- *Correspondence: Jianwei Li,
| | - Zhenwu Yang
- School of Artificial Intelligence, Institute of Computational Medicine, Hebei University of Technology, Tianjin, China
| | - Duanyang Wang
- School of Artificial Intelligence, Institute of Computational Medicine, Hebei University of Technology, Tianjin, China
| | - Zhiguang Li
- School of Artificial Intelligence, Institute of Computational Medicine, Hebei University of Technology, Tianjin, China
| |
Collapse
|
12
|
Giniūnaitė R, Petkevičiūtė-Gerlach D. Predicting the configuration and energy of DNA in a nucleosome by coarse-grain modelling. Phys Chem Chem Phys 2022; 24:26124-26133. [DOI: 10.1039/d2cp03553g] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/07/2022]
Abstract
We present a novel algorithm which uses a coarse-grained model and an energy minimisation procedure to predict the sequence-dependent DNA configuration in a nucleosome together with its energetic cost.
Collapse
Affiliation(s)
- Rasa Giniūnaitė
- Department of Applied Mathematics, Kaunas University of Technology, Studentų 50-318, 51368, Kaunas, Lithuania
- Institute of Applied Mathematics, Vilnius University, Naugarduko 24, 03225, Vilnius, Lithuania
| | - Daiva Petkevičiūtė-Gerlach
- Department of Applied Mathematics, Kaunas University of Technology, Studentų 50-318, 51368, Kaunas, Lithuania
| |
Collapse
|
13
|
Chien CH, Huang LY, Lo SF, Chen LJ, Liao CC, Chen JJ, Chu YW. Using Machine Learning Approaches to Predict Target Gene Expression in Rice T-DNA Insertional Mutants. Front Genet 2021; 12:798107. [PMID: 34976025 PMCID: PMC8718795 DOI: 10.3389/fgene.2021.798107] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/19/2021] [Accepted: 11/15/2021] [Indexed: 11/13/2022] Open
Abstract
To change the expression of the flanking genes by inserting T-DNA into the genome is commonly used in rice functional gene research. However, whether the expression of a gene of interest is enhanced must be validated experimentally. Consequently, to improve the efficiency of screening activated genes, we established a model to predict gene expression in T-DNA mutants through machine learning methods. We gathered experimental datasets consisting of gene expression data in T-DNA mutants and captured the PROMOTER and MIDDLE sequences for encoding. In first-layer models, support vector machine (SVM) models were constructed with nine features consisting of information about biological function and local and global sequences. Feature encoding based on the PROMOTER sequence was weighted by logistic regression. The second-layer models integrated 16 first-layer models with minimum redundancy maximum relevance (mRMR) feature selection and the LADTree algorithm, which were selected from nine feature selection methods and 65 classified methods, respectively. The accuracy of the final two-layer machine learning model, referred to as TIMgo, was 99.3% based on fivefold cross-validation, and 85.6% based on independent testing. We discovered that the information within the local sequence had a greater contribution than the global sequence with respect to classification. TIMgo had a good predictive ability for target genes within 20 kb from the 35S enhancer. Based on the analysis of significant sequences, the G-box regulatory sequence may also play an important role in the activation mechanism of the 35S enhancer.
Collapse
Affiliation(s)
- Ching-Hsuan Chien
- Ph.D. Program in Medical Biotechnology, National Chung Hsing University, Taichung, Taiwan
| | - Lan-Ying Huang
- Ph.D. Program in Medical Biotechnology, National Chung Hsing University, Taichung, Taiwan
| | - Shuen-Fang Lo
- Biotechnology Center, National Chung Hsing University, Taichung, Taiwan
| | - Liang-Jwu Chen
- Institute of Molecular Biology, National Chung Hsing University, Taichung, Taiwan
- Advanced Plant Biotechnology Center National Chung Hsing University, Taichung, Taiwan
| | - Chi-Chou Liao
- Institute of Molecular Biology, National Chung Hsing University, Taichung, Taiwan
| | - Jia-Jyun Chen
- Institute of Genomics and Bioinformatics, National Chung Hsing University, Taichung, Taiwan
| | - Yen-Wei Chu
- Ph.D. Program in Medical Biotechnology, National Chung Hsing University, Taichung, Taiwan
- Biotechnology Center, National Chung Hsing University, Taichung, Taiwan
- Institute of Molecular Biology, National Chung Hsing University, Taichung, Taiwan
- Institute of Genomics and Bioinformatics, National Chung Hsing University, Taichung, Taiwan
- Agricultural Biotechnology Center, National Chung Hsing University, Taichung, Taiwan
- Ph.D. Program in Translational Medicine, National Chung Hsing University, Taichung, Taiwan
- Rong Hsing Research Center for Translational Medicine, National Chung Hsing University, Taichung, Taiwan
- *Correspondence: Yen-Wei Chu,
| |
Collapse
|
14
|
Li HL, Pang YH, Liu B. BioSeq-BLM: a platform for analyzing DNA, RNA and protein sequences based on biological language models. Nucleic Acids Res 2021; 49:e129. [PMID: 34581805 PMCID: PMC8682797 DOI: 10.1093/nar/gkab829] [Citation(s) in RCA: 99] [Impact Index Per Article: 33.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/18/2021] [Revised: 08/24/2021] [Accepted: 09/09/2021] [Indexed: 01/08/2023] Open
Abstract
In order to uncover the meanings of ‘book of life’, 155 different biological language models (BLMs) for DNA, RNA and protein sequence analysis are discussed in this study, which are able to extract the linguistic properties of ‘book of life’. We also extend the BLMs into a system called BioSeq-BLM for automatically representing and analyzing the sequence data. Experimental results show that the predictors generated by BioSeq-BLM achieve comparable or even obviously better performance than the exiting state-of-the-art predictors published in literatures, indicating that BioSeq-BLM will provide new approaches for biological sequence analysis based on natural language processing technologies, and contribute to the development of this very important field. In order to help the readers to use BioSeq-BLM for their own experiments, the corresponding web server and stand-alone package are established and released, which can be freely accessed at http://bliulab.net/BioSeq-BLM/.
Collapse
Affiliation(s)
- Hong-Liang Li
- School of Computer Science and Technology, Beijing Institute of Technology, Beijing, China
| | - Yi-He Pang
- School of Computer Science and Technology, Beijing Institute of Technology, Beijing, China
| | - Bin Liu
- School of Computer Science and Technology, Beijing Institute of Technology, Beijing, China.,Advanced Research Institute of Multidisciplinary Science, Beijing Institute of Technology, Beijing, China
| |
Collapse
|
15
|
Chen Z, Zhao P, Li C, Li F, Xiang D, Chen YZ, Akutsu T, Daly RJ, Webb GI, Zhao Q, Kurgan L, Song J. iLearnPlus: a comprehensive and automated machine-learning platform for nucleic acid and protein sequence analysis, prediction and visualization. Nucleic Acids Res 2021; 49:e60. [PMID: 33660783 PMCID: PMC8191785 DOI: 10.1093/nar/gkab122] [Citation(s) in RCA: 107] [Impact Index Per Article: 35.7] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/26/2020] [Revised: 02/05/2021] [Accepted: 02/25/2021] [Indexed: 12/14/2022] Open
Abstract
Sequence-based analysis and prediction are fundamental bioinformatic tasks that facilitate understanding of the sequence(-structure)-function paradigm for DNAs, RNAs and proteins. Rapid accumulation of sequences requires equally pervasive development of new predictive models, which depends on the availability of effective tools that support these efforts. We introduce iLearnPlus, the first machine-learning platform with graphical- and web-based interfaces for the construction of machine-learning pipelines for analysis and predictions using nucleic acid and protein sequences. iLearnPlus provides a comprehensive set of algorithms and automates sequence-based feature extraction and analysis, construction and deployment of models, assessment of predictive performance, statistical analysis, and data visualization; all without programming. iLearnPlus includes a wide range of feature sets which encode information from the input sequences and over twenty machine-learning algorithms that cover several deep-learning approaches, outnumbering the current solutions by a wide margin. Our solution caters to experienced bioinformaticians, given the broad range of options, and biologists with no programming background, given the point-and-click interface and easy-to-follow design process. We showcase iLearnPlus with two case studies concerning prediction of long noncoding RNAs (lncRNAs) from RNA transcripts and prediction of crotonylation sites in protein chains. iLearnPlus is an open-source platform available at https://github.com/Superzchen/iLearnPlus/ with the webserver at http://ilearnplus.erc.monash.edu/.
Collapse
Affiliation(s)
- Zhen Chen
- Collaborative Innovation Center of Henan Grain Crops, Henan Agricultural University, Zhengzhou 450046, China
| | - Pei Zhao
- State Key Laboratory of Cotton Biology, Institute of Cotton Research of Chinese Academy of Agricultural Sciences (CAAS), Anyang 455000, China
| | - Chen Li
- Monash Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, VIC 3800, Australia
| | - Fuyi Li
- Monash Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, VIC 3800, Australia.,Monash Centre for Data Science, Faculty of Information Technology, Monash University, Melbourne, VIC 3800, Australia.,Department of Microbiology and Immunology, The Peter Doherty Institute for Infection and Immunity, The University of Melbourne, Melbourne, Victoria 3000, Australia
| | - Dongxu Xiang
- Monash Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, VIC 3800, Australia.,Monash Centre for Data Science, Faculty of Information Technology, Monash University, Melbourne, VIC 3800, Australia
| | - Yong-Zi Chen
- Laboratory of Tumor Cell Biology, Key Laboratory of Cancer Prevention and Therapy, National Clinical Research Center for Cancer, Tianjin Medical University Cancer Institute and Hospital, Tianjin Medical University, Tianjin 300060, China
| | - Tatsuya Akutsu
- Bioinformatics Center, Institute for Chemical Research, Kyoto University, Kyoto 611-0011, Japan
| | - Roger J Daly
- Monash Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, VIC 3800, Australia
| | - Geoffrey I Webb
- Monash Centre for Data Science, Faculty of Information Technology, Monash University, Melbourne, VIC 3800, Australia
| | - Quanzhi Zhao
- Collaborative Innovation Center of Henan Grain Crops, Henan Agricultural University, Zhengzhou 450046, China.,Key Laboratory of Rice Biology in Henan Province, Henan Agricultural University, Zhengzhou 450046, China
| | - Lukasz Kurgan
- Department of Computer Science, Virginia Commonwealth University, Richmond, VA, USA
| | - Jiangning Song
- Monash Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, VIC 3800, Australia.,Monash Centre for Data Science, Faculty of Information Technology, Monash University, Melbourne, VIC 3800, Australia
| |
Collapse
|
16
|
Abstract
Background Many transcripts have been generated due to the development of sequencing technologies, and lncRNA is an important type of transcript. Predicting lncRNAs from transcripts is a challenging and important task. Traditional experimental lncRNA prediction methods are time-consuming and labor-intensive. Efficient computational methods for lncRNA prediction are in demand. Results In this paper, we propose two lncRNA prediction methods based on feature ensemble learning strategies named LncPred-IEL and LncPred-ANEL. Specifically, we encode sequences into six different types of features including transcript-specified features and general sequence-derived features. Then we consider two feature ensemble strategies to utilize and integrate the information in different feature types, the iterative ensemble learning (IEL) and the attention network ensemble learning (ANEL). IEL employs a supervised iterative way to ensemble base predictors built on six different types of features. ANEL introduces an attention mechanism-based deep learning model to ensemble features by adaptively learning the weight of individual feature types. Experiments demonstrate that both LncPred-IEL and LncPred-ANEL can effectively separate lncRNAs and other transcripts in feature space. Moreover, comparison experiments demonstrate that LncPred-IEL and LncPred-ANEL outperform several state-of-the-art methods when evaluated by 5-fold cross-validation. Both methods have good performances in cross-species lncRNA prediction. Conclusions LncPred-IEL and LncPred-ANEL are promising lncRNA prediction tools that can effectively utilize and integrate the information in different types of features.
Collapse
Affiliation(s)
- Yanzhen Xu
- College of Informatics, Huazhong Agricultural University, Wuhan, 430070, China
| | - Xiaohan Zhao
- College of Informatics, Huazhong Agricultural University, Wuhan, 430070, China
| | - Shuai Liu
- College of Informatics, Huazhong Agricultural University, Wuhan, 430070, China
| | - Wen Zhang
- College of Informatics, Huazhong Agricultural University, Wuhan, 430070, China.
| |
Collapse
|
17
|
MNase Profiling of Promoter Chromatin in Salmonella typhimurium-Stimulated GM12878 Cells Reveals Dynamic and Response-Specific Nucleosome Architecture. G3-GENES GENOMES GENETICS 2020; 10:2171-2178. [PMID: 32404364 PMCID: PMC7341138 DOI: 10.1534/g3.120.401266] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
The nucleosome is the primary unit of chromatin structure and commonly imputed as a regulator of nuclear events, although the exact mechanisms remain unclear. Recent studies have shown that certain nucleosomes can have different sensitivities to micrococcal nuclease (MNase) digestion, resulting in the release of populations of nucleosomes dependent on the concentration of MNase. Mapping MNase sensitivity of nucleosomes at transcription start sites genome-wide reveals an important functional nucleosome organization that correlates with gene expression levels and transcription factor binding. In order to understand nucleosome distribution and sensitivity dynamics during a robust genome response, we mapped nucleosome position and sensitivity using multiple concentrations of MNase. We used the innate immune response as a model system to understand chromatin-mediated regulation. Herein we demonstrate that stimulation of a human lymphoblastoid cell line (GM12878) with heat-killed Salmonella typhimurium (HKST) results in changes in nucleosome sensitivity to MNase. We show that the HKST response alters the sensitivity of -1 nucleosomes at highly expressed promoters. Finally, we correlate the increased sensitivity with response-specific transcription factor binding. These results indicate that nucleosome sensitivity dynamics reflect the cellular response to HKST and pave the way for further studies that will deepen our understanding of the specificity of genome response.
Collapse
|
18
|
Zuo Y, Zou Q, Lin J, Jiang M, Liu X. 2lpiRNApred: a two-layered integrated algorithm for identifying piRNAs and their functions based on LFE-GM feature selection. RNA Biol 2020; 17:892-902. [PMID: 32138598 PMCID: PMC7549647 DOI: 10.1080/15476286.2020.1734382] [Citation(s) in RCA: 13] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/05/2019] [Revised: 12/16/2019] [Accepted: 02/18/2020] [Indexed: 12/18/2022] Open
Abstract
Piwi-interacting RNAs (piRNAs) are indispensable in the transposon silencing, including in germ cell formation, germline stem cell maintenance, spermatogenesis, and oogenesis. piRNA pathways are amongst the major genome defence mechanisms, which maintain genome integrity. They also have important functions in tumorigenesis, as indicated by aberrantly expressed piRNAs being recently shown to play roles in the process of cancer development. A number of computational methods for this have recently been proposed, but they still have not yielded satisfactory predictive performance. Moreover, only one computational method that identifies whether piRNAs function in inducting target mRNA deadenylation been reported in the literature. In this study, we developed a two-layered integrated classifier algorithm, 2lpiRNApred. It identifies piRNAs in the first layer and determines whether they function in inducting target mRNA deadenylation in the second layer. A new feature selection algorithm, which was based on Luca fuzzy entropy and Gaussian membership function (LFE-GM), was proposed to reduce the dimensionality of the features. Five feature extraction strategies, namely, Kmer, General parallel correlation pseudo-dinucleotide composition, General series correlation pseudo-dinucleotide composition, Normalized Moreau-Broto autocorrelation, and Geary autocorrelation, and two types of classifier, Sparse Representation Classifier (SRC) and support vector machine with Mahalanobis distance-based radial basis function (SVMMDRBF), were used to construct a two-layered integrated classifier algorithm, 2lpiRNApred. The results indicate that 2lpiRNApred performs significantly better than six other existing prediction tools.
Collapse
Affiliation(s)
- Yun Zuo
- Department of Computer Science, Xiamen University, Xiamen, China
| | - Quan Zou
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, China
| | - Jianyuan Lin
- Department of Computer Science, Xiamen University, Xiamen, China
| | - Min Jiang
- Department of Cognitive Science and Technology, Xiamen University, Xiamen, China
| | - Xiangrong Liu
- Department of Computer Science, Xiamen University, Xiamen, China
| |
Collapse
|
19
|
Liu B. BioSeq-Analysis: a platform for DNA, RNA and protein sequence analysis based on machine learning approaches. Brief Bioinform 2020; 20:1280-1294. [PMID: 29272359 DOI: 10.1093/bib/bbx165] [Citation(s) in RCA: 188] [Impact Index Per Article: 47.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/08/2017] [Revised: 11/08/2017] [Indexed: 01/07/2023] Open
Abstract
With the avalanche of biological sequences generated in the post-genomic age, one of the most challenging problems is how to computationally analyze their structures and functions. Machine learning techniques are playing key roles in this field. Typically, predictors based on machine learning techniques contain three main steps: feature extraction, predictor construction and performance evaluation. Although several Web servers and stand-alone tools have been developed to facilitate the biological sequence analysis, they only focus on individual step. In this regard, in this study a powerful Web server called BioSeq-Analysis (http://bioinformatics.hitsz.edu.cn/BioSeq-Analysis/) has been proposed to automatically complete the three main steps for constructing a predictor. The user only needs to upload the benchmark data set. BioSeq-Analysis can generate the optimized predictor based on the benchmark data set, and the performance measures can be reported as well. Furthermore, to maximize user's convenience, its stand-alone program was also released, which can be downloaded from http://bioinformatics.hitsz.edu.cn/BioSeq-Analysis/download/, and can be directly run on Windows, Linux and UNIX. Applied to three sequence analysis tasks, experimental results showed that the predictors generated by BioSeq-Analysis even outperformed some state-of-the-art methods. It is anticipated that BioSeq-Analysis will become a useful tool for biological sequence analysis.
Collapse
|
20
|
Zhang J, Peng W, Wang L. LeNup: learning nucleosome positioning from DNA sequences with improved convolutional neural networks. Bioinformatics 2019; 34:1705-1712. [PMID: 29329398 PMCID: PMC5946947 DOI: 10.1093/bioinformatics/bty003] [Citation(s) in RCA: 27] [Impact Index Per Article: 5.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/26/2017] [Accepted: 01/09/2018] [Indexed: 11/17/2022] Open
Abstract
Motivation Nucleosome positioning plays significant roles in proper genome packing and its accessibility to execute transcription regulation. Despite a multitude of nucleosome positioning resources available on line including experimental datasets of genome-wide nucleosome occupancy profiles and computational tools to the analysis on these data, the complex language of eukaryotic Nucleosome positioning remains incompletely understood. Results Here, we address this challenge using an approach based on a state-of-the-art machine learning method. We present a novel convolutional neural network (CNN) to understand nucleosome positioning. We combined Inception-like networks with a gating mechanism for the response of multiple patterns and long term association in DNA sequences. We developed the open-source package LeNup based on the CNN to predict nucleosome positioning in Homo sapiens, Caenorhabditis elegans, Drosophila melanogaster as well as Saccharomyces cerevisiae genomes. We trained LeNup on four benchmark datasets. LeNup achieved greater predictive accuracy than previously published methods. Availability and implementation LeNup is freely available as Python and Lua script source code under a BSD style license from https://github.com/biomedBit/LeNup. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Juhua Zhang
- Department of Biomedical Engineering.,Key Laboratory of Convergence Medical Engineering System and Healthcare Technology of the Ministry of Industry and Information Technology, School of Life Science, Beijing Institute of Technology, Beijing 100081, China
| | | | - Lei Wang
- Department of Biomedical Engineering
| |
Collapse
|
21
|
Zhang S, Lin J, Su L, Zhou Z. pDHS-DSET: Prediction of DNase I hypersensitive sites in plant genome using DS evidence theory. Anal Biochem 2019; 564-565:54-63. [DOI: 10.1016/j.ab.2018.10.018] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/23/2018] [Revised: 10/10/2018] [Accepted: 10/15/2018] [Indexed: 10/28/2022]
|
22
|
Tang G, Shi J, Wu W, Yue X, Zhang W. Sequence-based bacterial small RNAs prediction using ensemble learning strategies. BMC Bioinformatics 2018; 19:503. [PMID: 30577759 PMCID: PMC6302447 DOI: 10.1186/s12859-018-2535-1] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/08/2023] Open
Abstract
Background Bacterial small non-coding RNAs (sRNAs) have emerged as important elements in diverse physiological processes, including growth, development, cell proliferation, differentiation, metabolic reactions and carbon metabolism, and attract great attention. Accurate prediction of sRNAs is important and challenging, and helps to explore functions and mechanism of sRNAs. Results In this paper, we utilize a variety of sRNA sequence-derived features to develop ensemble learning methods for the sRNA prediction. First, we compile a balanced dataset and four imbalanced datasets. Then, we investigate various sRNA sequence-derived features, such as spectrum profile, mismatch profile, reverse compliment k-mer and pseudo nucleotide composition. Finally, we consider two ensemble learning strategies to integrate all features for building ensemble learning models for the sRNA prediction. One is the weighted average ensemble method (WAEM), which uses the linear weighted sum of outputs from the individual feature-based predictors to predict sRNAs. The other is the neural network ensemble method (NNEM), which trains a deep neural network by combining diverse features. In the computational experiments, we evaluate our methods on these five datasets by using 5-fold cross validation. WAEM and NNEM can produce better results than existing state-of-the-art sRNA prediction methods. Conclusions WAEM and NNEM have great potential for the sRNA prediction, and are helpful for understanding the biological mechanism of bacteria.
Collapse
Affiliation(s)
- Guifeng Tang
- School of Computer Science, Wuhan University, Wuhan, 430072, China
| | - Jingwen Shi
- School of Mathematics and Statistics, Wuhan University, Wuhan, 430072, China
| | - Wenjian Wu
- Electronic Information School, Wuhan University, Wuhan, 430072, China
| | - Xiang Yue
- Department of Computer Science and Engineering, The Ohio State University, Columbus, OH, 43210, USA
| | - Wen Zhang
- College of Informatics, Huazhong Agricultural University, Wuhan, 430070, China.
| |
Collapse
|
23
|
Zhang S, Zhuang W, Xu Z. Prediction of DNase I hypersensitive sites in plant genome using multiple modes of pseudo components. Anal Biochem 2018; 549:149-156. [DOI: 10.1016/j.ab.2018.03.025] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/10/2018] [Revised: 03/23/2018] [Accepted: 03/27/2018] [Indexed: 12/25/2022]
|
24
|
Jia C, Yang Q, Zou Q. NucPosPred: Predicting species-specific genomic nucleosome positioning via four different modes of general PseKNC. J Theor Biol 2018; 450:15-21. [PMID: 29678692 DOI: 10.1016/j.jtbi.2018.04.025] [Citation(s) in RCA: 27] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/26/2018] [Revised: 04/13/2018] [Accepted: 04/16/2018] [Indexed: 11/20/2022]
Abstract
The nucleosome is the basic structure of chromatin in eukaryotic cells, with essential roles in the regulation of many biological processes, such as DNA transcription, replication and repair, and RNA splicing. Because of the importance of nucleosomes, the factors that determine their positioning within genomes should be investigated. High-resolution nucleosome-positioning maps are now available for organisms including Saccharomyces cerevisiae, Drosophila melanogaster and Caenorhabditis elegans, enabling the identification of nucleosome positioning by application of computational tools. Here, we describe a novel predictor called NucPosPred, which was specifically designed for large-scale identification of nucleosome positioning in C. elegans and D. melanogaster genomes. NucPosPred was separately optimized for each species for four types of DNA sequence feature extraction, with consideration of two classification algorithms (gradient-boosting decision tree and support vector machine). The overall accuracy obtained with NucPosPred was 92.29% for C. elegans and 88.26% for D. melanogaster, outperforming previous methods and demonstrating the potential for species-specific prediction of nucleosome positioning. For the convenience of most experimental scientists, a web-server for the predictor NucPosPred is available at http://121.42.167.206/NucPosPred/index.jsp.
Collapse
Affiliation(s)
- Cangzhi Jia
- Science of College, Dalian Maritime University, No. 1 Linghai Road, Dalian 116026, China.
| | - Qing Yang
- Science of College, Dalian Maritime University, No. 1 Linghai Road, Dalian 116026, China
| | - Quan Zou
- School of Computer Science and Technology, Tianjin University, Tianjin, China.
| |
Collapse
|
25
|
McCormick RF, Truong SK, Sreedasyam A, Jenkins J, Shu S, Sims D, Kennedy M, Amirebrahimi M, Weers BD, McKinley B, Mattison A, Morishige DT, Grimwood J, Schmutz J, Mullet JE. The Sorghum bicolor reference genome: improved assembly, gene annotations, a transcriptome atlas, and signatures of genome organization. THE PLANT JOURNAL : FOR CELL AND MOLECULAR BIOLOGY 2018; 93:338-354. [PMID: 29161754 DOI: 10.1111/tpj.13781] [Citation(s) in RCA: 285] [Impact Index Per Article: 47.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/05/2017] [Revised: 11/05/2017] [Accepted: 11/14/2017] [Indexed: 05/20/2023]
Abstract
Sorghum bicolor is a drought tolerant C4 grass used for the production of grain, forage, sugar, and lignocellulosic biomass and a genetic model for C4 grasses due to its relatively small genome (approximately 800 Mbp), diploid genetics, diverse germplasm, and colinearity with other C4 grass genomes. In this study, deep sequencing, genetic linkage analysis, and transcriptome data were used to produce and annotate a high-quality reference genome sequence. Reference genome sequence order was improved, 29.6 Mbp of additional sequence was incorporated, the number of genes annotated increased 24% to 34 211, average gene length and N50 increased, and error frequency was reduced 10-fold to 1 per 100 kbp. Subtelomeric repeats with characteristics of Tandem Repeats in Miniature (TRIM) elements were identified at the termini of most chromosomes. Nucleosome occupancy predictions identified nucleosomes positioned immediately downstream of transcription start sites and at different densities across chromosomes. Alignment of more than 50 resequenced genomes from diverse sorghum genotypes to the reference genome identified approximately 7.4 M single nucleotide polymorphisms (SNPs) and 1.9 M indels. Large-scale variant features in euchromatin were identified with periodicities of approximately 25 kbp. A transcriptome atlas of gene expression was constructed from 47 RNA-seq profiles of growing and developed tissues of the major plant organs (roots, leaves, stems, panicles, and seed) collected during the juvenile, vegetative and reproductive phases. Analysis of the transcriptome data indicated that tissue type and protein kinase expression had large influences on transcriptional profile clustering. The updated assembly, annotation, and transcriptome data represent a resource for C4 grass research and crop improvement.
Collapse
Affiliation(s)
- Ryan F McCormick
- Interdisciplinary Program in Genetics, Texas A&M University, College Station, TX, 77843, USA
- Department of Biochemistry and Biophysics, Texas A&M University, College Station, TX, 77843, USA
| | - Sandra K Truong
- Interdisciplinary Program in Genetics, Texas A&M University, College Station, TX, 77843, USA
- Department of Biochemistry and Biophysics, Texas A&M University, College Station, TX, 77843, USA
| | | | - Jerry Jenkins
- HudsonAlpha Institute for Biotechnology, Huntsville, AL, 35806, USA
| | - Shengqiang Shu
- Department of Energy, Joint Genome Institute, Walnut Creek, CA, 94598, USA
| | - David Sims
- HudsonAlpha Institute for Biotechnology, Huntsville, AL, 35806, USA
| | - Megan Kennedy
- Department of Energy, Joint Genome Institute, Walnut Creek, CA, 94598, USA
| | | | - Brock D Weers
- Department of Biochemistry and Biophysics, Texas A&M University, College Station, TX, 77843, USA
| | - Brian McKinley
- Department of Biochemistry and Biophysics, Texas A&M University, College Station, TX, 77843, USA
| | - Ashley Mattison
- Interdisciplinary Program in Genetics, Texas A&M University, College Station, TX, 77843, USA
- Department of Biochemistry and Biophysics, Texas A&M University, College Station, TX, 77843, USA
| | - Daryl T Morishige
- Department of Biochemistry and Biophysics, Texas A&M University, College Station, TX, 77843, USA
| | - Jane Grimwood
- HudsonAlpha Institute for Biotechnology, Huntsville, AL, 35806, USA
- Department of Energy, Joint Genome Institute, Walnut Creek, CA, 94598, USA
| | - Jeremy Schmutz
- HudsonAlpha Institute for Biotechnology, Huntsville, AL, 35806, USA
- Department of Energy, Joint Genome Institute, Walnut Creek, CA, 94598, USA
| | - John E Mullet
- Department of Biochemistry and Biophysics, Texas A&M University, College Station, TX, 77843, USA
| |
Collapse
|
26
|
Druliner BR, Vera D, Johnson R, Ruan X, Apone LM, Dimalanta ET, Stewart FJ, Boardman L, Dennis JH. Comprehensive nucleosome mapping of the human genome in cancer progression. Oncotarget 2017; 7:13429-45. [PMID: 26735342 PMCID: PMC4924652 DOI: 10.18632/oncotarget.6811] [Citation(s) in RCA: 15] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/18/2015] [Accepted: 12/21/2015] [Indexed: 11/25/2022] Open
Abstract
Altered chromatin structure is a hallmark of cancer, and inappropriate regulation of chromatin structure may represent the origin of transformation. Important studies have mapped human nucleosome distributions genome wide, but the role of chromatin structure in cancer progression has not been addressed. We developed a MNase-Transcription Start Site Sequence Capture method (mTSS-seq) to map the nucleosome distribution at human transcription start sites genome-wide in primary human lung and colon adenocarcinoma tissue. Here, we confirm that nucleosome redistribution is an early, widespread event in lung (LAC) and colon (CRC) adenocarcinoma. These altered nucleosome architectures are consistent between LAC and CRC patient samples indicating that they may serve as important early adenocarcinoma markers. We demonstrate that the nucleosome alterations are driven by the underlying DNA sequence and potentiate transcription factor binding. We conclude that DNA-directed nucleosome redistributions are widespread early in cancer progression. We have proposed an entirely new hierarchical model for chromatin-mediated genome regulation.
Collapse
Affiliation(s)
- Brooke R Druliner
- Department of Biological Science, Florida State University, Tallahassee, Florida, United States of America.,Division of Gastroenterology and Hepatology, Department of Internal Medicine, Mayo Clinic, Rochester, Minnesota, United States of America
| | - Daniel Vera
- Department of Biological Science, Florida State University, Tallahassee, Florida, United States of America.,The Center for Genomics and Personalized Medicine, The Florida State University, Tallahassee, Florida, United States of America
| | - Ruth Johnson
- Department of Laboratory Medicine and Experimental Pathology, Mayo Clinic, Rochester, Minnesota, United States of America
| | - Xiaoyang Ruan
- Biomedical Statistics and Informatics, Department of Health Sciences Research, Mayo Clinic College of Medicine, Rochester, Minnesota, United States of America
| | - Lynn M Apone
- New England Biolabs Inc., Ipswich, Massachusetts, United States of America
| | - Eileen T Dimalanta
- New England Biolabs Inc., Ipswich, Massachusetts, United States of America
| | - Fiona J Stewart
- New England Biolabs Inc., Ipswich, Massachusetts, United States of America
| | - Lisa Boardman
- Division of Gastroenterology and Hepatology, Department of Internal Medicine, Mayo Clinic, Rochester, Minnesota, United States of America
| | - Jonathan H Dennis
- Department of Biological Science, Florida State University, Tallahassee, Florida, United States of America.,The Center for Genomics and Personalized Medicine, The Florida State University, Tallahassee, Florida, United States of America.,Institute of Molecular Biophysics, The Florida State University, Tallahassee, Florida, United States of America
| |
Collapse
|
27
|
Tahir M, Hayat M. iNuc-STNC: a sequence-based predictor for identification of nucleosome positioning in genomes by extending the concept of SAAC and Chou's PseAAC. MOLECULAR BIOSYSTEMS 2017; 12:2587-93. [PMID: 27271822 DOI: 10.1039/c6mb00221h] [Citation(s) in RCA: 89] [Impact Index Per Article: 12.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/14/2023]
Abstract
The nucleosome is the fundamental unit of eukaryotic chromatin, which participates in regulating different cellular processes. Owing to the huge exploration of new DNA primary sequences, it is indispensable to develop an automated model. However, identification of novel protein sequences using conventional methods is difficult or sometimes impossible because of vague motifs and the intricate structure of DNA. In this regard, an effective and high throughput automated model "iNuc-STNC" has been proposed in order to identify accurately and reliably nucleosome positioning in genomes. In this proposed model, DNA sequences are expressed into three distinct feature extraction strategies containing dinucleotide composition, trinucleotide composition and split trinucleotide composition (STNC). Various statistical models were utilized as learner hypotheses. Jackknife test was employed to evaluate the success rates of the proposed model. The experiential results expressed that SVM, in combination with STNC, has obtained an outstanding performance on all benchmark datasets. The predicted outcomes of the proposed model "iNuc-STNC" is higher than current state of the art methods in the literature so far. It is ascertained that the "iNuc-STNC" model will provide a rudimentary framework for the pharmaceutical industry in the development of drug design.
Collapse
Affiliation(s)
- Muhammad Tahir
- Department of Computer Science, Abdul Wali Khan University, Mardan, Pakistan.
| | - Maqsood Hayat
- Department of Computer Science, Abdul Wali Khan University, Mardan, Pakistan.
| |
Collapse
|
28
|
Ramachandran S, Ahmad K, Henikoff S. Capitalizing on disaster: Establishing chromatin specificity behind the replication fork. Bioessays 2017; 39. [PMID: 28133760 PMCID: PMC5513704 DOI: 10.1002/bies.201600150] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
Eukaryotic genomes are packaged into nucleosomal chromatin, and genomic activity requires the precise localization of transcription factors, histone modifications and nucleosomes. Classic work described the progressive reassembly and maturation of bulk chromatin behind replication forks. More recent proteomics has detailed the molecular machines that accompany the replicative polymerase to promote rapid histone deposition onto the newly replicated DNA. However, localized chromatin features are transiently obliterated by DNA replication every S phase of the cell cycle. Genomic strategies now observe the rebuilding of locus-specific chromatin features, and reveal surprising delays in transcription factor binding behind replication forks. This implies that transient chromatin disorganization during replication is a central juncture for targeted transcription factor binding within genomes. We propose that transient occlusion of regulatory elements by disorganized nucleosomes during chromatin maturation enforces specificity of factor binding.
Collapse
Affiliation(s)
- Srinivas Ramachandran
- Division of Basic Sciences, Fred Hutchinson Cancer Research Center, Seattle, WA, USA.,Howard Hughes Medical Institute, Seattle, WA, USA
| | - Kami Ahmad
- Division of Basic Sciences, Fred Hutchinson Cancer Research Center, Seattle, WA, USA
| | - Steven Henikoff
- Division of Basic Sciences, Fred Hutchinson Cancer Research Center, Seattle, WA, USA.,Howard Hughes Medical Institute, Seattle, WA, USA
| |
Collapse
|
29
|
Sexton BS, Druliner BR, Vera DL, Avey D, Zhu F, Dennis JH. Hierarchical regulation of the genome: global changes in nucleosome organization potentiate genome response. Oncotarget 2016; 7:6460-75. [PMID: 26771136 PMCID: PMC4872727 DOI: 10.18632/oncotarget.6841] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/22/2015] [Accepted: 12/28/2015] [Indexed: 11/25/2022] Open
Abstract
Nucleosome occupancy is critically important in regulating access to the eukaryotic genome. Few studies in human cells have measured genome-wide nucleosome distributions at high temporal resolution during a response to a common stimulus. We measured nucleosome distributions at high temporal resolution following Kaposi's-sarcoma-associated herpesvirus (KSHV) reactivation using our newly developed mTSS-seq technology, which maps nucleosome distribution at the transcription start sites (TSS) of all human genes. Nucleosomes underwent widespread changes in organization 24 hours after KSHV reactivation and returned to their basal nucleosomal architecture 48 hours after KSHV reactivation. The widespread changes consisted of an indiscriminate remodeling event resulting in the loss of nucleosome rotational phasing signals. Additionally, one in six TSSs in the human genome possessed nucleosomes that are translationally remodeled. 72% of the loci with translationally remodeled nucleosomes have nucleosomes that moved to positions encoded by the underlying DNA sequence. Finally we demonstrated that these widespread alterations in nucleosomal architecture potentiated regulatory factor binding. These descriptions of nucleosomal architecture changes provide a new framework for understanding the role of chromatin in the genomic response, and have allowed us to propose a hierarchical model for chromatin-based regulation of genome response.
Collapse
Affiliation(s)
- Brittany S Sexton
- Department of Biological Science, The Florida State University, Tallahassee, Florida, USA
| | - Brooke R Druliner
- Department of Biological Science, The Florida State University, Tallahassee, Florida, USA.,Division of Gastroenterology and Hepatology, Mayo Clinic, Rochester, Minnesota, USA
| | - Daniel L Vera
- Department of Biological Science, The Florida State University, Tallahassee, Florida, USA.,The Center for Genomics and Personalized Medicine The Florida State University, Tallahassee, Florida, USA
| | - Denis Avey
- Department of Biological Science, The Florida State University, Tallahassee, Florida, USA
| | - Fanxiu Zhu
- Department of Biological Science, The Florida State University, Tallahassee, Florida, USA
| | - Jonathan H Dennis
- Department of Biological Science, The Florida State University, Tallahassee, Florida, USA
| |
Collapse
|
30
|
Li D, Luo L, Zhang W, Liu F, Luo F. A genetic algorithm-based weighted ensemble method for predicting transposon-derived piRNAs. BMC Bioinformatics 2016; 17:329. [PMID: 27578422 PMCID: PMC5006569 DOI: 10.1186/s12859-016-1206-3] [Citation(s) in RCA: 56] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/18/2016] [Accepted: 08/24/2016] [Indexed: 02/05/2023] Open
Abstract
BACKGROUND Predicting piwi-interacting RNA (piRNA) is an important topic in the small non-coding RNAs, which provides clues for understanding the generation mechanism of gamete. To the best of our knowledge, several machine learning approaches have been proposed for the piRNA prediction, but there is still room for improvements. RESULTS In this paper, we develop a genetic algorithm-based weighted ensemble method for predicting transposon-derived piRNAs. We construct datasets for three species: Human, Mouse and Drosophila. For each species, we compile the balanced dataset and imbalanced dataset, and thus obtain six datasets to build and evaluate prediction models. In the computational experiments, the genetic algorithm-based weighted ensemble method achieves 10-fold cross validation AUC of 0.932, 0.937 and 0.995 on the balanced Human dataset, Mouse dataset and Drosophila dataset, respectively, and achieves AUC of 0.935, 0.939 and 0.996 on the imbalanced datasets of three species. Further, we use the prediction models trained on the Mouse dataset to identify piRNAs of other species, and the models demonstrate the good performances in the cross-species prediction. CONCLUSIONS Compared with other state-of-the-art methods, our method can lead to better performances. In conclusion, the proposed method is promising for the transposon-derived piRNA prediction. The source codes and datasets are available in https://github.com/zw9977129/piRNAPredictor .
Collapse
Affiliation(s)
- Dingfang Li
- School of Mathematics and Statistics, Wuhan University, Wuhan, 430072 China
| | - Longqiang Luo
- School of Mathematics and Statistics, Wuhan University, Wuhan, 430072 China
| | - Wen Zhang
- State Key Lab of Software Engineering, Wuhan University, Wuhan, 430072 China
- School of Computer, Wuhan University, Wuhan, 430072 China
| | - Feng Liu
- International School of Software, Wuhan University, Wuhan, 430072 China
| | - Fei Luo
- State Key Lab of Software Engineering, Wuhan University, Wuhan, 430072 China
- School of Computer, Wuhan University, Wuhan, 430072 China
| |
Collapse
|
31
|
Dong J, Yao ZJ, Wen M, Zhu MF, Wang NN, Miao HY, Lu AP, Zeng WB, Cao DS. BioTriangle: a web-accessible platform for generating various molecular representations for chemicals, proteins, DNAs/RNAs and their interactions. J Cheminform 2016; 8:34. [PMID: 27330567 PMCID: PMC4915156 DOI: 10.1186/s13321-016-0146-2] [Citation(s) in RCA: 34] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/16/2016] [Accepted: 06/14/2016] [Indexed: 12/18/2022] Open
Abstract
BACKGROUND More and more evidences from network biology indicate that most cellular components exert their functions through interactions with other cellular components, such as proteins, DNAs, RNAs and small molecules. The rapidly increasing amount of publicly available data in biology and chemistry enables researchers to revisit interaction problems by systematic integration and analysis of heterogeneous data. Currently, some tools have been developed to represent these components. However, they have some limitations and only focus on the analysis of either small molecules or proteins or DNAs/RNAs. To the best of our knowledge, there is still a lack of freely-available, easy-to-use and integrated platforms for generating molecular descriptors of DNAs/RNAs, proteins, small molecules and their interactions. RESULTS Herein, we developed a comprehensive molecular representation platform, called BioTriangle, to emphasize the integration of cheminformatics and bioinformatics into a molecular informatics platform for computational biology study. It contains a feature-rich toolkit used for the characterization of various biological molecules and complex interaction samples including chemicals, proteins, DNAs/RNAs and even their interactions. By using BioTriangle, users are able to start a full pipelining from getting molecular data, molecular representation to constructing machine learning models conveniently. CONCLUSION BioTriangle provides a user-friendly interface to calculate various features of biological molecules and complex interaction samples conveniently. The computing tasks can be submitted and performed simply in a browser without any sophisticated installation and configuration process. BioTriangle is freely available at http://biotriangle.scbdd.com.Graphical abstractAn overview of BioTriangle. A platform for generating various molecular representations for chemicals, proteins, DNAs/RNAs and their interactions.
Collapse
Affiliation(s)
- Jie Dong
- School of Pharmaceutical Sciences, Central South University, Changsha, People's Republic of China
| | - Zhi-Jiang Yao
- College of Chemistry and Chemical Engineering, Central South University, Changsha, People's Republic of China
| | - Ming Wen
- College of Chemistry and Chemical Engineering, Central South University, Changsha, People's Republic of China
| | - Min-Feng Zhu
- School of Mathematics and Statistics, Central South University, Changsha, People's Republic of China
| | - Ning-Ning Wang
- School of Pharmaceutical Sciences, Central South University, Changsha, People's Republic of China
| | - Hong-Yu Miao
- School of Public Health, University of Texas Health Science Center, Houston, TX USA
| | - Ai-Ping Lu
- Institute for Advancing Translational Medicine in Bone and Joint Diseases, School of Chinese Medicine, Hong Kong Baptist University, Hong Kong, SAR People's Republic of China
| | - Wen-Bin Zeng
- School of Pharmaceutical Sciences, Central South University, Changsha, People's Republic of China
| | - Dong-Sheng Cao
- School of Pharmaceutical Sciences, Central South University, Changsha, People's Republic of China ; Institute for Advancing Translational Medicine in Bone and Joint Diseases, School of Chinese Medicine, Hong Kong Baptist University, Hong Kong, SAR People's Republic of China
| |
Collapse
|
32
|
Eslami-Mossallam B, Schiessel H, van Noort J. Nucleosome dynamics: Sequence matters. Adv Colloid Interface Sci 2016; 232:101-113. [PMID: 26896338 DOI: 10.1016/j.cis.2016.01.007] [Citation(s) in RCA: 49] [Impact Index Per Article: 6.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/09/2015] [Revised: 01/22/2016] [Accepted: 01/25/2016] [Indexed: 02/06/2023]
Abstract
About three quarter of all eukaryotic DNA is wrapped around protein cylinders, forming nucleosomes. Even though the histone proteins that make up the core of nucleosomes are highly conserved in evolution, nucleosomes can be very different from each other due to posttranslational modifications of the histones. Another crucial factor in making nucleosomes unique has so far been underappreciated: the sequence of their DNA. This review provides an overview of the experimental and theoretical progress that increasingly points to the importance of the nucleosomal base pair sequence. Specifically, we discuss the role of the underlying base pair sequence in nucleosome positioning, sliding, breathing, force-induced unwrapping, dissociation and partial assembly and also how the sequence can influence higher-order structures. A new view emerges: the physical properties of nucleosomes, especially their dynamical properties, are determined to a large extent by the mechanical properties of their DNA, which in turn depends on DNA sequence.
Collapse
|
33
|
Liu B, Long R, Chou KC. iDHS-EL: identifying DNase I hypersensitive sites by fusing three different modes of pseudo nucleotide composition into an ensemble learning framework. ACTA ACUST UNITED AC 2016; 32:2411-8. [PMID: 27153623 DOI: 10.1093/bioinformatics/btw186] [Citation(s) in RCA: 174] [Impact Index Per Article: 21.8] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/22/2016] [Accepted: 04/03/2016] [Indexed: 11/13/2022]
Abstract
MOTIVATION Regulatory DNA elements are associated with DNase I hypersensitive sites (DHSs). Accordingly, identification of DHSs will provide useful insights for in-depth investigation into the function of noncoding genomic regions. RESULTS In this study, using the strategy of ensemble learning framework, we proposed a new predictor called iDHS-EL for identifying the location of DHS in human genome. It was formed by fusing three individual Random Forest (RF) classifiers into an ensemble predictor. The three RF operators were respectively based on the three special modes of the general pseudo nucleotide composition (PseKNC): (i) kmer, (ii) reverse complement kmer and (iii) pseudo dinucleotide composition. It has been demonstrated that the new predictor remarkably outperforms the relevant state-of-the-art methods in both accuracy and stability. AVAILABILITY AND IMPLEMENTATION For the convenience of most experimental scientists, a web server for iDHS-EL is established at http://bioinformatics.hitsz.edu.cn/iDHS-EL, which is the first web-server predictor ever established for identifying DHSs, and by which users can easily get their desired results without the need to go through the mathematical details. We anticipate that IDHS-EL: will become a very useful high throughput tool for genome analysis. CONTACT bliu@gordonlifescience.org or bliu@insun.hit.edu.cn SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Bin Liu
- School of Computer Science and Technology, Harbin Institute of Technology Shenzhen Graduate School, Shenzhen, Guangdong 518055, China Key Laboratory of Network Oriented Intelligent Computation, Harbin Institute of Technology Shenzhen Graduate School, Shenzhen, Guangdong 518055, China Gordon Life Science Institute, Belmont, MA 02478, USA
| | - Ren Long
- School of Computer Science and Technology, Harbin Institute of Technology Shenzhen Graduate School, Shenzhen, Guangdong 518055, China
| | - Kuo-Chen Chou
- Gordon Life Science Institute, Belmont, MA 02478, USA Center of Excellence in Genomic Medicine Research (CEGMR), King Abdulaziz University, Jeddah 21589, Saudi Arabia
| |
Collapse
|
34
|
Chen W, Feng P, Ding H, Lin H, Chou KC. Using deformation energy to analyze nucleosome positioning in genomes. Genomics 2016; 107:69-75. [DOI: 10.1016/j.ygeno.2015.12.005] [Citation(s) in RCA: 87] [Impact Index Per Article: 10.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/12/2015] [Revised: 12/06/2015] [Accepted: 12/22/2015] [Indexed: 12/28/2022]
|
35
|
Global mapping of the regulatory interactions of histone residues. FEBS Lett 2015; 589:4061-70. [PMID: 26602082 DOI: 10.1016/j.febslet.2015.11.016] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/03/2015] [Revised: 11/10/2015] [Accepted: 11/11/2015] [Indexed: 11/23/2022]
Abstract
Histone residues can serve as platforms for specific regulatory function. Here we constructed a map of regulatory associations between histone residues and a wide spectrum of chromatin regulation factors based on gene expression changes by histone point mutations in Saccharomyces cerevisiae. Detailed analyses of this map revealed novel associations. Regarding the modulation of H3K4 and K36 methylation by Set1, Set2, or Jhd2, we proposed a role for H4K91 acetylation in early Pol II elongation, and for H4K16 deacetylation in late elongation and crosstalk with H3K4 demethylation for gene silencing. The association of H3K56 with nucleosome positioning suggested that this lysine residue and its acetylation might contribute to nucleosome mobility for transcription activation. Further insights into chromatin regulation are expected from this approach.
Collapse
|
36
|
Brown AN, Vied C, Dennis JH, Bhide PG. Nucleosome Repositioning: A Novel Mechanism for Nicotine- and Cocaine-Induced Epigenetic Changes. PLoS One 2015; 10:e0139103. [PMID: 26414157 PMCID: PMC4586372 DOI: 10.1371/journal.pone.0139103] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/25/2014] [Accepted: 09/09/2015] [Indexed: 11/19/2022] Open
Abstract
Drugs of abuse modify behavior by altering gene expression in the brain. Gene expression can be regulated by changes in DNA methylation as well as by histone modifications, which alter chromatin structure, DNA compaction and DNA accessibility. In order to better understand the molecular mechanisms directing drug-induced changes in chromatin structure, we examined DNA-nucleosome interactions within promoter regions of 858 genes in human neuroblastoma cells (SH-SY5Y) exposed to nicotine or cocaine. Widespread, drug- and time-resolved repositioning of nucleosomes was identified at the transcription start site and promoter region of multiple genes. Nicotine and cocaine produced unique and shared changes in terms of the numbers and types of genes affected, as well as repositioning of nucleosomes at sites which could increase or decrease the probability of gene expression based on DNA accessibility. Half of the drug-induced nucleosome positions approximated a theoretical model of nucleosome occupancy based on physical and chemical characteristics of the DNA sequence, whereas the basal or drug naïve positions were generally DNA sequence independent. Thus we suggest that nucleosome repositioning represents an initial dynamic genome-wide alteration of the transcriptional landscape preceding more selective downstream transcriptional reprogramming, which ultimately characterizes the cell- and tissue-specific responses to drugs of abuse.
Collapse
Affiliation(s)
- Amber N. Brown
- Center for Brain Repair, Department of Biomedical Sciences, Florida State University College of Medicine, Tallahassee, FL, United States of America
| | - Cynthia Vied
- Center for Brain Repair, Department of Biomedical Sciences, Florida State University College of Medicine, Tallahassee, FL, United States of America
| | - Jonathan H. Dennis
- Department of Biological Sciences, Florida State University, Tallahassee, Florida, United States of America
| | - Pradeep G. Bhide
- Center for Brain Repair, Department of Biomedical Sciences, Florida State University College of Medicine, Tallahassee, FL, United States of America
- * E-mail:
| |
Collapse
|
37
|
Liu B, Liu F, Wang X, Chen J, Fang L, Chou KC. Pse-in-One: a web server for generating various modes of pseudo components of DNA, RNA, and protein sequences. Nucleic Acids Res 2015; 43:W65-71. [PMID: 25958395 PMCID: PMC4489303 DOI: 10.1093/nar/gkv458] [Citation(s) in RCA: 558] [Impact Index Per Article: 62.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/16/2015] [Accepted: 04/27/2015] [Indexed: 11/12/2022] Open
Abstract
With the avalanche of biological sequences generated in the post-genomic age, one of the most challenging problems in computational biology is how to effectively formulate the sequence of a biological sample (such as DNA, RNA or protein) with a discrete model or a vector that can effectively reflect its sequence pattern information or capture its key features concerned. Although several web servers and stand-alone tools were developed to address this problem, all these tools, however, can only handle one type of samples. Furthermore, the number of their built-in properties is limited, and hence it is often difficult for users to formulate the biological sequences according to their desired features or properties. In this article, with a much larger number of built-in properties, we are to propose a much more flexible web server called Pse-in-One (http://bioinformatics.hitsz.edu.cn/Pse-in-One/), which can, through its 28 different modes, generate nearly all the possible feature vectors for DNA, RNA and protein sequences. Particularly, it can also generate those feature vectors with the properties defined by users themselves. These feature vectors can be easily combined with machine-learning algorithms to develop computational predictors and analysis methods for various tasks in bioinformatics and system biology. It is anticipated that the Pse-in-One web server will become a very useful tool in computational proteomics, genomics, as well as biological sequence analysis. Moreover, to maximize users’ convenience, its stand-alone version can also be downloaded from http://bioinformatics.hitsz.edu.cn/Pse-in-One/download/, and directly run on Windows, Linux, Unix and Mac OS.
Collapse
Affiliation(s)
- Bin Liu
- School of Computer Science and Technology, Harbin Institute of Technology Shenzhen Graduate School, Shenzhen, Guangdong 518055, China Key Laboratory of Network Oriented Intelligent Computation, Harbin Institute of Technology Shenzhen Graduate School, Shenzhen, Guangdong 518055, China Gordon Life Science Institute, Belmont, MA 02478, USA
| | - Fule Liu
- School of Computer Science and Technology, Harbin Institute of Technology Shenzhen Graduate School, Shenzhen, Guangdong 518055, China
| | - Xiaolong Wang
- School of Computer Science and Technology, Harbin Institute of Technology Shenzhen Graduate School, Shenzhen, Guangdong 518055, China Key Laboratory of Network Oriented Intelligent Computation, Harbin Institute of Technology Shenzhen Graduate School, Shenzhen, Guangdong 518055, China
| | - Junjie Chen
- School of Computer Science and Technology, Harbin Institute of Technology Shenzhen Graduate School, Shenzhen, Guangdong 518055, China
| | - Longyun Fang
- School of Computer Science and Technology, Harbin Institute of Technology Shenzhen Graduate School, Shenzhen, Guangdong 518055, China
| | - Kuo-Chen Chou
- Gordon Life Science Institute, Belmont, MA 02478, USA Center of Excellence in Genomic Medicine Research (CEGMR), King Abdulaziz University, Jeddah 21589, Saudi Arabia
| |
Collapse
|
38
|
DNA-Encoded Chromatin Structural Intron Boundary Signals Identify Conserved Genes with Common Function. Int J Genomics 2015; 2015:167578. [PMID: 25861617 PMCID: PMC4377520 DOI: 10.1155/2015/167578] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/01/2014] [Accepted: 02/15/2015] [Indexed: 12/14/2022] Open
Abstract
The regulation of metazoan gene expression occurs in part by pre-mRNA splicing into mature RNAs. Signals affecting the efficiency and specificity with which introns are removed have not been completely elucidated. Splicing likely occurs cotranscriptionally, with chromatin structure playing a key regulatory role. We calculated DNA encoded nucleosome occupancy likelihood (NOL) scores at the boundaries between introns and exons across five metazoan species. We found that (i) NOL scores reveal a sequence-based feature at the introns on both sides of the intron-exon boundary; (ii) this feature is not part of any recognizable consensus sequence; (iii) this feature is conserved throughout metazoa; (iv) this feature is enriched in genes sharing similar functions: ATPase activity, ATP binding, helicase activity, and motor activity; (v) genes with these functions exhibit different genomic characteristics;
(vi) in vivo nucleosome positioning data confirm ontological enrichment at this feature; and (vii) genes with this feature exhibit unique dinucleotide distributions at the intron-exon boundary. The NOL scores point toward a physical property of DNA that may play a role in the mechanism of pre-mRNA splicing. These results provide a foundation for identification of a new set of regulatory DNA elements involved in splicing regulation.
Collapse
|
39
|
Nüsgen N, Goering W, Dauksa A, Biswas A, Jamil MA, Dimitriou I, Sharma A, Singer H, Fimmers R, Fröhlich H, Oldenburg J, Gulbinas A, Schulz WA, El-Maarri O. Inter-locus as well as intra-locus heterogeneity in LINE-1 promoter methylation in common human cancers suggests selective demethylation pressure at specific CpGs. Clin Epigenetics 2015; 7:17. [PMID: 25798207 PMCID: PMC4367886 DOI: 10.1186/s13148-015-0051-y] [Citation(s) in RCA: 31] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/14/2014] [Accepted: 02/02/2015] [Indexed: 11/10/2022] Open
Abstract
Background Hypomethylation of long interspersed element (LINE)-1 has been observed in tumorigenesis when using degenerate assays, which provide an average across all repeats. However, it is unknown whether individual LINE-1 loci or different CpGs within one specific LINE-1 promoter are equally affected by methylation changes. Conceivably, studying methylation changes at specific LINE-1 may be more informative than global assays for cancer diagnostics. Therefore, with the aim of mapping methylation at individual LINE-1 loci at single-CpG resolution and exploring the diagnostic potential of individual LINE-1 locus methylation, we analyzed methylation at 11 loci by pyrosequencing, next-generation bisulfite sequencing as well as global LINE-1 methylation in bladder, colon, pancreas, prostate, and stomach cancers compared to paired normal tissues and in blood samples from some of the patients compared to healthy donors. Results Most (72/80) tumor samples harbored significant methylation changes at at least one locus. Notably, our data revealed not only the expected hypomethylation but also hypermethylation at some loci. Specific CpGs within the LINE-1 consensus sequence appeared preferentially hypomethylated suggesting that these could act as seeds for hypomethylation. In silico analysis revealed that these CpG sites more likely faced the histones in the nucleosome. Multivariate logistic regression analysis did not reveal a significant clinical advantage of locus-specific methylation markers over global methylation markers in distinguishing tumors from normal tissues. Conclusions Methylation changes at individual LINE-1 loci are heterogeneous, whereas specific CpGs within the consensus sequence appear to be more prone to hypomethylation. With a broader selection of loci, locus-specific LINE-1 methylation could become a tool for tumor detection. Electronic supplementary material The online version of this article (doi:10.1186/s13148-015-0051-y) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Nicole Nüsgen
- Institute of Experimental Hematology and Transfusion Medicine, University of Bonn, Sigmund-Freud Str. 25, 53127 Bonn, Germany
| | - Wolfgang Goering
- Department of Urology, Medical Faculty, Heinrich-Heine-University, Moorenstr. 5, 40225 Düsseldorf, Germany
| | - Albertas Dauksa
- Institute for Digestive Research, Lithuanian University of Health Sciences, Eiveniu g. 2, Kaunas, 50009 Lithuania
| | - Arijit Biswas
- Institute of Experimental Hematology and Transfusion Medicine, University of Bonn, Sigmund-Freud Str. 25, 53127 Bonn, Germany
| | - Muhammad Ahmer Jamil
- Institute of Experimental Hematology and Transfusion Medicine, University of Bonn, Sigmund-Freud Str. 25, 53127 Bonn, Germany ; Bonn-Aachen International Center for IT (B-IT) Algorithmic Bioinformatics, University of Bonn, Dahlmannstr. 2, 53113 Bonn, Germany
| | - Ioanna Dimitriou
- Institute of Medical Biometry, Informatics and Epidemiology (IMBIE), University of Bonn, Sigmund-Freud-Straße 25, D-53127 Bonn, Germany
| | - Amit Sharma
- Institute of Experimental Hematology and Transfusion Medicine, University of Bonn, Sigmund-Freud Str. 25, 53127 Bonn, Germany
| | - Heike Singer
- Institute of Experimental Hematology and Transfusion Medicine, University of Bonn, Sigmund-Freud Str. 25, 53127 Bonn, Germany
| | - Rolf Fimmers
- Institute of Medical Biometry, Informatics and Epidemiology (IMBIE), University of Bonn, Sigmund-Freud-Straße 25, D-53127 Bonn, Germany
| | - Holger Fröhlich
- Bonn-Aachen International Center for IT (B-IT) Algorithmic Bioinformatics, University of Bonn, Dahlmannstr. 2, 53113 Bonn, Germany
| | - Johannes Oldenburg
- Institute of Experimental Hematology and Transfusion Medicine, University of Bonn, Sigmund-Freud Str. 25, 53127 Bonn, Germany
| | - Antanas Gulbinas
- Institute for Digestive Research, Lithuanian University of Health Sciences, Eiveniu g. 2, Kaunas, 50009 Lithuania
| | - Wolfgang A Schulz
- Department of Urology, Medical Faculty, Heinrich-Heine-University, Moorenstr. 5, 40225 Düsseldorf, Germany
| | - Osman El-Maarri
- Institute of Experimental Hematology and Transfusion Medicine, University of Bonn, Sigmund-Freud Str. 25, 53127 Bonn, Germany
| |
Collapse
|
40
|
Borozan I, Watt S, Ferretti V. Integrating alignment-based and alignment-free sequence similarity measures for biological sequence classification. ACTA ACUST UNITED AC 2015; 31:1396-404. [PMID: 25573913 PMCID: PMC4410667 DOI: 10.1093/bioinformatics/btv006] [Citation(s) in RCA: 26] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/05/2014] [Accepted: 01/05/2015] [Indexed: 01/02/2023]
Abstract
MOTIVATION Alignment-based sequence similarity searches, while accurate for some type of sequences, can produce incorrect results when used on more divergent but functionally related sequences that have undergone the sequence rearrangements observed in many bacterial and viral genomes. Here, we propose a classification model that exploits the complementary nature of alignment-based and alignment-free similarity measures with the aim to improve the accuracy with which DNA and protein sequences are characterized. RESULTS Our model classifies sequences using a combined sequence similarity score calculated by adaptively weighting the contribution of different sequence similarity measures. Weights are determined independently for each sequence in the test set and reflect the discriminatory ability of individual similarity measures in the training set. Because the similarity between some sequences is determined more accurately with one type of measure rather than another, our classifier allows different sets of weights to be associated with different sequences. Using five different similarity measures, we show that our model significantly improves the classification accuracy over the current composition- and alignment-based models, when predicting the taxonomic lineage for both short viral sequence fragments and complete viral sequences. We also show that our model can be used effectively for the classification of reads from a real metagenome dataset as well as protein sequences. AVAILABILITY AND IMPLEMENTATION All the datasets and the code used in this study are freely available at https://collaborators.oicr.on.ca/vferretti/borozan_csss/csss.html. CONTACT ivan.borozan@gmail.com SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Ivan Borozan
- Department of Informatics and Bio-computing, Ontario Institute for Cancer Research, MaRS Centre, South Tower, 101 College Street, Suite 800, Toronto, Ontario, Canada
| | - Stuart Watt
- Department of Informatics and Bio-computing, Ontario Institute for Cancer Research, MaRS Centre, South Tower, 101 College Street, Suite 800, Toronto, Ontario, Canada
| | - Vincent Ferretti
- Department of Informatics and Bio-computing, Ontario Institute for Cancer Research, MaRS Centre, South Tower, 101 College Street, Suite 800, Toronto, Ontario, Canada
| |
Collapse
|
41
|
Predicting nucleosome positioning based on geometrically transformed Tsallis entropy. PLoS One 2014; 9:e109395. [PMID: 25380134 PMCID: PMC4224380 DOI: 10.1371/journal.pone.0109395] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/06/2014] [Accepted: 08/26/2014] [Indexed: 11/19/2022] Open
Abstract
As the fundamental unit of eukaryotic chromatin structure, nucleosome plays critical roles in gene expression and regulation by controlling physical access to transcription factors. In this paper, based on the geometrically transformed Tsallis entropy and two index-vectors, a valid nucleosome positioning information model is developed to describe the distribution of A/T-riched and G/C-riched dimeric and trimeric motifs along the DNA duplex. When applied to train the support vector machine, the model achieves high AUCs across five organisms, which have significantly outperformed the previous studies. Besides, we adopt the concept of relative distance to describe the probability of arbitrary DNA sequence covered by nucleosome. Thus, the average nucleosome occupancy profile over the S.cerevisiae genome is calculated. With our peak detection model, the isolated nucleosomes along genome sequence are located. When compared with some published results, it shows that our model is effective for nucleosome positioning. The index-vector component is identified to be an important influencing factor of nucleosome organizations.
Collapse
|
42
|
Kaer K, Speek M. Intronic retroelements: Not just "speed bumps" for RNA polymerase II. Mob Genet Elements 2014; 2:154-157. [PMID: 23061024 PMCID: PMC3463474 DOI: 10.4161/mge.20774] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022] Open
Abstract
Two well-known retroelements, L1 and Alu, comprise about one third of the human genome and are nearly equally distributed between the intergenic and intragenic regions. They carry different regulatory elements and contribute structurally and functionally to the expression of our genes. Recent data also suggest that hundreds of intronic L1s and Alus interfere with the transcription of human genes by inducing intron retention, forcing exonization and cryptic polyadenylation. These novel features can be explained with the RNA polymerase kinetic model and suggest that intronic L1s and Alus are not just "speed bumps" in regulation of RNA polymerase traffic. Here we discuss the complexity of the regulation of gene transcription imposed by intronic retroelements and predict that in addition to transcriptional activity, transcription factor binding and nucleosomal occupancy play a significant role in the transcriptional interference effects of the host genes.
Collapse
Affiliation(s)
- Kristel Kaer
- Department of Gene Technology; Tallinn University of Technology; Tallinn, Estonia
| | | |
Collapse
|
43
|
Cui F, Chen L, LoVerso PR, Zhurkin VB. Prediction of nucleosome rotational positioning in yeast and human genomes based on sequence-dependent DNA anisotropy. BMC Bioinformatics 2014; 15:313. [PMID: 25244936 PMCID: PMC4261538 DOI: 10.1186/1471-2105-15-313] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/31/2014] [Accepted: 08/29/2014] [Indexed: 01/22/2023] Open
Abstract
BACKGROUND An organism's DNA sequence is one of the key factors guiding the positioning of nucleosomes within a cell's nucleus. Sequence-dependent bending anisotropy dictates how DNA is wrapped around a histone octamer. One of the best established sequence patterns consistent with this anisotropy is the periodic occurrence of AT-containing dinucleotides (WW) and GC-containing dinucleotides (SS) in the nucleosomal locations where DNA is bent in the minor and major grooves, respectively. Although this simple pattern has been observed in nucleosomes across eukaryotic genomes, its use for prediction of nucleosome positioning was not systematically tested. RESULTS We present a simple computational model, termed the W/S scheme, implementing this pattern, without using any training data. This model accurately predicts the rotational positioning of nucleosomes both in vitro and in vivo, in yeast and human genomes. About 65 - 75% of the experimentally observed nucleosome positions are predicted with the precision of one to two base pairs. The program is freely available at http://people.rit.edu/fxcsbi/WS_scheme/. We also introduce a simple and efficient way to compare the performance of different models predicting the rotational positioning of nucleosomes. CONCLUSIONS This paper presents the W/S scheme to achieve accurate prediction of rotational positioning of nucleosomes, solely based on the sequence-dependent anisotropic bending of nucleosomal DNA. This method successfully captures DNA features critical for the rotational positioning of nucleosomes, and can be further improved by incorporating additional terms related to the translational positioning of nucleosomes in a species-specific manner.
Collapse
Affiliation(s)
- Feng Cui
- Thomas H, Gosnell School of Life Sciences, Rochester Institute of Technology, Rochester, NY 14623, USA.
| | | | | | | |
Collapse
|
44
|
Babarinde IA, Saitou N. Heterogeneous tempo and mode of conserved noncoding sequence evolution among four mammalian orders. Genome Biol Evol 2014; 5:2330-43. [PMID: 24259317 PMCID: PMC3879966 DOI: 10.1093/gbe/evt177] [Citation(s) in RCA: 22] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/12/2022] Open
Abstract
Conserved noncoding sequences (CNSs) of vertebrates are considered to be closely linked with protein-coding gene regulatory functions. We examined the abundance and genomic distribution of CNSs in four mammalian orders: primates, rodents, carnivores, and cetartiodactyls. We defined the two thresholds for CNS using conservation level of coding genes; using all the three coding positions and using only first and second codon positions. The abundance of CNSs varied among lineages, with primates and rodents having highest and lowest number of CNSs, respectively, whereas carnivores and cetartiodactyls had intermediate values. These CNSs cover 1.3-5.5% of the mammalian genomes and have signatures of selective constraints that are stronger in more ancestral than the recent ones. Evolution of new CNSs as well as retention of ancestral CNSs contribute to the differences in abundance. The genomic distribution of CNSs is dynamic with higher proportions of rodent and primate CNSs located in the introns compared with carnivores and cetartiodactyls. In fact, 19% of orthologous single-copy CNSs between human and dog are located in different genomic regions. If CNSs can be considered as candidates of gene expression regulatory sequences, heterogeneity of CNSs among the four mammalian orders may have played an important role in creating the order-specific phenotypes. Fewer CNSs in rodents suggest that rodent diversity is related to lower regulatory conservation. With CNSs shown to cluster around genes involved in nervous systems and the higher number of primate CNSs, our result suggests that CNSs may be involved in the higher complexity of the primate nervous system.
Collapse
Affiliation(s)
- Isaac Adeyemi Babarinde
- Department of Genetics, School of Life Science, The Graduate University for Advanced Studies (SOKENDAI), Mishima Japan
| | | |
Collapse
|
45
|
Zheng Y, Li X, Hu H. Computational discovery of feature patterns in nucleosomal DNA sequences. Genomics 2014; 104:87-95. [PMID: 25063528 DOI: 10.1016/j.ygeno.2014.07.002] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/03/2013] [Revised: 04/18/2014] [Accepted: 07/15/2014] [Indexed: 11/27/2022]
Abstract
The identification of important factors that affect nucleosome formation is critical to clarify nucleosome-forming mechanisms and the role of the nucleosome in gene regulation. Various features reported in the literature led to our hypothesis that multiple features can together contribute to nucleosome formation. Therefore, we compiled 779 features and developed a pattern discovery and scoring algorithm FFN (Finding Features for Nucleosomes) to identify feature patterns that are differentially enriched in nucleosome-forming sequences and nucleosome-depletion sequences. Applying FFN to genome-wide nucleosome occupancy data in yeast and human, we identified statistically significant feature patterns that may influence nucleosome formation, many of which are common to the two species. We found that both sequence and structural features are important in nucleosome occupancy prediction. We discovered that, even for the same feature combinations, variations in feature values may lead to differences in predictive power. We demonstrated that the identified feature patterns could be used to assist nucleosomal sequence prediction.
Collapse
Affiliation(s)
- Yiyu Zheng
- Department of Electrical Engineering and Computer Science, University Of Central Florida, Orlando, FL 32816, USA
| | - Xiaoman Li
- Department of Electrical Engineering and Computer Science, University Of Central Florida, Orlando, FL 32816, USA; Burnett School of Biomedical Science, University Of Central Florida, Orlando, FL 32816, USA
| | - Haiyan Hu
- Department of Electrical Engineering and Computer Science, University Of Central Florida, Orlando, FL 32816, USA.
| |
Collapse
|
46
|
Zare H, Khodursky A, Sartorelli V. An evolutionarily biased distribution of miRNA sites toward regulatory genes with high promoter-driven intrinsic transcriptional noise. BMC Evol Biol 2014; 14:74. [PMID: 24707827 PMCID: PMC4031498 DOI: 10.1186/1471-2148-14-74] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/21/2013] [Accepted: 03/24/2014] [Indexed: 12/21/2022] Open
Abstract
Background miRNAs are a major class of regulators of gene expression in metazoans. By targeting cognate mRNAs, miRNAs are involved in regulating most, if not all, biological processes in different cell and tissue types. To better understand how this regulatory potential is allocated among different target gene sets, we carried out a detailed and systematic analysis of miRNA target sites distribution in the mouse genome. Results We used predicted conserved and non-conserved sites for 779 miRNAs in 3′ UTR of 18440 genes downloaded from TargetScan website. Our analysis reveals that 3′ UTRs of genes encoding regulatory proteins harbor significantly greater number of miRNA sites than those of non-regulatory, housekeeping and structural, genes. Analysis of miRNA sites for orthologous 3′UTR’s in 10 other species indicates that the regulatory genes were maintaining or accruing miRNA sites while non-regulatory genes gradually shed them in the course of evolution. Furthermore, we observed that 3′ UTR of genes with higher gene expression variability driven by their promoter sequence content are targeted by many more distinct miRNAs compared to genes with low transcriptional noise. Conclusions Based on our results we envision a model, which we dubbed “selective inclusion”, whereby non-regulatory genes with low transcription noise and stable expression profile lost their sites, while regulatory genes which endure higher transcription noise retained and gained new sites. This adaptation is consistent with the requirements that regulatory genes need to be tightly controlled in order to have precise and optimum protein level to properly function.
Collapse
Affiliation(s)
- Hossein Zare
- Laboratory of Muscle Stem Cells and Gene Regulation, National Institute of Arthritis, Musculoskeletal and Skin Diseases, National Institutes of Health, 50 South Drive, Bethesda, MD 20892, USA.
| | | | | |
Collapse
|
47
|
Histone variants and epigenetic inheritance. BIOCHIMICA ET BIOPHYSICA ACTA-GENE REGULATORY MECHANISMS 2014; 1819:222-229. [PMID: 24459724 DOI: 10.1016/j.bbagrm.2011.06.007] [Citation(s) in RCA: 28] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
Nucleosome particles, which are composed of core histones and DNA, are the basic unit of eukaryotic chromatin. Histone modifications and histone composition determine the structure and function of the chromatin; this genome packaging, often referred to as "epigenetic information", provides additional information beyond the underlying genomic sequence. The epigenetic information must be transmitted from mother cells to daughter cells during mitotic division to maintain the cell lineage identity and proper gene expression. However, the mechanisms responsible for mitotic epigenetic inheritance remain largely unknown. In this review, we focus on recent studies regarding histone variants and discuss the assembly pathways that may contribute to epigenetic inheritance. This article is part of a Special Issue entitled: Histone chaperones and Chromatin assembly.
Collapse
|
48
|
Guo SH, Deng EZ, Xu LQ, Ding H, Lin H, Chen W, Chou KC. iNuc-PseKNC: a sequence-based predictor for predicting nucleosome positioning in genomes with pseudo k-tuple nucleotide composition. ACTA ACUST UNITED AC 2014; 30:1522-9. [PMID: 24504871 DOI: 10.1093/bioinformatics/btu083] [Citation(s) in RCA: 312] [Impact Index Per Article: 31.2] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/30/2022]
Abstract
MOTIVATION Nucleosome positioning participates in many cellular activities and plays significant roles in regulating cellular processes. With the avalanche of genome sequences generated in the post-genomic age, it is highly desired to develop automated methods for rapidly and effectively identifying nucleosome positioning. Although some computational methods were proposed, most of them were species specific and neglected the intrinsic local structural properties that might play important roles in determining the nucleosome positioning on a DNA sequence. RESULTS Here a predictor called 'iNuc-PseKNC' was developed for predicting nucleosome positioning in Homo sapiens, Caenorhabditis elegans and Drosophila melanogaster genomes, respectively. In the new predictor, the samples of DNA sequences were formulated by a novel feature-vector called 'pseudo k-tuple nucleotide composition', into which six DNA local structural properties were incorporated. It was observed by the rigorous cross-validation tests on the three stringent benchmark datasets that the overall success rates achieved by iNuc-PseKNC in predicting the nucleosome positioning of the aforementioned three genomes were 86.27%, 86.90% and 79.97%, respectively. Meanwhile, the results obtained by iNuc-PseKNC on various benchmark datasets used by the previous investigators for different genomes also indicated that the current predictor remarkably outperformed its counterparts. AVAILABILITY A user-friendly web-server, iNuc-PseKNC is freely accessible at http://lin.uestc.edu.cn/server/iNuc-PseKNC.
Collapse
Affiliation(s)
- Shou-Hui Guo
- Key Laboratory for Neuro-Information of Ministry of Education, Center of Bioinformatics, School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu 610054, China, Gordon Life Science Institute, Belmont, Massachusetts, USA, Department of Physics, School of Sciences, Center for Genomics and Computational Biology, Hebei United University, Tangshan 063000, China and Center of Excellence in Genomic Medicine Research (CEGMR), King Abdulaziz University, Jeddah, Saudi Arabia
| | - En-Ze Deng
- Key Laboratory for Neuro-Information of Ministry of Education, Center of Bioinformatics, School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu 610054, China, Gordon Life Science Institute, Belmont, Massachusetts, USA, Department of Physics, School of Sciences, Center for Genomics and Computational Biology, Hebei United University, Tangshan 063000, China and Center of Excellence in Genomic Medicine Research (CEGMR), King Abdulaziz University, Jeddah, Saudi Arabia
| | - Li-Qin Xu
- Key Laboratory for Neuro-Information of Ministry of Education, Center of Bioinformatics, School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu 610054, China, Gordon Life Science Institute, Belmont, Massachusetts, USA, Department of Physics, School of Sciences, Center for Genomics and Computational Biology, Hebei United University, Tangshan 063000, China and Center of Excellence in Genomic Medicine Research (CEGMR), King Abdulaziz University, Jeddah, Saudi Arabia
| | - Hui Ding
- Key Laboratory for Neuro-Information of Ministry of Education, Center of Bioinformatics, School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu 610054, China, Gordon Life Science Institute, Belmont, Massachusetts, USA, Department of Physics, School of Sciences, Center for Genomics and Computational Biology, Hebei United University, Tangshan 063000, China and Center of Excellence in Genomic Medicine Research (CEGMR), King Abdulaziz University, Jeddah, Saudi Arabia
| | - Hao Lin
- Key Laboratory for Neuro-Information of Ministry of Education, Center of Bioinformatics, School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu 610054, China, Gordon Life Science Institute, Belmont, Massachusetts, USA, Department of Physics, School of Sciences, Center for Genomics and Computational Biology, Hebei United University, Tangshan 063000, China and Center of Excellence in Genomic Medicine Research (CEGMR), King Abdulaziz University, Jeddah, Saudi ArabiaKey Laboratory for Neuro-Information of Ministry of Education, Center of Bioinformatics, School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu 610054, China, Gordon Life Science Institute, Belmont, Massachusetts, USA, Department of Physics, School of Sciences, Center for Genomics and Computational Biology, Hebei United University, Tangshan 063000, China and Center of Excellence in Genomic Medicine Research (CEGMR), King Abdulaziz University, Jeddah, Saudi Arabia
| | - Wei Chen
- Key Laboratory for Neuro-Information of Ministry of Education, Center of Bioinformatics, School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu 610054, China, Gordon Life Science Institute, Belmont, Massachusetts, USA, Department of Physics, School of Sciences, Center for Genomics and Computational Biology, Hebei United University, Tangshan 063000, China and Center of Excellence in Genomic Medicine Research (CEGMR), King Abdulaziz University, Jeddah, Saudi ArabiaKey Laboratory for Neuro-Information of Ministry of Education, Center of Bioinformatics, School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu 610054, China, Gordon Life Science Institute, Belmont, Massachusetts, USA, Department of Physics, School of Sciences, Center for Genomics and Computational Biology, Hebei United University, Tangshan 063000, China and Center of Excellence in Genomic Medicine Research (CEGMR), King Abdulaziz University, Jeddah, Saudi Arabia
| | - Kuo-Chen Chou
- Key Laboratory for Neuro-Information of Ministry of Education, Center of Bioinformatics, School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu 610054, China, Gordon Life Science Institute, Belmont, Massachusetts, USA, Department of Physics, School of Sciences, Center for Genomics and Computational Biology, Hebei United University, Tangshan 063000, China and Center of Excellence in Genomic Medicine Research (CEGMR), King Abdulaziz University, Jeddah, Saudi ArabiaKey Laboratory for Neuro-Information of Ministry of Education, Center of Bioinformatics, School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu 610054, China, Gordon Life Science Institute, Belmont, Massachusetts, USA, Department of Physics, School of Sciences, Center for Genomics and Computational Biology, Hebei United University, Tangshan 063000, China and Center of Excellence in Genomic Medicine Research (CEGMR), King Abdulaziz University, Jeddah, Saudi Arabia
| |
Collapse
|
49
|
Sexton BS, Avey D, Druliner BR, Fincher JA, Vera DL, Grau DJ, Borowsky ML, Gupta S, Girimurugan SB, Chicken E, Zhang J, Noble WS, Zhu F, Kingston RE, Dennis JH. The spring-loaded genome: nucleosome redistributions are widespread, transient, and DNA-directed. Genome Res 2013; 24:251-9. [PMID: 24310001 PMCID: PMC3912415 DOI: 10.1101/gr.160150.113] [Citation(s) in RCA: 26] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/11/2022]
Abstract
Nucleosome occupancy plays a key role in regulating access to eukaryotic genomes. Although various chromatin regulatory complexes are known to regulate nucleosome occupancy, the role of DNA sequence in this regulation remains unclear, particularly in mammals. To address this problem, we measured nucleosome distribution at high temporal resolution in human cells at hundreds of genes during the reactivation of Kaposi's sarcoma–associated herpesvirus (KSHV). We show that nucleosome redistribution peaks at 24 h post-KSHV reactivation and that the nucleosomal redistributions are widespread and transient. To clarify the role of DNA sequence in these nucleosomal redistributions, we compared the genes with altered nucleosome distribution to a sequence-based computer model and in vitro–assembled nucleosomes. We demonstrate that both the predicted model and the assembled nucleosome distributions are concordant with the majority of nucleosome redistributions at 24 h post-KSHV reactivation. We suggest a model in which loci are held in an unfavorable chromatin architecture and “spring” to a transient intermediate state directed by DNA sequence information. We propose that DNA sequence plays a more considerable role in the regulation of nucleosome positions than was previously appreciated. The surprising findings that nucleosome redistributions are widespread, transient, and DNA-directed shift the current perspective regarding regulation of nucleosome distribution in humans.
Collapse
Affiliation(s)
- Brittany S Sexton
- Department of Biological Science, The Florida State University, Tallahassee, Florida 32306-4295, USA
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
50
|
Nalabothula N, Xi L, Bhattacharyya S, Widom J, Wang JP, Reeve JN, Santangelo TJ, Fondufe-Mittendorf YN. Archaeal nucleosome positioning in vivo and in vitro is directed by primary sequence motifs. BMC Genomics 2013; 14:391. [PMID: 23758892 PMCID: PMC3691661 DOI: 10.1186/1471-2164-14-391] [Citation(s) in RCA: 44] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/19/2013] [Accepted: 05/31/2013] [Indexed: 02/03/2023] Open
Abstract
Background Histone wrapping of DNA into nucleosomes almost certainly evolved in the Archaea, and predates Eukaryotes. In Eukaryotes, nucleosome positioning plays a central role in regulating gene expression and is directed by primary sequence motifs that together form a nucleosome positioning code. The experiments reported were undertaken to determine if archaeal histone assembly conforms to the nucleosome positioning code. Results Eukaryotic nucleosome positioning is favored and directed by phased helical repeats of AA/TT/AT/TA and CC/GG/CG/GC dinucleotides, and disfavored by longer AT-rich oligonucleotides. Deep sequencing of genomic DNA protected from micrococcal nuclease digestion by assembly into archaeal nucleosomes has established that archaeal nucleosome assembly is also directed and positioned by these sequence motifs, both in vivo in Methanothermobacter thermautotrophicus and Thermococcus kodakarensis and in vitro in reaction mixtures containing only one purified archaeal histone and genomic DNA. Archaeal nucleosomes assembled at the same locations in vivo and in vitro, with much reduced assembly immediately upstream of open reading frames and throughout the ribosomal rDNA operons. Providing further support for a common positioning code, archaeal histones assembled into nucleosomes on eukaryotic DNA and eukaryotic histones into nucleosomes on archaeal DNA at the same locations. T. kodakarensis has two histones, designated HTkA and HTkB, and strains with either but not both histones deleted grow normally but do exhibit transcriptome differences. Comparisons of the archaeal nucleosome profiles in the intergenic regions immediately upstream of genes that exhibited increased or decreased transcription in the absence of HTkA or HTkB revealed substantial differences but no consistent pattern of changes that would correlate directly with archaeal nucleosome positioning inhibiting or stimulating transcription. Conclusions The results obtained establish that an archaeal histone and a genome sequence together are sufficient to determine where archaeal nucleosomes preferentially assemble and where they avoid assembly. We confirm that the same nucleosome positioning code operates in Archaea as in Eukaryotes and presumably therefore evolved with the histone-fold mechanism of DNA binding and compaction early in the archaeal lineage, before the divergence of Eukaryotes.
Collapse
Affiliation(s)
- Narasimharao Nalabothula
- Department of Molecular and Cellular Biochemistry, College of Medicine, University of Kentucky, Lexington, KY 40536, USA
| | | | | | | | | | | | | | | |
Collapse
|