1
|
Gollapalli P, Rudrappa S, Kumar V, Santosh Kumar HS. Domain Architecture Based Methods for Comparative Functional Genomics Toward Therapeutic Drug Target Discovery. J Mol Evol 2023; 91:598-615. [PMID: 37626222 DOI: 10.1007/s00239-023-10129-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/13/2022] [Accepted: 08/06/2023] [Indexed: 08/27/2023]
Abstract
Genes duplicate, mutate, recombine, fuse or fission to produce new genes, or when genes are formed from de novo, novel functions arise during evolution. Researchers have tried to quantify the causes of these molecular diversification processes to know how these genes increase molecular complexity over a period of time, for instance protein domain organization. In contrast to global sequence similarity, protein domain architectures can capture key structural and functional characteristics, making them better proxies for describing functional equivalence. In Prokaryotes and eukaryotes it has proven that, domain designs are retained over significant evolutionary distances. Protein domain architectures are now being utilized to categorize and distinguish evolutionarily related proteins and find homologs among species that are evolutionarily distant from one another. Additionally, structural information stored in domain structures has accelerated homology identification and sequence search methods. Tools for functional protein annotation have been developed to discover, protein domain content, domain order, domain recurrence, and domain position as all these contribute to the prediction of protein functional accuracy. In this review, an attempt is made to summarise facts and speculations regarding the use of protein domain architecture and modularity to identify possible therapeutic targets among cellular activities based on the understanding their linked biological processes.
Collapse
Affiliation(s)
- Pavan Gollapalli
- Center for Bioinformatics and Biostatistics, Nitte (Deemed to be University), Mangalore, Karnataka, 575018, India
| | - Sushmitha Rudrappa
- Department of Biotechnology and Bioinformatics, Jnana Sahyadri Campus, Kuvempu University, Shankaraghatta, Shivamogga, Karnataka, 577451, India
| | - Vadlapudi Kumar
- Department of Biochemistry, Davangere University, Shivagangothri, Davangere, Karnataka, 577007, India
| | - Hulikal Shivashankara Santosh Kumar
- Department of Biotechnology and Bioinformatics, Jnana Sahyadri Campus, Kuvempu University, Shankaraghatta, Shivamogga, Karnataka, 577451, India.
| |
Collapse
|
2
|
Mining semantic information of co-word network to improve link prediction performance. Scientometrics 2022. [DOI: 10.1007/s11192-021-04247-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
|
3
|
Sikander R, Wang Y, Ghulam A, Wu X. Identification of Enzymes-specific Protein Domain Based on DDE, and Convolutional Neural Network. Front Genet 2021; 12:759384. [PMID: 34917128 PMCID: PMC8670239 DOI: 10.3389/fgene.2021.759384] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/16/2021] [Accepted: 10/25/2021] [Indexed: 11/21/2022] Open
Abstract
Predicting the protein sequence information of enzymes and non-enzymes is an important but a very challenging task. Existing methods use protein geometric structures only or protein sequences alone to predict enzymatic functions. Thus, their prediction results are unsatisfactory. In this paper, we propose a novel approach for predicting the amino acid sequences of enzymes and non-enzymes via Convolutional Neural Network (CNN). In CNN, the roles of enzymes are predicted from multiple sides of biological information, including information on sequences and structures. We propose the use of two-dimensional data via 2DCNN to predict the proteins of enzymes and non-enzymes by using the same fivefold cross-validation function. We also use an independent dataset to test the performance of our model, and the results demonstrate that we are able to solve the overfitting problem. We used the CNN model proposed herein to demonstrate the superiority of our model for classifying an entire set of filters, such as 32, 64, and 128 parameters, with the fivefold validation test set as the independent classification. Via the Dipeptide Deviation from Expected Mean (DDE) matrix, mutation information is extracted from amino acid sequences and structural information with the distance and angle of amino acids is conveyed. The derived feature maps are then encoded in DDE exploitation. The independent datasets are then compared with other two methods, namely, GRU and XGBOOST. All analyses were conducted using 32, 64 and 128 filters on our proposed CNN method. The cross-validation datasets achieved an accuracy score of 0.8762%, whereas the accuracy of independent datasets was 0.7621%. Additional variables were derived on the basis of ROC AUC with fivefold cross-validation was achieved score is 0.95%. The performance of our model and that of other models in terms of sensitivity (0.9028%) and specificity (0.8497%) was compared. The overall accuracy of our model was 0.9133% compared with 0.8310% for the other model.
Collapse
Affiliation(s)
- Rahu Sikander
- School of Computer Science and Technology, Xidian University, Xi'an, China
| | - Yuping Wang
- School of Computer Science and Technology, Xidian University, Xi'an, China
| | - Ali Ghulam
- Computerization and Network Section, Sindh Agriculture University, Tando Jam, Pakistan
| | - Xianjuan Wu
- School of Computer Science and Technology, Xidian University, Xi'an, China
| |
Collapse
|
4
|
YAMATO K, KATO H, KATSURAGI T, TAKAHASHI Y. The Multiple Representation of Protein Sequence MotifsUsing Sequence Binary Decision Diagrams. JOURNAL OF COMPUTER CHEMISTRY-JAPAN 2020. [DOI: 10.2477/jccj.2019-0028] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/05/2022]
Affiliation(s)
- Kohei YAMATO
- Department of Computer Science and Engineering, Toyohashi University of Technology, 1-1 Hibarigaoka, Tempaku-cho, Toyohashi, Aichi 441-8580, Japan
| | - Hiroaki KATO
- Department of Distribution and Information Engineering, National Institute of Technology, Hiroshima College,4272-1 Higashino, Osakikamijima-cho, Toyota-gun, Hiroshima 725-0231, Japan
| | - Tetsuo KATSURAGI
- Department of Computer Science and Engineering, Toyohashi University of Technology, 1-1 Hibarigaoka, Tempaku-cho, Toyohashi, Aichi 441-8580, Japan
| | - Yoshimasa TAKAHASHI
- Department of Computer Science and Engineering, Toyohashi University of Technology, 1-1 Hibarigaoka, Tempaku-cho, Toyohashi, Aichi 441-8580, Japan
| |
Collapse
|
5
|
Gao R, Wang M, Zhou J, Fu Y, Liang M, Guo D, Nie J. Prediction of Enzyme Function Based on Three Parallel Deep CNN and Amino Acid Mutation. Int J Mol Sci 2019; 20:E2845. [PMID: 31212665 PMCID: PMC6600291 DOI: 10.3390/ijms20112845] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/05/2019] [Revised: 06/03/2019] [Accepted: 06/04/2019] [Indexed: 01/28/2023] Open
Abstract
During the past decade, due to the number of proteins in PDB database being increased gradually, traditional methods cannot better understand the function of newly discovered enzymes in chemical reactions. Computational models and protein feature representation for predicting enzymatic function are more important. Most of existing methods for predicting enzymatic function have used protein geometric structure or protein sequence alone. In this paper, the functions of enzymes are predicted from many-sided biological information including sequence information and structure information. Firstly, we extract the mutation information from amino acids sequence by the position scoring matrix and express structure information with amino acids distance and angle. Then, we use histogram to show the extracted sequence and structural features respectively. Meanwhile, we establish a network model of three parallel Deep Convolutional Neural Networks (DCNN) to learn three features of enzyme for function prediction simultaneously, and the outputs are fused through two different architectures. Finally, The proposed model was investigated on a large dataset of 43,843 enzymes from the PDB and achieved 92.34% correct classification when sequence information is considered, demonstrating an improvement compared with the previous result.
Collapse
Affiliation(s)
- Ruibo Gao
- School of Information Science and Engineering, Yanshan University, Qinhuangdao 066004, Hebei, China.
| | - Mengmeng Wang
- School of Information Science and Engineering, Yanshan University, Qinhuangdao 066004, Hebei, China.
| | - Jiaoyan Zhou
- School of Information Science and Engineering, Yanshan University, Qinhuangdao 066004, Hebei, China.
| | - Yuhang Fu
- School of Information Science and Engineering, Yanshan University, Qinhuangdao 066004, Hebei, China.
| | - Meng Liang
- School of Information Science and Engineering, Yanshan University, Qinhuangdao 066004, Hebei, China.
| | - Dongliang Guo
- School of Information Science and Engineering, Yanshan University, Qinhuangdao 066004, Hebei, China.
| | - Junlan Nie
- School of Information Science and Engineering, Yanshan University, Qinhuangdao 066004, Hebei, China.
| |
Collapse
|
6
|
Liu T, Wang Z. Reconstructing high-resolution chromosome three-dimensional structures by Hi-C complex networks. BMC Bioinformatics 2018; 19:496. [PMID: 30591009 PMCID: PMC6309071 DOI: 10.1186/s12859-018-2464-z] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/15/2023] Open
Abstract
BACKGROUND Hi-C data have been widely used to reconstruct chromosomal three-dimensional (3D) structures. One of the key limitations of Hi-C is the unclear relationship between spatial distance and the number of Hi-C contacts. Many methods used a fixed parameter when converting the number of Hi-C contacts to wish distances. However, a single parameter cannot properly explain the relationship between wish distances and genomic distances or the locations of topologically associating domains (TADs). RESULTS We have addressed one of the key issues of using Hi-C data, that is, the unclear relationship between spatial distances and the number of Hi-C contacts, which is crucial to understand significant biological functions, such as the enhancer-promoter interactions. Specifically, we developed a new method to infer this converting parameter and pairwise Euclidean distances based on the topology of the Hi-C complex network (HiCNet). The inferred distances were modeled by clustering coefficient and multiple other types of constraints. We found that our inferred distances between bead-pairs within the same TAD were apparently smaller than those distances between bead-pairs from different TADs. Our inferred distances had a higher correlation with fluorescence in situ hybridization (FISH) data, fitted the localization patterns of Xist transcripts on DNA, and better matched 156 pairs of protein-enabled long-range chromatin interactions detected by ChIA-PET. Using the inferred distances and another round of optimization, we further reconstructed 40 kb high-resolution 3D chromosomal structures of mouse male ES cells. The high-resolution structures successfully illustrate TADs and DNA loops (peaks in Hi-C contact heatmaps) that usually indicate enhancer-promoter interactions. CONCLUSIONS We developed a novel method to infer the wish distances between DNA bead-pairs from Hi-C contacts. High-resolution 3D structures of chromosomes were built based on the newly-inferred wish distances. This whole process has been implemented as a tool named HiCNet, which is publicly available at http://dna.cs.miami.edu/HiCNet/ .
Collapse
Affiliation(s)
- Tong Liu
- Department of Computer Science, University of Miami, 1365 Memorial Drive, Coral Gables, FL, 33124, USA
| | - Zheng Wang
- Department of Computer Science, University of Miami, 1365 Memorial Drive, Coral Gables, FL, 33124, USA.
| |
Collapse
|
7
|
Co-Occurrence Network of High-Frequency Words in the Bioinformatics Literature: Structural Characteristics and Evolution. APPLIED SCIENCES-BASEL 2018. [DOI: 10.3390/app8101994] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
The subjects of literature are the direct expression of the author’s research results. Mining valuable knowledge helps to save time for the readers to understand the content and direction of the literature quickly. Therefore, the co-occurrence network of high-frequency words in the bioinformatics literature and its structural characteristics and evolution were analysed in this paper. First, 242,891 articles from 47 top bioinformatics periodicals were chosen as the object of the study. Second, the co-occurrence relationship among high-frequency words of these articles was analysed by word segmentation and high-frequency word selection. Then, a co-occurrence network of high-frequency words in bioinformatics literature was built. Finally, the conclusions were drawn by analysing its structural characteristics and evolution. The results showed that the co-occurrence network of high-frequency words in the bioinformatics literature was a small-world network with scale-free distribution, rich-club phenomenon and disassortative matching characteristics. At the same time, the high-frequency words used by authors changed little in 2–3 years but varied greatly in four years because of the influence of the state-of-the-art technology.
Collapse
|
8
|
Keel BN, Deng B, Moriyama EN. MOCASSIN-prot: a multi-objective clustering approach for protein similarity networks. Bioinformatics 2018; 34:1270-1277. [PMID: 29186344 DOI: 10.1093/bioinformatics/btx755] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/31/2016] [Accepted: 11/23/2017] [Indexed: 11/14/2022] Open
Abstract
Motivation Proteins often include multiple conserved domains. Various evolutionary events including duplication and loss of domains, domain shuffling, as well as sequence divergence contribute to generating complexities in protein structures, and consequently, in their functions. The evolutionary history of proteins is hence best modeled through networks that incorporate information both from the sequence divergence and the domain content. Here, a game-theoretic approach proposed for protein network construction is adapted into the framework of multi-objective optimization, and extended to incorporate clustering refinement procedure. Results The new method, MOCASSIN-prot, was applied to cluster multi-domain proteins from ten genomes. The performance of MOCASSIN-prot was compared against two protein clustering methods, Markov clustering (TRIBE-MCL) and spectral clustering (SCPS). We showed that compared to these two methods, MOCASSIN-prot, which uses both domain composition and quantitative sequence similarity information, generates fewer false positives. It achieves more functionally coherent protein clusters and better differentiates protein families. Availability and implementation MOCASSIN-prot, implemented in Perl and Matlab, is freely available at http://bioinfolab.unl.edu/emlab/MOCASSINprot. Contact emoriyama2@unl.edu. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Brittney N Keel
- USDA †, ARS, U.S. Meat Animal Research Center, Clay Center, NE 68933, USA.,Department of Mathematics, University of Nebraska-Lincoln, Lincoln, NE 68588, USA
| | - Bo Deng
- Department of Mathematics, University of Nebraska-Lincoln, Lincoln, NE 68588, USA
| | - Etsuko N Moriyama
- School of Biological Sciences and Center for Plant Science Innovation, University of Nebraska-Lincoln, Lincoln, NE 68588, USA
| |
Collapse
|
9
|
Hansen BO, Meyer EH, Ferrari C, Vaid N, Movahedi S, Vandepoele K, Nikoloski Z, Mutwil M. Ensemble gene function prediction database reveals genes important for complex I formation in Arabidopsis thaliana. THE NEW PHYTOLOGIST 2018; 217:1521-1534. [PMID: 29205376 DOI: 10.1111/nph.14921] [Citation(s) in RCA: 15] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/08/2017] [Accepted: 10/24/2017] [Indexed: 05/25/2023]
Abstract
Recent advances in gene function prediction rely on ensemble approaches that integrate results from multiple inference methods to produce superior predictions. Yet, these developments remain largely unexplored in plants. We have explored and compared two methods to integrate 10 gene co-function networks for Arabidopsis thaliana and demonstrate how the integration of these networks produces more accurate gene function predictions for a larger fraction of genes with unknown function. These predictions were used to identify genes involved in mitochondrial complex I formation, and for five of them, we confirmed the predictions experimentally. The ensemble predictions are provided as a user-friendly online database, EnsembleNet. The methods presented here demonstrate that ensemble gene function prediction is a powerful method to boost prediction performance, whereas the EnsembleNet database provides a cutting-edge community tool to guide experimentalists.
Collapse
Affiliation(s)
- Bjoern Oest Hansen
- Max Planck Institute of Molecular Plant Physiology, Am Muehlenberg 1, Potsdam, 14476, Germany
- Institut für Medizinische Informatik, Universitätsmedizin Göttingen, Robert-Koch-Str. 40, Göttingen, 37075, Germany
| | - Etienne H Meyer
- Max Planck Institute of Molecular Plant Physiology, Am Muehlenberg 1, Potsdam, 14476, Germany
| | - Camilla Ferrari
- Max Planck Institute of Molecular Plant Physiology, Am Muehlenberg 1, Potsdam, 14476, Germany
| | - Neha Vaid
- Max Planck Institute of Molecular Plant Physiology, Am Muehlenberg 1, Potsdam, 14476, Germany
| | - Sara Movahedi
- Department of Plant Biotechnology and Bioinformatics, VIB Center for Plant Systems Biology, Ghent University, Technologiepark 927, Gent, B-9052, Belgium
- Rijk Zwaan Breeding BV, Burgemeester Crezéelaan 40, PO Box 40, De Lier, 2678 ZG, the Netherlands
| | - Klaas Vandepoele
- Department of Plant Biotechnology and Bioinformatics, VIB Center for Plant Systems Biology, Ghent University, Technologiepark 927, Gent, B-9052, Belgium
| | - Zoran Nikoloski
- Max Planck Institute of Molecular Plant Physiology, Am Muehlenberg 1, Potsdam, 14476, Germany
- Bioinformatics Group, Institute of Biochemistry and Biology, University of Potsdam, Karl-Liebknecht-Str. 24-25, Potsdam-Golm, 14476, Germany
| | - Marek Mutwil
- Max Planck Institute of Molecular Plant Physiology, Am Muehlenberg 1, Potsdam, 14476, Germany
- School of Biological Sciences, Nanyang Technological University, 60 Nanyang Drive, Singapore, 637551, Singapore
| |
Collapse
|
10
|
Wang Z, Zhao C, Wang Y, Sun Z, Wang N. PANDA: Protein function prediction using domain architecture and affinity propagation. Sci Rep 2018; 8:3484. [PMID: 29472600 PMCID: PMC5823857 DOI: 10.1038/s41598-018-21849-1] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/16/2017] [Accepted: 02/09/2018] [Indexed: 12/23/2022] Open
Abstract
We developed PANDA (Propagation of Affinity and Domain Architecture) to predict protein functions in the format of Gene Ontology (GO) terms. PANDA at first executes profile-profile alignment algorithm to search against PfamA, KOG, COG, and SwissProt databases, and then launches PSI-BLAST against UniProt for homologue search. PANDA integrates a domain architecture inference algorithm based on the Bayesian statistics that calculates the probability of having a GO term. All the candidate GO terms are pooled and filtered based on Z-score. After that, the remaining GO terms are clustered using an affinity propagation algorithm based on the GO directed acyclic graph, followed by a second round of filtering on the clusters of GO terms. We benchmarked the performance of all the baseline predictors PANDA integrates and also for every pooling and filtering step of PANDA. It can be found that PANDA achieves better performances in terms of area under the curve for precision and recall compared to the baseline predictors. PANDA can be accessed from http://dna.cs.miami.edu/PANDA/ .
Collapse
Affiliation(s)
- Zheng Wang
- Department of Computer Science, University of Miami, 1364 Memorial Drive, P.O. Box 248154, Coral Gables, FL, 33124, USA.
| | - Chenguang Zhao
- School of Computing, University of Southern Mississippi, 118 College Drive #5106, Hattiesburg, MS, 39406, USA
| | - Yiheng Wang
- School of Computing, University of Southern Mississippi, 118 College Drive #5106, Hattiesburg, MS, 39406, USA
| | - Zheng Sun
- Department of Mathematics and Computer Science, The Citadel, 171 Moulrie Street, Charleston, SC, 29409, USA
| | - Nan Wang
- Department of Computer Science, New Jersey City University, 2039 Kennedy Blvd, Jersey City, NJ, 07305, USA
| |
Collapse
|
11
|
ProLanGO: Protein Function Prediction Using Neural Machine Translation Based on a Recurrent Neural Network. Molecules 2017; 22:molecules22101732. [PMID: 29039790 PMCID: PMC6151571 DOI: 10.3390/molecules22101732] [Citation(s) in RCA: 114] [Impact Index Per Article: 16.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/30/2017] [Revised: 10/11/2017] [Accepted: 10/11/2017] [Indexed: 11/25/2022] Open
Abstract
With the development of next generation sequencing techniques, it is fast and cheap to determine protein sequences but relatively slow and expensive to extract useful information from protein sequences because of limitations of traditional biological experimental techniques. Protein function prediction has been a long standing challenge to fill the gap between the huge amount of protein sequences and the known function. In this paper, we propose a novel method to convert the protein function problem into a language translation problem by the new proposed protein sequence language “ProLan” to the protein function language “GOLan”, and build a neural machine translation model based on recurrent neural networks to translate “ProLan” language to “GOLan” language. We blindly tested our method by attending the latest third Critical Assessment of Function Annotation (CAFA 3) in 2016, and also evaluate the performance of our methods on selected proteins whose function was released after CAFA competition. The good performance on the training and testing datasets demonstrates that our new proposed method is a promising direction for protein function prediction. In summary, we first time propose a method which converts the protein function prediction problem to a language translation problem and applies a neural machine translation model for protein function prediction.
Collapse
|
12
|
Cao R, Cheng J. Integrated protein function prediction by mining function associations, sequences, and protein-protein and gene-gene interaction networks. Methods 2016; 93:84-91. [PMID: 26370280 PMCID: PMC4894840 DOI: 10.1016/j.ymeth.2015.09.011] [Citation(s) in RCA: 66] [Impact Index Per Article: 8.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/31/2015] [Revised: 09/03/2015] [Accepted: 09/10/2015] [Indexed: 11/30/2022] Open
Abstract
MOTIVATIONS Protein function prediction is an important and challenging problem in bioinformatics and computational biology. Functionally relevant biological information such as protein sequences, gene expression, and protein-protein interactions has been used mostly separately for protein function prediction. One of the major challenges is how to effectively integrate multiple sources of both traditional and new information such as spatial gene-gene interaction networks generated from chromosomal conformation data together to improve protein function prediction. RESULTS In this work, we developed three different probabilistic scores (MIS, SEQ, and NET score) to combine protein sequence, function associations, and protein-protein interaction and spatial gene-gene interaction networks for protein function prediction. The MIS score is mainly generated from homologous proteins found by PSI-BLAST search, and also association rules between Gene Ontology terms, which are learned by mining the Swiss-Prot database. The SEQ score is generated from protein sequences. The NET score is generated from protein-protein interaction and spatial gene-gene interaction networks. These three scores were combined in a new Statistical Multiple Integrative Scoring System (SMISS) to predict protein function. We tested SMISS on the data set of 2011 Critical Assessment of Function Annotation (CAFA). The method performed substantially better than three base-line methods and an advanced method based on protein profile-sequence comparison, profile-profile comparison, and domain co-occurrence networks according to the maximum F-measure.
Collapse
Affiliation(s)
- Renzhi Cao
- Computer Science Department, Informatics Institute, University of Missouri, Columbia, MO 65211, USA
| | - Jianlin Cheng
- Computer Science Department, Informatics Institute, University of Missouri, Columbia, MO 65211, USA.
| |
Collapse
|
13
|
Exploring soybean metabolic pathways based on probabilistic graphical model and knowledge-based methods. EURASIP JOURNAL ON BIOINFORMATICS & SYSTEMS BIOLOGY 2015; 2015:5. [PMID: 28194174 PMCID: PMC5270328 DOI: 10.1186/s13637-015-0026-5] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 01/16/2015] [Accepted: 06/09/2015] [Indexed: 12/02/2022]
Abstract
Soybean (Glycine max) is a major source of vegetable oil and protein for both animal and human consumption. The completion of soybean genome sequence led to a number of transcriptomic studies (RNA-seq), which provide a resource for gene discovery and functional analysis. Several data-driven (e.g., based on gene expression data) and knowledge-based (e.g., predictions of molecular interactions) methods have been proposed and implemented. In order to better understand gene relationships and protein interactions, we applied probabilistic graphical methods, based on Bayesian network and knowledgebase constraints using gene expression data to reconstruct soybean metabolic pathways. The results show that this method can predict new relationships between genes, improving on traditional reference pathway maps.
Collapse
|
14
|
Li J, Hou J, Sun L, Wilkins JM, Lu Y, Niederhuth CE, Merideth BR, Mawhinney TP, Mossine VV, Greenlief CM, Walker JC, Folk WR, Hannink M, Lubahn DB, Birchler JA, Cheng J. From Gigabyte to Kilobyte: A Bioinformatics Protocol for Mining Large RNA-Seq Transcriptomics Data. PLoS One 2015; 10:e0125000. [PMID: 25902288 PMCID: PMC4406561 DOI: 10.1371/journal.pone.0125000] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/28/2014] [Accepted: 03/19/2015] [Indexed: 01/31/2023] Open
Abstract
RNA-Seq techniques generate hundreds of millions of short RNA reads using next-generation sequencing (NGS). These RNA reads can be mapped to reference genomes to investigate changes of gene expression but improved procedures for mining large RNA-Seq datasets to extract valuable biological knowledge are needed. RNAMiner--a multi-level bioinformatics protocol and pipeline--has been developed for such datasets. It includes five steps: Mapping RNA-Seq reads to a reference genome, calculating gene expression values, identifying differentially expressed genes, predicting gene functions, and constructing gene regulatory networks. To demonstrate its utility, we applied RNAMiner to datasets generated from Human, Mouse, Arabidopsis thaliana, and Drosophila melanogaster cells, and successfully identified differentially expressed genes, clustered them into cohesive functional groups, and constructed novel gene regulatory networks. The RNAMiner web service is available at http://calla.rnet.missouri.edu/rnaminer/index.html.
Collapse
Affiliation(s)
- Jilong Li
- Computer Science Department, University of Missouri, Columbia, Missouri, United States of America
- MU Botanical Center, University of Missouri, Columbia, Missouri, United States of America
| | - Jie Hou
- Computer Science Department, University of Missouri, Columbia, Missouri, United States of America
| | - Lin Sun
- Division of Biological Sciences, University of Missouri, Columbia, Missouri, United States of America
| | | | - Yuan Lu
- MU Botanical Center, University of Missouri, Columbia, Missouri, United States of America
- Department of Biochemistry, University of Missouri, Columbia, Missouri, United States of America
| | - Chad E. Niederhuth
- Division of Biological Sciences, University of Missouri, Columbia, Missouri, United States of America
| | - Benjamin Ryan Merideth
- Department of Biochemistry, University of Missouri, Columbia, Missouri, United States of America
| | - Thomas P. Mawhinney
- Department of Biochemistry, University of Missouri, Columbia, Missouri, United States of America
| | - Valeri V. Mossine
- MU Botanical Center, University of Missouri, Columbia, Missouri, United States of America
- Department of Biochemistry, University of Missouri, Columbia, Missouri, United States of America
| | - C. Michael Greenlief
- MU Botanical Center, University of Missouri, Columbia, Missouri, United States of America
- Department of Chemistry, University of Missouri, Columbia, Missouri, United States of America
| | - John C. Walker
- Division of Biological Sciences, University of Missouri, Columbia, Missouri, United States of America
| | - William R. Folk
- MU Botanical Center, University of Missouri, Columbia, Missouri, United States of America
- Department of Biochemistry, University of Missouri, Columbia, Missouri, United States of America
| | - Mark Hannink
- MU Botanical Center, University of Missouri, Columbia, Missouri, United States of America
- Department of Biochemistry, University of Missouri, Columbia, Missouri, United States of America
| | - Dennis B. Lubahn
- MU Botanical Center, University of Missouri, Columbia, Missouri, United States of America
- Department of Biochemistry, University of Missouri, Columbia, Missouri, United States of America
| | - James A. Birchler
- Division of Biological Sciences, University of Missouri, Columbia, Missouri, United States of America
| | - Jianlin Cheng
- Computer Science Department, University of Missouri, Columbia, Missouri, United States of America
- MU Botanical Center, University of Missouri, Columbia, Missouri, United States of America
- Informatics Institute, University of Missouri, Columbia, Missouri, United States of America
- C. Bond Life Science Center, University of Missouri, Columbia, Missouri, United States of America
| |
Collapse
|
15
|
Gong P, Madak-Erdogan Z, Li J, Cheng J, Greenlief CM, Helferich W, Katzenellenbogen JA, Katzenellenbogen BS. Transcriptomic analysis identifies gene networks regulated by estrogen receptor α (ERα) and ERβ that control distinct effects of different botanical estrogens. NUCLEAR RECEPTOR SIGNALING 2014; 12:e001. [PMID: 25363786 PMCID: PMC4193135 DOI: 10.1621/nrs.12001] [Citation(s) in RCA: 52] [Impact Index Per Article: 5.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 04/03/2014] [Revised: 04/28/2014] [Accepted: 05/13/2014] [Indexed: 12/31/2022]
Abstract
The estrogen receptors (ERs) ERα and ERβ mediate the actions of endogenous estrogens as well as those of botanical estrogens (BEs) present in plants. BEs are ingested in the diet and also widely consumed by postmenopausal women as dietary supplements, often as a substitute for the loss of endogenous estrogens at menopause. However, their activities and efficacies, and similarities and differences in gene expression programs with respect to endogenous estrogens such as estradiol (E2) are not fully understood. Because gene expression patterns underlie and control the broad physiological effects of estrogens, we have investigated and compared the gene networks that are regulated by different BEs and by E2. Our aim was to determine if the soy and licorice BEs control similar or different gene expression programs and to compare their gene regulations with that of E2. Gene expression was examined by RNA-Seq in human breast cancer (MCF7) cells treated with control vehicle, BE or E2. These cells contained three different complements of ERs, ERα only, ERα+ERβ, or ERβ only, reflecting the different ratios of these two receptors in different human breast cancers and in different estrogen target cells. Using principal component, hierarchical clustering, and gene ontology and interactome analyses, we found that BEs regulated many of the same genes as did E2. The genes regulated by each BE, however, were somewhat different from one another, with some genes being regulated uniquely by each compound. The overlap with E2 in regulated genes was greatest for the soy isoflavones genistein and S-equol, while the greatest difference from E2 in gene expression pattern was observed for the licorice root BE liquiritigenin. The gene expression pattern of each ligand depended greatly on the cell background of ERs present. Despite similarities in gene expression pattern with E2, the BEs were generally less stimulatory of genes promoting proliferation and were more pro-apoptotic in their gene regulations than E2. The distinctive patterns of gene regulation by the individual BEs and E2 may underlie differences in the activities of these soy and licorice-derived BEs in estrogen target cells containing different levels of the two ERs.
Collapse
Affiliation(s)
| | | | - Jilong Li
- Botanical Research Center, University of Missouri, Columbia, MO 65211
| | - Jianlin Cheng
- Botanical Research Center, University of Missouri, Columbia, MO 65211
| | | | | | | | | |
Collapse
|
16
|
Abstract
Protein location and function can change dynamically depending on many factors, including environmental stress, disease state, age, developmental stage, and cell type. Here, we describe an integrative computational framework, called the conditional function predictor (CoFP; http://nbm.ajou.ac.kr/cofp/), for predicting changes in subcellular location and function on a proteome-wide scale. The essence of the CoFP approach is to cross-reference general knowledge about a protein and its known network of physical interactions, which typically pool measurements from diverse environments, against gene expression profiles that have been measured under specific conditions of interest. Using CoFP, we predict condition-specific subcellular locations, biological processes, and molecular functions of the yeast proteome under 18 specified conditions. In addition to highly accurate retrieval of previously known gold standard protein locations and functions, CoFP predicts previously unidentified condition-dependent locations and functions for nearly all yeast proteins. Many of these predictions can be confirmed using high-resolution cellular imaging. We show that, under DNA-damaging conditions, Tsr1, Caf120, Dip5, Skg6, Lte1, and Nnf2 change subcellular location and RNA polymerase I subunit A43, Ino2, and Ids2 show changes in DNA binding. Beyond specific predictions, this work reveals a global landscape of changing protein location and function, highlighting a surprising number of proteins that translocate from the mitochondria to the nucleus or from endoplasmic reticulum to Golgi apparatus under stress.
Collapse
|
17
|
Qu Z, Meng F, Zhou H, Li J, Wang Q, Wei F, Cheng J, Greenlief CM, Lubahn DB, Sun GY, Liu S, Gu Z. NitroDIGE analysis reveals inhibition of protein S-nitrosylation by epigallocatechin gallates in lipopolysaccharide-stimulated microglial cells. J Neuroinflammation 2014; 11:17. [PMID: 24472655 PMCID: PMC3922161 DOI: 10.1186/1742-2094-11-17] [Citation(s) in RCA: 23] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/07/2013] [Accepted: 01/20/2014] [Indexed: 12/28/2022] Open
Abstract
Background Nitric oxide (NO) is a signaling molecule regulating numerous cellular functions in development and disease. In the brain, neuronal injury or neuroinflammation can lead to microglial activation, which induces NO production. NO can react with critical cysteine thiols of target proteins forming S-nitroso-proteins. This modification, known as S-nitrosylation, is an evolutionarily conserved redox-based post-translational modification (PTM) of specific proteins analogous to phosphorylation. In this study, we describe a protocol for analyzing S-nitrosylation of proteins using a gel-based proteomic approach and use it to investigate the modes of action of a botanical compound found in green tea, epigallocatechin-3-gallate (EGCG), on protein S-nitrosylation after microglial activation. Methods/Results To globally and quantitatively analyze NO-induced protein S-nitrosylation, the sensitive gel-based proteomic method, termed NitroDIGE, was developed by combining two-dimensional differential in-gel electrophoresis (2-D DIGE) with the modified biotin switch technique (BST) using fluorescence-tagged CyDye™ thiol reactive agents to label S-nitrosothiols. The NitroDIGE method showed high specificity and sensitivity in detecting S-nitrosylated proteins (SNO-proteins). Using this approach, we identified a subset of SNO-proteins ex vivo by exposing immortalized murine BV-2 microglial cells to a physiological NO donor, or in vivo by exposing BV-2 cells to endotoxin lipopolysaccharides (LPS) to induce a proinflammatory response. Moreover, EGCG was shown to attenuate S-nitrosylation of proteins after LPS-induced activation of microglial cells primarily by modulation of the nuclear factor erythroid 2-related factor 2 (Nrf2)-mediated oxidative stress response. Conclusions These results demonstrate that NitroDIGE is an effective proteomic strategy for “top-down” quantitative analysis of protein S-nitrosylation in multi-group samples in response to nitrosative stress due to excessive generation of NO in cells. Using this approach, we have revealed the ability of EGCG to down-regulate protein S-nitrosylation in LPS-stimulated BV-2 microglial cells, consistent with its known antioxidant effects.
Collapse
Affiliation(s)
| | | | | | | | | | | | | | | | | | | | | | - Zezong Gu
- Department of Pathology & Anatomical Sciences, University of Missouri School of Medicine, Columbia, MO 65212, USA.
| |
Collapse
|
18
|
Zhu M, Dahmen JL, Stacey G, Cheng J. Predicting gene regulatory networks of soybean nodulation from RNA-Seq transcriptome data. BMC Bioinformatics 2013; 14:278. [PMID: 24053776 PMCID: PMC3854569 DOI: 10.1186/1471-2105-14-278] [Citation(s) in RCA: 22] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/20/2012] [Accepted: 09/03/2013] [Indexed: 12/29/2022] Open
Abstract
BACKGROUND High-throughput RNA sequencing (RNA-Seq) is a revolutionary technique to study the transcriptome of a cell under various conditions at a systems level. Despite the wide application of RNA-Seq techniques to generate experimental data in the last few years, few computational methods are available to analyze this huge amount of transcription data. The computational methods for constructing gene regulatory networks from RNA-Seq expression data of hundreds or even thousands of genes are particularly lacking and urgently needed. RESULTS We developed an automated bioinformatics method to predict gene regulatory networks from the quantitative expression values of differentially expressed genes based on RNA-Seq transcriptome data of a cell in different stages and conditions, integrating transcriptional, genomic and gene function data. We applied the method to the RNA-Seq transcriptome data generated for soybean root hair cells in three different development stages of nodulation after rhizobium infection. The method predicted a soybean nodulation-related gene regulatory network consisting of 10 regulatory modules common for all three stages, and 24, 49 and 70 modules separately for the first, second and third stage, each containing both a group of co-expressed genes and several transcription factors collaboratively controlling their expression under different conditions. 8 of 10 common regulatory modules were validated by at least two kinds of validations, such as independent DNA binding motif analysis, gene function enrichment test, and previous experimental data in the literature. CONCLUSIONS We developed a computational method to reliably reconstruct gene regulatory networks from RNA-Seq transcriptome data. The method can generate valuable hypotheses for interpreting biological data and designing biological experiments such as ChIP-Seq, RNA interference, and yeast two hybrid experiments.
Collapse
Affiliation(s)
- Mingzhu Zhu
- Department of Computer Science, University of Missouri, Columbia, MO 65211, USA
- Current address: Department of Genetics, Geisel School of Medicine at Dartmouth, Hanover, NH 03755, USA
| | - Jeremy L Dahmen
- C.S. Bond Life Science Center, University of Missouri, Columbia, MO, USA
- Divisions of Plant Science and Biochemistry, Columbia, MO, USA
| | - Gary Stacey
- C.S. Bond Life Science Center, University of Missouri, Columbia, MO, USA
- Divisions of Plant Science and Biochemistry, Columbia, MO, USA
| | - Jianlin Cheng
- Department of Computer Science, University of Missouri, Columbia, MO 65211, USA
- Informatics Institute, University of Missouri, Columbia, MO, USA
- C.S. Bond Life Science Center, University of Missouri, Columbia, MO, USA
| |
Collapse
|
19
|
A novel function prediction approach using protein overlap networks. BMC SYSTEMS BIOLOGY 2013; 7:61. [PMID: 23866986 PMCID: PMC3720179 DOI: 10.1186/1752-0509-7-61] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 02/08/2013] [Accepted: 07/12/2013] [Indexed: 11/10/2022]
Abstract
BACKGROUND Construction of a reliable network remains the bottleneck for network-based protein function prediction. We built an artificial network model called protein overlap network (PON) for the entire genome of yeast, fly, worm, and human, respectively. Each node of the network represents a protein, and two proteins are connected if they share a domain according to InterPro database. RESULTS The function of a protein can be predicted by counting the occurrence frequency of GO (gene ontology) terms associated with domains of direct neighbors. The average success rate and coverage were 34.3% and 43.9%, respectively, for the test genomes, and were increased to 37.9% and 51.3% when a composite PON of the four species was used for the prediction. As a comparison, the success rate was 7.0% in the random control procedure. We also made predictions with GO term annotations of the second layer nodes using the composite network and obtained an impressive success rate (>30%) and coverage (>30%), even for small genomes. Further improvement was achieved by statistical analysis of manually annotated GO terms for each neighboring protein. CONCLUSIONS The PONs are composed of dense modules accompanied by a few long distance connections. Based on the PONs, we developed multiple approaches effective for protein function prediction.
Collapse
|
20
|
The properties of genome conformation and spatial gene interaction and regulation networks of normal and malignant human cell types. PLoS One 2013; 8:e58793. [PMID: 23536826 PMCID: PMC3594155 DOI: 10.1371/journal.pone.0058793] [Citation(s) in RCA: 27] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/13/2012] [Accepted: 02/06/2013] [Indexed: 01/01/2023] Open
Abstract
The spatial conformation of a genome plays an important role in the long-range regulation of genome-wide gene expression and methylation, but has not been extensively studied due to lack of genome conformation data. The recently developed chromosome conformation capturing techniques such as the Hi-C method empowered by next generation sequencing can generate unbiased, large-scale, high-resolution chromosomal interaction (contact) data, providing an unprecedented opportunity to investigate the spatial structure of a genome and its applications in gene regulation, genomics, epigenetics, and cell biology. In this work, we conducted a comprehensive, large-scale computational analysis of this new stream of genome conformation data generated for three different human leukemia cells or cell lines by the Hi-C technique. We developed and applied a set of bioinformatics methods to reliably generate spatial chromosomal contacts from high-throughput sequencing data and to effectively use them to study the properties of the genome structures in one-dimension (1D) and two-dimension (2D). Our analysis demonstrates that Hi-C data can be effectively applied to study tissue-specific genome conformation, chromosome-chromosome interaction, chromosomal translocations, and spatial gene-gene interaction and regulation in a three-dimensional genome of primary tumor cells. Particularly, for the first time, we constructed genome-scale spatial gene-gene interaction network, transcription factor binding site (TFBS) – TFBS interaction network, and TFBS-gene interaction network from chromosomal contact information. Remarkably, all these networks possess the properties of scale-free modular networks.
Collapse
|
21
|
Fang H, Gough J. A disease-drug-phenotype matrix inferred by walking on a functional domain network. MOLECULAR BIOSYSTEMS 2013; 9:1686-96. [PMID: 23462907 DOI: 10.1039/c3mb25495j] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/17/2022]
Abstract
Protein domains are classified as units of structure, evolution and function, and thus form the molecular backbone of biosphere. Although functional networks at the protein level have been reported to be of value in predicting diseases (phenotypes or drugs), they have not previously been applied at the sub-protein resolution (protein domain in this case). We herein introduce a domain network with a functional perspective. This network has nodes consisting of protein domains (at the superfamily/evolutionary level), with edges weighted by the semantic similarity according to domain-centric Gene Ontology (dcGO) annotations, which henceforth we call "dcGOnet". By globally exploring this network via a random walk, we demonstrate its predictive value on disease, drug, or phenotype-related ontologies. On cross-validation recovering ontology labels for domains, we achieve an overall area under the ROC curve of 89.0% for drugs, 87.3% for diseases, 87.6% for human phenotypes and 88.2% for mouse phenotypes. We show that the performance using global information from this network is significantly better than using local information, and also illustrate that the better performance is not sensitive to network size, or the choice of algorithm parameters, and is universal to different ontologies. Based on the dcGOnet and its global properties, we further develop an approach to build a disease-drug-phenotype matrix. The predicted interconnections are statistically supported using a novel randomization procedure, and are also empirically supported by inspection for biological relevance. Most of the high-ranking predictions recover connections that are well known, but others uncover connections that have only suggestive or obscure support in the literature; we show that these are missed by simpler methods, in particular for drug-disease connections. The value of this work is threefold: we describe a general methodology and make the software available, we provide the functional domain network itself, and the ranked drug-disease-phenotype matrix provides rich targets for investigation. All three can be found at .
Collapse
Affiliation(s)
- Hai Fang
- Department of Computer Science, University of Bristol, The Merchant Venturers Building, Bristol BS8 1UB, UK.
| | | |
Collapse
|
22
|
Wang Z, Cao R, Cheng J. Three-level prediction of protein function by combining profile-sequence search, profile-profile search, and domain co-occurrence networks. BMC Bioinformatics 2013; 14 Suppl 3:S3. [PMID: 23514381 PMCID: PMC3584933 DOI: 10.1186/1471-2105-14-s3-s3] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/27/2022] Open
Abstract
Predicting protein function from sequence is useful for biochemical experiment design, mutagenesis analysis, protein engineering, protein design, biological pathway analysis, drug design, disease diagnosis, and genome annotation as a vast number of protein sequences with unknown function are routinely being generated by DNA, RNA and protein sequencing in the genomic era. However, despite significant progresses in the last several years, the accuracy of protein function prediction still needs to be improved in order to be used effectively in practice, particularly when little or no homology exists between a target protein and proteins with annotated function. Here, we developed a method that integrated profile-sequence alignment, profile-profile alignment, and Domain Co-Occurrence Networks (DCN) to predict protein function at different levels of complexity, ranging from obvious homology, to remote homology, to no homology. We tested the method blindingly in the 2011 Critical Assessment of Function Annotation (CAFA). Our experiments demonstrated that our three-level prediction method effectively increased the recall of function prediction while maintaining a reasonable precision. Particularly, our method can predict function terms defined by the Gene Ontology more accurately than three standard baseline methods in most situations, handle multi-domain proteins naturally, and make ab initio function prediction when no homology exists. These results show that our approach can combine complementary strengths of most widely used BLAST-based function prediction methods, rarely used in function prediction but more sensitive profile-profile comparison-based homology detection methods, and non-homology-based domain co-occurrence networks, to effectively extend the power of function prediction from high homology, to low homology, to no homology (ab initio cases).
Collapse
Affiliation(s)
- Zheng Wang
- Department of Computer Science, University of Missouri, Columbia, Missouri 65211, USA
| | | | | |
Collapse
|
23
|
Mangiola S, Young ND, Korhonen P, Mondal A, Scheerlinck JP, Sternberg PW, Cantacessi C, Hall RS, Jex AR, Gasser RB. Getting the most out of parasitic helminth transcriptomes using HelmDB: implications for biology and biotechnology. Biotechnol Adv 2012; 31:1109-19. [PMID: 23266393 DOI: 10.1016/j.biotechadv.2012.12.004] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/11/2012] [Revised: 12/08/2012] [Accepted: 12/13/2012] [Indexed: 12/17/2022]
Abstract
Compounded by a massive global food shortage, many parasitic diseases have a devastating, long-term impact on animal and human health and welfare worldwide. Parasitic helminths (worms) affect the health of billions of animals. Unlocking the systems biology of these neglected pathogens will underpin the design of new and improved interventions against them. Currently, the functional annotation of genomic and transcriptomic sequence data for socio-economically important parasitic worms relies almost exclusively on comparative bioinformatic analyses using model organism- and other databases. However, many genes and gene products of parasitic helminths (often >50%) cannot be annotated using this approach, because they are specific to parasites and/or do not have identifiable homologs in other organisms for which sequence data are available. This inability to fully annotate transcriptomes and predicted proteomes is a major challenge and constrains our understanding of the biology of parasites, interactions with their hosts and of parasitism and the pathogenesis of disease on a molecular level. In the present article, we compiled transcriptomic data sets of key, socioeconomically important parasitic helminths, and constructed and validated a curated database, called HelmDB (www.helmdb.org). We demonstrate how this database can be used effectively for the improvement of functional annotation by employing data integration and clustering. Importantly, HelmDB provides a practical and user-friendly toolkit for sequence browsing and comparative analyses among divergent helminth groups (including nematodes and trematodes), and should be readily adaptable and applicable to a wide range of other organisms. This web-based, integrative database should assist 'systems biology' studies of parasitic helminths, and the discovery and prioritization of novel drug and vaccine targets. This focus provides a pathway toward developing new and improved approaches for the treatment and control of parasitic diseases, with the potential for important biotechnological outcomes.
Collapse
Affiliation(s)
- Stefano Mangiola
- Faculty of Veterinary Science, The University of Melbourne, Victoria 3010, Australia
| | | | | | | | | | | | | | | | | | | |
Collapse
|
24
|
Zhu M, Deng X, Joshi T, Xu D, Stacey G, Cheng J. Reconstructing differentially co-expressed gene modules and regulatory networks of soybean cells. BMC Genomics 2012; 13:437. [PMID: 22938179 PMCID: PMC3563468 DOI: 10.1186/1471-2164-13-437] [Citation(s) in RCA: 21] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/15/2012] [Accepted: 08/22/2012] [Indexed: 11/23/2022] Open
Abstract
BACKGROUND Current experimental evidence indicates that functionally related genes show coordinated expression in order to perform their cellular functions. In this way, the cell transcriptional machinery can respond optimally to internal or external stimuli. This provides a research opportunity to identify and study co-expressed gene modules whose transcription is controlled by shared gene regulatory networks. RESULTS We developed and integrated a set of computational methods of differential gene expression analysis, gene clustering, gene network inference, gene function prediction, and DNA motif identification to automatically identify differentially co-expressed gene modules, reconstruct their regulatory networks, and validate their correctness. We tested the methods using microarray data derived from soybean cells grown under various stress conditions. Our methods were able to identify 42 coherent gene modules within which average gene expression correlation coefficients are greater than 0.8 and reconstruct their putative regulatory networks. A total of 32 modules and their regulatory networks were further validated by the coherence of predicted gene functions and the consistency of putative transcription factor binding motifs. Approximately half of the 32 modules were partially supported by the literature, which demonstrates that the bioinformatic methods used can help elucidate the molecular responses of soybean cells upon various environmental stresses. CONCLUSIONS The bioinformatics methods and genome-wide data sources for gene expression, clustering, regulation, and function analysis were integrated seamlessly into one modular protocol to systematically analyze and infer modules and networks from only differential expression genes in soybean cells grown under stress conditions. Our approach appears to effectively reduce the complexity of the problem, and is sufficiently robust and accurate to generate a rather complete and detailed view of putative soybean gene transcription logic potentially underlying the responses to the various environmental challenges. The same automated method can also be applied to reconstruct differentially co-expressed gene modules and their regulatory networks from gene expression data of any other transcriptome.
Collapse
Affiliation(s)
- Mingzhu Zhu
- Department of Computer Science, University of Missouri, Columbia, MO 65211, U.S.A
| | - Xin Deng
- Department of Computer Science, University of Missouri, Columbia, MO 65211, U.S.A
| | - Trupti Joshi
- Department of Computer Science, University of Missouri, Columbia, MO 65211, U.S.A
- Informatics Institute, University of Missouri, Columbia, MO 65211, U.S.A
- C.S. Bond Life Science Center, University of Missouri, Columbia, MO 65211, U.S.A
| | - Dong Xu
- Department of Computer Science, University of Missouri, Columbia, MO 65211, U.S.A
- Informatics Institute, University of Missouri, Columbia, MO 65211, U.S.A
- C.S. Bond Life Science Center, University of Missouri, Columbia, MO 65211, U.S.A
| | - Gary Stacey
- C.S. Bond Life Science Center, University of Missouri, Columbia, MO 65211, U.S.A
- Divisions of Plant Sciences and Biochemistry, University of Missouri, Columbia, MO 65211, U.S.A
| | - Jianlin Cheng
- Department of Computer Science, University of Missouri, Columbia, MO 65211, U.S.A
- Informatics Institute, University of Missouri, Columbia, MO 65211, U.S.A
- C.S. Bond Life Science Center, University of Missouri, Columbia, MO 65211, U.S.A
| |
Collapse
|
25
|
Zhang XC, Wang Z, Zhang X, Le MH, Sun J, Xu D, Cheng J, Stacey G. Evolutionary dynamics of protein domain architecture in plants. BMC Evol Biol 2012; 12:6. [PMID: 22252370 PMCID: PMC3310802 DOI: 10.1186/1471-2148-12-6] [Citation(s) in RCA: 25] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/23/2011] [Accepted: 01/17/2012] [Indexed: 12/17/2022] Open
Abstract
Background Protein domains are the structural, functional and evolutionary units of the protein. Protein domain architectures are the linear arrangements of domain(s) in individual proteins. Although the evolutionary history of protein domain architecture has been extensively studied in microorganisms, the evolutionary dynamics of domain architecture in the plant kingdom remains largely undefined. To address this question, we analyzed the lineage-based protein domain architecture content in 14 completed green plant genomes. Results Our analyses show that all 14 plant genomes maintain similar distributions of species-specific, single-domain, and multi-domain architectures. Approximately 65% of plant domain architectures are universally present in all plant lineages, while the remaining architectures are lineage-specific. Clear examples are seen of both the loss and gain of specific protein architectures in higher plants. There has been a dynamic, lineage-wise expansion of domain architectures during plant evolution. The data suggest that this expansion can be largely explained by changes in nuclear ploidy resulting from rounds of whole genome duplications. Indeed, there has been a decrease in the number of unique domain architectures when the genomes were normalized into a presumed ancestral genome that has not undergone whole genome duplications. Conclusions Our data show the conservation of universal domain architectures in all available plant genomes, indicating the presence of an evolutionarily conserved, core set of protein components. However, the occurrence of lineage-specific domain architectures indicates that domain architecture diversity has been maintained beyond these core components in plant genomes. Although several features of genome-wide domain architecture content are conserved in plants, the data clearly demonstrate lineage-wise, progressive changes and expansions of individual protein domain architectures, reinforcing the notion that plant genomes have undergone dynamic evolution.
Collapse
Affiliation(s)
- Xue-Cheng Zhang
- Division of Plant Sciences, University of Missouri, Columbia, MO 65211, USA.
| | | | | | | | | | | | | | | |
Collapse
|