1
|
Khandia R, Garg R, Pandey MK, Khan AA, Dhanda SK, Malik A, Gurjar P. Determination of codon pattern and evolutionary forces acting on genes linked to inflammatory bowel disease. Int J Biol Macromol 2024; 278:134480. [PMID: 39116987 DOI: 10.1016/j.ijbiomac.2024.134480] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/01/2024] [Revised: 06/25/2024] [Accepted: 07/31/2024] [Indexed: 08/10/2024]
Abstract
Inflammatory bowel disease (IBD) is an inflammatory disorder of the gastrointestinal tract. The present study attempted to understand the codon usage preferences in genes associated with IBD progression. Compositional analysis, codon usage bias (CUB), Relative synonymous codon usage (RSCU), RNA structure, and expression analysis were performed to obtain a comprehensive picture of codon usage in IBD genes. Compositional analysis of 62 IBD-associated genes revealed that G and T are the most and least abundant nucleotides, respectively. ApG, CpA, and TpG dinucleotides were overrepresented or randomly used, while ApC, CpG, GpT, and TpA dinucleotides were either underrepresented or randomly used in genes related to IBD. The codons influencing the codon usage the most in IBD genes were CGC and AGG. A comparison of codon usage between IBD, and pancreatitis (non-IBD inflammatory disease) indicated that only codon CTG codon usage was significantly different between IBD and pancreatitis. At the same time, there were codons ATA, ACA, CGT, CAA, GTA, CCT, ATT, GCT, CGG, TTG, and CAG for whom codon usage was significantly different for IBD and housekeeping gene sets. The results suggest similar codon usage in at least two inflammatory disorders, IBD and pancreatitis. The analysis helps understand the codon biology, factors affecting gene expression of IBD-associated genes, and the evolution of these genes. The study helps reveal the molecular patterns associated with IBD.
Collapse
Affiliation(s)
- Rekha Khandia
- Department of Biochemistry and Genetics, Barkatullah University, Bhopal 462026, MP, India.
| | - Rajkumar Garg
- Department of Biosciences, Barkatullah University, Bhopal 462026, MP, India
| | - Megha Katare Pandey
- Translational Medicine Center, All India Institute of Medical Sciences, Bhopal 462020, MP, India.
| | - Azmat Ali Khan
- Pharmaceutical Biotechnology Laboratory, Department of Pharmaceutical Chemistry, College of Pharmacy, King Saud University, Riyadh 11451, Saudi Arabia.
| | - Sandeep Kumar Dhanda
- Department of Oncology, St Jude Children's Research Hospital, Memphis, TN 38105, USA
| | - Abdul Malik
- Department of Pharmaceutics, College of Pharmacy, King Saud University, Riyadh 11451, Saudi Arabia.
| | - Pankaj Gurjar
- Centre for Global Health Research, Saveetha Medical College and Hospital, Saveetha Institute of Medical and Technical Sciences, Saveetha University, Chennai, Tamil Nadu, India; Department of Science and Engineering, Novel Global Community Educational Foundation, Hebersham, Australia.
| |
Collapse
|
2
|
Malaina I, Gonzalez-Melero L, Martínez L, Salvador A, Sanchez-Diez A, Asumendi A, Margareto J, Carrasco-Pujante J, Legarreta L, García MA, Pérez-Pinilla MB, Izu R, Martínez de la Fuente I, Igartua M, Alonso S, Hernandez RM, Boyano MD. Computational and Experimental Evaluation of the Immune Response of Neoantigens for Personalized Vaccine Design. Int J Mol Sci 2023; 24:9024. [PMID: 37240369 PMCID: PMC10219310 DOI: 10.3390/ijms24109024] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/04/2023] [Revised: 05/16/2023] [Accepted: 05/17/2023] [Indexed: 05/28/2023] Open
Abstract
In the last few years, the importance of neoantigens in the development of personalized antitumor vaccines has increased remarkably. In order to study whether bioinformatic tools are effective in detecting neoantigens that generate an immune response, DNA samples from patients with cutaneous melanoma in different stages were obtained, resulting in a total of 6048 potential neoantigens gathered. Thereafter, the immunological responses generated by some of those neoantigens ex vivo were tested, using a vaccine designed by a new optimization approach and encapsulated in nanoparticles. Our bioinformatic analysis indicated that no differences were found between the number of neoantigens and that of non-mutated sequences detected as potential binders by IEDB tools. However, those tools were able to highlight neoantigens over non-mutated peptides in HLA-II recognition (p-value 0.03). However, neither HLA-I binding affinity (p-value 0.08) nor Class I immunogenicity values (p-value 0.96) indicated significant differences for the latter parameters. Subsequently, the new vaccine, using aggregative functions and combinatorial optimization, was designed. The six best neoantigens were selected and formulated into two nanoparticles, with which the immune response ex vivo was evaluated, demonstrating a specific activation of the immune response. This study reinforces the use of bioinformatic tools in vaccine development, as their usefulness is proven both in silico and ex vivo.
Collapse
Affiliation(s)
- Iker Malaina
- Department of Mathematics, Faculty of Science and Technology, University of the Basque Country (UPV/EHU), 48940 Leioa, Spain
| | - Lorena Gonzalez-Melero
- NanoBioCel Research Group, Laboratory of Pharmaceutics, School of Pharmacy, University of the Basque Country (UPV/EHU), 01006 Vitoria-Gasteiz, Spain (R.M.H.)
- Bioaraba, NanoBioCel Research Group, 01009 Vitoria-Gasteiz, Spain
| | - Luis Martínez
- Department of Mathematics, Faculty of Science and Technology, University of the Basque Country (UPV/EHU), 48940 Leioa, Spain
- Luis Martínez, Basque Center for Applied Mathematics BCAM, 48009 Bilbao, Spain
| | - Aiala Salvador
- NanoBioCel Research Group, Laboratory of Pharmaceutics, School of Pharmacy, University of the Basque Country (UPV/EHU), 01006 Vitoria-Gasteiz, Spain (R.M.H.)
- Bioaraba, NanoBioCel Research Group, 01009 Vitoria-Gasteiz, Spain
- Biomedical Research Networking Centre in Bioengineering, Biomaterials and Nanomedicine (CIBER-BBN). Institute of Health Carlos III, 28029 Madrid, Spain
| | - Ana Sanchez-Diez
- Department of Dermatology, Basurto University Hospital, 48013 Bilbao, Spain
- Biocruces Bizkaia Health Research Institute, 48903 Barakaldo, Spain (M.D.B.)
| | - Aintzane Asumendi
- Biocruces Bizkaia Health Research Institute, 48903 Barakaldo, Spain (M.D.B.)
- Department of Cell Biology and Histology, Faculty of Medicine and Nursing, University of the Basque Country (UPV/EHU), 48940 Leioa, Spain
| | - Javier Margareto
- Technological Services Division, Health and Quality of Life, TECNALIA, 01510 Miñano, Spain
| | - Jose Carrasco-Pujante
- Department of Mathematics, Faculty of Science and Technology, University of the Basque Country (UPV/EHU), 48940 Leioa, Spain
- Luis Martínez, Basque Center for Applied Mathematics BCAM, 48009 Bilbao, Spain
| | - Leire Legarreta
- Department of Mathematics, Faculty of Science and Technology, University of the Basque Country (UPV/EHU), 48940 Leioa, Spain
- Luis Martínez, Basque Center for Applied Mathematics BCAM, 48009 Bilbao, Spain
| | - María Asunción García
- Department of Mathematics, Faculty of Science and Technology, University of the Basque Country (UPV/EHU), 48940 Leioa, Spain
- Luis Martínez, Basque Center for Applied Mathematics BCAM, 48009 Bilbao, Spain
| | - Martín Blas Pérez-Pinilla
- Department of Mathematics, Faculty of Science and Technology, University of the Basque Country (UPV/EHU), 48940 Leioa, Spain
- Luis Martínez, Basque Center for Applied Mathematics BCAM, 48009 Bilbao, Spain
| | - Rosa Izu
- Department of Dermatology, Basurto University Hospital, 48013 Bilbao, Spain
- Biocruces Bizkaia Health Research Institute, 48903 Barakaldo, Spain (M.D.B.)
| | - Ildefonso Martínez de la Fuente
- Department of Mathematics, Faculty of Science and Technology, University of the Basque Country (UPV/EHU), 48940 Leioa, Spain
- Luis Martínez, Basque Center for Applied Mathematics BCAM, 48009 Bilbao, Spain
- CEBAS-CSIC Institute, Department of Nutrition, 30100 Murcia, Spain
| | - Manoli Igartua
- NanoBioCel Research Group, Laboratory of Pharmaceutics, School of Pharmacy, University of the Basque Country (UPV/EHU), 01006 Vitoria-Gasteiz, Spain (R.M.H.)
- Bioaraba, NanoBioCel Research Group, 01009 Vitoria-Gasteiz, Spain
- Biomedical Research Networking Centre in Bioengineering, Biomaterials and Nanomedicine (CIBER-BBN). Institute of Health Carlos III, 28029 Madrid, Spain
| | - Santos Alonso
- Department of Genetics, Physical Anthropology and Animal Physiology, Faculty of Science and Technology, University of the Basque Country (UPV/EHU), 48940 Leioa, Spain
| | - Rosa Maria Hernandez
- NanoBioCel Research Group, Laboratory of Pharmaceutics, School of Pharmacy, University of the Basque Country (UPV/EHU), 01006 Vitoria-Gasteiz, Spain (R.M.H.)
- Bioaraba, NanoBioCel Research Group, 01009 Vitoria-Gasteiz, Spain
- Biomedical Research Networking Centre in Bioengineering, Biomaterials and Nanomedicine (CIBER-BBN). Institute of Health Carlos III, 28029 Madrid, Spain
| | - María Dolores Boyano
- Biocruces Bizkaia Health Research Institute, 48903 Barakaldo, Spain (M.D.B.)
- Department of Cell Biology and Histology, Faculty of Medicine and Nursing, University of the Basque Country (UPV/EHU), 48940 Leioa, Spain
| |
Collapse
|
3
|
Jaiswal M, Singh A, Kumar S. PTPAMP: prediction tool for plant-derived antimicrobial peptides. Amino Acids 2023; 55:1-17. [PMID: 35864258 DOI: 10.1007/s00726-022-03190-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/21/2022] [Accepted: 07/12/2022] [Indexed: 01/28/2023]
Abstract
The emergence of antimicrobial peptides (AMPs) as a potential alternative to conventional antibiotics has led to the development of efficient computational methods for predicting AMPs. Among all organisms, the presence of multiple genes encoding AMPs in plants demands the development of a plant-based prediction tool. To this end, we developed models based on multiple peptide features like amino acid composition, dipeptide composition, and physicochemical attributes for predicting plant-derived AMPs. The selected compositional models are integrated into a web server termed PTPAMP. The designed web server is capable of classifying a query peptide sequence into four functional activities, i.e., antimicrobial (AMP), antibacterial (ABP), antifungal (AFP), and antiviral (AVP). Our models achieved an average area under the curve of 0.95, 0.91, 0.85, and 0.88 for AMP, ABP, AFP, and AVP, respectively, on benchmark datasets, which were ~ 6.75% higher than the state-of-the-art methods. Moreover, our analysis indicates the abundance of cysteine residues in plant-derived AMPs and the distribution of other residues like G, S, K, and R, which differ as per the peptide structural family. Finally, we have developed a user-friendly web server, available at the URL: http://www.nipgr.ac.in/PTPAMP/ . We expect the substantial input of this predictor for high-throughput identification of plant-derived AMPs followed by additional insights into their functions.
Collapse
Affiliation(s)
- Mohini Jaiswal
- Bioinformatics Laboratory, National Institute of Plant Genome Research (NIPGR), Aruna Asaf Ali Marg, New Delhi, 110067, India
| | - Ajeet Singh
- Bioinformatics Laboratory, National Institute of Plant Genome Research (NIPGR), Aruna Asaf Ali Marg, New Delhi, 110067, India
| | - Shailesh Kumar
- Bioinformatics Laboratory, National Institute of Plant Genome Research (NIPGR), Aruna Asaf Ali Marg, New Delhi, 110067, India.
| |
Collapse
|
4
|
Agrawal P, Bhalla S, Chaudhary K, Kumar R, Sharma M, Raghava GPS. In Silico Approach for Prediction of Antifungal Peptides. Front Microbiol 2018. [PMID: 29535692 PMCID: PMC5834480 DOI: 10.3389/fmicb.2018.00323] [Citation(s) in RCA: 92] [Impact Index Per Article: 15.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/14/2022] Open
Abstract
This paper describes in silico models developed using a wide range of peptide features for predicting antifungal peptides (AFPs). Our analyses indicate that certain types of residue (e.g., C, G, H, K, R, Y) are more abundant in AFPs. The positional residue preference analysis reveals the prominence of the particular type of residues (e.g., R, V, K) at N-terminus and a certain type of residues (e.g., C, H) at C-terminus. In this study, models have been developed for predicting AFPs using a wide range of peptide features (like residue composition, binary profile, terminal residues). The support vector machine based model developed using compositional features of peptides achieved maximum accuracy of 88.78% on the training dataset and 83.33% on independent or validation dataset. Our model developed using binary patterns of terminal residues of peptides achieved maximum accuracy of 84.88% on training and 84.64% on validation dataset. We benchmark models developed in this study and existing methods on a dataset containing compositionally similar antifungal and non-AFPs. It was observed that binary based model developed in this study preforms better than any model/method. In order to facilitate scientific community, we developed a mobile app, standalone and a user-friendly web server ‘Antifp’ (http://webs.iiitd.edu.in/raghava/antifp).
Collapse
Affiliation(s)
- Piyush Agrawal
- Council of Scientific and Industrial Research, Institute of Microbial Technology, Chandigarh, India
| | - Sherry Bhalla
- Council of Scientific and Industrial Research, Institute of Microbial Technology, Chandigarh, India
| | - Kumardeep Chaudhary
- Council of Scientific and Industrial Research, Institute of Microbial Technology, Chandigarh, India
| | - Rajesh Kumar
- Council of Scientific and Industrial Research, Institute of Microbial Technology, Chandigarh, India
| | - Meenu Sharma
- Council of Scientific and Industrial Research, Institute of Microbial Technology, Chandigarh, India
| | - Gajendra P S Raghava
- Council of Scientific and Industrial Research, Institute of Microbial Technology, Chandigarh, India.,Center for Computational Biology, Indraprastha Institute of Information Technology, New Delhi, India
| |
Collapse
|
5
|
Bessière C, Taha M, Petitprez F, Vandel J, Marin JM, Bréhélin L, Lèbre S, Lecellier CH. Probing instructions for expression regulation in gene nucleotide compositions. PLoS Comput Biol 2018; 14:e1005921. [PMID: 29293496 PMCID: PMC5766238 DOI: 10.1371/journal.pcbi.1005921] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/11/2017] [Revised: 01/12/2018] [Accepted: 12/10/2017] [Indexed: 01/22/2023] Open
Abstract
Gene expression is orchestrated by distinct regulatory regions to ensure a wide variety of cell types and functions. A challenge is to identify which regulatory regions are active, what are their associated features and how they work together in each cell type. Several approaches have tackled this problem by modeling gene expression based on epigenetic marks, with the ultimate goal of identifying driving regions and associated genomic variations that are clinically relevant in particular in precision medicine. However, these models rely on experimental data, which are limited to specific samples (even often to cell lines) and cannot be generated for all regulators and all patients. In addition, we show here that, although these approaches are accurate in predicting gene expression, inference of TF combinations from this type of models is not straightforward. Furthermore these methods are not designed to capture regulation instructions present at the sequence level, before the binding of regulators or the opening of the chromatin. Here, we probe sequence-level instructions for gene expression and develop a method to explain mRNA levels based solely on nucleotide features. Our method positions nucleotide composition as a critical component of gene expression. Moreover, our approach, able to rank regulatory regions according to their contribution, unveils a strong influence of the gene body sequence, in particular introns. We further provide evidence that the contribution of nucleotide content can be linked to co-regulations associated with genome 3D architecture and to associations of genes within topologically associated domains.
Collapse
Affiliation(s)
- Chloé Bessière
- IBC, Univ. Montpellier, CNRS, Montpellier, France
- Institut de Génétique Moléculaire de Montpellier, University of Montpellier, CNRS, Montpellier, France
| | - May Taha
- IBC, Univ. Montpellier, CNRS, Montpellier, France
- Institut de Génétique Moléculaire de Montpellier, University of Montpellier, CNRS, Montpellier, France
- IMAG, Univ. Montpellier, CNRS, Montpellier, France
| | - Florent Petitprez
- IBC, Univ. Montpellier, CNRS, Montpellier, France
- Institut de Génétique Moléculaire de Montpellier, University of Montpellier, CNRS, Montpellier, France
| | - Jimmy Vandel
- IBC, Univ. Montpellier, CNRS, Montpellier, France
- LIRMM, Univ. Montpellier, CNRS, Montpellier, France
| | - Jean-Michel Marin
- IBC, Univ. Montpellier, CNRS, Montpellier, France
- IMAG, Univ. Montpellier, CNRS, Montpellier, France
| | - Laurent Bréhélin
- IBC, Univ. Montpellier, CNRS, Montpellier, France
- LIRMM, Univ. Montpellier, CNRS, Montpellier, France
| | - Sophie Lèbre
- IBC, Univ. Montpellier, CNRS, Montpellier, France
- IMAG, Univ. Montpellier, CNRS, Montpellier, France
- Univ. Paul-Valéry-Montpellier 3, Montpellier, France
| | - Charles-Henri Lecellier
- IBC, Univ. Montpellier, CNRS, Montpellier, France
- Institut de Génétique Moléculaire de Montpellier, University of Montpellier, CNRS, Montpellier, France
| |
Collapse
|
6
|
Codon usage and amino acid usage influence genes expression level. Genetica 2017; 146:53-63. [DOI: 10.1007/s10709-017-9996-4] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/08/2017] [Accepted: 10/09/2017] [Indexed: 11/30/2022]
|
7
|
Yerukala Sathipati S, Ho SY. Identifying the miRNA signature associated with survival time in patients with lung adenocarcinoma using miRNA expression profiles. Sci Rep 2017; 7:7507. [PMID: 28790336 PMCID: PMC5548864 DOI: 10.1038/s41598-017-07739-y] [Citation(s) in RCA: 57] [Impact Index Per Article: 8.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/03/2017] [Accepted: 07/04/2017] [Indexed: 12/19/2022] Open
Abstract
Lung adenocarcinoma is a multifactorial disease. MicroRNA (miRNA) expression profiles are extensively used for discovering potential theranostic biomarkers of lung cancer. This work proposes an optimized support vector regression (SVR) method called SVR-LUAD to simultaneously identify a set of miRNAs referred to the miRNA signature for estimating the survival time of lung adenocarcinoma patients using their miRNA expression profiles. SVR-LUAD uses an inheritable bi-objective combinatorial genetic algorithm to identify a small set of informative miRNAs cooperating with SVR by maximizing estimation accuracy. SVR-LUAD identified 18 out of 332 miRNAs using 10-fold cross-validation and achieved a correlation coefficient of 0.88 ± 0.01 and mean absolute error of 0.56 ± 0.03 year between real and estimated survival time. SVR-LUAD performs well compared to some well-recognized regression methods. The miRNA signature consists of the 18 miRNAs which strongly correlates with lung adenocarcinoma: hsa-let-7f-1, hsa-miR-16-1, hsa-miR-152, hsa-miR-217, hsa-miR-18a, hsa-miR-193b, hsa-miR-3136, hsa-let-7g, hsa-miR-155, hsa-miR-3199-1, hsa-miR-219-2, hsa-miR-1254, hsa-miR-1291, hsa-miR-192, hsa-miR-3653, hsa-miR-3934, hsa-miR-342, and hsa-miR-141. Gene ontology annotation and pathway analysis of the miRNA signature revealed its biological significance in cancer and cellular pathways. This miRNA signature could aid in the development of novel therapeutic approaches to the treatment of lung adenocarcinoma.
Collapse
Affiliation(s)
| | - Shinn-Ying Ho
- Institute of Bioinformatics and Systems Biology, National Chiao Tung University, Hsinchu, Taiwan. .,Department of Biological Science and Technology, National Chiao Tung University, Hsinchu, Taiwan.
| |
Collapse
|
8
|
Kang S, Odom OW, Thangamani S, Herrin DL. Toward mosquito control with a green alga: Expression of Cry toxins of Bacillus thuringiensis subsp. israelensis (Bti) in the chloroplast of Chlamydomonas. JOURNAL OF APPLIED PHYCOLOGY 2017; 29:1377-1389. [PMID: 28713202 PMCID: PMC5509220 DOI: 10.1007/s10811-016-1008-z] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/08/2023]
Abstract
We are developing Chlamydomonas strains that can be used for safe and sustainable control of mosquitoes, because they produce proteins from Bacillus thuringiensis subsp. israelensis (Bti) in the chloroplast. Chlamydomonas has a number of advantages for this approach, including genetic controls that are not generally available with industrial algae. The Bti toxin has been used for mosquito control for > 30 years and does not engender resistance; it contains three Cry proteins, Cry4Aa (135 kDa), Cry4Ba (128 kDa) and Cry11Aa (72 kDa), and Cyt1Aa (25 kDa). To express the Cry proteins in the chloroplast, the three genes were resynthesized and cry4Aa was truncated to the first 700 amino acids (cry4Aa700 ); also, since they can be toxic to host cells, the inducible Cyc6:Nac2-psbD expression system was used. Western blots of total protein from the chloroplast transformants showed accumulation of the intact polypeptides, and the relative expression level was Cry11Aa > Cry4Aa700 > Cry4Ba. Quantitative western blots with purified Cry11Aa as a standard showed that Cry11Aa accumulated to 0.35% of total cell protein. Live cell bioassays in dH20 demonstrated toxicity of the cry4Aa700 and cry11Aa transformants to larvae of Aedes aegypti and Culex quinquefasciatus. These results demonstrate that the Cry proteins that are most toxic to Aedes and Culex mosquitoes, Cry4Aa and Cry11Aa, can be successfully expressed in the chloroplast of Chlamydomonas.
Collapse
Affiliation(s)
- Seongjoon Kang
- Dept. of Molecular Biosciences, University of Texas at Austin, Austin, TX 78712, USA
- Pond Life Technologies LLC, Cedar Park, TX 78613, USA
| | - Obed W. Odom
- Dept. of Molecular Biosciences, University of Texas at Austin, Austin, TX 78712, USA
| | - Saravanan Thangamani
- Dept. of Pathology, University of Texas Medical Branch, Galveston, TX 77555, USA
| | - David L. Herrin
- Dept. of Molecular Biosciences, University of Texas at Austin, Austin, TX 78712, USA
- Pond Life Technologies LLC, Cedar Park, TX 78613, USA
| |
Collapse
|
9
|
Bae YA. Codon Usage Patterns of Tyrosinase Genes in Clonorchis sinensis. THE KOREAN JOURNAL OF PARASITOLOGY 2017; 55:175-183. [PMID: 28506040 PMCID: PMC5450960 DOI: 10.3347/kjp.2017.55.2.175] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 01/13/2017] [Revised: 04/05/2017] [Accepted: 04/06/2017] [Indexed: 11/28/2022]
Abstract
Codon usage bias (CUB) is a unique property of genomes and has contributed to the better understanding of the molecular features and the evolution processes of particular gene. In this study, genetic indices associated with CUB, including relative synonymous codon usage and effective numbers of codons, as well as the nucleotide composition, were investigated in the Clonorchis sinensis tyrosinase genes and their platyhelminth orthologs, which play an important role in the eggshell formation. The relative synonymous codon usage patterns substantially differed among tyrosinase genes examined. In a neutrality analysis, the correlation between GC12 and GC3 was statistically significant, and the regression line had a relatively gradual slope (0.218). NC-plot, i.e., GC3 vs effective number of codons (ENC), showed that most of the tyrosinase genes were below the expected curve. The codon adaptation index (CAI) values of the platyhelminth tyrosinases had a narrow distribution between 0.685/0.714 and 0.797/0.837, and were negatively correlated with their ENC. Taken together, these results suggested that CUB in the tyrosinase genes seemed to be basically governed by selection pressures rather than mutational bias, although the latter factor provided an additional force in shaping CUB of the C. sinensis and Opisthorchis viverrini genes. It was also apparent that the equilibrium point between selection pressure and mutational bias is much more inclined to selection pressure in highly expressed C. sinensis genes, than in poorly expressed genes.
Collapse
|
10
|
BARUAH VISHWAJYOTI, SATAPATHY SIDDHARTHASANKAR, POWDEL BHESHRAJ, KONWARH ROCKTOTPAL, BURAGOHAIN ALAKKUMAR, RAY SUVENDRAKUMAR. Comparative analysis of codon usage bias in Crenarchaea and Euryarchaea genome reveals differential preference of synonymous codons to encode highly expressed ribosomal and RNA polymerase proteins. J Genet 2016; 95:537-49. [PMID: 27659324 DOI: 10.1007/s12041-016-0667-5] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/15/2023]
|
11
|
Comparisons between Arabidopsis thaliana and Drosophila melanogaster in relation to Coding and Noncoding Sequence Length and Gene Expression. Int J Genomics 2015; 2015:269127. [PMID: 26114098 PMCID: PMC4465843 DOI: 10.1155/2015/269127] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/18/2015] [Accepted: 05/11/2015] [Indexed: 11/24/2022] Open
Abstract
There is a continuing interest in the analysis of gene architecture and gene expression to determine the relationship that may exist. Advances in high-quality sequencing technologies and large-scale resource datasets have increased the understanding of relationships and cross-referencing of expression data to the large genome data. Although a negative correlation between expression level and gene (especially transcript) length has been generally accepted, there have been some conflicting results arising from the literature concerning the impacts of different regions of genes, and the underlying reason is not well understood. The research aims to apply quantile regression techniques for statistical analysis of coding and noncoding sequence length and gene expression data in the plant, Arabidopsis thaliana, and fruit fly, Drosophila melanogaster, to determine if a relationship exists and if there is any variation or similarities between these species. The quantile regression analysis found that the coding sequence length and gene expression correlations varied, and similarities emerged for the noncoding sequence length (5′ and 3′ UTRs) between animal and plant species. In conclusion, the information described in this study provides the basis for further exploration into gene regulation with regard to coding and noncoding sequence length.
Collapse
|
12
|
An unsupervised approach to predict functional relations between genes based on expression data. BIOMED RESEARCH INTERNATIONAL 2014; 2014:154594. [PMID: 24800208 PMCID: PMC3988973 DOI: 10.1155/2014/154594] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 11/01/2013] [Revised: 01/31/2014] [Accepted: 02/03/2014] [Indexed: 11/17/2022]
Abstract
This work presents a novel approach to predict functional relations between genes using gene expression data. Genes may have various types of relations between them, for example, regulatory relations, or they may be concerned with the same protein complex or metabolic/signaling pathways and obviously gene expression data should contain some clues to such relations. The present approach first digitizes the log-ratio type gene expression data of S. cerevisiae to a matrix consisting of 1, 0, and −1 indicating highly expressed, no major change, and highly suppressed conditions for genes, respectively. For each gene pair, a probability density mass function table is constructed indicating nine joint probabilities. Then gene pairs were selected based on linear and probabilistic relation between their profiles indicated by the sum of probability density masses in selected points. The selected gene pairs share many Gene Ontology terms. Furthermore a network is constructed by selecting a large number of gene pairs based on FDR analysis and the clustering of the network generates many modules rich with similar function genes. Also, the promoters of the gene sets in many modules are rich with binding sites of known transcription factors indicating the effectiveness of the proposed approach in predicting regulatory relations.
Collapse
|
13
|
Hybrid approach for predicting coreceptor used by HIV-1 from its V3 loop amino acid sequence. PLoS One 2013; 8:e61437. [PMID: 23596523 PMCID: PMC3626595 DOI: 10.1371/journal.pone.0061437] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/04/2012] [Accepted: 03/13/2013] [Indexed: 12/18/2022] Open
Abstract
Background HIV-1 infects the host cell by interacting with the primary receptor CD4 and a coreceptor CCR5 or CXCR4. Maraviroc, a CCR5 antagonist binds to CCR5 receptor. Thus, it is important to identify the coreceptor used by the HIV strains dominating in the patient. In past, a number of experimental assays and in-silico techniques have been developed for predicting the coreceptor tropism. The prediction accuracy of these methods is excellent when predicting CCR5(R5) tropic sequences but is relatively poor for CXCR4(X4) tropic sequences. Therefore, any new method for accurate determination of coreceptor usage would be of paramount importance to the successful management of HIV-infected individuals. Results The dataset used in this study comprised 1799 R5-tropic and 598 X4-tropic third variable (V3) sequences of HIV-1. We compared the amino acid composition of both types of V3 sequences and observed that certain types of residues, e.g., Asparagine and Isoleucine, were preferred in R5-tropic sequences whereas residues like Lysine, Arginine, and Tryptophan were preferred in X4-tropic sequences. Initially, Support Vector Machine-based models were developed using amino acid composition, dipeptide composition, and split amino acid composition, which achieved accuracy up to 90%. We used BLAST to discriminate R5- and X4-tropic sequences and correctly predicted 93.16% of R5- and 75.75% of X4-tropic sequences. In order to improve the prediction accuracy, a Hybrid model was developed that achieved 91.66% sensitivity, 81.77% specificity, 89.19% accuracy and 0.72 Matthews Correlation Coefficient. The performance of our models was also evaluated on an independent dataset (256 R5- and 81 X4-tropic sequences) and achieved maximum accuracy of 84.87% with Matthews Correlation Coefficient 0.63. Conclusion This study describes a highly efficient method for predicting HIV-1 coreceptor usage from V3 sequences. In order to provide a service to the scientific community, a webserver HIVcoPred was developed (http://www.imtech.res.in/raghava/hivcopred/) for predicting the coreceptor usage.
Collapse
|
14
|
Gautam A, Chaudhary K, Kumar R, Sharma A, Kapoor P, Tyagi A, Raghava GPS. In silico approaches for designing highly effective cell penetrating peptides. J Transl Med 2013; 11:74. [PMID: 23517638 PMCID: PMC3615965 DOI: 10.1186/1479-5876-11-74] [Citation(s) in RCA: 207] [Impact Index Per Article: 18.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/09/2012] [Accepted: 03/11/2013] [Indexed: 11/23/2022] Open
Abstract
Background Cell penetrating peptides have gained much recognition as a versatile transport vehicle for the intracellular delivery of wide range of cargoes (i.e. oligonucelotides, small molecules, proteins, etc.), that otherwise lack bioavailability, thus offering great potential as future therapeutics. Keeping in mind the therapeutic importance of these peptides, we have developed in silico methods for the prediction of cell penetrating peptides, which can be used for rapid screening of such peptides prior to their synthesis. Methods In the present study, support vector machine (SVM)-based models have been developed for predicting and designing highly effective cell penetrating peptides. Various features like amino acid composition, dipeptide composition, binary profile of patterns, and physicochemical properties have been used as input features. The main dataset used in this study consists of 708 peptides. In addition, we have identified various motifs in cell penetrating peptides, and used these motifs for developing a hybrid prediction model. Performance of our method was evaluated on an independent dataset and also compared with that of the existing methods. Results In cell penetrating peptides, certain residues (e.g. Arg, Lys, Pro, Trp, Leu, and Ala) are preferred at specific locations. Thus, it was possible to discriminate cell-penetrating peptides from non-cell penetrating peptides based on amino acid composition. All models were evaluated using five-fold cross-validation technique. We have achieved a maximum accuracy of 97.40% using the hybrid model that combines motif information and binary profile of the peptides. On independent dataset, we achieved maximum accuracy of 81.31% with MCC of 0.63. Conclusion The present study demonstrates that features like amino acid composition, binary profile of patterns and motifs, can be used to train an SVM classifier that can predict cell penetrating peptides with higher accuracy. The hybrid model described in this study achieved more accuracy than the previous methods and thus may complement the existing methods. Based on the above study, a user- friendly web server CellPPD has been developed to help the biologists, where a user can predict and design CPPs with much ease. CellPPD web server is freely accessible at http://crdd.osdd.net/raghava/cellppd/.
Collapse
Affiliation(s)
- Ankur Gautam
- Bioinformatics Centre, CSIR-Institute of Microbial Technology, Chandigarh 160036, India
| | | | | | | | | | | | | | | |
Collapse
|
15
|
Prediction and analysis of protein solubility using a novel scoring card method with dipeptide composition. BMC Bioinformatics 2012; 13 Suppl 17:S3. [PMID: 23282103 PMCID: PMC3521471 DOI: 10.1186/1471-2105-13-s17-s3] [Citation(s) in RCA: 47] [Impact Index Per Article: 3.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Existing methods for predicting protein solubility on overexpression in Escherichia coli advance performance by using ensemble classifiers such as two-stage support vector machine (SVM) based classifiers and a number of feature types such as physicochemical properties, amino acid and dipeptide composition, accompanied with feature selection. It is desirable to develop a simple and easily interpretable method for predicting protein solubility, compared to existing complex SVM-based methods. RESULTS This study proposes a novel scoring card method (SCM) by using dipeptide composition only to estimate solubility scores of sequences for predicting protein solubility. SCM calculates the propensities of 400 individual dipeptides to be soluble using statistic discrimination between soluble and insoluble proteins of a training data set. Consequently, the propensity scores of all dipeptides are further optimized using an intelligent genetic algorithm. The solubility score of a sequence is determined by the weighted sum of all propensity scores and dipeptide composition. To evaluate SCM by performance comparisons, four data sets with different sizes and variation degrees of experimental conditions were used. The results show that the simple method SCM with interpretable propensities of dipeptides has promising performance, compared with existing SVM-based ensemble methods with a number of feature types. Furthermore, the propensities of dipeptides and solubility scores of sequences can provide insights to protein solubility. For example, the analysis of dipeptide scores shows high propensity of α-helix structure and thermophilic proteins to be soluble. CONCLUSIONS The propensities of individual dipeptides to be soluble are varied for proteins under altered experimental conditions. For accurately predicting protein solubility using SCM, it is better to customize the score card of dipeptide propensities by using a training data set under the same specified experimental conditions. The proposed method SCM with solubility scores and dipeptide propensities can be easily applied to the protein function prediction problems that dipeptide composition features play an important role. AVAILABILITY The used datasets, source codes of SCM, and supplementary files are available at http://iclab.life.nctu.edu.tw/SCM/.
Collapse
|
16
|
Song J, Tan H, Wang M, Webb GI, Akutsu T. TANGLE: two-level support vector regression approach for protein backbone torsion angle prediction from primary sequences. PLoS One 2012; 7:e30361. [PMID: 22319565 PMCID: PMC3271071 DOI: 10.1371/journal.pone.0030361] [Citation(s) in RCA: 31] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/05/2011] [Accepted: 12/14/2011] [Indexed: 12/29/2022] Open
Abstract
Protein backbone torsion angles (Phi) and (Psi) involve two rotation angles rotating around the Cα-N bond (Phi) and the Cα-C bond (Psi). Due to the planarity of the linked rigid peptide bonds, these two angles can essentially determine the backbone geometry of proteins. Accordingly, the accurate prediction of protein backbone torsion angle from sequence information can assist the prediction of protein structures. In this study, we develop a new approach called TANGLE (Torsion ANGLE predictor) to predict the protein backbone torsion angles from amino acid sequences. TANGLE uses a two-level support vector regression approach to perform real-value torsion angle prediction using a variety of features derived from amino acid sequences, including the evolutionary profiles in the form of position-specific scoring matrices, predicted secondary structure, solvent accessibility and natively disordered region as well as other global sequence features. When evaluated based on a large benchmark dataset of 1,526 non-homologous proteins, the mean absolute errors (MAEs) of the Phi and Psi angle prediction are 27.8° and 44.6°, respectively, which are 1% and 3% respectively lower than that using one of the state-of-the-art prediction tools ANGLOR. Moreover, the prediction of TANGLE is significantly better than a random predictor that was built on the amino acid-specific basis, with the p-value<1.46e-147 and 7.97e-150, respectively by the Wilcoxon signed rank test. As a complementary approach to the current torsion angle prediction algorithms, TANGLE should prove useful in predicting protein structural properties and assisting protein fold recognition by applying the predicted torsion angles as useful restraints. TANGLE is freely accessible at http://sunflower.kuicr.kyoto-u.ac.jp/~sjn/TANGLE/.
Collapse
Affiliation(s)
- Jiangning Song
- Department of Biochemistry and Molecular Biology, Faculty of Medicine, Monash University, Melbourne, Victoria, Australia
- National Engineering Laboratory for Industrial Enzymes and Key Laboratory of Systems Microbial Biotechnology, Tianjin Institute of Industrial Biotechnology, Chinese Academy of Sciences, Tianjin, China
- Bioinformatics Center, Institute for Chemical Research, Kyoto University, Uji, Kyoto, Japan
- * E-mail: (JS); (GIW); (TA)
| | - Hao Tan
- Department of Biochemistry and Molecular Biology, Faculty of Medicine, Monash University, Melbourne, Victoria, Australia
| | - Mingjun Wang
- National Engineering Laboratory for Industrial Enzymes and Key Laboratory of Systems Microbial Biotechnology, Tianjin Institute of Industrial Biotechnology, Chinese Academy of Sciences, Tianjin, China
| | - Geoffrey I. Webb
- Faculty of Information Technology, Monash University, Melbourne, Victoria, Australia
- * E-mail: (JS); (GIW); (TA)
| | - Tatsuya Akutsu
- Bioinformatics Center, Institute for Chemical Research, Kyoto University, Uji, Kyoto, Japan
- * E-mail: (JS); (GIW); (TA)
| |
Collapse
|
17
|
Identification of mannose interacting residues using local composition. PLoS One 2011; 6:e24039. [PMID: 21931639 PMCID: PMC3172211 DOI: 10.1371/journal.pone.0024039] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/26/2011] [Accepted: 07/29/2011] [Indexed: 01/24/2023] Open
Abstract
Background Mannose binding proteins (MBPs) play a vital role in several biological functions such as defense mechanisms. These proteins bind to mannose on the surface of a wide range of pathogens and help in eliminating these pathogens from our body. Thus, it is important to identify mannose interacting residues (MIRs) in order to understand mechanism of recognition of pathogens by MBPs. Results This paper describes modules developed for predicting MIRs in a protein. Support vector machine (SVM) based models have been developed on 120 mannose binding protein chains, where no two chains have more than 25% sequence similarity. SVM models were developed on two types of datasets: 1) main dataset consists of 1029 mannose interacting and 1029 non-interacting residues, 2) realistic dataset consists of 1029 mannose interacting and 10320 non-interacting residues. In this study, firstly, we developed standard modules using binary and PSSM profile of patterns and got maximum MCC around 0.32. Secondly, we developed SVM modules using composition profile of patterns and achieved maximum MCC around 0.74 with accuracy 86.64% on main dataset. Thirdly, we developed a model on a realistic dataset and achieved maximum MCC of 0.62 with accuracy 93.08%. Based on this study, a standalone program and web server have been developed for predicting mannose interacting residues in proteins (http://www.imtech.res.in/raghava/premier/). Conclusions Compositional analysis of mannose interacting and non-interacting residues shows that certain types of residues are preferred in mannose interaction. It was also observed that residues around mannose interacting residues have a preference for certain types of residues. Composition of patterns/peptide/segment has been used for predicting MIRs and achieved reasonable high accuracy. It is possible that this novel strategy may be effective to predict other types of interacting residues. This study will be useful in annotating the function of protein as well as in understanding the role of mannose in the immune system.
Collapse
|
18
|
Van Damme P, Hole K, Pimenta-Marques A, Helsens K, Vandekerckhove J, Martinho RG, Gevaert K, Arnesen T. NatF contributes to an evolutionary shift in protein N-terminal acetylation and is important for normal chromosome segregation. PLoS Genet 2011; 7:e1002169. [PMID: 21750686 PMCID: PMC3131286 DOI: 10.1371/journal.pgen.1002169] [Citation(s) in RCA: 146] [Impact Index Per Article: 11.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/17/2011] [Accepted: 05/20/2011] [Indexed: 01/31/2023] Open
Abstract
N-terminal acetylation (N-Ac) is a highly abundant eukaryotic protein modification. Proteomics revealed a significant increase in the occurrence of N-Ac from lower to higher eukaryotes, but evidence explaining the underlying molecular mechanism(s) is currently lacking. We first analysed protein N-termini and their acetylation degrees, suggesting that evolution of substrates is not a major cause for the evolutionary shift in N-Ac. Further, we investigated the presence of putative N-terminal acetyltransferases (NATs) in higher eukaryotes. The purified recombinant human and Drosophila homologues of a novel NAT candidate was subjected to in vitro peptide library acetylation assays. This provided evidence for its NAT activity targeting Met-Lys- and other Met-starting protein N-termini, and the enzyme was termed Naa60p and its activity NatF. Its in vivo activity was investigated by ectopically expressing human Naa60p in yeast followed by N-terminal COFRADIC analyses. hNaa60p acetylated distinct Met-starting yeast protein N-termini and increased general acetylation levels, thereby altering yeast in vivo acetylation patterns towards those of higher eukaryotes. Further, its activity in human cells was verified by overexpression and knockdown of hNAA60 followed by N-terminal COFRADIC. NatF's cellular impact was demonstrated in Drosophila cells where NAA60 knockdown induced chromosomal segregation defects. In summary, our study revealed a novel major protein modifier contributing to the evolution of N-Ac, redundancy among NATs, and an essential regulator of normal chromosome segregation. With the characterization of NatF, the co-translational N-Ac machinery appears complete since all the major substrate groups in eukaryotes are accounted for. Small chemical groups are commonly attached to proteins in order to control their activity, localization, and stability. An abundant protein modification is N-terminal acetylation, in which an N-terminal acetyltransferase (NAT) catalyzes the transfer of an acetyl group to the very N-terminal amino acid of the protein. When going from lower to higher eukaryotes there is a significant increase in the occurrence of N-terminal acetylation. We demonstrate here that this is partly because higher eukaryotes uniquely express NatF, an enzyme capable of acetylating a large group of protein N-termini including those previously found to display an increased N-acetylation potential in higher eukaryotes. Thus, the current study has possibly identified the last major component of the eukaryotic machinery responsible for co-translational N-acetylation of proteins. All eukaryotic proteins start with methionine, which is co-translationally cleaved when the second amino acid is small. Thereafter, NatA may acetylate these newly exposed N-termini. Interestingly, NatF also has the potential to act on these types of N-termini where the methionine was not cleaved. At the cellular level, we further found that NatF is essential for normal chromosome segregation during cell division.
Collapse
Affiliation(s)
- Petra Van Damme
- Department of Medical Protein Research, Ghent University, Ghent, Belgium
- Department of Biochemistry, Ghent University, Ghent, Belgium
| | - Kristine Hole
- Department of Molecular Biology, University of Bergen, Bergen, Norway
- Department of Surgical Sciences, University of Bergen, Bergen, Norway
| | | | - Kenny Helsens
- Department of Medical Protein Research, Ghent University, Ghent, Belgium
- Department of Biochemistry, Ghent University, Ghent, Belgium
| | - Joël Vandekerckhove
- Department of Medical Protein Research, Ghent University, Ghent, Belgium
- Department of Biochemistry, Ghent University, Ghent, Belgium
| | | | - Kris Gevaert
- Department of Medical Protein Research, Ghent University, Ghent, Belgium
- Department of Biochemistry, Ghent University, Ghent, Belgium
| | - Thomas Arnesen
- Department of Molecular Biology, University of Bergen, Bergen, Norway
- Department of Surgery, Haukeland University Hospital, Bergen, Norway
- * E-mail:
| |
Collapse
|
19
|
Ivanisenko VA, Demenkov PS, Ivanisenko TV, Kolchanov NA. [Protein Structure Discovery: software package to perform computational proteomics tasks]. RUSSIAN JOURNAL OF BIOORGANIC CHEMISTRY 2011; 37:22-35. [PMID: 21460878 DOI: 10.1134/s1068162011010080] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/23/2022]
Abstract
Software-information system Protein Structure Discovery was developed. The system can be used for the wide range of tasks in the field of computer proteomics including prediction of function, structure and immunological properties of proteins. A specially created section of the system allows evaluating the quantitative and qualitative effects of mutations on the structural and functional properties of proteins. There are 19 of different programs integrated into the system, including the database of protein functional sites PDBSite, a PDBSiteScan program for the prediction of functional sites in three-dimensional structures of proteins, and WebProAnalyst program for the quantitative analysis of the structure-activity relationship of proteins. Protein Structure Discovery program has a Web interface and is available for users through the Internet (http://www-bionet.sscc.ru/psd/). For example, binding sites of zinc ion and ADP showed high stability of the method to errors PDBSiteScan reconstruction of spatial structures of proteins in the recognition of functional sites in model structures.
Collapse
|
20
|
Panwar B, Raghava GPS. Predicting sub-cellular localization of tRNA synthetases from their primary structures. Amino Acids 2011; 42:1703-13. [DOI: 10.1007/s00726-011-0872-8] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/11/2010] [Accepted: 02/21/2011] [Indexed: 11/25/2022]
|
21
|
Misawa K, Kikuno RF. Relationship between amino acid composition and gene expression in the mouse genome. BMC Res Notes 2011; 4:20. [PMID: 21272306 PMCID: PMC3038927 DOI: 10.1186/1756-0500-4-20] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/25/2010] [Accepted: 01/27/2011] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Codon bias is a phenomenon that refers to the differences in the frequencies of synonymous codons among different genes. In many organisms, natural selection is considered to be a cause of codon bias because codon usage in highly expressed genes is biased toward optimal codons. Methods have previously been developed to predict the expression level of genes from their nucleotide sequences, which is based on the observation that synonymous codon usage shows an overall bias toward a few codons called major codons. However, the relationship between codon bias and gene expression level, as proposed by the translation-selection model, is less evident in mammals. FINDINGS We investigated the correlations between the expression levels of 1,182 mouse genes and amino acid composition, as well as between gene expression and codon preference. We found that a weak but significant correlation exists between gene expression levels and amino acid composition in mouse. In total, less than 10% of variation of expression levels is explained by amino acid components. We found the effect of codon preference on gene expression was weaker than the effect of amino acid composition, because no significant correlations were observed with respect to codon preference. CONCLUSION These results suggest that it is difficult to predict expression level from amino acid components or from codon bias in mouse.
Collapse
Affiliation(s)
- Kazuharu Misawa
- Research Program for Computational Science, Research and Development Group for Next-Generation Integrated Living Matter Simulation, Fusion of Data and Analysis Research and Development Team, RIKEN, 4-6-1 Shirokane-dai, Minato-ku, Tokyo 108-8639, Japan.
| | | |
Collapse
|
22
|
Panwar B, Raghava GPS. Prediction and classification of aminoacyl tRNA synthetases using PROSITE domains. BMC Genomics 2010; 11:507. [PMID: 20860794 PMCID: PMC2997003 DOI: 10.1186/1471-2164-11-507] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/15/2010] [Accepted: 09/22/2010] [Indexed: 12/02/2022] Open
Abstract
Background Aminoacyl tRNA synthetases (aaRSs) catalyse the first step of protein synthesis in all organisms. They are responsible for the precise attachment of amino acids to their cognate transfer RNAs. There are twenty different types of aaRSs, unique for each amino acid. These aaRSs have been divided into two classes, each comprising ten enzymes. It is important to predict and classify aaRSs in order to understand protein synthesis. Results In this study, all models were developed on a non-redundant dataset containing 117 aaRSs and an equal number of non-aaRSs, in which no two sequences have more than 30% similarity. First, we applied the similarity search technique, BLAST, and achieved a maximum accuracy of 67.52%. We observed that 62% of tRNA synthetases contain one or more domains from amongst the following four PROSITE domains: PS50862, PS00178, PS50860 and PS50861. An SVM-based model was developed to discriminate between aaRSs, and non-aaRSs, and achieved a maximum MCC of 0.68 with accuracy of 83.73%, using selective dipeptide composition. We developed a hybrid approach and achieved a maximum MCC of 0.72 with accuracy of 85.49%, where SVM model developed using selected dipeptide composition and information of four PROSITE domains. We further developed an SVM-based model for classifying the aaRSs into class-1 and class-2, using selective dipeptide composition and achieved an MCC of 0.79. We also observed that two domains (PS00178, PS50889) in class-1 and three domains (PS50862, PS50860, PS50861) in class-2 were preferred. A hybrid method was developed using these domains as descriptor, along with selected dipeptide composition, and achieved an MCC of 0.87 with a sensitivity of 94.55% and an accuracy of 93.19%. All models were evaluated using a five-fold cross-validation technique. Conclusions We have analyzed protein sequences of aaRSs (class-1 and class-2) and non-aaRSs and identified interesting patterns. The high accuracy achieved by our SVM models using selected dipeptide composition demonstrates that certain types of dipeptide are preferred in aaRSs. We were able to identify PROSITE domains that are preferred in aaRSs and their classes, providing interesting insights into tRNA synthetases. The method developed in this study will be useful for researchers studying aaRS enzymes and tRNA biology. The web-server based on the above study, is available at http://www.imtech.res.in/raghava/icaars/.
Collapse
Affiliation(s)
- Bharat Panwar
- Bioinformatics Centre, Institute of Microbial Technology, Chandigarh, India
| | | |
Collapse
|
23
|
Metabolic flux distributions: genetic information, computational predictions, and experimental validation. Appl Microbiol Biotechnol 2010; 86:1243-55. [DOI: 10.1007/s00253-010-2506-6] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/22/2009] [Revised: 02/10/2010] [Accepted: 02/11/2010] [Indexed: 01/15/2023]
|
24
|
Liu X, Zhang J, Ni F, Dong X, Han B, Han D, Ji Z, Zhao Y. Genome wide exploration of the origin and evolution of amino acids. BMC Evol Biol 2010; 10:77. [PMID: 20230639 PMCID: PMC2853539 DOI: 10.1186/1471-2148-10-77] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/15/2009] [Accepted: 03/15/2010] [Indexed: 11/10/2022] Open
Abstract
Background Even after years of exploration, the terrestrial origin of bio-molecules remains unsolved and controversial. Today, observation of amino acid composition in proteins has become an alternative way for a global understanding of the mystery encoded in whole genomes and seeking clues for the origin of amino acids. Results In this study, we statistically monitored the frequencies of 20 alpha-amino acids in 549 taxa from three kingdoms of life: archaebacteria, eubacteria, and eukaryotes. We found that the amino acids evolved independently in these three kingdoms; but, conserved linkages were observed in two groups of amino acids, (A, G, H, L, P, Q, R, and W) and (F, I, K, N, S, and Y). Moreover, the amino acids encoded by GC-poor codons (F, Y, N, K, I, and M) were found to "lose" their usage in the development from single cell eukaryotic organisms like S. cerevisiae to H. sapiens, while the amino acids encoded by GC-rich codons (P, A, G, and W) were found to gain usage. These findings further support the co-evolution hypothesis of amino acids and genetic codes. Conclusion We proposed a new chronological order of the appearance of amino acids (L, A, V/E/G, S, I, K, T, R/D, P, N, F, Q, Y, M, H, W, C). Two conserved evolutionary paths of amino acids were also suggested: A→G→R→P and K→Y.
Collapse
Affiliation(s)
- Xiaoxia Liu
- The Key Laboratory for Chemical Biology of Fujian Province, Department of Chemistry, College of Chemistry and Chemical Engineering, Xiamen University, Xiamen 361005, Fujian, PR China
| | | | | | | | | | | | | | | |
Collapse
|
25
|
Song J, Tan H, Shen H, Mahmood K, Boyd SE, Webb GI, Akutsu T, Whisstock JC. Cascleave: towards more accurate prediction of caspase substrate cleavage sites. ACTA ACUST UNITED AC 2010; 26:752-60. [PMID: 20130033 DOI: 10.1093/bioinformatics/btq043] [Citation(s) in RCA: 132] [Impact Index Per Article: 9.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/21/2023]
Abstract
MOTIVATION The caspase family of cysteine proteases play essential roles in key biological processes such as programmed cell death, differentiation, proliferation, necrosis and inflammation. The complete repertoire of caspase substrates remains to be fully characterized. Accordingly, systematic computational screening studies of caspase substrate cleavage sites may provide insight into the substrate specificity of caspases and further facilitating the discovery of putative novel substrates. RESULTS In this article we develop an approach (termed Cascleave) to predict both classical (i.e. following a P(1) Asp) and non-typical caspase cleavage sites. When using local sequence-derived profiles, Cascleave successfully predicted 82.2% of the known substrate cleavage sites, with a Matthews correlation coefficient (MCC) of 0.667. We found that prediction performance could be further improved by incorporating information such as predicted solvent accessibility and whether a cleavage sequence lies in a region that is most likely natively unstructured. Novel bi-profile Bayesian signatures were found to significantly improve the prediction performance and yielded the best performance with an overall accuracy of 87.6% and a MCC of 0.747, which is higher accuracy than published methods that essentially rely on amino acid sequence alone. It is anticipated that Cascleave will be a powerful tool for predicting novel substrate cleavage sites of caspases and shedding new insights on the unknown caspase-substrate interactivity relationship. AVAILABILITY http://sunflower.kuicr.kyoto-u.ac.jp/ approximately sjn/Cascleave/ CONTACT jiangning.song@med.monash.edu.au; takutsu@kuicr.kyoto-u.ac.jp; james; whisstock@med.monash.edu.au SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Jiangning Song
- Department of Biochemistry and Molecular Biology, Monash University, Melbourne, VIC 3800, Australia.
| | | | | | | | | | | | | | | |
Collapse
|
26
|
Song J, Tan H, Mahmood K, Law RHP, Buckle AM, Webb GI, Akutsu T, Whisstock JC. Prodepth: predict residue depth by support vector regression approach from protein sequences only. PLoS One 2009; 4:e7072. [PMID: 19759917 PMCID: PMC2742725 DOI: 10.1371/journal.pone.0007072] [Citation(s) in RCA: 35] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/14/2009] [Accepted: 08/20/2009] [Indexed: 11/24/2022] Open
Abstract
Residue depth (RD) is a solvent exposure measure that complements the information provided by conventional accessible surface area (ASA) and describes to what extent a residue is buried in the protein structure space. Previous studies have established that RD is correlated with several protein properties, such as protein stability, residue conservation and amino acid types. Accurate prediction of RD has many potentially important applications in the field of structural bioinformatics, for example, facilitating the identification of functionally important residues, or residues in the folding nucleus, or enzyme active sites from sequence information. In this work, we introduce an efficient approach that uses support vector regression to quantify the relationship between RD and protein sequence. We systematically investigated eight different sequence encoding schemes including both local and global sequence characteristics and examined their respective prediction performances. For the objective evaluation of our approach, we used 5-fold cross-validation to assess the prediction accuracies and showed that the overall best performance could be achieved with a correlation coefficient (CC) of 0.71 between the observed and predicted RD values and a root mean square error (RMSE) of 1.74, after incorporating the relevant multiple sequence features. The results suggest that residue depth could be reliably predicted solely from protein primary sequences: local sequence environments are the major determinants, while global sequence features could influence the prediction performance marginally. We highlight two examples as a comparison in order to illustrate the applicability of this approach. We also discuss the potential implications of this new structural parameter in the field of protein structure prediction and homology modeling. This method might prove to be a powerful tool for sequence analysis.
Collapse
Affiliation(s)
- Jiangning Song
- Department of Biochemistry and Molecular Biology, Monash University, Clayton, Melbourne, Victoria, Australia
- Bioinformatics Center, Institute for Chemical Research, Kyoto University, Gokasho, Uji, Kyoto, Japan
- * E-mail: (JS); (JCW)
| | - Hao Tan
- Department of Biochemistry and Molecular Biology, Monash University, Clayton, Melbourne, Victoria, Australia
| | - Khalid Mahmood
- Department of Biochemistry and Molecular Biology, Monash University, Clayton, Melbourne, Victoria, Australia
- ARC Centre of Excellence for Structural and Functional Microbial Genomics, Monash University, Clayton, Melbourne, Victoria, Australia
| | - Ruby H. P. Law
- Department of Biochemistry and Molecular Biology, Monash University, Clayton, Melbourne, Victoria, Australia
| | - Ashley M. Buckle
- Department of Biochemistry and Molecular Biology, Monash University, Clayton, Melbourne, Victoria, Australia
| | - Geoffrey I. Webb
- Faculty of Information Technology, Monash University, Clayton, Melbourne, Victoria, Australia
| | - Tatsuya Akutsu
- Bioinformatics Center, Institute for Chemical Research, Kyoto University, Gokasho, Uji, Kyoto, Japan
| | - James C. Whisstock
- Department of Biochemistry and Molecular Biology, Monash University, Clayton, Melbourne, Victoria, Australia
- ARC Centre of Excellence for Structural and Functional Microbial Genomics, Monash University, Clayton, Melbourne, Victoria, Australia
- * E-mail: (JS); (JCW)
| |
Collapse
|
27
|
Tartaglia GG, Pechmann S, Dobson CM, Vendruscolo M. A relationship between mRNA expression levels and protein solubility in E. coli. J Mol Biol 2009; 388:381-9. [PMID: 19281824 DOI: 10.1016/j.jmb.2009.03.002] [Citation(s) in RCA: 44] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/25/2008] [Revised: 02/26/2009] [Accepted: 03/03/2009] [Indexed: 10/21/2022]
Abstract
Each step in the process of gene expression, from the transcription of DNA into mRNA to the folding and posttranslational modification of proteins, is regulated by complex cellular mechanisms. At the same time, stringent conditions on the physicochemical properties of proteins, and hence on the nature of their amino acids, are imposed by the need to avoid aggregation at the concentrations required for optimal cellular function. A relationship is therefore expected to exist between mRNA expression levels and protein solubility in the cell. By investigating such a relationship, we formulate a method that enables the prediction of the maximal levels of mRNA expression in Escherichia coli with an accuracy of 83% and of the solubility of recombinant human proteins expressed in E. coli with an accuracy of 86%.
Collapse
Affiliation(s)
- Gian Gaetano Tartaglia
- Department of Chemistry, University of Cambridge, Lensfield Road, Cambridge CB2 1EW, UK.
| | | | | | | |
Collapse
|
28
|
Miura F, Kawaguchi N, Yoshida M, Uematsu C, Kito K, Sakaki Y, Ito T. Absolute quantification of the budding yeast transcriptome by means of competitive PCR between genomic and complementary DNAs. BMC Genomics 2008; 9:574. [PMID: 19040753 PMCID: PMC2612024 DOI: 10.1186/1471-2164-9-574] [Citation(s) in RCA: 68] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/17/2008] [Accepted: 11/29/2008] [Indexed: 11/10/2022] Open
Abstract
Background An ideal format to describe transcriptome would be its composition measured on the scale of absolute numbers of individual mRNAs per cell. It would help not only to precisely grasp the structure of the transcriptome but also to accelerate data exchange and integration. Results We conceived an idea of competitive PCR between genomic DNA and cDNA. Since the former contains every gene exactly at the same copy number, it can serve as an ideal normalization standard for the latter to obtain stoichiometric composition data of the transcriptome. This data can then be easily converted to absolute quantification data provided with an appropriate calibration. To implement this idea, we improved adaptor-tagged competitive PCR, originally developed for relative quantification of the 3'-end restriction fragment of each cDNA, such that it can be applied to any restriction fragment. We demonstrated that this "generalized" adaptor-tagged competitive PCR (GATC-PCR) can be performed between genomic DNA and cDNA to accurately measure absolute expression level of each mRNA in the budding yeast Saccharomyces cerevisiae. Furthermore, we constructed a large-scale GATC-PCR system to measure absolute expression levels of 5,038 genes to show that the yeast contains more than 30,000 copies of mRNA molecules per cell. Conclusion We developed a GATC-PCR method to accurately measure absolute expression levels of mRNAs by means of competitive amplification of genomic and cDNA copies of each gene. A large-scale application of GATC-PCR to the budding yeast transcriptome revealed that it is twice or more as large as previously estimated. This method is flexibly applicable to both targeted and genome-wide analyses of absolute expression levels of mRNAs.
Collapse
Affiliation(s)
- Fumihito Miura
- Department of Computational Biology, Graduate School of Frontier Sciences, University of Tokyo, Kashiwa 277-8561, Japan.
| | | | | | | | | | | | | |
Collapse
|
29
|
Zhang H, Zhang T, Chen K, Shen S, Ruan J, Kurgan L. Sequence based residue depth prediction using evolutionary information and predicted secondary structure. BMC Bioinformatics 2008; 9:388. [PMID: 18803867 PMCID: PMC2567998 DOI: 10.1186/1471-2105-9-388] [Citation(s) in RCA: 29] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/10/2008] [Accepted: 09/20/2008] [Indexed: 11/29/2022] Open
Abstract
Background Residue depth allows determining how deeply a given residue is buried, in contrast to the solvent accessibility that differentiates between buried and solvent-exposed residues. When compared with the solvent accessibility, the depth allows studying deep-level structures and functional sites, and formation of the protein folding nucleus. Accurate prediction of residue depth would provide valuable information for fold recognition, prediction of functional sites, and protein design. Results A new method, RDPred, for the real-value depth prediction from protein sequence is proposed. RDPred combines information extracted from the sequence, PSI-BLAST scoring matrices, and secondary structure predicted with PSIPRED. Three-fold/ten-fold cross validation based tests performed on three independent, low-identity datasets show that the distance based depth (computed using MSMS) predicted by RDPred is characterized by 0.67/0.67, 0.66/0.67, and 0.64/0.65 correlation with the actual depth, by the mean absolute errors equal 0.56/0.56, 0.61/0.60, and 0.58/0.57, and by the mean relative errors equal 17.0%/16.9%, 18.2%/18.1%, and 17.7%/17.6%, respectively. The mean absolute and the mean relative errors are shown to be statistically significantly better when compared with a method recently proposed by Yuan and Wang [Proteins 2008; 70:509–516]. The results show that three-fold cross validation underestimates the variability of the prediction quality when compared with the results based on the ten-fold cross validation. We also show that the hydrophilic and flexible residues are predicted more accurately than hydrophobic and rigid residues. Similarly, the charged residues that include Lys, Glu, Asp, and Arg are the most accurately predicted. Our analysis reveals that evolutionary information encoded using PSSM is characterized by stronger correlation with the depth for hydrophilic amino acids (AAs) and aliphatic AAs when compared with hydrophobic AAs and aromatic AAs. Finally, we show that the secondary structure of coils and strands is useful in depth prediction, in contrast to helices that have relatively uniform distribution over the protein depth. Application of the predicted residue depth to prediction of buried/exposed residues shows consistent improvements in detection rates of both buried and exposed residues when compared with the competing method. Finally, we contrasted the prediction performance among distance based (MSMS and DPX) and volume based (SADIC) depth definitions. We found that the distance based indices are harder to predict due to the more complex nature of the corresponding depth profiles. Conclusion The proposed method, RDPred, provides statistically significantly better predictions of residue depth when compared with the competing method. The predicted depth can be used to provide improved prediction of both buried and exposed residues. The prediction of exposed residues has implications in characterization/prediction of interactions with ligands and other proteins, while the prediction of buried residues could be used in the context of folding predictions and simulations.
Collapse
Affiliation(s)
- Hua Zhang
- College of Mathematical Science and LPMC, Nankai University, Tianjin, PR China.
| | | | | | | | | | | |
Collapse
|
30
|
Song J, Tan H, Takemoto K, Akutsu T. HSEpred: predict half-sphere exposure from protein sequences. Bioinformatics 2008; 24:1489-97. [DOI: 10.1093/bioinformatics/btn222] [Citation(s) in RCA: 47] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
|
31
|
Otaki JM, Gotoh T, Yamamoto H. Potential implications of availability of short amino acid sequences in proteins: an old and new approach to protein decoding and design. BIOTECHNOLOGY ANNUAL REVIEW 2008; 14:109-41. [PMID: 18606361 DOI: 10.1016/s1387-2656(08)00004-5] [Citation(s) in RCA: 14] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/13/2022]
Abstract
Three-dimensional structure of a protein molecule is primarily determined by its amino acid sequence, and thus the elucidation of general rules embedded in amino acid sequences is of great importance in protein science and engineering. To extract valuable information from sequences, we propose an analytical method in which a protein sequence is considered to be constructed by serial superimpositions of short amino acid sequences of n amino acid sets, especially triplets (3-aa sets). Using the comprehensive nonredundant protein database, we first examined "availability" of all possible combinatorial sets of 8,000 triplet species. Availability score was mathematically defined as an indicator for the relative "preference" or "avoidance" for a given short constituent sequence to be used in protein chain. Availability scores of real proteins were clearly biased against those of randomly generated proteins. We found many triplet species that occurred in the database more than expected or less than expected. Such bias was extended to longer sets, and we found that some species of pentats (5-aa sets) that occurred reasonably frequently in the randomly generated protein population did not occur at all in any real proteins known today. Availability score was dependent on species, potentially serving as a phylogenetic indicator. Furthermore, we suggest possibilities of various biotechnological applications of characteristic short sequences such as human-specific and pathogen-specific short sequences obtained from availability analysis. Availability score was also dependent on secondary structures, potentially serving as a structural indicator. Availability analysis on triplets may be combined with a comprehensive data collection on the varphi and psi peptide-bond angles of the amino acid at the center of each triplet, i.e., a collection of Ramachandran plots for each triplet. These triplet characters, together with other physicochemical data, will provide us with basic information between protein sequence and structure, by which structure prediction and engineering may be greatly facilitated. Availability analysis may also be useful in identifying word processing units in amino acid sequences based on an analogy to natural languages. Together with other approaches, availability analysis will elucidate general rules hidden in the primary sequences and eventually contributes to rebuilding the paradigm of protein science.
Collapse
Affiliation(s)
- Joji M Otaki
- Department of Chemistry, Biology and Marine Science, University of the Ryukyus, Nishihara, Okinawa 903-0213, Japan.
| | | | | |
Collapse
|
32
|
Deschavanne P, Tufféry P. Exploring an alignment free approach for protein classification and structural class prediction. Biochimie 2007; 90:615-25. [PMID: 18067866 DOI: 10.1016/j.biochi.2007.11.004] [Citation(s) in RCA: 30] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/19/2007] [Accepted: 11/09/2007] [Indexed: 11/25/2022]
Abstract
Alignment free methods based on Chaos Game Representation (CGR), also known as sequence signature approaches, have proven of great interest for DNA sequence analysis. Indeed, they have been successfully applied for sequence comparison, phylogeny, detection of horizontal transfers or extraction of representative motifs in regulation sequences. Transposing such methods to proteins poses several fundamental questions related to representation space dimensionality. Several studies have tackled these points, but none has, so far, brought the application of CGRs to proteins to their fully expected potential. Yet, several studies have shown that techniques based on n-peptide frequencies can be relevant for proteins. Here, we investigate the effectiveness of a strategy based on the CGR approach using a fixed reverse encoding of amino acids into nucleic sequences. We first explore its relevance to protein classification into functional families. We then attempt to apply it to the prediction of protein structural classes. Our results suggest that the reverse encoding approach could be relevant in both cases. We show that it is able to classify functional families of proteins by extracting signatures close to the ProSite patterns. Applied to structural classification, the approach reaches scores of correct classification close to 84%, i.e. close to the scores of related methods in the field. Various optimizations of the approach are still possible, which open the door for future applications.
Collapse
Affiliation(s)
- P Deschavanne
- Equipe de Bioinformatique Génomique et Moléculaire, INSERM UMR-S 726, Université Paris 7, 75251 Paris Cedex 05, France
| | | |
Collapse
|
33
|
Adaptation of model proteins from cold to hot environments involves continuous and small adjustments of average parameters related to amino acid composition. J Theor Biol 2007; 250:156-71. [PMID: 17950361 DOI: 10.1016/j.jtbi.2007.09.006] [Citation(s) in RCA: 34] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/16/2007] [Revised: 08/29/2007] [Accepted: 09/01/2007] [Indexed: 10/22/2022]
Abstract
The growth temperature adaptation of six model proteins has been studied in 42 microorganisms belonging to eubacterial and archaeal kingdoms, covering optimum growth temperatures from 7 to 103 degrees C. The selected proteins include three elongation factors involved in translation, the enzymes glyceraldehyde-3-phosphate dehydrogenase and superoxide dismutase, the cell division protein FtsZ. The common strategy of protein adaptation from cold to hot environments implies the occurrence of small changes in the amino acid composition, without altering the overall structure of the macromolecule. These continuous adjustments were investigated through parameters related to the amino acid composition of each protein. The average value per residue of mass, volume and accessible surface area allowed an evaluation of the usage of bulky residues, whereas the average hydrophobicity reflected that of hydrophobic residues. The specific proportion of bulky and hydrophobic residues in each protein almost linearly increased with the temperature of the host microorganism. This finding agrees with the structural and functional properties exhibited by proteins in differently adapted sources, thus explaining the great compactness or the high flexibility exhibited by (hyper)thermophilic or psychrophilic proteins, respectively. Indeed, heat-adapted proteins incline toward the usage of heavier-size and more hydrophobic residues with respect to mesophiles, whereas the cold-adapted macromolecules show the opposite behavior with a certain preference for smaller-size and less hydrophobic residues. An investigation on the different increase of bulky residues along with the growth temperature observed in the six model proteins suggests the relevance of the possible different role and/or structure organization played by protein domains. The significance of the linear correlations between growth temperature and parameters related to the amino acid composition improved when the analysis was collectively carried out on all model proteins.
Collapse
|
34
|
Wu G, Nie L, Freeland SJ. The effects of differential gene expression on coding sequence features: Analysis by one-way ANOVA. Biochem Biophys Res Commun 2007; 358:1108-13. [PMID: 17517370 DOI: 10.1016/j.bbrc.2007.05.043] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/07/2007] [Accepted: 05/08/2007] [Indexed: 10/23/2022]
Abstract
It is well-established that non-random patterns in coding DNA sequence (CDS) features can be partially explained by translational selection. Recent extensions of microarray and proteomic expression data have stimulated many genome-wide investigations of the relationships between gene expression and various CDS features. However, only modest correlations have been found. Here we introduced the one-way ANOVA, a more powerful extension of previous grouping methods, to re-examine these relationships at the whole genome scale for Saccharomyces cerevisiae, where genome-wide protein abundance has been recently quantified. Our results clarify that coding sequence features are inappropriate for use as genome-wide estimators for protein expression levels. This analysis also demonstrates that one-way ANOVA is a powerful and simple method to explore the influence of gene expression on CDS features.
Collapse
Affiliation(s)
- Gang Wu
- Department of Biological Sciences, University of Maryland at Baltimore County, Baltimore, MD 21250, USA.
| | | | | |
Collapse
|
35
|
Prediction of highly expressed genes in microbes based on chromatin accessibility. BMC Mol Biol 2007; 8:11. [PMID: 17295928 PMCID: PMC1805505 DOI: 10.1186/1471-2199-8-11] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/25/2006] [Accepted: 02/13/2007] [Indexed: 12/22/2022] Open
Abstract
BACKGROUND It is well known that gene expression is dependent on chromatin structure in eukaryotes and it is likely that chromatin can play a role in bacterial gene expression as well. Here, we use a nucleosomal position preference measure of anisotropic DNA flexibility to predict highly expressed genes in microbial genomes. We compare these predictions with those based on codon adaptation index (CAI) values, and also with experimental data for 6 different microbial genomes, with a particular interest in experimental data from Escherichia coli. Moreover, position preference is examined further in 328 sequenced microbial genomes. RESULTS We find that absolute gene expression levels are correlated with the position preference in many microbial genomes. It is postulated that in these regions, the DNA may be more accessible to the transcriptional machinery. Moreover, ribosomal proteins and ribosomal RNA are encoded by DNA having significantly lower position preference values than other genes in fast-replicating microbes. CONCLUSION This insight into DNA structure-dependent gene expression in microbes may be exploited for predicting the expression of non-translated genes such as non-coding RNAs that may not be predicted by any of the conventional codon usage bias approaches.
Collapse
|
36
|
Machine learning techniques in disease forecasting: a case study on rice blast prediction. BMC Bioinformatics 2006; 7:485. [PMID: 17083731 PMCID: PMC1647291 DOI: 10.1186/1471-2105-7-485] [Citation(s) in RCA: 99] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/19/2006] [Accepted: 11/03/2006] [Indexed: 11/22/2022] Open
Abstract
Background Diverse modeling approaches viz. neural networks and multiple regression have been followed to date for disease prediction in plant populations. However, due to their inability to predict value of unknown data points and longer training times, there is need for exploiting new prediction softwares for better understanding of plant-pathogen-environment relationships. Further, there is no online tool available which can help the plant researchers or farmers in timely application of control measures. This paper introduces a new prediction approach based on support vector machines for developing weather-based prediction models of plant diseases. Results Six significant weather variables were selected as predictor variables. Two series of models (cross-location and cross-year) were developed and validated using a five-fold cross validation procedure. For cross-year models, the conventional multiple regression (REG) approach achieved an average correlation coefficient (r) of 0.50, which increased to 0.60 and percent mean absolute error (%MAE) decreased from 65.42 to 52.24 when back-propagation neural network (BPNN) was used. With generalized regression neural network (GRNN), the r increased to 0.70 and %MAE also improved to 46.30, which further increased to r = 0.77 and %MAE = 36.66 when support vector machine (SVM) based method was used. Similarly, cross-location validation achieved r = 0.48, 0.56 and 0.66 using REG, BPNN and GRNN respectively, with their corresponding %MAE as 77.54, 66.11 and 58.26. The SVM-based method outperformed all the three approaches by further increasing r to 0.74 with improvement in %MAE to 44.12. Overall, this SVM-based prediction approach will open new vistas in the area of forecasting plant diseases of various crops. Conclusion Our case study demonstrated that SVM is better than existing machine learning techniques and conventional REG approaches in forecasting plant diseases. In this direction, we have also developed a SVM-based web server for rice blast prediction, a first of its kind worldwide, which can help the plant science community and farmers in their decision making process. The server is freely available at .
Collapse
|
37
|
Song J, Burrage K. Predicting residue-wise contact orders in proteins by support vector regression. BMC Bioinformatics 2006; 7:425. [PMID: 17014735 PMCID: PMC1618864 DOI: 10.1186/1471-2105-7-425] [Citation(s) in RCA: 48] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/26/2006] [Accepted: 10/03/2006] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND The residue-wise contact order (RWCO) describes the sequence separations between the residues of interest and its contacting residues in a protein sequence. It is a new kind of one-dimensional protein structure that represents the extent of long-range contacts and is considered as a generalization of contact order. Together with secondary structure, accessible surface area, the B factor, and contact number, RWCO provides comprehensive and indispensable important information to reconstructing the protein three-dimensional structure from a set of one-dimensional structural properties. Accurately predicting RWCO values could have many important applications in protein three-dimensional structure prediction and protein folding rate prediction, and give deep insights into protein sequence-structure relationships. RESULTS We developed a novel approach to predict residue-wise contact order values in proteins based on support vector regression (SVR), starting from primary amino acid sequences. We explored seven different sequence encoding schemes to examine their effects on the prediction performance, including local sequence in the form of PSI-BLAST profiles, local sequence plus amino acid composition, local sequence plus molecular weight, local sequence plus secondary structure predicted by PSIPRED, local sequence plus molecular weight and amino acid composition, local sequence plus molecular weight and predicted secondary structure, and local sequence plus molecular weight, amino acid composition and predicted secondary structure. When using local sequences with multiple sequence alignments in the form of PSI-BLAST profiles, we could predict the RWCO distribution with a Pearson correlation coefficient (CC) between the predicted and observed RWCO values of 0.55, and root mean square error (RMSE) of 0.82, based on a well-defined dataset with 680 protein sequences. Moreover, by incorporating global features such as molecular weight and amino acid composition we could further improve the prediction performance with the CC to 0.57 and an RMSE of 0.79. In addition, combining the predicted secondary structure by PSIPRED was found to significantly improve the prediction performance and could yield the best prediction accuracy with a CC of 0.60 and RMSE of 0.78, which provided at least comparable performance compared with the other existing methods. CONCLUSION The SVR method shows a prediction performance competitive with or at least comparable to the previously developed linear regression-based methods for predicting RWCO values. In contrast to support vector classification (SVC), SVR is very good at estimating the raw value profiles of the samples. The successful application of the SVR approach in this study reinforces the fact that support vector regression is a powerful tool in extracting the protein sequence-structure relationship and in estimating the protein structural profiles from amino acid sequences.
Collapse
Affiliation(s)
- Jiangning Song
- Advanced Computational Modelling Centre, The University of Queensland, Brisbane Qld 4072, Australia
| | - Kevin Burrage
- Advanced Computational Modelling Centre, The University of Queensland, Brisbane Qld 4072, Australia
| |
Collapse
|
38
|
Tuikkala J, Elo L, Nevalainen OS, Aittokallio T. Improving missing value estimation in microarray data with gene ontology. Bioinformatics 2005; 22:566-72. [PMID: 16377613 DOI: 10.1093/bioinformatics/btk019] [Citation(s) in RCA: 81] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION Gene expression microarray experiments produce datasets with frequent missing expression values. Accurate estimation of missing values is an important prerequisite for efficient data analysis as many statistical and machine learning techniques either require a complete dataset or their results are significantly dependent on the quality of such estimates. A limitation of the existing estimation methods for microarray data is that they use no external information but the estimation is based solely on the expression data. We hypothesized that utilizing a priori information on functional similarities available from public databases facilitates the missing value estimation. RESULTS We investigated whether semantic similarity originating from gene ontology (GO) annotations could improve the selection of relevant genes for missing value estimation. The relative contribution of each information source was automatically estimated from the data using an adaptive weight selection procedure. Our experimental results in yeast cDNA microarray datasets indicated that by considering GO information in the k-nearest neighbor algorithm we can enhance its performance considerably, especially when the number of experimental conditions is small and the percentage of missing values is high. The increase of performance was less evident with a more sophisticated estimation method. We conclude that even a small proportion of annotated genes can provide improvements in data quality significant for the eventual interpretation of the microarray experiments. AVAILABILITY Java and Matlab codes are available on request from the authors. SUPPLEMENTARY MATERIAL Available online at http://users.utu.fi/jotatu/GOImpute.html.
Collapse
Affiliation(s)
- Johannes Tuikkala
- Department of Information Technology, University of Turku, Lemminkäisenkatu 14A, FIN-20520, Finland.
| | | | | | | |
Collapse
|
39
|
Arakawa K, Suzuki H, Fujishima K, Fujimoto K, Ueda S, Matsui M, Tomita M. A Comprehensive Software Suite for the Analysis of cDNAs. GENOMICS, PROTEOMICS & BIOINFORMATICS 2005; 3:179-88. [PMID: 16487083 PMCID: PMC5172547 DOI: 10.1016/s1672-0229(05)03023-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
Abstract
We have developed a comprehensive software suite for bioinformatics research of cDNAs; it is aimed at rapid characterization of the features of genes and the proteins they code. Methods implemented include the detection of translation initiation and termination signals, statistical analysis of codon usage, comparative study of amino acid composition, comparative modeling of the structures of product proteins, prediction of alternative splice forms, and metabolic pathway reconstruction. The software package is freely available under the GNU General Public License at http://www.g-language.org/data/cdna/.
Collapse
Affiliation(s)
- Kazuharu Arakawa
- Institute for Advanced Biosciences, Keio University, Fujisawa 252-8520, Japan.
| | | | | | | | | | | | | |
Collapse
|