1
|
Suárez T, Montaño DF, Suárez R. Construction of amino acids reduced alphabets from molecular descriptors for interpretation of N-carbamylase, luciferase and PI3K mutations. Biosystems 2024; 246:105331. [PMID: 39260761 DOI: 10.1016/j.biosystems.2024.105331] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/04/2024] [Revised: 09/04/2024] [Accepted: 09/08/2024] [Indexed: 09/13/2024]
Abstract
The classification of amino acids has proven to be a useful tool for understanding the importance of sequence in protein function. The reduced amino acid alphabets are an example of these classifications, which, when built from physicochemical, structural and quantum characteristics of the amino acids, allow it to simplify the representation of the sequences, being useful in the modelling, design and understanding of proteins. So, an objective selection of amino acids properties is important, due classes formed in a reduced alphabet depend on the descriptors used for classification. In this research, based on a careful selection of descriptors for the 20 amino acids, through techniques such as the information content index and hierarchical cluster analysis with ties in proximity, 20,871,586 reduced amino acid alphabets were constructed. This large collection of reduced alphabets was been used to interpret alterations in the function of three proteins: N-carbamylase, Luciferase, and PI3K, caused by amino acid changes in their sequences. For this, the similar and different descriptors linked to these mutations were studied. Properties such as volume, hydrophobicity, charge and autocorrelation can be associated with variations in the behaviour of these proteins, while the frequency in specific secondary structures, the Gibbs free energy and some topological and quantum properties can be considered as the causes of preventing the deactivation of protein function. This work offers the most complete collection of reduced alphabets that promise to be a useful tool for the interpretation of alterations caused by amino acid mutations in the protein sequence.
Collapse
Affiliation(s)
- Tatiana Suárez
- CHIMA Grupo de Química Matemática, Universidad de Pamplona, Km 1 Vía Bucaramanga, Pamplona, Colombia
| | - Diego F Montaño
- Departamento de Química, Universidad de Pamplona, Km 1 Vía Bucaramanga, Pamplona, Colombia
| | - Rosana Suárez
- CHIMA Grupo de Química Matemática, Universidad de Pamplona, Km 1 Vía Bucaramanga, Pamplona, Colombia
| |
Collapse
|
2
|
Kang J, Zhang C, Wang Y, Peng J, Berger B, Perrimon N, Shen J. Lipophorin receptors genetically modulate neurodegeneration caused by reduction of Psn expression in the aging Drosophila brain. Genetics 2024; 226:iyad202. [PMID: 37996068 PMCID: PMC10763532 DOI: 10.1093/genetics/iyad202] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/11/2023] [Revised: 11/01/2023] [Accepted: 11/12/2023] [Indexed: 11/25/2023] Open
Abstract
Mutations in the Presenilin (PSEN) genes are the most common cause of early-onset familial Alzheimer's disease (FAD). Studies in cell culture, in vitro biochemical systems, and knockin mice showed that PSEN mutations are loss-of-function mutations, impairing γ-secretase activity. Mouse genetic analysis highlighted the importance of Presenilin (PS) in learning and memory, synaptic plasticity and neurotransmitter release, and neuronal survival, and Drosophila studies further demonstrated an evolutionarily conserved role of PS in neuronal survival during aging. However, molecular pathways that interact with PS in neuronal survival remain unclear. To identify genetic modifiers that modulate PS-dependent neuronal survival, we developed a new DrosophilaPsn model that exhibits age-dependent neurodegeneration and increases of apoptosis. Following a bioinformatic analysis, we tested top ranked candidate genes by selective knockdown (KD) of each gene in neurons using two independent RNAi lines in Psn KD models. Interestingly, 4 of the 9 genes enhancing neurodegeneration in Psn KD flies are involved in lipid transport and metabolism. Specifically, neuron-specific KD of lipophorin receptors, lpr1 and lpr2, dramatically worsens neurodegeneration in Psn KD flies, and overexpression of lpr1 or lpr2 does not alleviate Psn KD-induced neurodegeneration. Furthermore, lpr1 or lpr2 KD alone also leads to neurodegeneration, increased apoptosis, climbing defects, and shortened lifespan. Lastly, heterozygotic deletions of lpr1 and lpr2 or homozygotic deletions of lpr1 or lpr2 similarly lead to age-dependent neurodegeneration and further exacerbate neurodegeneration in Psn KD flies. These findings show that LpRs modulate Psn-dependent neuronal survival and are critically important for neuronal integrity in the aging brain.
Collapse
Affiliation(s)
- Jongkyun Kang
- Department of Neurology, Brigham and Women's Hospital, Harvard Medical School, Boston, MA 02115, USA
| | - Chen Zhang
- Department of Neurology, Brigham and Women's Hospital, Harvard Medical School, Boston, MA 02115, USA
| | - Yuhao Wang
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA 02139, USA
| | - Jian Peng
- Department of Computer Science, University of Illinois at Urbana-Champaign, Champaign, IL 61801, USA
| | - Bonnie Berger
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA 02139, USA
- Department of Mathematics, Massachusetts Institute of Technology, Cambridge, MA 02139, USA
| | - Norbert Perrimon
- Department of Genetics, Harvard Medical School, Boston, MA 02115, USA
- Howard Hughes Medical Institute, Boston, MA 02115, USA
| | - Jie Shen
- Department of Neurology, Brigham and Women's Hospital, Harvard Medical School, Boston, MA 02115, USA
- Program in Neuroscience, Harvard Medical School, Boston, MA 02115, USA
| |
Collapse
|
3
|
Chang CH, Nelson WC, Jerger A, Wright AT, Egbert RG, McDermott JE. Snekmer: a scalable pipeline for protein sequence fingerprinting based on amino acid recoding. BIOINFORMATICS ADVANCES 2023; 3:vbad005. [PMID: 36789294 PMCID: PMC9913046 DOI: 10.1093/bioadv/vbad005] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 03/14/2022] [Revised: 12/16/2022] [Accepted: 02/01/2023] [Indexed: 02/04/2023]
Abstract
Motivation The vast expansion of sequence data generated from single organisms and microbiomes has precipitated the need for faster and more sensitive methods to assess evolutionary and functional relationships between proteins. Representing proteins as sets of short peptide sequences (kmers) has been used for rapid, accurate classification of proteins into functional categories; however, this approach employs an exact-match methodology and thus may be limited in terms of sensitivity and coverage. We have previously used similarity groupings, based on the chemical properties of amino acids, to form reduced character sets and recode proteins. This amino acid recoding (AAR) approach simplifies the construction of protein representations in the form of kmer vectors, which can link sequences with distant sequence similarity and provide accurate classification of problematic protein families. Results Here, we describe Snekmer, a software tool for recoding proteins into AAR kmer vectors and performing either (i) construction of supervised classification models trained on input protein families or (ii) clustering for de novo determination of protein families. We provide examples of the operation of the tool against a set of nitrogen cycling families originally collected using both standard hidden Markov models and a larger set of proteins from Uniprot and demonstrate that our method accurately differentiates these sequences in both operation modes. Availability and implementation Snekmer is written in Python using Snakemake. Code and data used in this article, along with tutorial notebooks, are available at http://github.com/PNNL-CompBio/Snekmer under an open-source BSD-3 license. Supplementary information Supplementary data are available at Bioinformatics Advances online.
Collapse
Affiliation(s)
- Christine H Chang
- Biological Sciences Division, Pacific Northwest National Laboratory, Richland, WA 99352, USA
| | - William C Nelson
- Biological Sciences Division, Pacific Northwest National Laboratory, Richland, WA 99352, USA
| | - Abby Jerger
- Biological Sciences Division, Pacific Northwest National Laboratory, Richland, WA 99352, USA
| | - Aaron T Wright
- Department of Biology, Baylor University, Waco, TX 76798, USA
| | - Robert G Egbert
- Biological Sciences Division, Pacific Northwest National Laboratory, Richland, WA 99352, USA
| | | |
Collapse
|
4
|
Liang Y, Yang S, Zheng L, Wang H, Zhou J, Huang S, Yang L, Zuo Y. Research progress of reduced amino acid alphabets in protein analysis and prediction. Comput Struct Biotechnol J 2022; 20:3503-3510. [PMID: 35860409 PMCID: PMC9284397 DOI: 10.1016/j.csbj.2022.07.001] [Citation(s) in RCA: 8] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/14/2022] [Revised: 06/30/2022] [Accepted: 07/01/2022] [Indexed: 11/29/2022] Open
Abstract
A comprehensive summary of the literature on the reduced amino acid alphabets. A systematic review of the development history of reduced amino acid alphabets. Rich application cases of amino acid reduction alphabets are described in the article. A detailed analysis of the properties and uses of the reduced amino acid alphabets.
Proteins are the executors of cellular physiological activities, and accurate structural and function elucidation are crucial for the refined mapping of proteins. As a feature engineering method, the reduction of amino acid composition is not only an important method for protein structure and function analysis, but also opens a broad horizon for the complex field of machine learning. Representing sequences with fewer amino acid types greatly reduces the complexity and noise of traditional feature engineering in dimension, and provides more interpretable predictive models for machine learning to capture key features. In this paper, we systematically reviewed the strategy and method studies of the reduced amino acid (RAA) alphabets, and summarized its main research in protein sequence alignment, functional classification, and prediction of structural properties, respectively. In the end, we gave a comprehensive analysis of 672 RAA alphabets from 74 reduction methods.
Collapse
Affiliation(s)
- Yuchao Liang
- State Key Laboratory of Reproductive Regulation and Breeding of Grassland Livestock, College of Life Sciences, Inner Mongolia University, Hohhot, China
| | - Siqi Yang
- State Key Laboratory of Reproductive Regulation and Breeding of Grassland Livestock, College of Life Sciences, Inner Mongolia University, Hohhot, China
| | - Lei Zheng
- State Key Laboratory of Reproductive Regulation and Breeding of Grassland Livestock, College of Life Sciences, Inner Mongolia University, Hohhot, China
| | - Hao Wang
- State Key Laboratory of Reproductive Regulation and Breeding of Grassland Livestock, College of Life Sciences, Inner Mongolia University, Hohhot, China
| | - Jian Zhou
- State Key Laboratory of Reproductive Regulation and Breeding of Grassland Livestock, College of Life Sciences, Inner Mongolia University, Hohhot, China
| | - Shenghui Huang
- State Key Laboratory of Reproductive Regulation and Breeding of Grassland Livestock, College of Life Sciences, Inner Mongolia University, Hohhot, China
| | - Lei Yang
- College of Bioinformatics Science and Technology, Harbin Medical University, Harbin 150081, China
- Corresponding authors.
| | - Yongchun Zuo
- State Key Laboratory of Reproductive Regulation and Breeding of Grassland Livestock, College of Life Sciences, Inner Mongolia University, Hohhot, China
- Corresponding authors.
| |
Collapse
|
5
|
Ou J, Liu H, Nirala NK, Stukalov A, Acharya U, Green MR, Zhu LJ. dagLogo: An R/Bioconductor package for identifying and visualizing differential amino acid group usage in proteomics data. PLoS One 2020; 15:e0242030. [PMID: 33156866 PMCID: PMC7647101 DOI: 10.1371/journal.pone.0242030] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/23/2020] [Accepted: 10/23/2020] [Indexed: 11/18/2022] Open
Abstract
Sequence logos have been widely used as graphical representations of conserved nucleic acid and protein motifs. Due to the complexity of the amino acid (AA) alphabet, rich post-translational modification, and diverse subcellular localization of proteins, few versatile tools are available for effective identification and visualization of protein motifs. In addition, various reduced AA alphabets based on physicochemical, structural, or functional properties have been valuable in the study of protein alignment, folding, structure prediction, and evolution. However, there is lack of tools for applying reduced AA alphabets to the identification and visualization of statistically significant motifs. To fill this gap, we developed an R/Bioconductor package dagLogo, which has several advantages over existing tools. First, dagLogo allows various formats for input sets and provides comprehensive options to build optimal background models. It implements different reduced AA alphabets to group AAs of similar properties. Furthermore, dagLogo provides statistical and visual solutions for differential AA (or AA group) usage analysis of both large and small data sets. Case studies showed that dagLogo can better identify and visualize conserved protein sequence patterns from different types of inputs and can potentially reveal the biological patterns that could be missed by other logo generators.
Collapse
Affiliation(s)
- Jianhong Ou
- Department of Molecular, Cell, and Cancer Biology, University of Massachusetts Medical School, Worcester, Massachusetts, United States of America
- Regeneration NEXT, Duke University School of Medicine, Duke University, Durham, North Carolina, United States of America
| | - Haibo Liu
- Department of Molecular, Cell, and Cancer Biology, University of Massachusetts Medical School, Worcester, Massachusetts, United States of America
| | - Niraj K. Nirala
- Program in Molecular Medicine, University of Massachusetts Medical School, Worcester, Massachusetts, United States of America
| | - Alexey Stukalov
- Institute of Virology, Technical University of Munich, Munich, Germany
| | - Usha Acharya
- Department of Molecular, Cell, and Cancer Biology, University of Massachusetts Medical School, Worcester, Massachusetts, United States of America
| | - Michael R. Green
- Department of Molecular, Cell, and Cancer Biology, University of Massachusetts Medical School, Worcester, Massachusetts, United States of America
| | - Lihua Julie Zhu
- Department of Molecular, Cell, and Cancer Biology, University of Massachusetts Medical School, Worcester, Massachusetts, United States of America
- Program in Molecular Medicine, University of Massachusetts Medical School, Worcester, Massachusetts, United States of America
- Program in Bioinformatics and Integrative Biology, University of Massachusetts Medical School, Worcester, Massachusetts, United States of America
- * E-mail:
| |
Collapse
|
6
|
Franco MA, Krasnogor N, Bacardit J. Automatic Tuning of Rule-Based Evolutionary Machine Learning via Problem Structure Identification. IEEE COMPUT INTELL M 2020. [DOI: 10.1109/mci.2020.2998232] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
|
7
|
Oberti M, Vaisman II. cnnAlpha: Protein disordered regions prediction by reduced amino acid alphabets and convolutional neural networks. Proteins 2020; 88:1472-1481. [PMID: 32535960 DOI: 10.1002/prot.25966] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/03/2019] [Revised: 11/18/2019] [Accepted: 06/06/2020] [Indexed: 12/23/2022]
Abstract
Intrinsically disordered regions (IDR) play an important role in key biological processes and are closely related to human diseases. IDRs have great potential to serve as targets for drug discovery, most notably in disordered binding regions. Accurate prediction of IDRs is challenging because their genome wide occurrence and a low ratio of disordered residues make them difficult targets for traditional classification techniques. Existing computational methods mostly rely on sequence profiles to improve accuracy which is time consuming and computationally expensive. This article describes an ab initio sequence-only prediction method-which tries to overcome the challenge of accurate prediction posed by IDRs-based on reduced amino acid alphabets and convolutional neural networks (CNNs). We experiment with six different 3-letter reduced alphabets. We argue that the dimensional reduction in the input alphabet facilitates the detection of complex patterns within the sequence by the convolutional step. Experimental results show that our proposed IDR predictor performs at the same level or outperforms other state-of-the-art methods in the same class, achieving accuracy levels of 0.76 and AUC of 0.85 on the publicly available Critical Assessment of protein Structure Prediction dataset (CASP10). Therefore, our method is suitable for proteome-wide disorder prediction yielding similar or better accuracy than existing approaches at a faster speed.
Collapse
Affiliation(s)
- Mauricio Oberti
- School of Systems Biology, George Mason University, Manassas, Virginia, USA.,Novartis Institutes for BioMedical Research, Cambridge, Massachussets, USA
| | - Iosif I Vaisman
- School of Systems Biology, George Mason University, Manassas, Virginia, USA
| |
Collapse
|
8
|
McDermott JE, Cort JR, Nakayasu ES, Pruneda JN, Overall C, Adkins JN. Prediction of bacterial E3 ubiquitin ligase effectors using reduced amino acid peptide fingerprinting. PeerJ 2019; 7:e7055. [PMID: 31211016 PMCID: PMC6557245 DOI: 10.7717/peerj.7055] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/11/2018] [Accepted: 05/02/2019] [Indexed: 11/20/2022] Open
Abstract
Background Although pathogenic Gram-negative bacteria lack their own ubiquitination machinery, they have evolved or acquired virulence effectors that can manipulate the host ubiquitination process through structural and/or functional mimicry of host machinery. Many such effectors have been identified in a wide variety of bacterial pathogens that share little sequence similarity amongst themselves or with eukaryotic ubiquitin E3 ligases. Methods To allow identification of novel bacterial E3 ubiquitin ligase effectors from protein sequences we have developed a machine learning approach, the SVM-based Identification and Evaluation of Virulence Effector Ubiquitin ligases (SIEVE-Ub). We extend the string kernel approach used previously to sequence classification by introducing reduced amino acid (RED) alphabet encoding for protein sequences. Results We found that 14mer peptides with amino acids represented as simply either hydrophobic or hydrophilic provided the best models for discrimination of E3 ligases from other effector proteins with a receiver-operator characteristic area under the curve (AUC) of 0.90. When considering a subset of E3 ubiquitin ligase effectors that do not fall into known sequence based families we found that the AUC was 0.82, demonstrating the effectiveness of our method at identifying novel functional family members. Feature selection was used to identify a parsimonious set of 10 RED peptides that provided good discrimination, and these peptides were found to be located in functionally important regions of the proteins involved in E2 and host target protein binding. Our general approach enables construction of models based on other effector functions. We used SIEVE-Ub to predict nine potential novel E3 ligases from a large set of bacterial genomes. SIEVE-Ub is available for download at https://doi.org/10.6084/m9.figshare.7766984.v1 or https://github.com/biodataganache/SIEVE-Ub for the most current version.
Collapse
Affiliation(s)
- Jason E McDermott
- Biological Sciences Division, Pacific Northwest National Laboratory, Richland, WA, United States of America.,Department of Molecular Microbiology and Immunology, Oregon Health & Science University, Portland, OR, United States of America
| | - John R Cort
- Biological Sciences Division, Pacific Northwest National Laboratory, Richland, WA, United States of America
| | - Ernesto S Nakayasu
- Biological Sciences Division, Pacific Northwest National Laboratory, Richland, WA, United States of America
| | - Jonathan N Pruneda
- Department of Molecular Microbiology and Immunology, Oregon Health & Science University, Portland, OR, United States of America
| | - Christopher Overall
- Center for Brain Immunology and Glia, University of Virginia, Charlottesville, United States of America
| | - Joshua N Adkins
- Biological Sciences Division, Pacific Northwest National Laboratory, Richland, WA, United States of America
| |
Collapse
|
9
|
A Data Adaptive Biological Sequence Representation for Supervised Learning. JOURNAL OF HEALTHCARE INFORMATICS RESEARCH 2018; 2:448-471. [DOI: 10.1007/s41666-018-0038-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2017] [Revised: 10/01/2018] [Accepted: 10/02/2018] [Indexed: 11/27/2022]
|
10
|
Eggeling R, Grosse I, Koivisto M. Algorithms for learning parsimonious context trees. Mach Learn 2018. [DOI: 10.1007/s10994-018-5770-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/27/2022]
|
11
|
Rubio-Largo A, Vanneschi L, Castelli M, Vega-Rodriguez MA. A Characteristic-Based Framework for Multiple Sequence Aligners. IEEE TRANSACTIONS ON CYBERNETICS 2018; 48:41-51. [PMID: 27831898 DOI: 10.1109/tcyb.2016.2621129] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/06/2023]
Abstract
The multiple sequence alignment is a well-known bioinformatics problem that consists in the alignment of three or more biological sequences (protein or nucleic acid). In the literature, a number of tools have been proposed for dealing with this biological sequence alignment problem, such as progressive methods, consistency-based methods, or iterative methods; among others. These aligners often use a default parameter configuration for all the input sequences to align. However, the default configuration is not always the best choice, the alignment accuracy of the tool may be highly boosted if specific parameter configurations are used, depending on the biological characteristics of the input sequences. In this paper, we propose a characteristic-based framework for multiple sequence aligners. The idea of the framework is, given an input set of unaligned sequences, extract its characteristics and run the aligner with the best parameter configuration found for another set of unaligned sequences with similar characteristics. In order to test the framework, we have used the well-known multiple sequence comparison by log-expectation (MUSCLE) v3.8 aligner with different benchmarks, such as benchmark alignments database v3.0, protein reference alignment benchmark v4.0, and sequence alignment benchmark v1.65. The results shown that the alignment accuracy and conservation of MUSCLE might be greatly improved with the proposed framework, specially in those scenarios with a low percentage of identity. The characteristic-based framework for multiple sequence aligners is freely available for downloading at http://arco.unex.es/arl/fwk-msa/cbf-msa.zip.
Collapse
|
12
|
Abstract
Based on the Shannon's information communication theory, information amount of the entire length of a polymeric macromolecule can be calculated in bits through adding the entropies of each building block. Proteins, DNA and RNA are such macromolecules. When only the building blocks' variation is considered as the source of entropy, there is seemingly lower information in case of the protein if this approach is applied directly on a protein of specific size and the coding sequence size of the mRNA corresponding to the particular length of the protein. This decrease in the information amount seems contradictory but this apparent conflict is resolved by considering the conformational variations in proteins as a new variable in the calculation and balancing the approximated entropy of the coding part of the mRNA and the protein. Probabilities can change therefore we also assigned hypothetical probabilities to the conformational states, which represent the uneven distribution as the time spent in one conformation, providing the probability of the presence in either or one of the possible conformations. Results that are obtained by using hypothetical probabilities are in line with the experimental values of variations in the conformational-state of protein populations. This equalization approach has further biological relevance that it compensates for the degeneracy in the codon usage during protein translation and it leads to the conclusion that the alphabet size for the protein is rather optimal for the proper protein functioning within the thermodynamic milieu of the cell. The findings were also discussed in relation to the codon bias and have implications in relation to the codon evolution concept. Eventually, this work brings the fields of protein structural studies and molecular protein translation processes together with a novel approach.
Collapse
Affiliation(s)
- Y Adiguzel
- Biophysics Department, School of Medicine, Istanbul Kemerburgaz University, Istanbul, Turkey.
| |
Collapse
|
13
|
Childs LM, Baskerville EB, Cobey S. Trade-offs in antibody repertoires to complex antigens. Philos Trans R Soc Lond B Biol Sci 2016; 370:rstb.2014.0245. [PMID: 26194759 PMCID: PMC4528422 DOI: 10.1098/rstb.2014.0245] [Citation(s) in RCA: 41] [Impact Index Per Article: 5.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/18/2023] Open
Abstract
Pathogens vary in their antigenic complexity. While some pathogens such as measles present a few relatively invariant targets to the immune system, others such as malaria display considerable antigenic diversity. How the immune response copes in the presence of multiple antigens, and whether a trade-off exists between the breadth and efficacy of antibody (Ab)-mediated immune responses, are unsolved problems. We present a theoretical model of affinity maturation of B-cell receptors (BCRs) during a primary infection and examine how variation in the number of accessible antigenic sites alters the Ab repertoire. Naive B cells with randomly generated receptor sequences initiate the germinal centre (GC) reaction. The binding affinity of a BCR to an antigen is quantified via a genotype-phenotype map, based on a random energy landscape, that combines local and distant interactions between residues. In the presence of numerous antigens or epitopes, B-cell clones with different specificities compete for stimulation during rounds of mutation within GCs. We find that the availability of many epitopes reduces the affinity and relative breadth of the Ab repertoire. Despite the stochasticity of somatic hypermutation, patterns of immunodominance are strongly shaped by chance selection of naive B cells with specificities for particular epitopes. Our model provides a mechanistic basis for the diversity of Ab repertoires and the evolutionary advantage of antigenically complex pathogens.
Collapse
Affiliation(s)
- Lauren M Childs
- Center for Communicable Disease Dynamics, Harvard T.H. Chan School of Public Health, Boston, MA, USA Department of Epidemiology, Harvard T.H. Chan School of Public Health, Boston, MA, USA
| | | | - Sarah Cobey
- Ecology and Evolution, University of Chicago, Chicago, IL, USA
| |
Collapse
|
14
|
Franco MA, Bacardit J. Large-scale experimental evaluation of GPU strategies for evolutionary machine learning. Inf Sci (N Y) 2016. [DOI: 10.1016/j.ins.2015.10.025] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/15/2023]
|
15
|
Solis AD. Amino acid alphabet reduction preserves fold information contained in contact interactions in proteins. Proteins 2015; 83:2198-216. [DOI: 10.1002/prot.24936] [Citation(s) in RCA: 22] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/06/2015] [Revised: 09/04/2015] [Accepted: 09/04/2015] [Indexed: 12/14/2022]
Affiliation(s)
- Armando D. Solis
- Biological Sciences Department, New York City College of Technology; the City University of New York (CUNY); Brooklyn New York 11201
| |
Collapse
|
16
|
ROSEFW-RF: The winner algorithm for the ECBDL’14 big data competition: An extremely imbalanced big data bioinformatics problem. Knowl Based Syst 2015. [DOI: 10.1016/j.knosys.2015.05.027] [Citation(s) in RCA: 105] [Impact Index Per Article: 11.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/21/2022]
|
17
|
Ofer D, Linial M. ProFET: Feature engineering captures high-level protein functions. Bioinformatics 2015; 31:3429-36. [DOI: 10.1093/bioinformatics/btv345] [Citation(s) in RCA: 55] [Impact Index Per Article: 6.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/04/2015] [Accepted: 05/29/2015] [Indexed: 11/13/2022] Open
|
18
|
Currin A, Swainston N, Day PJ, Kell DB. Synthetic biology for the directed evolution of protein biocatalysts: navigating sequence space intelligently. Chem Soc Rev 2015; 44:1172-239. [PMID: 25503938 PMCID: PMC4349129 DOI: 10.1039/c4cs00351a] [Citation(s) in RCA: 251] [Impact Index Per Article: 27.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/20/2014] [Indexed: 12/21/2022]
Abstract
The amino acid sequence of a protein affects both its structure and its function. Thus, the ability to modify the sequence, and hence the structure and activity, of individual proteins in a systematic way, opens up many opportunities, both scientifically and (as we focus on here) for exploitation in biocatalysis. Modern methods of synthetic biology, whereby increasingly large sequences of DNA can be synthesised de novo, allow an unprecedented ability to engineer proteins with novel functions. However, the number of possible proteins is far too large to test individually, so we need means for navigating the 'search space' of possible protein sequences efficiently and reliably in order to find desirable activities and other properties. Enzymologists distinguish binding (Kd) and catalytic (kcat) steps. In a similar way, judicious strategies have blended design (for binding, specificity and active site modelling) with the more empirical methods of classical directed evolution (DE) for improving kcat (where natural evolution rarely seeks the highest values), especially with regard to residues distant from the active site and where the functional linkages underpinning enzyme dynamics are both unknown and hard to predict. Epistasis (where the 'best' amino acid at one site depends on that or those at others) is a notable feature of directed evolution. The aim of this review is to highlight some of the approaches that are being developed to allow us to use directed evolution to improve enzyme properties, often dramatically. We note that directed evolution differs in a number of ways from natural evolution, including in particular the available mechanisms and the likely selection pressures. Thus, we stress the opportunities afforded by techniques that enable one to map sequence to (structure and) activity in silico, as an effective means of modelling and exploring protein landscapes. Because known landscapes may be assessed and reasoned about as a whole, simultaneously, this offers opportunities for protein improvement not readily available to natural evolution on rapid timescales. Intelligent landscape navigation, informed by sequence-activity relationships and coupled to the emerging methods of synthetic biology, offers scope for the development of novel biocatalysts that are both highly active and robust.
Collapse
Affiliation(s)
- Andrew Currin
- Manchester Institute of Biotechnology , The University of Manchester , 131, Princess St , Manchester M1 7DN , UK . ; http://dbkgroup.org/; @dbkell ; Tel: +44 (0)161 306 4492
- School of Chemistry , The University of Manchester , Manchester M13 9PL , UK
- Centre for Synthetic Biology of Fine and Speciality Chemicals (SYNBIOCHEM) , The University of Manchester , 131, Princess St , Manchester M1 7DN , UK
| | - Neil Swainston
- Manchester Institute of Biotechnology , The University of Manchester , 131, Princess St , Manchester M1 7DN , UK . ; http://dbkgroup.org/; @dbkell ; Tel: +44 (0)161 306 4492
- Centre for Synthetic Biology of Fine and Speciality Chemicals (SYNBIOCHEM) , The University of Manchester , 131, Princess St , Manchester M1 7DN , UK
- School of Computer Science , The University of Manchester , Manchester M13 9PL , UK
| | - Philip J. Day
- Manchester Institute of Biotechnology , The University of Manchester , 131, Princess St , Manchester M1 7DN , UK . ; http://dbkgroup.org/; @dbkell ; Tel: +44 (0)161 306 4492
- Centre for Synthetic Biology of Fine and Speciality Chemicals (SYNBIOCHEM) , The University of Manchester , 131, Princess St , Manchester M1 7DN , UK
- Faculty of Medical and Human Sciences , The University of Manchester , Manchester M13 9PT , UK
| | - Douglas B. Kell
- Manchester Institute of Biotechnology , The University of Manchester , 131, Princess St , Manchester M1 7DN , UK . ; http://dbkgroup.org/; @dbkell ; Tel: +44 (0)161 306 4492
- School of Chemistry , The University of Manchester , Manchester M13 9PL , UK
- Centre for Synthetic Biology of Fine and Speciality Chemicals (SYNBIOCHEM) , The University of Manchester , 131, Princess St , Manchester M1 7DN , UK
| |
Collapse
|
19
|
Abstract
MOTIVATION Next-generation sequencing technologies produce unprecedented amounts of data, leading to completely new research fields. One of these is metagenomics, the study of large-size DNA samples containing a multitude of diverse organisms. A key problem in metagenomics is to functionally and taxonomically classify the sequenced DNA, to which end the well-known BLAST program is usually used. But BLAST has dramatic resource requirements at metagenomic scales of data, imposing a high financial or technical burden on the researcher. Multiple attempts have been made to overcome these limitations and present a viable alternative to BLAST. RESULTS In this work we present Lambda, our own alternative for BLAST in the context of sequence classification. In our tests, Lambda often outperforms the best tools at reproducing BLAST's results and is the fastest compared with the current state of the art at comparable levels of sensitivity. AVAILABILITY AND IMPLEMENTATION Lambda was implemented in the SeqAn open-source C++ library for sequence analysis and is publicly available for download at http://www.seqan.de/projects/lambda. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Hannes Hauswedell
- Department of Mathematics and Computer Science, Freie Universität Berlin, Takustr. 9, 14195 Berlin, Germany
| | - Jochen Singer
- Department of Mathematics and Computer Science, Freie Universität Berlin, Takustr. 9, 14195 Berlin, Germany
| | - Knut Reinert
- Department of Mathematics and Computer Science, Freie Universität Berlin, Takustr. 9, 14195 Berlin, Germany
| |
Collapse
|
20
|
Bacardit J, Widera P, Lazzarini N, Krasnogor N. Hard Data Analytics Problems Make for Better Data Analysis Algorithms: Bioinformatics as an Example. BIG DATA 2014; 2:164-176. [PMID: 25276500 PMCID: PMC4174911 DOI: 10.1089/big.2014.0023] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/03/2023]
Abstract
Data mining and knowledge discovery techniques have greatly progressed in the last decade. They are now able to handle larger and larger datasets, process heterogeneous information, integrate complex metadata, and extract and visualize new knowledge. Often these advances were driven by new challenges arising from real-world domains, with biology and biotechnology a prime source of diverse and hard (e.g., high volume, high throughput, high variety, and high noise) data analytics problems. The aim of this article is to show the broad spectrum of data mining tasks and challenges present in biological data, and how these challenges have driven us over the years to design new data mining and knowledge discovery procedures for biodata. This is illustrated with the help of two kinds of case studies. The first kind is focused on the field of protein structure prediction, where we have contributed in several areas: by designing, through regression, functions that can distinguish between good and bad models of a protein's predicted structure; by creating new measures to characterize aspects of a protein's structure associated with individual positions in a protein's sequence, measures containing information that might be useful for protein structure prediction; and by creating accurate estimators of these structural aspects. The second kind of case study is focused on omics data analytics, a class of biological data characterized for having extremely high dimensionalities. Our methods were able not only to generate very accurate classification models, but also to discover new biological knowledge that was later ratified by experimentalists. Finally, we describe several strategies to tightly integrate knowledge extraction and data mining in order to create a new class of biodata mining algorithms that can natively embrace the complexity of biological data, efficiently generate accurate information in the form of classification/regression models, and extract valuable new knowledge. Thus, a complete data-to-information-to-knowledge pipeline is presented.
Collapse
Affiliation(s)
- Jaume Bacardit
- Interdisciplinary Computing and Complex BioSystems Research Group, School of Computing Science, Newcastle University
| | - Paweł Widera
- Interdisciplinary Computing and Complex BioSystems Research Group, School of Computing Science, Newcastle University
| | - Nicola Lazzarini
- Interdisciplinary Computing and Complex BioSystems Research Group, School of Computing Science, Newcastle University
| | - Natalio Krasnogor
- Interdisciplinary Computing and Complex BioSystems Research Group, School of Computing Science, Newcastle University
| |
Collapse
|
21
|
An information-theoretic classification of amino acids for the assessment of interfaces in protein-protein docking. J Mol Model 2013; 19:3901-10. [PMID: 23828247 DOI: 10.1007/s00894-013-1916-7] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/04/2013] [Accepted: 06/09/2013] [Indexed: 12/28/2022]
Abstract
Docking represents a versatile and powerful method to predict the geometry of protein-protein complexes. However, despite significant methodical advances, the identification of good docking solutions among a large number of false solutions still remains a difficult task. We have previously demonstrated that the formalism of mutual information (MI) from information theory can be adapted to protein docking, and we have now extended this approach to enhance its robustness and applicability. A large dataset consisting of 22,934 docking decoys derived from 203 different protein-protein complexes was used for an MI-based optimization of reduced amino acid alphabets representing the protein-protein interfaces. This optimization relied on a clustering analysis that allows one to estimate the mutual information of whole amino acid alphabets by considering all structural features simultaneously, rather than by treating them individually. This clustering approach is fast and can be applied in a similar fashion to the generation of reduced alphabets for other biological problems like fold recognition, sequence data mining, or secondary structure prediction. The reduced alphabets derived from the present work were converted into a scoring function for the evaluation of docking solutions, which is available for public use via the web service score-MI: http://score-MI.biochem.uni-erlangen.de.
Collapse
|
22
|
Franco MA, Krasnogor N, Bacardit J. GAssist vs. BioHEL: critical assessment of two paradigms of genetics-based machine learning. Soft comput 2013. [DOI: 10.1007/s00500-013-1016-8] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/27/2022]
|
23
|
|
24
|
Bacardit J, Widera P, Márquez-Chamorro A, Divina F, Aguilar-Ruiz JS, Krasnogor N. Contact map prediction using a large-scale ensemble of rule sets and the fusion of multiple predicted structural features. Bioinformatics 2012; 28:2441-8. [PMID: 22833524 DOI: 10.1093/bioinformatics/bts472] [Citation(s) in RCA: 35] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
MOTIVATION The prediction of a protein's contact map has become in recent years, a crucial stepping stone for the prediction of the complete 3D structure of a protein. In this article, we describe a methodology for this problem that was shown to be successful in CASP8 and CASP9. The methodology is based on (i) the fusion of the prediction of a variety of structural aspects of protein residues, (ii) an ensemble strategy used to facilitate the training process and (iii) a rule-based machine learning system from which we can extract human-readable explanations of the predictor and derive useful information about the contact map representation. RESULTS The main part of the evaluation is the comparison against the sequence-based contact prediction methods from CASP9, where our method presented the best rank in five out of the six evaluated metrics. We also assess the impact of the size of the ensemble used in our predictor to show the trade-off between performance and training time of our method. Finally, we also study the rule sets generated by our machine learning system. From this analysis, we are able to estimate the contribution of the attributes in our representation and how these interact to derive contact predictions. AVAILABILITY http://icos.cs.nott.ac.uk/servers/psp.html. CONTACT natalio.krasnogor@nottingham.ac.uk SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Jaume Bacardit
- Interdisciplinary Computing and Complex Systems research group, School of Computer Science, University of Nottingham, Nottingham, NG8 1BB, UK
| | | | | | | | | | | |
Collapse
|
25
|
Glaab E, Bacardit J, Garibaldi JM, Krasnogor N. Using rule-based machine learning for candidate disease gene prioritization and sample classification of cancer gene expression data. PLoS One 2012; 7:e39932. [PMID: 22808075 PMCID: PMC3394775 DOI: 10.1371/journal.pone.0039932] [Citation(s) in RCA: 82] [Impact Index Per Article: 6.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/29/2012] [Accepted: 05/29/2012] [Indexed: 12/19/2022] Open
Abstract
Microarray data analysis has been shown to provide an effective tool for studying cancer and genetic diseases. Although classical machine learning techniques have successfully been applied to find informative genes and to predict class labels for new samples, common restrictions of microarray analysis such as small sample sizes, a large attribute space and high noise levels still limit its scientific and clinical applications. Increasing the interpretability of prediction models while retaining a high accuracy would help to exploit the information content in microarray data more effectively. For this purpose, we evaluate our rule-based evolutionary machine learning systems, BioHEL and GAssist, on three public microarray cancer datasets, obtaining simple rule-based models for sample classification. A comparison with other benchmark microarray sample classifiers based on three diverse feature selection algorithms suggests that these evolutionary learning techniques can compete with state-of-the-art methods like support vector machines. The obtained models reach accuracies above 90% in two-level external cross-validation, with the added value of facilitating interpretation by using only combinations of simple if-then-else rules. As a further benefit, a literature mining analysis reveals that prioritizations of informative genes extracted from BioHEL's classification rule sets can outperform gene rankings obtained from a conventional ensemble feature selection in terms of the pointwise mutual information between relevant disease terms and the standardized names of top-ranked genes.
Collapse
Affiliation(s)
- Enrico Glaab
- Interdisciplinary Computing and Complex Systems (ICOS) Research Group, University of Nottingham, Nottingham, United Kingdom
| | - Jaume Bacardit
- Interdisciplinary Computing and Complex Systems (ICOS) Research Group, University of Nottingham, Nottingham, United Kingdom
| | - Jonathan M. Garibaldi
- Intelligent Modeling and Analysis (IMA) Research Group, University of Nottingham, Nottingham, United Kingdom
| | - Natalio Krasnogor
- Interdisciplinary Computing and Complex Systems (ICOS) Research Group, University of Nottingham, Nottingham, United Kingdom
| |
Collapse
|
26
|
Franco MA, Krasnogor N, Bacardit J. Analysing BioHEL using challenging boolean functions. EVOLUTIONARY INTELLIGENCE 2012. [DOI: 10.1007/s12065-012-0080-9] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
|
27
|
McGregor S, Polani D, Dautenhahn K. Generation of tactile maps for artificial skin. PLoS One 2011; 6:e26561. [PMID: 22102863 PMCID: PMC3213097 DOI: 10.1371/journal.pone.0026561] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/19/2011] [Accepted: 09/29/2011] [Indexed: 11/19/2022] Open
Abstract
Prior research has shown that representations of retinal surfaces can be learned from the intrinsic structure of visual sensory data in neural simulations, in robots, as well as by animals. Furthermore, representations of cochlear (frequency) surfaces can be learned from auditory data in neural simulations. Advances in hardware technology have allowed the development of artificial skin for robots, realising a new sensory modality which differs in important respects from vision and audition in its sensorimotor characteristics. This provides an opportunity to further investigate ordered sensory map formation using computational tools. We show that it is possible to learn representations of non-trivial tactile surfaces, which require topologically and geometrically involved three-dimensional embeddings. Our method automatically constructs a somatotopic map corresponding to the configuration of tactile sensors on a rigid body, using only intrinsic properties of the tactile data. The additional complexities involved in processing the tactile modality require the development of a novel multi-dimensional scaling algorithm. This algorithm, ANISOMAP, extends previous methods and outperforms them, producing high-quality reconstructions of tactile surfaces in both simulation and hardware tests. In addition, the reconstruction turns out to be robust to unanticipated hardware failure.
Collapse
Affiliation(s)
- Simon McGregor
- University of Hertfordshire, Hatfield, Herts, United Kingdom
| | - Daniel Polani
- University of Hertfordshire, Hatfield, Herts, United Kingdom
- * E-mail:
| | | |
Collapse
|
28
|
Bassel GW, Glaab E, Marquez J, Holdsworth MJ, Bacardit J. Functional network construction in Arabidopsis using rule-based machine learning on large-scale data sets. THE PLANT CELL 2011; 23:3101-16. [PMID: 21896882 PMCID: PMC3203449 DOI: 10.1105/tpc.111.088153] [Citation(s) in RCA: 40] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/14/2011] [Revised: 08/01/2011] [Accepted: 08/25/2011] [Indexed: 05/17/2023]
Abstract
The meta-analysis of large-scale postgenomics data sets within public databases promises to provide important novel biological knowledge. Statistical approaches including correlation analyses in coexpression studies of gene expression have emerged as tools to elucidate gene function using these data sets. Here, we present a powerful and novel alternative methodology to computationally identify functional relationships between genes from microarray data sets using rule-based machine learning. This approach, termed "coprediction," is based on the collective ability of groups of genes co-occurring within rules to accurately predict the developmental outcome of a biological system. We demonstrate the utility of coprediction as a powerful analytical tool using publicly available microarray data generated exclusively from Arabidopsis thaliana seeds to compute a functional gene interaction network, termed Seed Co-Prediction Network (SCoPNet). SCoPNet predicts functional associations between genes acting in the same developmental and signal transduction pathways irrespective of the similarity in their respective gene expression patterns. Using SCoPNet, we identified four novel regulators of seed germination (ALTERED SEED GERMINATION5, 6, 7, and 8), and predicted interactions at the level of transcript abundance between these novel and previously described factors influencing Arabidopsis seed germination. An online Web tool to query SCoPNet has been developed as a community resource to dissect seed biology and is available at http://www.vseed.nottingham.ac.uk/.
Collapse
Affiliation(s)
- George W Bassel
- Division of Plant and Crop Sciences, University of Nottingham, Loughborough, Leicestershire, UK.
| | | | | | | | | |
Collapse
|
29
|
Tress ML, Valencia A. Predicted residue-residue contacts can help the scoring of 3D models. Proteins 2010; 78:1980-91. [PMID: 20408174 DOI: 10.1002/prot.22714] [Citation(s) in RCA: 29] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022]
Abstract
During the 7th Critical Assessment of Protein Structure Prediction (CASP7) experiment, it was suggested that the real value of predicted residue-residue contacts might lie in the scoring of 3D model structures. Here, we have carried out a detailed reassessment of the contact predictions made during the recent CASP8 experiment to determine whether predicted contacts might aid in the selection of close-to-native structures or be a useful tool for scoring 3D structural models. We used the contacts predicted by the CASP8 residue-residue contact prediction groups to select models for each target domain submitted to the experiment. We found that the information contained in the predicted residue-residue contacts would probably have helped in the selection of 3D models in the free modeling regime and over the harder comparative modeling targets. Indeed, in many cases, the models selected using just the predicted contacts had better GDT-TS scores than all but the best 3D prediction groups. Despite the well-known low accuracy of residue-residue contact predictions, it is clear that the predictive power of contacts can be useful in 3D model prediction strategies.
Collapse
Affiliation(s)
- Michael L Tress
- Structural Biology and Biocomputing Programme, Spanish National Cancer Research Centre (CNIO), Madrid, Spain.
| | | |
Collapse
|
30
|
Liu X, Zhao YP. A scheme for multiple sequence alignment optimization--an improvement based on family representative mechanics features. J Theor Biol 2009; 261:593-7. [PMID: 19733185 DOI: 10.1016/j.jtbi.2009.08.028] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/22/2009] [Revised: 08/26/2009] [Accepted: 08/26/2009] [Indexed: 10/20/2022]
Abstract
As a basic tool of modern biology, sequence alignment can provide us useful information in fold, function, and active site of protein. For many cases, the increased quality of sequence alignment means a better performance. The motivation of present work is to increase ability of the existing scoring scheme/algorithm by considering residue-residue correlations better. Based on a coarse-grained approach, the hydrophobic force between each pair of residues is written out from protein sequence. It results in the construction of an intramolecular hydrophobic force network that describes the whole residue-residue interactions of each protein molecule, and characterizes protein's biological properties in the hydrophobic aspect. A former work has suggested that such network can characterize the top weighted feature regarding hydrophobicity. Moreover, for each homologous protein of a family, the corresponding network shares some common and representative family characters that eventually govern the conservation of biological properties during protein evolution. In present work, we score such family representative characters of a protein by the deviation of its intramolecular hydrophobic force network from that of background. Such score can assist the existing scoring schemes/algorithms, and boost up the ability of multiple sequences alignment, e.g. achieving a prominent increase (approximately 50%) in searching the structurally alike residue segments at a low identity level. As the theoretical basis is different, the present scheme can assist most existing algorithms, and improve their efficiency remarkably.
Collapse
Affiliation(s)
- Xin Liu
- The State Key Laboratory of Nonlinear Mechanics, Institute of Mechanics, Chinese Academy of Sciences, No. 15 Beisihuanxi Road, Beijing 100190, China.
| | | |
Collapse
|
31
|
Folding by numbers: primary sequence statistics and their use in studying protein folding. Int J Mol Sci 2009; 10:1567-1589. [PMID: 19468326 PMCID: PMC2680634 DOI: 10.3390/ijms10041567] [Citation(s) in RCA: 13] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/30/2009] [Revised: 03/30/2009] [Accepted: 04/02/2009] [Indexed: 11/16/2022] Open
Abstract
The exponential growth over the past several decades in the quantity of both primary sequence data available and the number of protein structures determined has provided a wealth of information describing the relationship between protein primary sequence and tertiary structure. This growing repository of data has served as a prime source for statistical analysis, where underlying relationships between patterns of amino acids and protein structure can be uncovered. Here, we survey the main statistical approaches that have been used for identifying patterns within protein sequences, and discuss sequence pattern research as it relates to both secondary and tertiary protein structure. Limitations to statistical analyses are discussed, and a context for their role within the field of protein folding is given. We conclude by describing a novel statistical study of residue patterning in β-strands, which finds that hydrophobic (i,i+2) pairing in β-strands occurs more often than expected at locations near strand termini. Interpretations involving β-sheet nucleation and growth are discussed.
Collapse
|