1
|
Mahmud S, Morehead A, Cheng J. Accurate prediction of protein tertiary structural changes induced by single-site mutations with equivariant graph neural networks. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.10.03.560758. [PMID: 37873289 PMCID: PMC10592624 DOI: 10.1101/2023.10.03.560758] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/25/2023]
Abstract
Predicting the change of protein tertiary structure caused by singlesite mutations is important for studying protein structure, function, and interaction. Even though computational protein structure prediction methods such as AlphaFold can predict the overall tertiary structures of most proteins rather accurately, they are not sensitive enough to accurately predict the structural changes induced by single-site amino acid mutations on proteins. Specialized mutation prediction methods mostly focus on predicting the overall stability or function changes caused by mutations without attempting to predict the exact mutation-induced structural changes, limiting their use in protein mutation study. In this work, we develop the first deep learning method based on equivariant graph neural networks (EGNN) to directly predict the tertiary structural changes caused by single-site mutations and the tertiary structure of any protein mutant from the structure of its wild-type counterpart. The results show that it performs substantially better in predicting the tertiary structures of protein mutants than the widely used protein structure prediction method AlphaFold.
Collapse
|
2
|
A user-guided Bayesian framework for ensemble feature selection in life science applications (UBayFS). Mach Learn 2022. [DOI: 10.1007/s10994-022-06221-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/15/2022]
Abstract
AbstractFeature selection reduces the complexity of high-dimensional datasets and helps to gain insights into systematic variation in the data. These aspects are essential in domains that rely on model interpretability, such as life sciences. We propose a (U)ser-Guided (Bay)esian Framework for (F)eature (S)election, UBayFS, an ensemble feature selection technique embedded in a Bayesian statistical framework. Our generic approach considers two sources of information: data and domain knowledge. From data, we build an ensemble of feature selectors, described by a multinomial likelihood model. Using domain knowledge, the user guides UBayFS by weighting features and penalizing feature blocks or combinations, implemented via a Dirichlet-type prior distribution. Hence, the framework combines three main aspects: ensemble feature selection, expert knowledge, and side constraints. Our experiments demonstrate that UBayFS (a) allows for a balanced trade-off between user knowledge and data observations and (b) achieves accurate and robust results.
Collapse
|
3
|
|
4
|
Abstract
Summary
Statistics derived from the eigenvalues of sample covariance matrices are called spectral statistics, and they play a central role in multivariate testing. Although bootstrap methods are an established approach to approximating the laws of spectral statistics in low-dimensional problems, such methods are relatively unexplored in the high-dimensional setting. The aim of this article is to focus on linear spectral statistics as a class of prototypes for developing a new bootstrap in high dimensions, a method we refer to as the spectral bootstrap. In essence, the proposed method originates from the parametric bootstrap and is motivated by the fact that in high dimensions it is difficult to obtain a nonparametric approximation to the full data-generating distribution. From a practical standpoint, the method is easy to use and allows the user to circumvent the difficulties of complex asymptotic formulas for linear spectral statistics. In addition to proving the consistency of the proposed method, we present encouraging empirical results in a variety of settings. Lastly, and perhaps most interestingly, we show through simulations that the method can be applied successfully to statistics outside the class of linear spectral statistics, such as the largest sample eigenvalue and others.
Collapse
Affiliation(s)
- Miles E Lopes
- Department of Statistics, University of California, One Shields Avenue, Davis, California 95616, USA
| | - Andrew Blandino
- Department of Statistics, University of California, One Shields Avenue, Davis, California 95616, USA
| | - Alexander Aue
- Department of Statistics, University of California, One Shields Avenue, Davis, California 95616, USA
| |
Collapse
|
5
|
2D compressed learning: support matrix machine with bilinear random projections. Mach Learn 2019. [DOI: 10.1007/s10994-019-05804-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/27/2022]
|
6
|
Unsupervised dimensionality reduction versus supervised regularization for classification from sparse data. Data Min Knowl Discov 2019. [DOI: 10.1007/s10618-019-00616-4] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/27/2022]
|
7
|
Mocanu DC, Mocanu E, Stone P, Nguyen PH, Gibescu M, Liotta A. Scalable training of artificial neural networks with adaptive sparse connectivity inspired by network science. Nat Commun 2018; 9:2383. [PMID: 29921910 PMCID: PMC6008460 DOI: 10.1038/s41467-018-04316-3] [Citation(s) in RCA: 67] [Impact Index Per Article: 9.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/26/2017] [Accepted: 04/20/2018] [Indexed: 01/27/2023] Open
Abstract
Through the success of deep learning in various domains, artificial neural networks are currently among the most used artificial intelligence methods. Taking inspiration from the network properties of biological neural networks (e.g. sparsity, scale-freeness), we argue that (contrary to general practice) artificial neural networks, too, should not have fully-connected layers. Here we propose sparse evolutionary training of artificial neural networks, an algorithm which evolves an initial sparse topology (Erdős-Rényi random graph) of two consecutive layers of neurons into a scale-free topology, during learning. Our method replaces artificial neural networks fully-connected layers with sparse ones before training, reducing quadratically the number of parameters, with no decrease in accuracy. We demonstrate our claims on restricted Boltzmann machines, multi-layer perceptrons, and convolutional neural networks for unsupervised and supervised learning on 15 datasets. Our approach has the potential to enable artificial neural networks to scale up beyond what is currently possible.
Collapse
Affiliation(s)
- Decebal Constantin Mocanu
- Department of Mathematics and Computer Science, Eindhoven University of Technology, De Rondom 70, 5612 AP, Eindhoven, The Netherlands. .,Department of Electrical Engineering, Eindhoven University of Technology, De Rondom 70, 5612 AP, Eindhoven, The Netherlands.
| | - Elena Mocanu
- Department of Electrical Engineering, Eindhoven University of Technology, De Rondom 70, 5612 AP, Eindhoven, The Netherlands.,Department of Mechanical Engineering, Eindhoven University of Technology, De Rondom 70, 5612 AP, Eindhoven, The Netherlands
| | - Peter Stone
- Department of Computer Science, The University of Texas at Austin, 2317 Speedway, Stop D9500, Austin, TX, 78712-1757, USA
| | - Phuong H Nguyen
- Department of Electrical Engineering, Eindhoven University of Technology, De Rondom 70, 5612 AP, Eindhoven, The Netherlands
| | - Madeleine Gibescu
- Department of Electrical Engineering, Eindhoven University of Technology, De Rondom 70, 5612 AP, Eindhoven, The Netherlands
| | - Antonio Liotta
- Data Science Centre, University of Derby, Lonsdale House, Quaker Way, Derby, DE1 3HD, UK
| |
Collapse
|
8
|
Abstract
Deleterious or 'disease-associated' mutations are mutations that lead to disease with high phenotype penetrance: they are inherited in a simple Mendelian manner, or, in the case of cancer, accumulate in somatic cells leading directly to disease. However, in some cases, the amino acid that is substituted resulting in disease is the wild-type native residue in the functionally equivalent protein in another species. Such examples are known as 'compensated pathogenic deviations' (CPDs) because, somewhere in the second species, there must be compensatory mutations that allow the protein to function normally despite having a residue which would cause disease in the first species. Depending on the nature of the mutations, compensation can occur in the same protein, or in a different protein with which it interacts. In principle, compensation can be achieved by a single mutation (most probably structurally close to the CPD), or by the cumulative effect of several mutations. Although it is clear that these effects occur in proteins, compensatory mutations are also important in RNA potentially having an impact on disease. As a much simpler molecule, RNA provides an interesting model for understanding mechanisms of compensatory effects, both by looking at naturally occurring RNA molecules and as a means of computational simulation. This review surveys the rather limited literature that has explored these effects. Understanding the nature of CPDs is important in understanding traversal along fitness landscape valleys in evolution. It could also have applications in treating diseases that result from such mutations.
Collapse
|
9
|
Ashworth J, Bernard B, Reynolds S, Plaisier CL, Shmulevich I, Baliga NS. Structure-based predictions broadly link transcription factor mutations to gene expression changes in cancers. Nucleic Acids Res 2014; 42:12973-83. [PMID: 25378323 PMCID: PMC4245936 DOI: 10.1093/nar/gku1031] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/27/2014] [Revised: 10/09/2014] [Accepted: 10/10/2014] [Indexed: 12/24/2022] Open
Abstract
Thousands of unique mutations in transcription factors (TFs) arise in cancers, and the functional and biological roles of relatively few of these have been characterized. Here, we used structure-based methods developed specifically for DNA-binding proteins to systematically predict the consequences of mutations in several TFs that are frequently mutated in cancers. The explicit consideration of protein-DNA interactions was crucial to explain the roles and prevalence of mutations in TP53 and RUNX1 in cancers, and resulted in a higher specificity of detection for known p53-regulated genes among genetic associations between TP53 genotypes and genome-wide expression in The Cancer Genome Atlas, compared to existing methods of mutation assessment. Biophysical predictions also indicated that the relative prevalence of TP53 missense mutations in cancer is proportional to their thermodynamic impacts on protein stability and DNA binding, which is consistent with the selection for the loss of p53 transcriptional function in cancers. Structure and thermodynamics-based predictions of the impacts of missense mutations that focus on specific molecular functions may be increasingly useful for the precise and large-scale inference of aberrant molecular phenotypes in cancer and other complex diseases.
Collapse
Affiliation(s)
| | - Brady Bernard
- Institute for Systems Biology, Seattle, WA 98109, USA
| | | | | | | | | |
Collapse
|
10
|
Zeng XQ, Li GZ. Dimension reduction for p53 protein recognition by using incremental partial least squares. IEEE Trans Nanobioscience 2014; 13:73-9. [PMID: 24893361 DOI: 10.1109/tnb.2014.2319234] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
As an important tumor suppressor protein, reactivating mutated p53 was found in many kinds of human cancers and that restoring active p53 would lead to tumor regression. In recent years, more and more data extracted from biophysical simulations, which makes the modelling of mutant p53 transcriptional activity suffering from the problems of huge amount of instances and high feature dimension. Incremental feature extraction is effective to facilitate analysis of large-scale data. However, most current incremental feature extraction methods are not suitable for processing big data with high feature dimension. Partial Least Squares (PLS) has been demonstrated to be an effective dimension reduction technique for classification. In this paper, we design a highly efficient and powerful algorithm named Incremental Partial Least Squares (IPLS), which conducts a two-stage extraction process. In the first stage, the PLS target function is adapted to be incremental with updating historical mean to extract the leading projection direction. In the last stage, the other projection directions are calculated through equivalence between the PLS vectors and the Krylov sequence. We compare IPLS with some state-of-the-arts incremental feature extraction methods like Incremental Principal Component Analysis, Incremental Maximum Margin Criterion and Incremental Inter-class Scatter on real p53 proteins data. Empirical results show IPLS performs better than other methods in terms of balanced classification accuracy.
Collapse
|
11
|
Geetha Ramani R, Jacob SG. Prediction of P53 mutants (multiple sites) transcriptional activity based on structural (2D&3D) properties. PLoS One 2013; 8:e55401. [PMID: 23468845 PMCID: PMC3572112 DOI: 10.1371/journal.pone.0055401] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/19/2012] [Accepted: 12/21/2012] [Indexed: 01/05/2023] Open
Abstract
Prediction of secondary site mutations that reinstate mutated p53 to normalcy has been the focus of intense research in the recent past owing to the fact that p53 mutants have been implicated in more than half of all human cancers and restoration of p53 causes tumor regression. However laboratory investigations are more often laborious and resource intensive but computational techniques could well surmount these drawbacks. In view of this, we formulated a novel approach utilizing computational techniques to predict the transcriptional activity of multiple site (one-site to five-site) p53 mutants. The optimal MCC obtained by the proposed approach on prediction of one-site, two-site, three-site, four-site and five-site mutants were 0.775,0.341,0.784,0.916 and 0.655 respectively, the highest reported thus far in literature. We have also demonstrated that 2D and 3D features generate higher prediction accuracy of p53 activity and our findings revealed the optimal results for prediction of p53 status, reported till date. We believe detection of the secondary site mutations that suppress tumor growth may facilitate better understanding of the relationship between p53 structure and function and further knowledge on the molecular mechanisms and biological activity of p53, a targeted source for cancer therapy. We expect that our prediction methods and reported results may provide useful insights on p53 functional mechanisms and generate more avenues for utilizing computational techniques in biological data analysis.
Collapse
Affiliation(s)
- R. Geetha Ramani
- Department of Information Science and Technology, College of Engineering, Guindy, Anna University, Chennai, Tamilnadu, India
| | - Shomona Gracia Jacob
- Faculty of Information and Communication Engineering, Anna University, Chennai, Tamilnadu, India
| |
Collapse
|
12
|
Wassman CD, Baronio R, Demir Ö, Wallentine BD, Chen CK, Hall LV, Salehi F, Lin DW, Chung BP, Wesley Hatfield G, Richard Chamberlin A, Luecke H, Lathrop RH, Kaiser P, Amaro RE. Computational identification of a transiently open L1/S3 pocket for reactivation of mutant p53. Nat Commun 2013; 4:1407. [PMID: 23360998 PMCID: PMC3562459 DOI: 10.1038/ncomms2361] [Citation(s) in RCA: 173] [Impact Index Per Article: 14.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/06/2012] [Accepted: 12/06/2012] [Indexed: 12/22/2022] Open
Abstract
The tumour suppressor p53 is the most frequently mutated gene in human cancer. Reactivation of mutant p53 by small molecules is an exciting potential cancer therapy. Although several compounds restore wild-type function to mutant p53, their binding sites and mechanisms of action are elusive. Here computational methods identify a transiently open binding pocket between loop L1 and sheet S3 of the p53 core domain. Mutation of residue Cys124, located at the centre of the pocket, abolishes p53 reactivation of mutant R175H by PRIMA-1, a known reactivation compound. Ensemble-based virtual screening against this newly revealed pocket selects stictic acid as a potential p53 reactivation compound. In human osteosarcoma cells, stictic acid exhibits dose-dependent reactivation of p21 expression for mutant R175H more strongly than does PRIMA-1. These results indicate the L1/S3 pocket as a target for pharmaceutical reactivation of p53 mutants.
Collapse
Affiliation(s)
- Christopher D. Wassman
- Department of Computer Science, University of California, Irvine, Irvine, California 92697, USA
- Institute for Genomics and Bioinformatics, University of California, Irvine, Irvine, California 92697, USA
- These authors contributed equally to this work
- Present address: Google Inc., 1600 Amphitheatre Parkway Mountain View, California 94043, USA
| | - Roberta Baronio
- Institute for Genomics and Bioinformatics, University of California, Irvine, Irvine, California 92697, USA
- Department of Biological Chemistry, University of California, Irvine, Irvine, California 92697, USA
- These authors contributed equally to this work
| | - Özlem Demir
- Department of Pharmaceutical Sciences, University of California, Irvine, Irvine, California 92697, USA
- These authors contributed equally to this work
- Present addresses: Department of Chemistry and Biochemistry, University of California, San Diego; La Jolla, California 92093, USA
| | - Brad D. Wallentine
- Department of Molecular Biology and Biochemistry, University of California, Irvine, Irvine, California 92697, USA
| | - Chiung-Kuang Chen
- Department of Molecular Biology and Biochemistry, University of California, Irvine, Irvine, California 92697, USA
| | - Linda V. Hall
- Institute for Genomics and Bioinformatics, University of California, Irvine, Irvine, California 92697, USA
- Department of Biological Chemistry, University of California, Irvine, Irvine, California 92697, USA
| | - Faezeh Salehi
- Department of Computer Science, University of California, Irvine, Irvine, California 92697, USA
- Institute for Genomics and Bioinformatics, University of California, Irvine, Irvine, California 92697, USA
| | - Da-Wei Lin
- Department of Biological Chemistry, University of California, Irvine, Irvine, California 92697, USA
| | - Benjamin P. Chung
- Department of Biological Chemistry, University of California, Irvine, Irvine, California 92697, USA
| | - G. Wesley Hatfield
- Institute for Genomics and Bioinformatics, University of California, Irvine, Irvine, California 92697, USA
- Department of Microbiology and Molecular Genetics, University of California, Irvine, Irvine, California 92697, USA
- Department of Chemical Engineering and Materials Science, University of California, Irvine, Irvine, California 92697, USA
| | - A. Richard Chamberlin
- Department of Pharmaceutical Sciences, University of California, Irvine, Irvine, California 92697, USA
- Department of Chemistry, University of California, Irvine, Irvine, California 92697, USA
- Chao Family Comprehensive Cancer Center, University of California, Irvine, Irvine, California 92697, USA
| | - Hartmut Luecke
- Institute for Genomics and Bioinformatics, University of California, Irvine, Irvine, California 92697, USA
- Department of Molecular Biology and Biochemistry, University of California, Irvine, Irvine, California 92697, USA
- Chao Family Comprehensive Cancer Center, University of California, Irvine, Irvine, California 92697, USA
- Department of Physiology and Biophysics, University of California, Irvine, Irvine, California 92697, USA
- Center for Biomembrane Systems, University of California, Irvine, Irvine, California 92697, USA
| | - Richard H. Lathrop
- Department of Computer Science, University of California, Irvine, Irvine, California 92697, USA
- Institute for Genomics and Bioinformatics, University of California, Irvine, Irvine, California 92697, USA
- Chao Family Comprehensive Cancer Center, University of California, Irvine, Irvine, California 92697, USA
- Department of Biomedical Engineering, University of California, Irvine, Irvine, California 92697, USA
| | - Peter Kaiser
- Institute for Genomics and Bioinformatics, University of California, Irvine, Irvine, California 92697, USA
- Department of Biological Chemistry, University of California, Irvine, Irvine, California 92697, USA
- Chao Family Comprehensive Cancer Center, University of California, Irvine, Irvine, California 92697, USA
| | - Rommie E. Amaro
- Department of Computer Science, University of California, Irvine, Irvine, California 92697, USA
- Department of Pharmaceutical Sciences, University of California, Irvine, Irvine, California 92697, USA
- Department of Chemistry, University of California, Irvine, Irvine, California 92697, USA
- Present addresses: Department of Chemistry and Biochemistry, University of California, San Diego; La Jolla, California 92093, USA
| |
Collapse
|
13
|
Huang T, Niu S, Xu Z, Huang Y, Kong X, Cai YD, Chou KC. Predicting transcriptional activity of multiple site p53 mutants based on hybrid properties. PLoS One 2011; 6:e22940. [PMID: 21857971 PMCID: PMC3152557 DOI: 10.1371/journal.pone.0022940] [Citation(s) in RCA: 34] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/23/2011] [Accepted: 07/01/2011] [Indexed: 11/26/2022] Open
Abstract
As an important tumor suppressor protein, reactivate mutated p53 was found in many kinds of human cancers and that restoring active p53 would lead to tumor regression. In this work, we developed a new computational method to predict the transcriptional activity for one-, two-, three- and four-site p53 mutants, respectively. With the approach from the general form of pseudo amino acid composition, we used eight types of features to represent the mutation and then selected the optimal prediction features based on the maximum relevance, minimum redundancy, and incremental feature selection methods. The Mathew's correlation coefficients (MCC) obtained by using nearest neighbor algorithm and jackknife cross validation for one-, two-, three- and four-site p53 mutants were 0.678, 0.314, 0.705, and 0.907, respectively. It was revealed by the further optimal feature set analysis that the 2D (two-dimensional) structure features composed the largest part of the optimal feature set and maybe played the most important roles in all four types of p53 mutant active status prediction. It was also demonstrated by the optimal feature sets, especially those at the top level, that the 3D structure features, conservation, physicochemical and biochemical properties of amino acid near the mutation site, also played quite important roles for p53 mutant active status prediction. Our study has provided a new and promising approach for finding functionally important sites and the relevant features for in-depth study of p53 protein and its action mechanism.
Collapse
Affiliation(s)
- Tao Huang
- Key Laboratory of Systems Biology, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai, People's Republic of China
- Shanghai Center for Bioinformation Technology, Shanghai, People's Republic of China
| | - Shen Niu
- Key Laboratory of Systems Biology, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai, People's Republic of China
| | - Zhongping Xu
- Key Laboratory of Stem Cell Biology, Institute of Health Sciences, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences and Shanghai Jiao Tong University School of Medicine, Shanghai, People's Republic of China
| | - Yun Huang
- Key Laboratory of Stem Cell Biology, Institute of Health Sciences, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences and Shanghai Jiao Tong University School of Medicine, Shanghai, People's Republic of China
| | - Xiangyin Kong
- Key Laboratory of Stem Cell Biology, Institute of Health Sciences, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences and Shanghai Jiao Tong University School of Medicine, Shanghai, People's Republic of China
- State Key Laboratory of Medical Genomics, Ruijin Hospital, Shanghai Jiaotong University, Shanghai, People's Republic of China
| | - Yu-Dong Cai
- Institute of Systems Biology, Shanghai University, Shanghai, People's Republic of China
- Centre for Computational Systems Biology, Fudan University, Shanghai, People's Republic of China
- Gordon Life Science Institute, San Diego, California, United States of America
| | - Kuo-Chen Chou
- Gordon Life Science Institute, San Diego, California, United States of America
| |
Collapse
|
14
|
Predicting positive p53 cancer rescue regions using Most Informative Positive (MIP) active learning. PLoS Comput Biol 2008; 5:e1000498. [PMID: 19756158 PMCID: PMC2742196 DOI: 10.1371/journal.pcbi.1000498] [Citation(s) in RCA: 46] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/21/2009] [Accepted: 08/04/2009] [Indexed: 11/19/2022] Open
Abstract
Many protein engineering problems involve finding mutations that produce proteins
with a particular function. Computational active learning is an attractive
approach to discover desired biological activities. Traditional active learning
techniques have been optimized to iteratively improve classifier accuracy, not
to quickly discover biologically significant results. We report here a novel
active learning technique, Most Informative Positive (MIP), which is tailored to
biological problems because it seeks novel and informative positive results. MIP
active learning differs from traditional active learning methods in two ways:
(1) it preferentially seeks Positive (functionally active) examples; and (2) it
may be effectively extended to select gene regions suitable for high throughput
combinatorial mutagenesis. We applied MIP to discover mutations in the tumor
suppressor protein p53 that reactivate mutated p53 found in human cancers. This
is an important biomedical goal because p53 mutants have been
implicated in half of all human cancers, and restoring active p53 in tumors
leads to tumor regression. MIP found Positive (cancer rescue) p53 mutants
in silico using 33% fewer experiments than
traditional non-MIP active learning, with only a minor decrease in classifier
accuracy. Applying MIP to in vivo experimentation yielded
immediate Positive results. Ten different p53 mutations found in human cancers
were paired in silico with all possible single amino acid
rescue mutations, from which MIP was used to select a Positive Region predicted
to be enriched for p53 cancer rescue mutants. In vivo assays
showed that the predicted Positive Region: (1) had significantly more
(p<0.01) new strong cancer rescue mutants than control regions (Negative,
and non-MIP active learning); (2) had slightly more new strong cancer rescue
mutants than an Expert region selected for purely biological considerations; and
(3) rescued for the first time the previously unrescuable p53 cancer mutant
P152L. Engineering proteins to acquire or enhance a particular useful function is at the
core of many biomedical problems. This paper presents Most Informative Positive
(MIP) active learning, a novel integrated computational/biological approach
designed to help guide biological discovery of novel and informative positive
mutants. A classifier, together with modeled structure-based features, helps
guide biological experiments and so accelerates protein engineering studies. MIP
reduces the number of expensive biological experiments needed to achieve novel
and informative positive results. We used the MIP method to discover novel p53
cancer rescue mutants. p53 is a tumor suppressor protein, and destructive p53
mutations have been implicated in half of all human cancers. Second-site cancer
rescue mutations restore p53 activity and eventually may facilitate rational
design of better cancer drugs. This paper shows that, even in the first round of
in vivo experiments, MIP significantly increased the discovery rate of novel and
informative positive mutants.
Collapse
|
15
|
Cai Z, Shi Y, Song M, Goebel R, Lin G. Smoothing blemished gene expression microarray data via missing value imputation. ANNUAL INTERNATIONAL CONFERENCE OF THE IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. ANNUAL INTERNATIONAL CONFERENCE 2008; 2008:5688-5691. [PMID: 19164008 DOI: 10.1109/iembs.2008.4650505] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/27/2023]
Abstract
Gene expression microarray technology has enabled advanced biological and medical research, but the data are well-recognized noisy and must be used with caution, since they are greatly affected by many experimental factors such as RNA concentration, spot typing, hybridization condition, and image analysis. It is highly desirable that the inaccurate data entries ('stains') can be identified and subsequently curated. In this paper, we propose a novel computational method, based on feature gene selection and sample classification, to efficiently discover the stains and apply imputation methods to estimate their values. Extensive experimental results on three Affymetrix platforms for human cancer diagnosis showed that by picking only 1-4% data entries as the most likely stains, the smoothed datasets could be used for better downstream data analyses such as robust biomarker identification and disease diagnosis.
Collapse
Affiliation(s)
- Zhipeng Cai
- Department of Computing Science, University of Alberta. Edmonton, T6G 2E8, Canada.
| | | | | | | | | |
Collapse
|
16
|
Danziger SA, Zeng J, Wang Y, Brachmann RK, Lathrop RH. Choosing where to look next in a mutation sequence space: Active Learning of informative p53 cancer rescue mutants. ACTA ACUST UNITED AC 2007; 23:i104-14. [PMID: 17646286 PMCID: PMC2811495 DOI: 10.1093/bioinformatics/btm166] [Citation(s) in RCA: 33] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/08/2023]
Abstract
MOTIVATION Many biomedical projects would benefit from reducing the time and expense of in vitro experimentation by using computer models for in silico predictions. These models may help determine which expensive biological data are most useful to acquire next. Active Learning techniques for choosing the most informative data enable biologists and computer scientists to optimize experimental data choices for rapid discovery of biological function. To explore design choices that affect this desirable behavior, five novel and five existing Active Learning techniques, together with three control methods, were tested on 57 previously unknown p53 cancer rescue mutants for their ability to build classifiers that predict protein function. The best of these techniques, Maximum Curiosity, improved the baseline accuracy of 56-77%. This article shows that Active Learning is a useful tool for biomedical research, and provides a case study of interest to others facing similar discovery challenges.
Collapse
Affiliation(s)
- Samuel A Danziger
- Department of Biomedical Engineering, University of California, Irvine, California 92697, USA
| | | | | | | | | |
Collapse
|
17
|
Saigo H, Uno T, Tsuda K. Mining complex genotypic features for predicting HIV-1 drug resistance. Bioinformatics 2007; 23:2455-62. [PMID: 17698858 DOI: 10.1093/bioinformatics/btm353] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION Human immunodeficiency virus type 1 (HIV-1) evolves in human body, and its exposure to a drug often causes mutations that enhance the resistance against the drug. To design an effective pharmacotherapy for an individual patient, it is important to accurately predict the drug resistance based on genotype data. Notably, the resistance is not just the simple sum of the effects of all mutations. Structural biological studies suggest that the association of mutations is crucial: even if mutations A or B alone do not affect the resistance, a significant change might happen when the two mutations occur together. Linear regression methods cannot take the associations into account, while decision tree methods can reveal only limited associations. Kernel methods and neural networks implicitly use all possible associations for prediction, but cannot select salient associations explicitly. RESULTS Our method, itemset boosting, performs linear regression in the complete space of power sets of mutations. It implements a forward feature selection procedure where, in each iteration, one mutation combination is found by an efficient branch-and-bound search. This method uses all possible combinations, and salient associations are explicitly shown. In experiments, our method worked particularly well for predicting the resistance of nucleotide reverse transcriptase inhibitors (NRTIs). Furthermore, it successfully recovered many mutation associations known in biological literature. AVAILABILITY http://www.kyb.mpg.de/bs/people/hiroto/iboost/. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Hiroto Saigo
- Max Planck Institute for Biological Cybernetics, 72076 Tübingen, Germany
| | | | | |
Collapse
|
18
|
Bichutskiy VY, Colman R, Brachmann RK, Lathrop RH. Heterogeneous biomedical database integration using a hybrid strategy: a p53 cancer research database. Cancer Inform 2007; 2:277-87. [PMID: 19458771 PMCID: PMC2675489] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/26/2022] Open
Abstract
Complex problems in life science research give rise to multidisciplinary collaboration, and hence, to the need for heterogeneous database integration. The tumor suppressor p53 is mutated in close to 50% of human cancers, and a small drug-like molecule with the ability to restore native function to cancerous p53 mutants is a long-held medical goal of cancer treatment. The Cancer Research DataBase (CRDB) was designed in support of a project to find such small molecules. As a cancer informatics project, the CRDB involved small molecule data, computational docking results, functional assays, and protein structure data. As an example of the hybrid strategy for data integration, it combined the mediation and data warehousing approaches. This paper uses the CRDB to illustrate the hybrid strategy as a viable approach to heterogeneous data integration in biomedicine, and provides a design method for those considering similar systems. More efficient data sharing implies increased productivity, and, hopefully, improved chances of success in cancer research. (Code and database schemas are freely downloadable, http://www.igb.uci.edu/research/research.html.).
Collapse
Affiliation(s)
- Vadim Y. Bichutskiy
- Department of Computer Science
- Institute for Genomics and Bioinformatics, University of California, Irvine, California 92697, U.S.A
| | - Richard Colman
- Institute for Genomics and Bioinformatics, University of California, Irvine, California 92697, U.S.A
| | - Rainer K. Brachmann
- Department of Medicine
- Department of Biological Chemistry
- Department of Pathology
- Division of Hematology/Oncology
- Institute for Genomics and Bioinformatics, University of California, Irvine, California 92697, U.S.A
| | - Richard H. Lathrop
- Department of Computer Science
- Department of Biomedical Engineering
- Institute for Genomics and Bioinformatics, University of California, Irvine, California 92697, U.S.A
| |
Collapse
|
19
|
Aeling KA, Steffen NR, Johnson M, Hatfield GW, Lathrop RH, Senear DF. DNA deformation energy as an indirect recognition mechanism in protein-DNA interactions. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2007; 4:117-25. [PMID: 17277419 DOI: 10.1109/tcbb.2007.1000] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/13/2023]
Abstract
Proteins that bind to specific locations in genomic DNA control many basic cellular functions. Proteins detect their binding sites using both direct and indirect recognition mechanisms. Deformation energy, which models the energy required to bend DNA from its native shape to its shape when bound to a protein, has been shown to be an indirect recognition mechanism for one particular protein, Integration Host Factor (IHF). This work extends the analysis of deformation to two other DNA-binding proteins, CRP and SRF, and two endonucleases, I-CreI and I-PpoI. Known binding sites for all five proteins showed statistically significant differences in mean deformation energy as compared to random sequences. Binding sites for the three DNA-binding proteins and one of the endonucleases had mean deformation energies lower than random sequences. Binding sites for I-PpoI had mean deformation energy higher than random sequences. Classifiers that were trained using the deformation energy at each base pair step showed good cross-validated accuracy when classifying unseen sequences as binders or nonbinders. These results support DNA deformation energy as an indirect recognition mechanism across a wider range of DNA-binding proteins. Deformation energy may also have a predictive capacity for the underlying catalytic mechanism of DNA-binding enzymes.
Collapse
MESH Headings
- Algorithms
- Animals
- Base Sequence
- Binding Sites
- Cyclic AMP Receptor Protein/chemistry
- Cyclic AMP Receptor Protein/metabolism
- DNA/chemistry
- DNA/genetics
- DNA/metabolism
- DNA Restriction Enzymes/chemistry
- DNA Restriction Enzymes/metabolism
- DNA, Algal/chemistry
- DNA, Algal/genetics
- DNA, Algal/metabolism
- DNA, Bacterial/chemistry
- DNA, Bacterial/genetics
- DNA, Bacterial/metabolism
- DNA, Protozoan/chemistry
- DNA, Protozoan/genetics
- DNA, Protozoan/metabolism
- DNA-Binding Proteins/chemistry
- DNA-Binding Proteins/metabolism
- Endodeoxyribonucleases/chemistry
- Endodeoxyribonucleases/metabolism
- Humans
- Integration Host Factors/chemistry
- Integration Host Factors/metabolism
- Models, Chemical
- Models, Molecular
- Protein Binding
- Serum Response Factor/chemistry
- Serum Response Factor/metabolism
- Thermodynamics
Collapse
Affiliation(s)
- Kimberly A Aeling
- Department of Microbiology and Molecular Genetics, School of Medicine, University of California, Irvine, 92697-3425, USA.
| | | | | | | | | | | |
Collapse
|
20
|
Bichutskiy VY, Colman R, Brachmann RK, Lathrop RH. Heterogeneous Biomedical Database Integration using a Hybrid Strategy: A P53 Cancer Research Database. Cancer Inform 2006. [DOI: 10.1177/117693510600200021] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022] Open
Abstract
Complex problems in life science research give rise to multidisciplinary collaboration, and hence, to the need for heterogeneous database integration. The tumor suppressor p53 is mutated in close to 50% of human cancers, and a small drug-like molecule with the ability to restore native function to cancerous p53 mutants is a long-held medical goal of cancer treatment. The Cancer Research DataBase (CRDB) was designed in support of a project to find such small molecules. As a cancer informatics project, the CRDB involved small molecule data, computational docking results, functional assays, and protein structure data. As an example of the hybrid strategy for data integration, it combined the mediation and data warehousing approaches. This paper uses the CRDB to illustrate the hybrid strategy as a viable approach to heterogeneous data integration in biomedicine, and provides a design method for those considering similar systems. More efficient data sharing implies increased productivity, and, hopefully, improved chances of success in cancer research. (Code and database schemas are freely downloadable, http://www.igb.uci.edu/research/research.html .)
Collapse
Affiliation(s)
- Vadim Y. Bichutskiy
- Department of Computer Science. University of California, Irvine, California 92697, U.S.A
- Institute for Genomics and Bioinformatics, University of California, Irvine, California 92697, U.S.A
| | - Richard Colman
- Institute for Genomics and Bioinformatics, University of California, Irvine, California 92697, U.S.A
| | - Rainer K. Brachmann
- Department of Medicine. University of California, Irvine, California 92697, U.S.A
- Department of Biological Chemistry. University of California, Irvine, California 92697, U.S.A
- Department of Pathology. University of California, Irvine, California 92697, U.S.A
- Division of Hematology/Oncology. University of California, Irvine, California 92697, U.S.A
- Institute for Genomics and Bioinformatics, University of California, Irvine, California 92697, U.S.A
| | - Richard H. Lathrop
- Department of Computer Science. University of California, Irvine, California 92697, U.S.A
- Department of Biomedical Engineering. University of California, Irvine, California 92697, U.S.A
- Institute for Genomics and Bioinformatics, University of California, Irvine, California 92697, U.S.A
| |
Collapse
|