1
|
Kamat G, Shan M, Gutman R. Bayesian record linkage with variables in one file. Stat Med 2023; 42:4931-4951. [PMID: 37652076 DOI: 10.1002/sim.9894] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/13/2022] [Revised: 06/12/2023] [Accepted: 08/21/2023] [Indexed: 09/02/2023]
Abstract
In many healthcare and social science applications, information about units is dispersed across multiple data files. Linking records across files is necessary to estimate the associations of interest. Common record linkage algorithms only rely on similarities between linking variables that appear in all the files. Moreover, analysis of linked files often ignores errors that may arise from incorrect or missed links. Bayesian record linking methods allow for natural propagation of linkage error, by jointly sampling the linkage structure and the model parameters. We extend an existing Bayesian record linkage method to integrate associations between variables exclusive to each file being linked. We show analytically, and using simulations, that the proposed method can improve the linking process, and can result in accurate inferences. We apply the method to link Meals on Wheels recipients to Medicare enrollment records.
Collapse
Affiliation(s)
- Gauri Kamat
- Department of Biostatistics, Brown University, Providence, Rhode Island, USA
| | | | - Roee Gutman
- Department of Biostatistics, Brown University, Providence, Rhode Island, USA
| |
Collapse
|
2
|
Andreella A, De Santis R, Vesely A, Finos L. Procrustes-based distances for exploring between-matrices similarity. STAT METHOD APPL-GER 2023. [DOI: 10.1007/s10260-023-00689-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 04/05/2023]
Abstract
AbstractThe statistical shape analysis called Procrustes analysis minimizes the Frobenius distance between matrices by similarity transformations. The method returns a set of optimal orthogonal matrices, which project each matrix into a common space. This manuscript presents two types of distances derived from Procrustes analysis for exploring between-matrices similarity. The first one focuses on the residuals from the Procrustes analysis, i.e., the residual-based distance metric. In contrast, the second one exploits the fitted orthogonal matrices, i.e., the rotational-based distance metric. Thanks to these distances, similarity-based techniques such as the multidimensional scaling method can be applied to visualize and explore patterns and similarities among observations. The proposed distances result in being helpful in functional magnetic resonance imaging (fMRI) data analysis. The brain activation measured over space and time can be represented by a matrix. The proposed distances applied to a sample of subjects—i.e., matrices—revealed groups of individuals sharing patterns of neural brain activation. Finally, the proposed method is useful in several contexts when the aim is to analyze the similarity between high-dimensional matrices affected by functional misalignment.
Collapse
|
3
|
Andreella A, Finos L. Procrustes Analysis for High-Dimensional Data. PSYCHOMETRIKA 2022; 87:1422-1438. [PMID: 35583747 PMCID: PMC9636303 DOI: 10.1007/s11336-022-09859-5] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/05/2021] [Revised: 03/01/2022] [Indexed: 05/31/2023]
Abstract
The Procrustes-based perturbation model (Goodall in J R Stat Soc Ser B Methodol 53(2):285-321, 1991) allows minimization of the Frobenius distance between matrices by similarity transformation. However, it suffers from non-identifiability, critical interpretation of the transformed matrices, and inapplicability in high-dimensional data. We provide an extension of the perturbation model focused on the high-dimensional data framework, called the ProMises (Procrustes von Mises-Fisher) model. The ill-posed and interpretability problems are solved by imposing a proper prior distribution for the orthogonal matrix parameter (i.e., the von Mises-Fisher distribution) which is a conjugate prior, resulting in a fast estimation process. Furthermore, we present the Efficient ProMises model for the high-dimensional framework, useful in neuroimaging, where the problem has much more than three dimensions. We found a great improvement in functional magnetic resonance imaging connectivity analysis because the ProMises model permits incorporation of topological brain information in the alignment's estimation process.
Collapse
Affiliation(s)
- Angela Andreella
- Department of Economics, CA’ Foscari University of Venice, San Giobbe - Cannaregio 873, Fondamenta San Giobbe, 30121 Venice, Italy
| | - Livio Finos
- Department of Developmental Psychology and Socialization, University of Padova, Via Venezia, 8, Padua, Italy
| |
Collapse
|
4
|
Improving Wildlife Population Inference Using Aerial Imagery and Entity Resolution. JOURNAL OF AGRICULTURAL, BIOLOGICAL AND ENVIRONMENTAL STATISTICS 2022. [DOI: 10.1007/s13253-021-00484-w] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/19/2022]
|
5
|
Fallaize CJ, Green PJ, Mardia KV, Barber S. Bayesian protein sequence and structure alignment. J R Stat Soc Ser C Appl Stat 2020. [DOI: 10.1111/rssc.12394] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022]
Affiliation(s)
| | - Peter J. Green
- University of Bristol UK
- University of Technology Sydney Australia
| | | | | |
Collapse
|
6
|
Kasmi Y, Khataby K, Souiri A, Ennaji MM. Coronaviridae: 100,000 Years of Emergence and Reemergence. EMERGING AND REEMERGING VIRAL PATHOGENS 2020. [PMCID: PMC7149750 DOI: 10.1016/b978-0-12-819400-3.00007-7] [Citation(s) in RCA: 13] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 02/05/2023]
Abstract
The coronavirus family (Coronaviridae) is a positive-sense single-stranded RNA, with a size of 27 kb. These viruses have a potential species specificity and interspecies transmission. The interspecies transmission of viruses from one host species to another is a major factor responsible for the majority of emerging and reemerging infections. The Coronaviridae is one of the most popular emerging viral families that threaten to the public health.
Collapse
|
7
|
Affiliation(s)
- Giacomo Zanella
- Department of Decision Sciences, BIDSA and IGIER, Bocconi University, Milan, Italy
| |
Collapse
|
8
|
Kent JT, Ganeiber AM, Mardia KV. A New Unified Approach for the Simulation of a Wide Class of Directional Distributions. J Comput Graph Stat 2018. [DOI: 10.1080/10618600.2017.1390468] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/18/2022]
Affiliation(s)
- John T. Kent
- Department of Statistics, University of Leeds, Leeds, United Kingdom
| | | | - Kanti V. Mardia
- Department of Statistics, University of Leeds, Leeds, United Kingdom
- Department of Statistics, University of Oxford, Oxford, United Kingdom
| |
Collapse
|
9
|
Eltzner B, Huckemann S, Mardia KV. Torus principal component analysis with applications to RNA structure. Ann Appl Stat 2018. [DOI: 10.1214/17-aoas1115] [Citation(s) in RCA: 19] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
10
|
Sikaroudi AE, Welch DA, Woehl TJ, Faller R, Evans JE, Browning ND, Park C. Directional Statistics of Preferential Orientations of Two Shapes in Their Aggregate and Its Application to Nanoparticle Aggregation. Technometrics 2018. [DOI: 10.1080/00401706.2017.1366949] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/19/2022]
Affiliation(s)
| | | | - Taylor J. Woehl
- Department of Chemical and Biomolecular Engineering, University of Maryland, College Park, MD
| | - Roland Faller
- Department of Chemical Engineering, University of California at Davis, Davis, CA
| | - James E. Evans
- Environmental Molecular Sciences Laboratory, Pacific Northwest National Laboratory, Richland, WA
| | - Nigel D. Browning
- Fundamental Computational Sciences Directorate, Pacific Northwest National Laboratory, Richland, WA
| | - Chiwoo Park
- Department of Industrial and Manufacturing Engineering, Florida State University, Tallahassee, FL
| |
Collapse
|
11
|
Ejlali N, Faghihi MR, Sadeghi M. Bayesian comparison of protein structures using partial Procrustes distance. Stat Appl Genet Mol Biol 2017; 16:243-257. [PMID: 28862992 DOI: 10.1515/sagmb-2016-0014] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Abstract
An important topic in bioinformatics is the protein structure alignment. Some statistical methods have been proposed for this problem, but most of them align two protein structures based on the global geometric information without considering the effect of neighbourhood in the structures. In this paper, we provide a Bayesian model to align protein structures, by considering the effect of both local and global geometric information of protein structures. Local geometric information is incorporated to the model through the partial Procrustes distance of small substructures. These substructures are composed of β-carbon atoms from the side chains. Parameters are estimated using a Markov chain Monte Carlo (MCMC) approach. We evaluate the performance of our model through some simulation studies. Furthermore, we apply our model to a real dataset and assess the accuracy and convergence rate. Results show that our model is much more efficient than previous approaches.
Collapse
|
12
|
Affiliation(s)
- Mauricio Sadinle
- Department of Statistical Science, Duke University, Durham, NC, and the National Institute of Statistical Sciences—NISS, Research Triangle Park, NC
| |
Collapse
|
13
|
|
14
|
Herman JL, Novák Á, Lyngsø R, Szabó A, Miklós I, Hein J. Efficient representation of uncertainty in multiple sequence alignments using directed acyclic graphs. BMC Bioinformatics 2015; 16:108. [PMID: 25888064 PMCID: PMC4395974 DOI: 10.1186/s12859-015-0516-1] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/24/2014] [Accepted: 02/24/2015] [Indexed: 11/30/2022] Open
Abstract
BACKGROUND A standard procedure in many areas of bioinformatics is to use a single multiple sequence alignment (MSA) as the basis for various types of analysis. However, downstream results may be highly sensitive to the alignment used, and neglecting the uncertainty in the alignment can lead to significant bias in the resulting inference. In recent years, a number of approaches have been developed for probabilistic sampling of alignments, rather than simply generating a single optimum. However, this type of probabilistic information is currently not widely used in the context of downstream inference, since most existing algorithms are set up to make use of a single alignment. RESULTS In this work we present a framework for representing a set of sampled alignments as a directed acyclic graph (DAG) whose nodes are alignment columns; each path through this DAG then represents a valid alignment. Since the probabilities of individual columns can be estimated from empirical frequencies, this approach enables sample-based estimation of posterior alignment probabilities. Moreover, due to conditional independencies between columns, the graph structure encodes a much larger set of alignments than the original set of sampled MSAs, such that the effective sample size is greatly increased. CONCLUSIONS The alignment DAG provides a natural way to represent a distribution in the space of MSAs, and allows for existing algorithms to be efficiently scaled up to operate on large sets of alignments. As an example, we show how this can be used to compute marginal probabilities for tree topologies, averaging over a very large number of MSAs. This framework can also be used to generate a statistically meaningful summary alignment; example applications show that this summary alignment is consistently more accurate than the majority of the alignment samples, leading to improvements in downstream tree inference. Implementations of the methods described in this article are available at http://statalign.github.io/WeaveAlign .
Collapse
Affiliation(s)
- Joseph L Herman
- Department of Statistics, University of Oxford, 1 South Parks Road, Oxford, OX1 3TG, UK.
- Division of Mathematical Biology, National Institute of Medical Research,, The Ridgeway, London, NW7 1AA, UK.
| | - Ádám Novák
- Department of Statistics, University of Oxford, 1 South Parks Road, Oxford, OX1 3TG, UK.
| | - Rune Lyngsø
- Department of Statistics, University of Oxford, 1 South Parks Road, Oxford, OX1 3TG, UK.
| | - Adrienn Szabó
- Institute of Computer Science and Control, Hungarian Academy of Sciences, Lagymanyosi u. 11., Budapest, 1111, Hungary.
| | - István Miklós
- Institute of Computer Science and Control, Hungarian Academy of Sciences, Lagymanyosi u. 11., Budapest, 1111, Hungary.
- Department of Stochastics, Rényi Institute, Reáltanoda u. 13-15, Budapest, 1053, Hungary.
| | - Jotun Hein
- Department of Statistics, University of Oxford, 1 South Parks Road, Oxford, OX1 3TG, UK.
| |
Collapse
|
15
|
Najibi S, Faghihi M, Golalizadeh M, Arab S. Bayesian alignment of proteins via Delaunay tetrahedralization. J Appl Stat 2015. [DOI: 10.1080/02664763.2014.995605] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/24/2022]
|
16
|
Herman JL, Challis CJ, Novák Á, Hein J, Schmidler SC. Simultaneous Bayesian estimation of alignment and phylogeny under a joint model of protein sequence and structure. Mol Biol Evol 2014; 31:2251-66. [PMID: 24899668 PMCID: PMC4137710 DOI: 10.1093/molbev/msu184] [Citation(s) in RCA: 35] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/29/2022] Open
Abstract
For sequences that are highly divergent, there is often insufficient information to infer accurate alignments, and phylogenetic uncertainty may be high. One way to address this issue is to make use of protein structural information, since structures generally diverge more slowly than sequences. In this work, we extend a recently developed stochastic model of pairwise structural evolution to multiple structures on a tree, analytically integrating over ancestral structures to permit efficient likelihood computations under the resulting joint sequence-structure model. We observe that the inclusion of structural information significantly reduces alignment and topology uncertainty, and reduces the number of topology and alignment errors in cases where the true trees and alignments are known. In some cases, the inclusion of structure results in changes to the consensus topology, indicating that structure may contain additional information beyond that which can be obtained from sequences. We use the model to investigate the order of divergence of cytoglobins, myoglobins, and hemoglobins and observe a stabilization of phylogenetic inference: although a sequence-based inference assigns significant posterior probability to several different topologies, the structural model strongly favors one of these over the others and is more robust to the choice of data set.
Collapse
Affiliation(s)
- Joseph L Herman
- Department of Statistics, University of Oxford, Oxford, United KingdomDivision of Mathematical Biology, National Institute of Medical Research, London, United Kingdom
| | | | - Ádám Novák
- Department of Statistics, University of Oxford, Oxford, United Kingdom
| | - Jotun Hein
- Department of Statistics, University of Oxford, Oxford, United Kingdom
| | - Scott C Schmidler
- Department of Statistical Science, Duke UniversityDepartment of Computer Science, Duke University
| |
Collapse
|
17
|
Shape and object data analysis. Biom J 2014; 56:758-60. [DOI: 10.1002/bimj.201300220] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/05/2013] [Revised: 01/23/2014] [Accepted: 01/23/2014] [Indexed: 11/07/2022]
|
18
|
Kent JT. Contribution to the Discussion of the Paper Geodesic Monte Carlo on Embedded Manifolds. Scand Stat Theory Appl 2014. [DOI: 10.1111/sjos.12068] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
|
19
|
Abstract
The analysis of the three-dimensional structure of proteins is an important topic in molecular biochemistry. Structure plays a critical role in defining the function of proteins and is more strongly conserved than amino acid sequence over evolutionary timescales. A key challenge is the identification and evaluation of structural similarity between proteins; such analysis can aid in understanding the role of newly discovered proteins and help elucidate evolutionary relationships between organisms. Computational biologists have developed many clever algorithmic techniques for comparing protein structures, however, all are based on heuristic optimization criteria, making statistical interpretation somewhat difficult. Here we present a fully probabilistic framework for pairwise structural alignment of proteins. Our approach has several advantages, including the ability to capture alignment uncertainty and to estimate key "gap" parameters which critically affect the quality of the alignment. We show that several existing alignment methods arise as maximum a posteriori estimates under specific choices of prior distributions and error models. Our probabilistic framework is also easily extended to incorporate additional information, which we demonstrate by including primary sequence information to generate simultaneous sequence-structure alignments that can resolve ambiguities obtained using structure alone. This combined model also provides a natural approach for the difficult task of estimating evolutionary distance based on structural alignments. The model is illustrated by comparison with well-established methods on several challenging protein alignment examples.
Collapse
Affiliation(s)
- Abel Rodriguez
- University of California, Santa Cruz and Duke University
| | | |
Collapse
|
20
|
|
21
|
Su J, Srivastava A, Huffer F. Detection, classification and estimation of individual shapes in 2D and 3D point clouds. Comput Stat Data Anal 2013. [DOI: 10.1016/j.csda.2012.09.008] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
|
22
|
Mardia KV. Statistical approaches to three key challenges in protein structural bioinformatics. J R Stat Soc Ser C Appl Stat 2013. [DOI: 10.1111/rssc.12003] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
|
23
|
Mardia KV, Fallaize CJ, Barber S, Jackson RM, Theobald DL. BAYESIAN ALIGNMENT OF SIMILARITY SHAPES. Ann Appl Stat 2013; 7:989-1009. [PMID: 24052809 PMCID: PMC3774796 DOI: 10.1214/12-aoas615] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
Abstract
We develop a Bayesian model for the alignment of two point configurations under the full similarity transformations of rotation, translation and scaling. Other work in this area has concentrated on rigid body transformations, where scale information is preserved, motivated by problems involving molecular data; this is known as form analysis. We concentrate on a Bayesian formulation for statistical shape analysis. We generalize the model introduced by Green and Mardia for the pairwise alignment of two unlabeled configurations to full similarity transformations by introducing a scaling factor to the model. The generalization is not straight-forward, since the model needs to be reformulated to give good performance when scaling is included. We illustrate our method on the alignment of rat growth profiles and a novel application to the alignment of protein domains. Here, scaling is applied to secondary structure elements when comparing protein folds; additionally, we find that one global scaling factor is not in general sufficient to model these data and, hence, we develop a model in which multiple scale factors can be included to handle different scalings of shape components.
Collapse
Affiliation(s)
- Kanti V. Mardia
- Department of Statistics, University of Leeds, Leeds, LS2 9JT, United Kingdom, ,
| | - Christopher J. Fallaize
- School of Mathematical Sciences, University of Nottingham, Nottingham, NG7 2RD, United Kingdom,
| | - Stuart Barber
- Department of Statistics, University of Leeds, Leeds, LS2 9JT, United Kingdom, ,
| | - Richard M. Jackson
- Institute of Molecular and Cellular Biology, University of Leeds, Leeds, LS2 9JT, United Kingdom
| | - Douglas L. Theobald
- Department of Biochemistry, Brandeis University, 415 South St, Waltham, Massachusetts 02454-9110, USA,
| |
Collapse
|
24
|
Mardia KV, Petty EM, Taylor CC. Matching markers and unlabeled configurations in protein gels. Ann Appl Stat 2012. [DOI: 10.1214/12-aoas544] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
25
|
Challis CJ, Schmidler SC. A stochastic evolutionary model for protein structure alignment and phylogeny. Mol Biol Evol 2012; 29:3575-87. [PMID: 22723302 DOI: 10.1093/molbev/mss167] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
We present a stochastic process model for the joint evolution of protein primary and tertiary structure, suitable for use in alignment and estimation of phylogeny. Indels arise from a classic Links model, and mutations follow a standard substitution matrix, whereas backbone atoms diffuse in three-dimensional space according to an Ornstein-Uhlenbeck process. The model allows for simultaneous estimation of evolutionary distances, indel rates, structural drift rates, and alignments, while fully accounting for uncertainty. The inclusion of structural information enables phylogenetic inference on time scales not previously attainable with sequence evolution models. The model also provides a tool for testing evolutionary hypotheses and improving our understanding of protein structural evolution.
Collapse
|
26
|
Czogiel I, Dryden IL, Brignell CJ. Bayesian matching of unlabeled marked point sets using random fields, with an application to molecular alignment. Ann Appl Stat 2011. [DOI: 10.1214/11-aoas486] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
27
|
Melnykov V, Maitra R, Nettleton D. Accounting for spot matching uncertainty in the analysis of proteomics data from two-dimensional gel electrophoresis. SANKHYA B 2011. [DOI: 10.1007/s13571-011-0016-x] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
|
28
|
Tancredi A, Liseo B. A hierarchical Bayesian approach to record linkage and population size problems. Ann Appl Stat 2011. [DOI: 10.1214/10-aoas447] [Citation(s) in RCA: 56] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
29
|
Xia H(H, Ding Y, Mallick BK. Bayesian hierarchical model for combining misaligned two-resolution metrology data. ACTA ACUST UNITED AC 2011. [DOI: 10.1080/0740817x.2010.521804] [Citation(s) in RCA: 25] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/18/2022]
|
30
|
Mardia KV, Nyirongo VB, Fallaize CJ, Barber S, Jackson RM. Hierarchical bayesian modeling of pharmacophores in bioinformatics. Biometrics 2010; 67:611-9. [PMID: 20618307 DOI: 10.1111/j.1541-0420.2010.01460.x] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
Abstract
One of the key ingredients in drug discovery is the derivation of conceptual templates called pharmacophores. A pharmacophore model characterizes the physicochemical properties common to all active molecules, called ligands, bound to a particular protein receptor, together with their relative spatial arrangement. Motivated by this important application, we develop a Bayesian hierarchical model for the derivation of pharmacophore templates from multiple configurations of point sets, partially labeled by the atom type of each point. The model is implemented through a multistage template hunting algorithm that produces a series of templates that capture the geometrical relationship of atoms matched across multiple configurations. Chemical information is incorporated by distinguishing between atoms of different elements, whereby different elements are less likely to be matched than atoms of the same element. We illustrate our method through examples of deriving templates from sets of ligands that all bind structurally related protein active sites and show that the model is able to retrieve the key pharmacophore features in two test cases.
Collapse
Affiliation(s)
- Kanti V Mardia
- Department of Statistics, The University of Leeds, Leeds LS2 9JT, UK.
| | | | | | | | | |
Collapse
|
31
|
|
32
|
|
33
|
Kayano M, Konishi S. Functional principal component analysis via regularized Gaussian basis expansions and its application to unbalanced data. J Stat Plan Inference 2009. [DOI: 10.1016/j.jspi.2008.11.002] [Citation(s) in RCA: 13] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
|
34
|
Xie L, Xie L, Bourne PE. A unified statistical model to support local sequence order independent similarity searching for ligand-binding sites and its application to genome-based drug discovery. Bioinformatics 2009; 25:i305-12. [PMID: 19478004 PMCID: PMC2687974 DOI: 10.1093/bioinformatics/btp220] [Citation(s) in RCA: 70] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
Functional relationships between proteins that do not share global structure similarity can be established by detecting their ligand-binding-site similarity. For a large-scale comparison, it is critical to accurately and efficiently assess the statistical significance of this similarity. Here, we report an efficient statistical model that supports local sequence order independent ligand-binding-site similarity searching. Most existing statistical models only take into account the matching vertices between two sites that are defined by a fixed number of points. In reality, the boundary of the binding site is not known or is dependent on the bound ligand making these approaches limited. To address these shortcomings and to perform binding-site mapping on a genome-wide scale, we developed a sequence-order independent profile-profile alignment (SOIPPA) algorithm that is able to detect local similarity between unknown binding sites a priori. The SOIPPA scoring integrates geometric, evolutionary and physical information into a unified framework. However, this imposes a significant challenge in assessing the statistical significance of the similarity because the conventional probability model that is based on fixed-point matching cannot be applied. Here we find that scores for binding-site matching by SOIPPA follow an extreme value distribution (EVD). Benchmark studies show that the EVD model performs at least two-orders faster and is more accurate than the non-parametric statistical method in the previous SOIPPA version. Efficient statistical analysis makes it possible to apply SOIPPA to genome-based drug discovery. Consequently, we have applied the approach to the structural genome of Mycobacterium tuberculosis to construct a protein-ligand interaction network. The network reveals highly connected proteins, which represent suitable targets for promiscuous drugs.
Collapse
Affiliation(s)
- Lei Xie
- San Diego Supercomputer Center, University of California, San Diego, La Jolla, CA 92093, USA.
| | | | | |
Collapse
|
35
|
Habeck M. Generation of three-dimensional random rotations in fitting and matching problems. Comput Stat 2009. [DOI: 10.1007/s00180-009-0156-x] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/20/2022]
|
36
|
Hamelryck T. Probabilistic models and machine learning in structural bioinformatics. Stat Methods Med Res 2009; 18:505-26. [PMID: 19153168 DOI: 10.1177/0962280208099492] [Citation(s) in RCA: 13] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
Structural bioinformatics is concerned with the molecular structure of biomacromolecules on a genomic scale, using computational methods. Classic problems in structural bioinformatics include the prediction of protein and RNA structure from sequence, the design of artificial proteins or enzymes, and the automated analysis and comparison of biomacromolecules in atomic detail. The determination of macromolecular structure from experimental data (for example coming from nuclear magnetic resonance, X-ray crystallography or small angle X-ray scattering) has close ties with the field of structural bioinformatics. Recently, probabilistic models and machine learning methods based on Bayesian principles are providing efficient and rigorous solutions to challenging problems that were long regarded as intractable. In this review, I will highlight some important recent developments in the prediction, analysis and experimental determination of macromolecular structure that are based on such methods. These developments include generative models of protein structure, the estimation of the parameters of energy functions that are used in structure prediction, the superposition of macromolecules and structure determination methods that are based on inference. Although this review is not exhaustive, I believe the selected topics give a good impression of the exciting new, probabilistic road the field of structural bioinformatics is taking.
Collapse
Affiliation(s)
- Thomas Hamelryck
- Bioinformatics Center, Department of Biology, University of Copenhagen, Copenhagen N, Denmark.
| |
Collapse
|
37
|
Ruffieux Y, Green PJ. Alignment of Multiple Configurations Using Hierarchical Models. J Comput Graph Stat 2009. [DOI: 10.1198/jcgs.2009.07048] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/21/2022]
|
38
|
Mardia KV, Nyirongo VB. Simulating virtual protein Calpha traces with applications. J Comput Biol 2008; 15:1209-20. [PMID: 18973436 DOI: 10.1089/cmb.2007.0092] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
We propose a simple procedure for generating virtual protein C(alpha) traces. One of the key ingredients of our method, to build a three-dimensional structure from a random sequence of amino acids, is to work directly on torsional angles of the chain which we sample from a von Mises distribution. With simple modeling of the hydrophobic effect in protein folding, the procedure produces compact and globular structures. Some characteristics of real proteins (i.e., compactness and globularity) are well mimicked by this procedure. These virtual traces are used to assess algorithms for matching protein structures or functional sites.
Collapse
Affiliation(s)
- Kanti V Mardia
- Department of Statistics, University of Leeds, Leeds, United Kingdom.
| | | |
Collapse
|
39
|
Marín JM, Nieto C. Spatial Matching of Multiple Configurations of Points with a Bioinformatics Application. COMMUN STAT-THEOR M 2008. [DOI: 10.1080/03610920701759669] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/22/2022]
|
40
|
Liu J, Yu W, Wu B, Zhao H. Bayesian Mass Spectra Peak Alignment from Mass Charge Ratios. Cancer Inform 2008. [DOI: 10.1177/117693510800600006] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022] Open
Abstract
Proteomics studies based on mass spectrometry (MS) are gaining popular applications in biomedical research for protein identification/quantification and biomarker discovery, especially for potential early diagnosis and prognosis of severe disease before the occurrence of symptoms. However, MS data collected using current technologies are very noisy and appropriate data preprocessing is critical for successful applications of MS-based approaches. Among various data preprocessing steps, peak alignment from multiple spectra based on detected peak sample locations presents special statistical challenges when effective experimental calibration is not feasible due to relatively large peak location variation. To avoid intensive tuning parameter optimization, we propose a simple novel Bayesian algorithm “random grafting-pruning Markov chain Monte Carlo (RGPMCMC)” that can be applied to global MS peak alignment and to follow certain modelbased sample classification criterion for using aligned peaks to classify spectrum samples. The usefulness of our approach is demonstrated through simulation study by making extensive comparison with other algorithms in the literature. Its application to an ovarian cancer MALDI-MS data set achieves a smaller 10-fold cross validation error rate than other current large scale methodologies.
Collapse
Affiliation(s)
- Junfeng Liu
- Department of Statistics, West Virginia University, Morgantown, WV 26506, U.S.A
| | - Weichuan Yu
- Departmentof Electronic and Computer Engineering, Hong Kong University of Science and Technology, Clear Water Bay, Sai Kung Kowloon, Hong Kong
| | - Baolin Wu
- Division of Biostatistics, University of Minnesota, Minneapolis, MN 55455, U.S.A
| | - Hongyu Zhao
- Department of Epidemiology and Public Health, Yale University School of Medicine, New Haven, CT 06520-8034, U.S.A
| |
Collapse
|
41
|
Mardia KV. Comment. J Am Stat Assoc 2007. [DOI: 10.1198/016214507000001210] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/21/2022]
|
42
|
Davies JR, Jackson RM, Mardia KV, Taylor CC. The Poisson Index: a new probabilistic model for protein–ligand binding site similarity. Bioinformatics 2007; 23:3001-8. [PMID: 17893083 DOI: 10.1093/bioinformatics/btm470] [Citation(s) in RCA: 14] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION The large-scale comparison of protein-ligand binding sites is problematic, in that measures of structural similarity are difficult to quantify and are not easily understood in terms of statistical similarity that can ultimately be related to structure and function. We present a binding site matching score the Poisson Index (PI) based upon a well-defined statistical model. PI requires only the number of matching atoms between two sites and the size of the two sites-the same information used by the Tanimoto Index (TI), a comparable and widely used measure for molecular similarity. We apply PI and TI to a previously automatically extracted set of binding sites to determine the robustness and usefulness of both scores. RESULTS We found that PI outperforms TI; moreover, site similarity is poorly defined for TI at values around the 99.5% confidence level for which PI is well defined. A difference map at this confidence level shows that PI gives much more meaningful information than TI. We show individual examples where TI fails to distinguish either a false or a true site paring in contrast to PI, which performs much better. TI cannot handle large or small sites very well, or the comparison of large and small sites, in contrast to PI that is shown to be much more robust. Despite the difficulty of determining a biological 'ground truth' for binding site similarity we conclude that PI is a suitable measure of binding site similarity and could form the basis for a binding site classification scheme comparable to existing protein domain classification schema.
Collapse
Affiliation(s)
- J R Davies
- School of Mathematics and Institute of Molecular and Cellular Biology, University of Leeds, Leeds LS2 9JT, UK
| | | | | | | |
Collapse
|
43
|
Dryden IL, Hirst JD, Melville JL. Statistical analysis of unlabeled point sets: comparing molecules in chemoinformatics. Biometrics 2007; 63:237-51. [PMID: 17447950 DOI: 10.1111/j.1541-0420.2006.00622.x] [Citation(s) in RCA: 25] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
Abstract
We consider Bayesian methodology for comparing two or more unlabeled point sets. Application of the technique to a set of steroid molecules illustrates its potential utility involving the comparison of molecules in chemoinformatics and bioinformatics. We initially match a pair of molecules, where one molecule is regarded as random and the other fixed. A type of mixture model is proposed for the point set coordinates, and the parameters of the distribution are a labeling matrix (indicating which pairs of points match) and a concentration parameter. An important property of the likelihood is that it is invariant under rotations and translations of the data. Bayesian inference for the parameters is carried out using Markov chain Monte Carlo simulation, and it is demonstrated that the procedure works well on the steroid data. The posterior distribution is difficult to simulate from, due to multiple local modes, and we also use additional data (partial charges on atoms) to help with this task. An approximation is considered for speeding up the simulation algorithm, and the approximating fast algorithm leads to essentially identical inference to that under the exact method for our data. Extensions to multiple molecule alignment are also introduced, and an algorithm is described which also works well on the steroid data set. After all the steroid molecules have been matched, exploratory data analysis is carried out to examine which molecules are similar. Also, further Bayesian inference for the multiple alignment problem is considered.
Collapse
Affiliation(s)
- Ian L Dryden
- School of Mathematical Sciences, University of Nottingham, University Park, Nottingham NG7 2RD, UK.
| | | | | |
Collapse
|
44
|
Bayesian refinement of protein functional site matching. BMC Bioinformatics 2007; 8:257. [PMID: 17640336 PMCID: PMC1940029 DOI: 10.1186/1471-2105-8-257] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/24/2006] [Accepted: 07/17/2007] [Indexed: 11/21/2022] Open
Abstract
Background Matching functional sites is a key problem for the understanding of protein function and evolution. The commonly used graph theoretic approach, and other related approaches, require adjustment of a matching distance threshold a priori according to the noise in atomic positions. This is difficult to pre-determine when matching sites related by varying evolutionary distances and crystallographic precision. Furthermore, sometimes the graph method is unable to identify alternative but important solutions in the neighbourhood of the distance based solution because of strict distance constraints. We consider the Bayesian approach to improve graph based solutions. In principle this approach applies to other methods with strict distance matching constraints. The Bayesian method can flexibly incorporate all types of prior information on specific binding sites (e.g. amino acid types) in contrast to combinatorial formulations. Results We present a new meta-algorithm for matching protein functional sites (active sites and ligand binding sites) based on an initial graph matching followed by refinement using a Markov chain Monte Carlo (MCMC) procedure. This procedure is an innovative extension to our recent work. The method accounts for the 3-dimensional structure of the site as well as the physico-chemical properties of the constituent amino acids. The MCMC procedure can lead to a significant increase in the number of significant matches compared to the graph method as measured independently by rigorously derived p-values. Conclusion MCMC refinement step is able to significantly improve graph based matches. We apply the method to matching NAD(P)(H) binding sites within single Rossmann fold families, between different families in the same superfamily, and in different folds. Within families sites are often well conserved, but there are examples where significant shape based matches do not retain similar amino acid chemistry, indicating that even within families the same ligand may be bound using substantially different physico-chemistry. We also show that the procedure finds significant matches between binding sites for the same co-factor in different families and different folds.
Collapse
|