1
|
de Jager M, Kolbeck PJ, Vanderlinden W, Lipfert J, Filion L. Exploring protein-mediated compaction of DNA by coarse-grained simulations and unsupervised learning. Biophys J 2024; 123:3231-3241. [PMID: 39044429 PMCID: PMC11427786 DOI: 10.1016/j.bpj.2024.07.023] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/28/2024] [Revised: 06/18/2024] [Accepted: 07/18/2024] [Indexed: 07/25/2024] Open
Abstract
Protein-DNA interactions and protein-mediated DNA compaction play key roles in a range of biological processes. The length scales typically involved in DNA bending, bridging, looping, and compaction (≥1 kbp) are challenging to address experimentally or by all-atom molecular dynamics simulations, making coarse-grained simulations a natural approach. Here, we present a simple and generic coarse-grained model for DNA-protein and protein-protein interactions and investigate the role of the latter in the protein-induced compaction of DNA. Our approach models the DNA as a discrete worm-like chain. The proteins are treated in the grand canonical ensemble, and the protein-DNA binding strength is taken from experimental measurements. Protein-DNA interactions are modeled as an isotropic binding potential with an imposed binding valency without specific assumptions about the binding geometry. To systematically and quantitatively classify DNA-protein complexes, we present an unsupervised machine learning pipeline that receives a large set of structural order parameters as input, reduces the dimensionality via principal-component analysis, and groups the results using a Gaussian mixture model. We apply our method to recent data on the compaction of viral genome-length DNA by HIV integrase and find that protein-protein interactions are critical to the formation of looped intermediate structures seen experimentally. Our methodology is broadly applicable to DNA-binding proteins and protein-induced DNA compaction and provides a systematic and semi-quantitative approach for analyzing their mesoscale complexes.
Collapse
Affiliation(s)
- Marjolein de Jager
- Soft Condensed Matter and Biophysics, Debye Institute for Nanomaterials Science, Utrecht University, Utrecht, the Netherlands.
| | - Pauline J Kolbeck
- Soft Condensed Matter and Biophysics, Debye Institute for Nanomaterials Science, Utrecht University, Utrecht, the Netherlands; Department of Physics and Center for NanoScience, LMU, Munich, Germany
| | - Willem Vanderlinden
- Soft Condensed Matter and Biophysics, Debye Institute for Nanomaterials Science, Utrecht University, Utrecht, the Netherlands; Department of Physics and Center for NanoScience, LMU, Munich, Germany; School of Physics and Astronomy, University of Edinburgh, Scotland, United Kingdom
| | - Jan Lipfert
- Soft Condensed Matter and Biophysics, Debye Institute for Nanomaterials Science, Utrecht University, Utrecht, the Netherlands; Department of Physics and Center for NanoScience, LMU, Munich, Germany
| | - Laura Filion
- Soft Condensed Matter and Biophysics, Debye Institute for Nanomaterials Science, Utrecht University, Utrecht, the Netherlands
| |
Collapse
|
2
|
Varghese A, Santos-Fernandez E, Denti F, Mira A, Mengersen K. A global perspective on the intrinsic dimensionality of COVID-19 data. Sci Rep 2023; 13:9761. [PMID: 37328523 PMCID: PMC10276009 DOI: 10.1038/s41598-023-36116-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/13/2022] [Accepted: 05/30/2023] [Indexed: 06/18/2023] Open
Abstract
We develop a novel global perspective of the complexity of the relationships between three COVID-19 datasets, the standardised per-capita growth rate of COVID-19 cases and deaths, and the Oxford Coronavirus Government Response Tracker COVID-19 Stringency Index (CSI) which is a measure describing a country's stringency of lockdown policies. We use a state-of-the-art heterogeneous intrinsic dimension estimator implemented as a Bayesian mixture model, called Hidalgo. Our findings suggest that these highly popular COVID-19 statistics may project onto two low-dimensional manifolds without significant information loss, suggesting that COVID-19 data dynamics are generated from a latent mechanism characterised by a few important variables. The low dimensionality imply a strong dependency among the standardised growth rates of cases and deaths per capita and the CSI for countries over 2020-2021. Importantly, we identify spatial autocorrelation in the intrinsic dimension distribution worldwide. The results show how high-income countries are more prone to lie on low-dimensional manifolds, likely arising from aging populations, comorbidities, and increased per capita mortality burden from COVID-19. Finally, the temporal stratification of the dataset allows the examination of the intrinsic dimension at a more granular level throughout the pandemic.
Collapse
Affiliation(s)
- Abhishek Varghese
- School of Mathematical Sciences, Queensland University of Technology, Brisbane, Australia
- Centre for Data Science (CDS), Queensland University of Technology (QUT), Brisbane, Australia
| | - Edgar Santos-Fernandez
- School of Mathematical Sciences, Queensland University of Technology, Brisbane, Australia.
- Centre for Data Science (CDS), Queensland University of Technology (QUT), Brisbane, Australia.
| | - Francesco Denti
- Department of Statistics, Università Cattolica del Sacro Cuore, Milan, Italy
| | - Antonietta Mira
- Data Science Lab, Università della Svizzera italiana, Lugano, Switzerland.
- Department of Science and High Technology, Università degli Studi dell'Insubria, Como, Italy.
| | - Kerrie Mengersen
- School of Mathematical Sciences, Queensland University of Technology, Brisbane, Australia
- Centre for Data Science (CDS), Queensland University of Technology (QUT), Brisbane, Australia
| |
Collapse
|
3
|
Lysov M, Pukhkiy K, Vasiliev E, Getmanskaya A, Turlapov V. Ensuring Explainability and Dimensionality Reduction in a Multidimensional HSI World for Early XAI-Diagnostics of Plant Stress. ENTROPY (BASEL, SWITZERLAND) 2023; 25:e25050801. [PMID: 37238556 DOI: 10.3390/e25050801] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/10/2023] [Revised: 05/08/2023] [Accepted: 05/08/2023] [Indexed: 05/28/2023]
Abstract
This work is mostly devoted to the search for effective solutions to the problem of early diagnosis of plant stress (given an example of wheat and its drought stress), which would be based on explainable artificial intelligence (XAI). The main idea is to combine the benefits of two of the most popular agricultural data sources, hyperspectral images (HSI) and thermal infrared images (TIR), in a single XAI model. Our own dataset of a 25-day experiment was used, which was created via both (1) an HSI camera Specim IQ (400-1000 nm, 204, 512 × 512) and (2) a TIR camera Testo 885-2 (320 × 240, res. 0.1 °C). The HSI were a source of the k-dimensional high-level features of plants (k ≤ K, where K is the number of HSI channels) for the learning process. Such combination was implemented as a single-layer perceptron (SLP) regressor, which is the main feature of the XAI model and receives as input an HSI pixel-signature belonging to the plant mask, which then automatically through the mask receives a mark from the TIR. The correlation of HSI channels with the TIR image on the plant's mask on the days of the experiment was studied. It was established that HSI channel 143 (820 nm) was the most correlated with TIR. The problem of training the HSI signatures of plants with their corresponding temperature value via the XAI model was solved. The RMSE of plant temperature prediction is 0.2-0.3 °C, which is acceptable for early diagnostics. Each HSI pixel was represented in training by a number (k) of channels (k ≤ K = 204 in our case). The number of channels used for training was minimized by a factor of 25-30, from 204 to eight or seven, while maintaining the RMSE value. The model is computationally efficient in training; the average training time was much less than one minute (Intel Core i3-8130U, 2.2 GHz, 4 cores, 4 GB). This XAI model can be considered a research-aimed model (R-XAI), which allows the transfer of knowledge about plants from the TIR domain to the HSI domain, with their contrasting onto only a few from hundreds of HSI channels.
Collapse
Affiliation(s)
- Maxim Lysov
- Department of Mathematical Software and Supercomputing Technologies, Lobachevsky University, 603950 Nizhny Novgorod, Russia
| | - Konstantin Pukhkiy
- Department of Mathematical Software and Supercomputing Technologies, Lobachevsky University, 603950 Nizhny Novgorod, Russia
| | - Evgeny Vasiliev
- Department of Mathematical Software and Supercomputing Technologies, Lobachevsky University, 603950 Nizhny Novgorod, Russia
| | - Alexandra Getmanskaya
- Department of Mathematical Software and Supercomputing Technologies, Lobachevsky University, 603950 Nizhny Novgorod, Russia
| | - Vadim Turlapov
- Department of Mathematical Software and Supercomputing Technologies, Lobachevsky University, 603950 Nizhny Novgorod, Russia
| |
Collapse
|
4
|
The generalized ratios intrinsic dimension estimator. Sci Rep 2022; 12:20005. [PMID: 36411305 PMCID: PMC9678878 DOI: 10.1038/s41598-022-20991-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/23/2021] [Accepted: 09/21/2022] [Indexed: 11/23/2022] Open
Abstract
Modern datasets are characterized by numerous features related by complex dependency structures. To deal with these data, dimensionality reduction techniques are essential. Many of these techniques rely on the concept of intrinsic dimension (id), a measure of the complexity of the dataset. However, the estimation of this quantity is not trivial: often, the id depends rather dramatically on the scale of the distances among data points. At short distances, the id can be grossly overestimated due to the presence of noise, becoming smaller and approximately scale-independent only at large distances. An immediate approach to examining the scale dependence consists in decimating the dataset, which unavoidably induces non-negligible statistical errors at large scale. This article introduces a novel statistical method, Gride, that allows estimating the id as an explicit function of the scale without performing any decimation. Our approach is based on rigorous distributional results that enable the quantification of uncertainty of the estimates. Moreover, our method is simple and computationally efficient since it relies only on the distances among data points. Through simulation studies, we show that Gride is asymptotically unbiased, provides comparable estimates to other state-of-the-art methods, and is more robust to short-scale noise than other likelihood-based approaches.
Collapse
|
5
|
Santos-Fernandez E, Denti F, Mengersen K, Mira A. The role of intrinsic dimension in high-resolution player tracking data—Insights in basketball. Ann Appl Stat 2022. [DOI: 10.1214/21-aoas1506] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
Affiliation(s)
| | | | - Kerrie Mengersen
- School of Mathematical Sciences, Queensland University of Technology
| | | |
Collapse
|
6
|
Benkő Z, Stippinger M, Rehus R, Bencze A, Fabó D, Hajnal B, Eröss LG, Telcs A, Somogyvári Z. Manifold-adaptive dimension estimation revisited. PeerJ Comput Sci 2022; 8:e790. [PMID: 35111907 PMCID: PMC8771813 DOI: 10.7717/peerj-cs.790] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/08/2021] [Accepted: 11/01/2021] [Indexed: 06/14/2023]
Abstract
Data dimensionality informs us about data complexity and sets limit on the structure of successful signal processing pipelines. In this work we revisit and improve the manifold adaptive Farahmand-Szepesvári-Audibert (FSA) dimension estimator, making it one of the best nearest neighbor-based dimension estimators available. We compute the probability density function of local FSA estimates, if the local manifold density is uniform. Based on the probability density function, we propose to use the median of local estimates as a basic global measure of intrinsic dimensionality, and we demonstrate the advantages of this asymptotically unbiased estimator over the previously proposed statistics: the mode and the mean. Additionally, from the probability density function, we derive the maximum likelihood formula for global intrinsic dimensionality, if i.i.d. holds. We tackle edge and finite-sample effects with an exponential correction formula, calibrated on hypercube datasets. We compare the performance of the corrected median-FSA estimator with kNN estimators: maximum likelihood (Levina-Bickel), the 2NN and two implementations of DANCo (R and MATLAB). We show that corrected median-FSA estimator beats the maximum likelihood estimator and it is on equal footing with DANCo for standard synthetic benchmarks according to mean percentage error and error rate metrics. With the median-FSA algorithm, we reveal diverse changes in the neural dynamics while resting state and during epileptic seizures. We identify brain areas with lower-dimensional dynamics that are possible causal sources and candidates for being seizure onset zones.
Collapse
Affiliation(s)
- Zsigmond Benkő
- Department of Computational Sciences, Wigner Research Centre for Physics, Budapest, Hungary
- János Szentágothai Doctoral School of Neurosciences, Semmelweis University, Budapest, Hungary
| | - Marcell Stippinger
- Department of Computational Sciences, Wigner Research Centre for Physics, Budapest, Hungary
| | - Roberta Rehus
- Department of Computational Sciences, Wigner Research Centre for Physics, Budapest, Hungary
| | - Attila Bencze
- Department of Computational Sciences, Wigner Research Centre for Physics, Budapest, Hungary
| | - Dániel Fabó
- Epilepsy Center, Department of Neurology, National Institute of Clinical Neurosciences, Budapest, Hungary
| | - Boglárka Hajnal
- János Szentágothai Doctoral School of Neurosciences, Semmelweis University, Budapest, Hungary
- Epilepsy Center, Department of Neurology, National Institute of Clinical Neurosciences, Budapest, Hungary
| | - Loránd G. Eröss
- Department of Functional Neurosurgery, National Institute of Clinical Neurosciences, Budapest, Hungary
- Faculty of Information Technology and Bionics, Péter Pázmány Catholic University, Budapest, Hungary
| | - András Telcs
- Department of Computational Sciences, Wigner Research Centre for Physics, Budapest, Hungary
- Department of Computer Science and Information Theory, Faculty of Electrical Engineering and Informatics, Budapest University of Technology and Economics, Budapest, Hungary
- Department of Quantitative Methods, Faculty of Business and Economics,, University of Pannonia, Veszprém, Hungary
| | - Zoltán Somogyvári
- Department of Computational Sciences, Wigner Research Centre for Physics, Budapest, Hungary
- Neuromicrosystems ltd., Budapest, Hungary
| |
Collapse
|
7
|
Canducci M, Tiño P, Mastropietro M. Probabilistic modelling of general noisy multi-manifold data sets. ARTIF INTELL 2022. [DOI: 10.1016/j.artint.2021.103579] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/02/2022]
|
8
|
Bac J, Mirkes EM, Gorban AN, Tyukin I, Zinovyev A. Scikit-Dimension: A Python Package for Intrinsic Dimension Estimation. ENTROPY (BASEL, SWITZERLAND) 2021; 23:1368. [PMID: 34682092 PMCID: PMC8534554 DOI: 10.3390/e23101368] [Citation(s) in RCA: 28] [Impact Index Per Article: 9.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 09/17/2021] [Revised: 10/10/2021] [Accepted: 10/16/2021] [Indexed: 02/07/2023]
Abstract
Dealing with uncertainty in applications of machine learning to real-life data critically depends on the knowledge of intrinsic dimensionality (ID). A number of methods have been suggested for the purpose of estimating ID, but no standard package to easily apply them one by one or all at once has been implemented in Python. This technical note introduces scikit-dimension, an open-source Python package for intrinsic dimension estimation. The scikit-dimension package provides a uniform implementation of most of the known ID estimators based on the scikit-learn application programming interface to evaluate the global and local intrinsic dimension, as well as generators of synthetic toy and benchmark datasets widespread in the literature. The package is developed with tools assessing the code quality, coverage, unit testing and continuous integration. We briefly describe the package and demonstrate its use in a large-scale (more than 500 datasets) benchmarking of methods for ID estimation for real-life and synthetic data.
Collapse
Affiliation(s)
- Jonathan Bac
- Institut Curie, PSL Research University, 75248 Paris, France
- INSERM, U900, 75248 Paris, France
- CBIO-Centre for Computational Biology, Mines ParisTech, PSL Research University, 75272 Paris, France
| | - Evgeny M. Mirkes
- Department of Mathematics, University of Leicester, Leicester LE1 7RH, UK; (E.M.M.); (A.N.G.); (I.T.)
- Laboratory of Advanced Methods for High-Dimensional Data Analysis, Lobachevsky University, 603105 Nizhniy Novgorod, Russia
| | - Alexander N. Gorban
- Department of Mathematics, University of Leicester, Leicester LE1 7RH, UK; (E.M.M.); (A.N.G.); (I.T.)
- Laboratory of Advanced Methods for High-Dimensional Data Analysis, Lobachevsky University, 603105 Nizhniy Novgorod, Russia
| | - Ivan Tyukin
- Department of Mathematics, University of Leicester, Leicester LE1 7RH, UK; (E.M.M.); (A.N.G.); (I.T.)
- Laboratory of Advanced Methods for High-Dimensional Data Analysis, Lobachevsky University, 603105 Nizhniy Novgorod, Russia
| | - Andrei Zinovyev
- Institut Curie, PSL Research University, 75248 Paris, France
- INSERM, U900, 75248 Paris, France
- CBIO-Centre for Computational Biology, Mines ParisTech, PSL Research University, 75272 Paris, France
- Laboratory of Advanced Methods for High-Dimensional Data Analysis, Lobachevsky University, 603105 Nizhniy Novgorod, Russia
| |
Collapse
|
9
|
Arella D, Dilucca M, Giansanti A. Codon usage bias and environmental adaptation in microbial organisms. Mol Genet Genomics 2021; 296:751-762. [PMID: 33818631 PMCID: PMC8144148 DOI: 10.1007/s00438-021-01771-4] [Citation(s) in RCA: 34] [Impact Index Per Article: 11.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/14/2020] [Accepted: 02/22/2021] [Indexed: 01/01/2023]
Abstract
In each genome, synonymous codons are used with different frequencies; this general phenomenon is known as codon usage bias. It has been previously recognised that codon usage bias could affect the cellular fitness and might be associated with the ecology of microbial organisms. In this exploratory study, we investigated the relationship between codon usage bias, lifestyles (thermophiles vs. mesophiles; pathogenic vs. non-pathogenic; halophilic vs. non-halophilic; aerobic vs. anaerobic and facultative) and habitats (aquatic, terrestrial, host-associated, specialised, multiple) of 615 microbial organisms (544 bacteria and 71 archaea). Principal component analysis revealed that species with given phenotypic traits and living in similar environmental conditions have similar codon preferences, as represented by the relative synonymous codon usage (RSCU) index, and similar spectra of tRNA availability, as gauged by the tRNA gene copy number (tGCN). Moreover, by measuring the average tRNA adaptation index (tAI) for each genome, an index that can be associated with translational efficiency, we observed that organisms able to live in multiple habitats, including facultative organisms, mesophiles and pathogenic bacteria, are characterised by a reduced translational efficiency, consistently with their need to adapt to different environments. Our results show that synonymous codon choices might be under strong translational selection, which modulates the choice of the codons to differently match tRNA availability, depending on the organism's lifestyle needs. To our knowledge, this is the first large-scale study that examines the role of codon bias and translational efficiency in the adaptation of microbial organisms to the environment in which they live.
Collapse
Affiliation(s)
- Davide Arella
- Department of Physics, Sapienza University of Rome, 00185, Rome, Italy.
| | - Maddalena Dilucca
- Department of Physics, Sapienza University of Rome, 001885, Rome, Italy
| | - Andrea Giansanti
- Department of Physics, Sapienza University of Rome, 00185, Rome, Italy
- INFN, Roma1 Unit, 00185, Rome, Italy
| |
Collapse
|