1
|
Bernaola-Galván P, Carpena P, Gómez-Martín C, Oliver JL. Compositional Structure of the Genome: A Review. Biology (Basel) 2023; 12:849. [PMID: 37372134 DOI: 10.3390/biology12060849] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/21/2023] [Revised: 06/06/2023] [Accepted: 06/07/2023] [Indexed: 06/29/2023]
Abstract
As the genome carries the historical information of a species' biotic and environmental interactions, analyzing changes in genome structure over time by using powerful statistical physics methods (such as entropic segmentation algorithms, fluctuation analysis in DNA walks, or measures of compositional complexity) provides valuable insights into genome evolution. Nucleotide frequencies tend to vary along the DNA chain, resulting in a hierarchically patchy chromosome structure with heterogeneities at different length scales that range from a few nucleotides to tens of millions of them. Fluctuation analysis reveals that these compositional structures can be classified into three main categories: (1) short-range heterogeneities (below a few kilobase pairs (Kbp)) primarily attributed to the alternation of coding and noncoding regions, interspersed or tandem repeats densities, etc.; (2) isochores, spanning tens to hundreds of tens of Kbp; and (3) superstructures, reaching sizes of tens of megabase pairs (Mbp) or even larger. The obtained isochore and superstructure coordinates in the first complete T2T human sequence are now shared in a public database. In this way, interested researchers can use T2T isochore data, as well as the annotations for different genome elements, to check a specific hypothesis about genome structure. Similarly to other levels of biological organization, a hierarchical compositional structure is prevalent in the genome. Once the compositional structure of a genome is identified, various measures can be derived to quantify the heterogeneity of such structure. The distribution of segment G+C content has recently been proposed as a new genome signature that proves to be useful for comparing complete genomes. Another meaningful measure is the sequence compositional complexity (SCC), which has been used for genome structure comparisons. Lastly, we review the recent genome comparisons in species of the ancient phylum Cyanobacteria, conducted by phylogenetic regression of SCC against time, which have revealed positive trends towards higher genome complexity. These findings provide the first evidence for a driven progressive evolution of genome compositional structure.
Collapse
Affiliation(s)
- Pedro Bernaola-Galván
- Department of Applied Physics II and Institute Carlos I for Theoretical and Computational Physics, University of Málaga, 29071 Málaga, Spain
| | - Pedro Carpena
- Department of Applied Physics II and Institute Carlos I for Theoretical and Computational Physics, University of Málaga, 29071 Málaga, Spain
| | - Cristina Gómez-Martín
- Department of Pathology, Cancer Center Amsterdam, Amsterdam UMC, Vrije Universiteit Amsterdam, 1081 HV Amsterdam, The Netherlands
- Department of Genetics, Faculty of Sciences, 18071 and Laboratory of Bioinformatics, Institute of Biotechnology, Center of Biomedical Research, University of Granada, 18100 Granada, Spain
| | - Jose L Oliver
- Department of Genetics, Faculty of Sciences, 18071 and Laboratory of Bioinformatics, Institute of Biotechnology, Center of Biomedical Research, University of Granada, 18100 Granada, Spain
| |
Collapse
|
2
|
Moya A, Oliver JL, Verdú M, Delaye L, Arnau V, Bernaola-Galván P, de la Fuente R, Díaz W, Gómez-Martín C, González FM, Latorre A, Lebrón R, Román-Roldán R. Driven progressive evolution of genome sequence complexity in Cyanobacteria. Sci Rep 2020; 10:19073. [PMID: 33149190 PMCID: PMC7643063 DOI: 10.1038/s41598-020-76014-4] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/09/2020] [Accepted: 10/22/2020] [Indexed: 02/07/2023] Open
Abstract
Progressive evolution, or the tendency towards increasing complexity, is a controversial issue in biology, which resolution entails a proper measurement of complexity. Genomes are the best entities to address this challenge, as they encode the historical information of a species' biotic and environmental interactions. As a case study, we have measured genome sequence complexity in the ancient phylum Cyanobacteria. To arrive at an appropriate measure of genome sequence complexity, we have chosen metrics that do not decipher biological functionality but that show strong phylogenetic signal. Using a ridge regression of those metrics against root-to-tip distance, we detected positive trends towards higher complexity in three of them. Lastly, we applied three standard tests to detect if progressive evolution is passive or driven-the minimum, ancestor-descendant, and sub-clade tests. These results provide evidence for driven progressive evolution at the genome-level in the phylum Cyanobacteria.
Collapse
Affiliation(s)
- Andrés Moya
- Institute of Integrative Systems Biology (I2Sysbio), University of València and Consejo Superior de Investigaciones Científicas (CSIC), 46980, Valencia, Spain.
- Foundation for the Promotion of Sanitary and Biomedical Research of Valencian Community (FISABIO), 46020, Valencia, Spain.
- CIBER in Epidemiology and Public Health, 28029, Madrid, Spain.
| | - José L Oliver
- Department of Genetics, Faculty of Sciences, University of Granada, 18071, Granada, Spain
- Laboratory of Bioinformatics, Institute of Biotechnology, Center of Biomedical Research, 18100, Granada, Spain
| | - Miguel Verdú
- Centro de Investigaciones sobre Desertificación, Consejo Superior de Investigaciones Científicas (CSIC), University of València and Generalitat Valenciana, 46113, Valencia, Spain
| | - Luis Delaye
- Department of Genetic Engineering, CINVESTAV, 36821, Irapuato, Mexico
| | - Vicente Arnau
- Institute of Integrative Systems Biology (I2Sysbio), University of València and Consejo Superior de Investigaciones Científicas (CSIC), 46980, Valencia, Spain
| | - Pedro Bernaola-Galván
- Department of Applied Physics II and Institute Carlos I for Theoretical and Computational Physics, University of Málaga, 29071, Málaga, Spain
| | - Rebeca de la Fuente
- Institute for Cross-Disciplinary Physics and Complex Systems (IFISC), Consejo Superior de Investigaciones Científicas (CSIC) and University of Balearic Islands, 07122, Palma de Mallorca, Spain
| | - Wladimiro Díaz
- Institute of Integrative Systems Biology (I2Sysbio), University of València and Consejo Superior de Investigaciones Científicas (CSIC), 46980, Valencia, Spain
| | - Cristina Gómez-Martín
- Department of Genetics, Faculty of Sciences, University of Granada, 18071, Granada, Spain
- Laboratory of Bioinformatics, Institute of Biotechnology, Center of Biomedical Research, 18100, Granada, Spain
| | | | - Amparo Latorre
- Institute of Integrative Systems Biology (I2Sysbio), University of València and Consejo Superior de Investigaciones Científicas (CSIC), 46980, Valencia, Spain
- Foundation for the Promotion of Sanitary and Biomedical Research of Valencian Community (FISABIO), 46020, Valencia, Spain
- CIBER in Epidemiology and Public Health, 28029, Madrid, Spain
| | - Ricardo Lebrón
- Department of Genetics, Faculty of Sciences, University of Granada, 18071, Granada, Spain
- Laboratory of Bioinformatics, Institute of Biotechnology, Center of Biomedical Research, 18100, Granada, Spain
| | - Ramón Román-Roldán
- Department of Applied Physics, University of Granada, 18071, Granada, Spain
| |
Collapse
|
3
|
Faes L, Gómez-Extremera M, Pernice R, Carpena P, Nollo G, Porta A, Bernaola-Galván P. Comparison of methods for the assessment of nonlinearity in short-term heart rate variability under different physiopathological states. Chaos 2019; 29:123114. [PMID: 31893647 DOI: 10.1063/1.5115506] [Citation(s) in RCA: 25] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/18/2019] [Accepted: 11/19/2019] [Indexed: 06/10/2023]
Abstract
Despite the widespread diffusion of nonlinear methods for heart rate variability (HRV) analysis, the presence and the extent to which nonlinear dynamics contribute to short-term HRV are still controversial. This work aims at testing the hypothesis that different types of nonlinearity can be observed in HRV depending on the method adopted and on the physiopathological state. Two entropy-based measures of time series complexity (normalized complexity index, NCI) and regularity (information storage, IS), and a measure quantifying deviations from linear correlations in a time series (Gaussian linear contrast, GLC), are applied to short HRV recordings obtained in young (Y) and old (O) healthy subjects and in myocardial infarction (MI) patients monitored in the resting supine position and in the upright position reached through head-up tilt. The method of surrogate data is employed to detect the presence and quantify the contribution of nonlinear dynamics to HRV. We find that the three measures differ both in their variations across groups and conditions and in the percentage and strength of nonlinear HRV dynamics. NCI and IS displayed opposite variations, suggesting more complex dynamics in O and MI compared to Y and less complex dynamics during tilt. The strength of nonlinear dynamics is reduced by tilt using all measures in Y, while only GLC detects a significant strengthening of such dynamics in MI. A large percentage of detected nonlinear dynamics is revealed only by the IS measure in the Y group at rest, with a decrease in O and MI and during T, while NCI and GLC detect lower percentages in all groups and conditions. While these results suggest that distinct dynamic structures may lie beneath short-term HRV in different physiological states and pathological conditions, the strong dependence on the measure adopted and on their implementation suggests that physiological interpretations should be provided with caution.
Collapse
Affiliation(s)
- Luca Faes
- Department of Engineering, University of Palermo, 90128 Palermo, Italy
| | - Manuel Gómez-Extremera
- Dpto. de Física Aplicada II, ETSI de Telecomunicación, University of Málaga, 29071 Málaga, Spain
| | - Riccardo Pernice
- Department of Engineering, University of Palermo, 90128 Palermo, Italy
| | - Pedro Carpena
- Dpto. de Física Aplicada II, ETSI de Telecomunicación, University of Málaga, 29071 Málaga, Spain
| | - Giandomenico Nollo
- Department of Industrial Engineering, University of Trento, 38123 Trento, Italy
| | - Alberto Porta
- Department of Biomedical Sciences for Health, University of Milan, 20122 Milan, Italy
| | - Pedro Bernaola-Galván
- Dpto. de Física Aplicada II, ETSI de Telecomunicación, University of Málaga, 29071 Málaga, Spain
| |
Collapse
|
4
|
Lebrón R, Gómez-Martín C, Carpena P, Bernaola-Galván P, Barturen G, Hackenberg M, Oliver JL. NGSmethDB 2017: enhanced methylomes and differential methylation. Nucleic Acids Res 2017; 45:D97-D103. [PMID: 27794041 PMCID: PMC5210667 DOI: 10.1093/nar/gkw996] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/12/2016] [Revised: 10/08/2016] [Accepted: 10/14/2016] [Indexed: 12/27/2022] Open
Abstract
The 2017 update of NGSmethDB stores whole genome methylomes generated from short-read data sets obtained by bisulfite sequencing (WGBS) technology. To generate high-quality methylomes, stringent quality controls were integrated with third-part software, adding also a two-step mapping process to exploit the advantages of the new genome assembly models. The samples were all profiled under constant parameter settings, thus enabling comparative downstream analyses. Besides a significant increase in the number of samples, NGSmethDB now includes two additional data-types, which are a valuable resource for the discovery of methylation epigenetic biomarkers: (i) differentially methylated single-cytosines; and (ii) methylation segments (i.e. genome regions of homogeneous methylation). The NGSmethDB back-end is now based on MongoDB, a NoSQL hierarchical database using JSON-formatted documents and dynamic schemas, thus accelerating sample comparative analyses. Besides conventional database dumps, track hubs were implemented, which improved database access, visualization in genome browsers and comparative analyses to third-part annotations. In addition, the database can be also accessed through a RESTful API. Lastly, a Python client and a multiplatform virtual machine allow for program-driven access from user desktop. This way, private methylation data can be compared to NGSmethDB without the need to upload them to public servers. Database website: http://bioinfo2.ugr.es/NGSmethDB.
Collapse
Affiliation(s)
- Ricardo Lebrón
- Department of Genetics, Faculty of Science, University of Granada, Campus de Fuentenueva s/n, 18071-Granada, Spain
- Laboratory of Bioinformatics, Centro de Investigación Biomédica, PTS, Avda. del Conocimiento s/n, 18100-Granada, Spain
| | - Cristina Gómez-Martín
- Department of Genetics, Faculty of Science, University of Granada, Campus de Fuentenueva s/n, 18071-Granada, Spain
- Laboratory of Bioinformatics, Centro de Investigación Biomédica, PTS, Avda. del Conocimiento s/n, 18100-Granada, Spain
| | - Pedro Carpena
- Department of Applied Physics II, Universidad de Málaga, 29071 Málaga, Spain
| | | | - Guillermo Barturen
- Genetics of Complex Diseases Group, GENyO, Pfizer-University of Granada-Junta de Andalucía Center for Genomics and Oncological Research, 18100-Granada, Spain
| | - Michael Hackenberg
- Department of Genetics, Faculty of Science, University of Granada, Campus de Fuentenueva s/n, 18071-Granada, Spain
- Laboratory of Bioinformatics, Centro de Investigación Biomédica, PTS, Avda. del Conocimiento s/n, 18100-Granada, Spain
| | - José L Oliver
- Department of Genetics, Faculty of Science, University of Granada, Campus de Fuentenueva s/n, 18071-Granada, Spain
- Laboratory of Bioinformatics, Centro de Investigación Biomédica, PTS, Avda. del Conocimiento s/n, 18100-Granada, Spain
| |
Collapse
|
5
|
Bernaola-Galván P, Oliver J, Hackenberg M, Coronado A, Ivanov P, Carpena P. Segmentation of time series with long-range fractal correlations. Eur Phys J B 2012; 85:211. [PMID: 23645997 PMCID: PMC3643524 DOI: 10.1140/epjb/e2012-20969-5] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/02/2023]
Abstract
Segmentation is a standard method of data analysis to identify change-points dividing a nonstationary time series into homogeneous segments. However, for long-range fractal correlated series, most of the segmentation techniques detect spurious change-points which are simply due to the heterogeneities induced by the correlations and not to real nonstationarities. To avoid this oversegmentation, we present a segmentation algorithm which takes as a reference for homogeneity, instead of a random i.i.d. series, a correlated series modeled by a fractional noise with the same degree of correlations as the series to be segmented. We apply our algorithm to artificial series with long-range correlations and show that it systematically detects only the change-points produced by real nonstationarities and not those created by the correlations of the signal. Further, we apply the method to the sequence of the long arm of human chromosome 21, which is known to have long-range fractal correlations. We obtain only three segments that clearly correspond to the three regions of different G + C composition revealed by means of a multi-scale wavelet plot. Similar results have been obtained when segmenting all human chromosome sequences, showing the existence of previously unknown huge compositional superstructures in the human genome.
Collapse
Affiliation(s)
| | - J.L. Oliver
- Dpto. de Genética, Inst. de Biotecnología, Universidad de Granada, 18071 Granada, Spain
| | - M. Hackenberg
- Dpto. de Genética, Inst. de Biotecnología, Universidad de Granada, 18071 Granada, Spain
| | - A.V. Coronado
- Dpto. de Física Aplicada II, Universidad de Málaga, 29071 Málaga, Spain
| | - P.Ch. Ivanov
- Harvard Medical School, Division of Sleep Medicine, Brigham & Women’s Hospital, 02115 Boston, MA, USA
- Department of Physics and Center for Polymer Studies, Boston University, 2215 Boston, MA, USA
- Institute of Solid State Physics, Bulgarian Academy of Sciences, 1784 Sofia, Bulgaria
| | - P. Carpena
- Dpto. de Física Aplicada II, Universidad de Málaga, 29071 Málaga, Spain
| |
Collapse
|
6
|
Carretero-Campos C, Bernaola-Galván P, Ch. Ivanov P, Carpena P. Phase transitions in the first-passage time of scale-invariant correlated processes. Phys Rev E Stat Nonlin Soft Matter Phys 2012; 85:011139. [PMID: 22400544 PMCID: PMC3518899 DOI: 10.1103/physreve.85.011139] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/29/2011] [Revised: 11/30/2011] [Indexed: 05/31/2023]
Abstract
A key quantity describing the dynamics of complex systems is the first-passage time (FPT). The statistical properties of FPT depend on the specifics of the underlying system dynamics. We present a unified approach to account for the diversity of statistical behaviors of FPT observed in real-world systems. We find three distinct regimes, separated by two transition points, with fundamentally different behavior for FPT as a function of increasing strength of the correlations in the system dynamics: stretched exponential, power-law, and saturation regimes. In the saturation regime, the average length of FPT diverges proportionally to the system size, with important implications for understanding electronic delocalization in one-dimensional correlated-disordered systems.
Collapse
Affiliation(s)
| | | | - Plamen Ch. Ivanov
- Center for Polymer Studies and Department of Physics, Boston University, Boston, Massachusetts 02212, USA
- Harvard Medical School and Division of Sleep Medicine, Brigham and Women’s Hospital, Boston, Massachusetts 02115, USA
- Institute of Solid State Physics, Bulgarian Academy of Sciences, 1784 Sofia, Bulgaria
| | - Pedro Carpena
- Departamento de Física Aplicada II, Universidad de Málaga, E-29071 Málaga, Spain
| |
Collapse
|
7
|
Xu Y, Ma QD, Schmitt DT, Bernaola-Galván P, Ivanov PC. Effects of coarse-graining on the scaling behavior of long-range correlated and anti-correlated signals. Physica A 2011; 390:4057-4072. [PMID: 25392599 PMCID: PMC4226277 DOI: 10.1016/j.physa.2011.05.015] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/31/2023]
Abstract
We investigate how various coarse-graining (signal quantization) methods affect the scaling properties of long-range power-law correlated and anti-correlated signals, quantified by the detrended fluctuation analysis. Specifically, for coarse-graining in the magnitude of a signal, we consider (i) the Floor, (ii) the Symmetry and (iii) the Centro-Symmetry coarse-graining methods. We find that for anti-correlated signals coarse-graining in the magnitude leads to a crossover to random behavior at large scales, and that with increasing the width of the coarse-graining partition interval Δ, this crossover moves to intermediate and small scales. In contrast, the scaling of positively correlated signals is less affected by the coarse-graining, with no observable changes when Δ < 1, while for Δ > 1 a crossover appears at small scales and moves to intermediate and large scales with increasing Δ. For very rough coarse-graining (Δ > 3) based on the Floor and Symmetry methods, the position of the crossover stabilizes, in contrast to the Centro-Symmetry method where the crossover continuously moves across scales and leads to a random behavior at all scales; thus indicating a much stronger effect of the Centro-Symmetry compared to the Floor and the Symmetry method. For coarse-graining in time, where data points are averaged in non-overlapping time windows, we find that the scaling for both anti-correlated and positively correlated signals is practically preserved. The results of our simulations are useful for the correct interpretation of the correlation and scaling properties of symbolic sequences.
Collapse
Affiliation(s)
- Yinlin Xu
- Center for Polymer Studies and Department of Physics, Boston University, Boston, MA 02215, USA
- College of Physics Science and Technology, Nanjing Normal University, Nanjing 210097, China
| | - Qianli D.Y. Ma
- Harvard Medical School and Division of Sleep Medicine, Brigham & Women’s Hospital, Boston, MA 02215, USA
- College of Geography and Biological Information, Nanjing University of Posts and Telecommunications, Nanjing 210003, China
| | - Daniel T. Schmitt
- Center for Polymer Studies and Department of Physics, Boston University, Boston, MA 02215, USA
| | | | - Plamen Ch. Ivanov
- Center for Polymer Studies and Department of Physics, Boston University, Boston, MA 02215, USA
- Harvard Medical School and Division of Sleep Medicine, Brigham & Women’s Hospital, Boston, MA 02215, USA
- Departamento de Física Aplicada II, Universidad de Málaga, 29071 Málaga, Spain
| |
Collapse
|
8
|
Carpena P, Oliver JL, Hackenberg M, Coronado AV, Barturen G, Bernaola-Galván P. High-level organization of isochores into gigantic superstructures in the human genome. Phys Rev E Stat Nonlin Soft Matter Phys 2011; 83:031908. [PMID: 21517526 DOI: 10.1103/physreve.83.031908] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/28/2010] [Revised: 01/10/2011] [Indexed: 05/30/2023]
Abstract
Human DNA shows a complex structure with compositional features at many scales; the isochores--long DNA segments (~10⁵ bp) of relatively homogeneous guanine-cytosine (G + C) content--are the largest well-documented and well-analyzed compositional structures. However, we report here on the existence of a high-level compositional organization of isochores in the human genome. By using a segmentation algorithm incorporating the long-range correlations existing in human DNA, we find that every chromosome is composed of a few huge segments (~ 10⁷ bp) of relatively homogeneous G + C content, which become the largest compositional organization of the genome. Finally, we show evidence of the biological relevance of these superstructures, pointing to a large-scale functional organization of the human genome.
Collapse
Affiliation(s)
- P Carpena
- Departamento de Física Aplicada II, Universidad de Málaga, ES-29071, Málaga, Spain.
| | | | | | | | | | | |
Collapse
|
9
|
Hackenberg M, Carpena P, Bernaola-Galván P, Barturen G, Alganza ÁM, Oliver JL. WordCluster: detecting clusters of DNA words and genomic elements. Algorithms Mol Biol 2011; 6:2. [PMID: 21261981 PMCID: PMC3037320 DOI: 10.1186/1748-7188-6-2] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/30/2010] [Accepted: 01/24/2011] [Indexed: 01/26/2023] Open
Abstract
Background Many k-mers (or DNA words) and genomic elements are known to be spatially clustered in the genome. Well established examples are the genes, TFBSs, CpG dinucleotides, microRNA genes and ultra-conserved non-coding regions. Currently, no algorithm exists to find these clusters in a statistically comprehensible way. The detection of clustering often relies on densities and sliding-window approaches or arbitrarily chosen distance thresholds. Results We introduce here an algorithm to detect clusters of DNA words (k-mers), or any other genomic element, based on the distance between consecutive copies and an assigned statistical significance. We implemented the method into a web server connected to a MySQL backend, which also determines the co-localization with gene annotations. We demonstrate the usefulness of this approach by detecting the clusters of CAG/CTG (cytosine contexts that can be methylated in undifferentiated cells), showing that the degree of methylation vary drastically between inside and outside of the clusters. As another example, we used WordCluster to search for statistically significant clusters of olfactory receptor (OR) genes in the human genome. Conclusions WordCluster seems to predict biological meaningful clusters of DNA words (k-mers) and genomic entities. The implementation of the method into a web server is available at http://bioinfo2.ugr.es/wordCluster/wordCluster.php including additional features like the detection of co-localization with gene regions or the annotation enrichment tool for functional analysis of overlapped genes.
Collapse
|
10
|
Ma QDY, Bartsch RP, Bernaola-Galván P, Yoneyama M, Ivanov PC. Effect of extreme data loss on long-range correlated and anticorrelated signals quantified by detrended fluctuation analysis. Phys Rev E Stat Nonlin Soft Matter Phys 2010; 81:031101. [PMID: 20365691 PMCID: PMC3534784 DOI: 10.1103/physreve.81.031101] [Citation(s) in RCA: 25] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/19/2009] [Indexed: 05/29/2023]
Abstract
Detrended fluctuation analysis (DFA) is an improved method of classical fluctuation analysis for nonstationary signals where embedded polynomial trends mask the intrinsic correlation properties of the fluctuations. To better identify the intrinsic correlation properties of real-world signals where a large amount of data is missing or removed due to artifacts, we investigate how extreme data loss affects the scaling behavior of long-range power-law correlated and anticorrelated signals. We introduce a segmentation approach to generate surrogate signals by randomly removing data segments from stationary signals with different types of long-range correlations. The surrogate signals we generate are characterized by four parameters: (i) the DFA scaling exponent alpha of the original correlated signal u(i) , (ii) the percentage p of the data removed from u(i) , (iii) the average length mu of the removed (or remaining) data segments, and (iv) the functional form P(l) of the distribution of the length l of the removed (or remaining) data segments. We find that the global scaling exponent of positively correlated signals remains practically unchanged even for extreme data loss of up to 90%. In contrast, the global scaling of anticorrelated signals changes to uncorrelated behavior even when a very small fraction of the data is lost. These observations are confirmed on two examples of real-world signals: human gait and commodity price fluctuations. We further systematically study the local scaling behavior of surrogate signals with missing data to reveal subtle deviations across scales. We find that for anticorrelated signals even 10% of data loss leads to significant monotonic deviations in the local scaling at large scales from the original anticorrelated to uncorrelated behavior. In contrast, positively correlated signals show no observable changes in the local scaling for up to 65% of data loss, while for larger percentage of data loss, the local scaling shows overestimated regions (with higher local exponent) at small scales, followed by underestimated regions (with lower local exponent) at large scales. Finally, we investigate how the scaling is affected by the average length, probability distribution, and percentage of the remaining data segments in comparison to the removed segments. We find that the average length mu_{r} of the remaining segments is the key parameter which determines the scales at which the local scaling exponent has a maximum deviation from its original value. Interestingly, the scales where the maximum deviation occurs follow a power-law relationship with mu_{r} . Whereas the percentage of data loss determines the extent of the deviation. The results presented in this paper are useful to correctly interpret the scaling properties obtained from signals with extreme data loss.
Collapse
Affiliation(s)
- Qianli D. Y. Ma
- Harvard Medical School and Division of Sleep Medicine, Brigham and Women’s Hospital, Boston, Massachusetts 02115, USA
- College of Geography and Biological Information, Nanjing University of Posts and Telecommunications, Nanjing 210003, China
| | - Ronny P. Bartsch
- Harvard Medical School and Division of Sleep Medicine, Brigham and Women’s Hospital, Boston, Massachusetts 02115, USA
| | | | - Mitsuru Yoneyama
- Mitsubishi Chemical Group, Science and Technology Research Center Inc., Yokohama 227-8502, Japan
| | - Plamen Ch. Ivanov
- Harvard Medical School and Division of Sleep Medicine, Brigham and Women’s Hospital, Boston, Massachusetts 02115, USA
- Departamento de Física Aplicada II, Universidad de Málaga, 29071 Málaga, Spain
- Center for Polymer Studies and Department of Physics, Boston University, Boston, Massachusetts 02215, USA
| |
Collapse
|
11
|
Carpena P, Bernaola-Galván P, Hackenberg M, Coronado AV, Oliver JL. Level statistics of words: finding keywords in literary texts and symbolic sequences. Phys Rev E Stat Nonlin Soft Matter Phys 2009; 79:035102. [PMID: 19392005 DOI: 10.1103/physreve.79.035102] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/21/2008] [Indexed: 05/27/2023]
Abstract
Using a generalization of the level statistics analysis of quantum disordered systems, we present an approach able to extract automatically keywords in literary texts. Our approach takes into account not only the frequencies of the words present in the text but also their spatial distribution along the text, and is based on the fact that relevant words are significantly clustered (i.e., they self-attract each other), while irrelevant words are distributed randomly in the text. Since a reference corpus is not needed, our approach is especially suitable for single documents for which no a priori information is available. In addition, we show that our method works also in generic symbolic sequences (continuous texts without spaces), thus suggesting its general applicability.
Collapse
Affiliation(s)
- P Carpena
- Departamento de Física Aplicada II, Universidad de Málaga, 29071 Málaga, Spain
| | | | | | | | | |
Collapse
|
12
|
Oliver JL, Bernaola-Galván P, Hackenberg M, Carpena P. Phylogenetic distribution of large-scale genome patchiness. BMC Evol Biol 2008; 8:107. [PMID: 18405379 PMCID: PMC2397391 DOI: 10.1186/1471-2148-8-107] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/15/2007] [Accepted: 04/11/2008] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND The phylogenetic distribution of large-scale genome structure (i.e. mosaic compositional patchiness) has been explored mainly by analytical ultracentrifugation of bulk DNA. However, with the availability of large, good-quality chromosome sequences, and the recently developed computational methods to directly analyze patchiness on the genome sequence, an evolutionary comparative analysis can be carried out at the sequence level. RESULTS The local variations in the scaling exponent of the Detrended Fluctuation Analysis are used here to analyze large-scale genome structure and directly uncover the characteristic scales present in genome sequences. Furthermore, through shuffling experiments of selected genome regions, computationally-identified, isochore-like regions were identified as the biological source for the uncovered large-scale genome structure. The phylogenetic distribution of short- and large-scale patchiness was determined in the best-sequenced genome assemblies from eleven eukaryotic genomes: mammals (Homo sapiens, Pan troglodytes, Mus musculus, Rattus norvegicus, and Canis familiaris), birds (Gallus gallus), fishes (Danio rerio), invertebrates (Drosophila melanogaster and Caenorhabditis elegans), plants (Arabidopsis thaliana) and yeasts (Saccharomyces cerevisiae). We found large-scale patchiness of genome structure, associated with in silico determined, isochore-like regions, throughout this wide phylogenetic range. CONCLUSION Large-scale genome structure is detected by directly analyzing DNA sequences in a wide range of eukaryotic chromosome sequences, from human to yeast. In all these genomes, large-scale patchiness can be associated with the isochore-like regions, as directly detected in silico at the sequence level.
Collapse
Affiliation(s)
- José L Oliver
- Dpto de Genética, Facultad de Ciencias, Universidad de Granada, Spain.
| | | | | | | |
Collapse
|
13
|
Carpena P, Bernaola-Galván P, Coronado AV, Hackenberg M, Oliver JL. Identifying characteristic scales in the human genome. Phys Rev E Stat Nonlin Soft Matter Phys 2007; 75:032903. [PMID: 17500745 DOI: 10.1103/physreve.75.032903] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/01/2006] [Indexed: 05/15/2023]
Abstract
The scale-free, long-range correlations detected in DNA sequences contrast with characteristic lengths of genomic elements, being particularly incompatible with the isochores (long, homogeneous DNA segments). By computing the local behavior of the scaling exponent alpha of detrended fluctuation analysis (DFA), we discriminate between sequences with and without true scaling, and we find that no single scaling exists in the human genome. Instead, human chromosomes show a common compositional structure with two characteristic scales, the large one corresponding to the isochores and the other to small and medium scale genomic elements.
Collapse
Affiliation(s)
- P Carpena
- Departamento de Física Aplicada II, Universidad de Málaga, 29071 Málaga, Spain
| | | | | | | | | |
Collapse
|
14
|
Hackenberg M, Bernaola-Galván P, Carpena P, Oliver JL. The Biased Distribution of Alus in Human Isochores Might Be Driven by Recombination. J Mol Evol 2005; 60:365-77. [PMID: 15871047 DOI: 10.1007/s00239-004-0197-2] [Citation(s) in RCA: 36] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/28/2004] [Accepted: 10/01/2004] [Indexed: 11/30/2022]
Abstract
Alu retrotransposons do not show a homogeneous distribution over the human genome but have a higher density in GC-rich (H) than in AT-rich (L) isochores. However, since they preferentially insert into the L isochores, the question arises: What is the evolutionary mechanism that shifts the Alu density maximum from L to H isochores? To disclose the role played by each of the potential mechanisms involved in such biased distribution, we carried out a genome-wide analysis of the density of the Alus as a function of their evolutionary age, isochore membership, and intron vs. intergene location. Since Alus depend on the retrotransposase encoded by the LINE1 elements, we also studied the distribution of LINE1 to provide a complete evolutionary scenario. We consecutively check, and discard, the contributions of the Alu/LINE1 competition for retrotransposase, compositional matching pressure, and Alu overrepresentation in introns. In analyzing the role played by unequal recombination, we scan the genome for Alu trimers, a direct product of Alu-Alu recombination. Through computer simulations, we show that such trimers are much more frequent than expected, the observed/expected ratio being higher in L than in H isochores. This result, together with the known higher selective disadvantage of recombination products in H isochores, points to Alu-Alu recombination as the main agent provoking the density shift of Alus toward the GC-rich parts of the genome. Two independent pieces of evidence-the lower evolutionary divergence shown by recently inserted Alu subfamilies and the higher frequency of old stand-alone Alus in L isochores-support such a conclusion. Other evolutionary factors, such as population bottlenecks during primate speciation, may have accelerated the fast accumulation of Alus in GC-rich isochores.
Collapse
Affiliation(s)
- Michael Hackenberg
- Departamento de Genética, Facultad de Ciencias, Universidad de Granada, Spain
| | | | | | | |
Collapse
|
15
|
Carpena P, Bernaola-Galván P, Ivanov PC. New class of level statistics in correlated disordered chains. Phys Rev Lett 2004; 93:176804. [PMID: 15525105 DOI: 10.1103/physrevlett.93.176804] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/24/2003] [Indexed: 05/24/2023]
Abstract
We study the properties of the level statistics of 1D disordered systems with long-range spatial correlations. We find a threshold value in the degree of correlations below which in the limit of large system size the level statistics follows a Poisson distribution (as expected for 1D uncorrelated-disordered systems), and above which the level statistics is described by a new class of distribution functions. At the threshold, we find that with increasing system size, the standard deviation of the function describing the level statistics converges to the standard deviation of the Poissonian distribution as a power law. Above the threshold we find that the level statistics is characterized by different functional forms for different degrees of correlations.
Collapse
Affiliation(s)
- Pedro Carpena
- Departamento de Física Aplicada II. E.T.S.I. de Telecomunicación, Universidad de Málaga, 29071, Málaga, Spain
| | | | | |
Collapse
|
16
|
Bernaola-Galván P, Oliver JL, Carpena P, Clay O, Bernardi G. Quantifying intrachromosomal GC heterogeneity in prokaryotic genomes. Gene 2004; 333:121-33. [PMID: 15177687 DOI: 10.1016/j.gene.2004.02.042] [Citation(s) in RCA: 26] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/29/2003] [Revised: 11/14/2003] [Accepted: 02/10/2004] [Indexed: 11/15/2022]
Abstract
The sequencing of prokaryotic genomes covering a wide taxonomic range has sparked renewed interest in intrachromosomal compositional (GC) heterogeneity, largely in view of lateral transfers. We present here a brief overview of some methods for visualizing and quantifying GC variation in prokaryotes. We used these methods to examine heterogeneity levels in sequenced prokaryotes, for a range of scales or stringencies. Some species are consistently homogeneous, whereas others are markedly heterogeneous in comparison, in particular Aeropyrum pernix, Xylella fastidiosa, Mycoplasma genitalium, Enterococcus faecalis, Bacillus subtilis, Pyrobaculum aerophilum, Vibrio vulnificus chromosome I, Deinococcus radiodurans chromosome II and Halobacterium. As we discuss here, the wide range of heterogeneities calls for reexamination of an accepted belief, namely that the endogenous DNA of bacteria and archaea should typically exhibit low intrachromosomal GC contrasts. Supplementary results for all species analyzed are available at our website: http://bioinfo2.ugr.es/prok.
Collapse
|
17
|
Oliver JL, Carpena P, Hackenberg M, Bernaola-Galván P. IsoFinder: computational prediction of isochores in genome sequences. Nucleic Acids Res 2004; 32:W287-92. [PMID: 15215396 PMCID: PMC441537 DOI: 10.1093/nar/gkh399] [Citation(s) in RCA: 59] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/22/2004] [Revised: 03/04/2004] [Accepted: 03/25/2004] [Indexed: 11/13/2022] Open
Abstract
Isochores are long genome segments homogeneous in G+C. Here, we describe an algorithm (IsoFinder) running on the web (http://bioinfo2.ugr.es/IsoF/isofinder.html) able to predict isochores at the sequence level. We move a sliding pointer from left to right along the DNA sequence. At each position of the pointer, we compute the mean G+C values to the left and to the right of the pointer. We then determine the position of the pointer for which the difference between left and right mean values (as measured by the t-statistic) reaches its maximum. Next, we determine the statistical significance of this potential cutting point, after filtering out short-scale heterogeneities below 3 kb by applying a coarse-graining technique. Finally, the program checks whether this significance exceeds a probability threshold. If so, the sequence is cut at this point into two subsequences; otherwise, the sequence remains undivided. The procedure continues recursively for each of the two resulting subsequences created by each cut. This leads to the decomposition of a chromosome sequence into long homogeneous genome regions (LHGRs) with well-defined mean G+C contents, each significantly different from the G+C contents of the adjacent LHGRs. Most LHGRs can be identified with Bernardi's isochores, given their correlation with biological features such as gene density, SINE and LINE (short, long interspersed repetitive elements) densities, recombination rate or single nucleotide polymorphism variability. The resulting isochore maps are available at our web site (http://bioinfo2.ugr.es/isochores/), and also at the UCSC Genome Browser (http://genome.cse.ucsc.edu/).
Collapse
Affiliation(s)
- José L Oliver
- Departamento de Genética, Instituto de Biotecnología, Facultad de Ciencias, Universidad de Granada, Spain.
| | | | | | | |
Collapse
|
18
|
|
19
|
Abstract
The isochore concept in the human genome sequence was challenged in an analysis by the International Human Genome Sequencing Consortium (IHGSC). We argue here that a statement in the IHGSC's analysis concerning the existence of isochores is misleading, because the homogeneity was not examined at a large enough length scale and consequently an inappropriate statistical test was applied. A test of the existence of isochores should be equivalent to a test of homogeneity or equality of windowed GC%. The statistical test applied in the IHGSC's analysis, the binomial test, is a test of whether individual bases are independent and identically-distributed (iid). For testing the existence of isochores, or homogeneity in windowed GC%, we propose to use another statistical test: the analysis of variance (ANOVA). It can be shown that DNA sequences that are rejected by the binomial test may not be rejected by the ANOVA test.
Collapse
Affiliation(s)
- Wentian Li
- Center for Genomics and Human Genetics, North Shore LIJ Research Institute, 350 Community Drive, Manhasset, NY 11030, USA.
| | | | | | | |
Collapse
|
20
|
Abstract
We present a coding measure which is based on the statistical properties of the stop codons, and that is able to estimate accurately the variation of coding content along an anonymous sequence. As the stop codons play the same role in all the genomes (with very few exceptions) the measure turns out to be species-independent. We show results both for prokaryotic and for eukaryotic genomes, indicating, first, the accuracy of the measure, and, second, that better prediction is achieved if the measure is applied on homogeneous, isochore-like sequences than if it is applied following the standard moving window approach. Finally, we discuss on some of the possible applications of the measure.
Collapse
Affiliation(s)
- P Carpena
- Departamento de Física Aplicada II, E.T.S.I. de Telecomunicación, Universidad de Málaga, Malaga, Spain.
| | | | | | | |
Collapse
|
21
|
Abstract
Here we present a study of statistical correlations among different positions in DNA sequences and their implications by directly using the autocorrelation function. Such an analysis is possible now because of the availability of large sequences or even complete genomes of many organisms. After describing the way in which the autocorrelation function can be applied to DNA-sequence analysis, we show that long-range correlations, implying scale independence, appear in several bacterial genomes as well as in long human chromosome contigs. The source for such correlations in bacteria, which may extend up to 60 kb in Bacillus subtilis, may be related to massive lateral transfer of compositionally biased genes from other genomes. In the human genome, correlations extend for more than five decades and may be related to the evolution of the 'neogenome', a modern evolutionary acquisition composed by GC-rich isochores displaying long-range correlations and scale invariance.
Collapse
Affiliation(s)
- P Bernaola-Galván
- Departamento de Física Aplicada II, E.T.S.I. de Telecomunicación, Universidad de Málaga, Málaga, Spain.
| | | | | | | |
Collapse
|
22
|
Abstract
The human genome is a mosaic of isochores, which are long DNA segments (z.Gt;300 kbp) relatively homogeneous in G+C. Human isochores were first identified by density-gradient ultracentrifugation of bulk DNA, and differ in important features, e.g. genes are found predominantly in the GC-richest isochores. Here, we use a reliable segmentation method to partition the longest contigs in the human genome draft sequence into long homogeneous genome regions (LHGRs), thereby revealing the isochore structure of the human genome. The advantages of the isochore maps presented here are: (1) sequence heterogeneities at different scales are shown in the same plot; (2) pair-wise compositional differences between adjacent regions are all statistically significant; (3) isochore boundaries are accurately defined to single base pair resolution; and (4) both gradual and abrupt isochore boundaries are simultaneously revealed. Taking advantage of the wide sample of genome sequence analyzed, we investigate the correspondence between LHGRs and true human isochores revealed through DNA centrifugation. LHGRs show many of the typical isochore features, mainly size distribution, G+C range, and proportions of the isochore classes. The relative density of genes, Alu and long interspersed nuclear element repeats and the different types of single nucleotide polymorphisms on LHGRs also coincide with expectations in true isochores. Potential applications of isochore maps range from the improvement of gene-finding algorithms to the prediction of linkage disequilibrium levels in association studies between marker genes and complex traits. The coordinates for the LHGRs identified in all the contigs longer than 2 Mb in the human genome sequence are available at the online resource on isochore mapping: http://bioinfo2.ugr.es/isochores.
Collapse
Affiliation(s)
- José L Oliver
- Departamento de Genética, Instituto de Biotecnología, Universidad de Granada, Granada, Spain.
| | | | | | | | | | | | | |
Collapse
|
23
|
Abstract
According to Bloch's theorem, electronic wavefunctions in perfectly ordered crystals are extended, which implies that the probability of finding an electron is the same over the entire crystal. Such extended states can lead to metallic behaviour. But when disorder is introduced in the crystal, electron states can become localized, and the system can undergo a metal-insulator transition (also known as an Anderson transition). Here we theoretically investigate the effect on the physical properties of the electron wavefunctions of introducing long-range correlations in the disorder in one-dimensional binary solids, and find a correlation-induced metal-insulator transition. We perform numerical simulations using a one-dimensional tight-binding model, and find a threshold value for the exponent characterizing the long-range correlations of the system. Above this threshold, and in the thermodynamic limit, the system behaves as a conductor within a broad energy band; below threshold, the system behaves as an insulator. We discuss the possible relevance of this result for electronic transport in DNA, which displays long-range correlations and has recently been reported to be a one-dimensional disordered conductor.
Collapse
Affiliation(s)
- Pedro Carpena
- Departamento de Física Aplicada II, ETSI de Telecomunicación, Universidad de Málaga, 29071 Málaga, Spain.
| | | | | | | |
Collapse
|
24
|
Abstract
Recursive segmentation is a procedure that partitions a DNA sequence into domains with a homogeneous composition of the four nucleotides A, C, G and T. This procedure can also be applied to any sequence converted from a DNA sequence, such as to a binary strong(G + C)/weak(A + T) sequence, to a binary sequence indicating the presence or absence of the dinucleotide CpG, or to a sequence indicating both the base and the codon position information. We apply various conversion schemes in order to address the following five DNA sequence analysis problems: isochore mapping, CpG island detection, locating the origin and terminus of replication in bacterial genomes, finding complex repeats in telomere sequences, and delineating coding and noncoding regions. We find that the recursive segmentation procedure can successfully detect isochore borders, CpG islands, and the origin and terminus of replication, but it needs improvement for detecting complex repeats as well as borders between coding and noncoding regions.
Collapse
Affiliation(s)
- Wentian Li
- Center for Genomics and Human Genetics, North Shore-LIJ Research Institute, Manhasset, NY 11030, USA.
| | | | | | | |
Collapse
|
25
|
Bernaola-Galván P, Carpena P. Comment on "Factorial moments analyses show a characteristic length scale in DNA sequences". Phys Rev Lett 2002; 88:219803-219804. [PMID: 12059508 DOI: 10.1103/physrevlett.88.219803] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/10/2001] [Indexed: 05/23/2023]
Affiliation(s)
- P Bernaola-Galván
- Departamento de Física Aplicada II E.T.S.I. de Telecomunicación, Universidad de Málaga, Málaga 29071, Spain
| | | |
Collapse
|
26
|
Azad RK, Bernaola-Galván P, Ramaswamy R, Rao JS. Segmentation of genomic DNA through entropic divergence: power laws and scaling. Phys Rev E Stat Nonlin Soft Matter Phys 2002; 65:051909. [PMID: 12059595 DOI: 10.1103/physreve.65.051909] [Citation(s) in RCA: 14] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/08/2001] [Revised: 01/22/2002] [Indexed: 11/07/2022]
Abstract
Genomic DNA is fragmented into segments using the Jensen-Shannon divergence. Use of this criterion results in the fragments being entropically homogeneous to within a predefined level of statistical significance. Application of this procedure is made to complete genomes of organisms from archaebacteria, eubacteria, and eukaryotes. The distribution of fragment lengths in bacterial and primitive eukaryotic DNAs shows two distinct regimes of power-law scaling. The characteristic length separating these two regimes appears to be an intrinsic property of the sequence rather than a finite-size artifact, and is independent of the significance level used in segmenting a given genome. Fragment length distributions obtained in the segmentation of the genomes of more highly evolved eukaryotes do not have such distinct regimes of power-law behavior.
Collapse
Affiliation(s)
- Rajeev K Azad
- School of Environmental Sciences, Jawaharlal Nehru University, New Delhi 110 067, India.
| | | | | | | |
Collapse
|
27
|
Grosse I, Bernaola-Galván P, Carpena P, Román-Roldán R, Oliver J, Stanley HE. Analysis of symbolic sequences using the Jensen-Shannon divergence. Phys Rev E Stat Nonlin Soft Matter Phys 2002; 65:041905. [PMID: 12005871 DOI: 10.1103/physreve.65.041905] [Citation(s) in RCA: 108] [Impact Index Per Article: 4.9] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/22/2000] [Revised: 08/08/2001] [Indexed: 05/23/2023]
Abstract
We study statistical properties of the Jensen-Shannon divergence D, which quantifies the difference between probability distributions, and which has been widely applied to analyses of symbolic sequences. We present three interpretations of D in the framework of statistical physics, information theory, and mathematical statistics, and obtain approximations of the mean, the variance, and the probability distribution of D in random, uncorrelated sequences. We present a segmentation method based on D that is able to segment a nonstationary symbolic sequence into stationary subsequences, and apply this method to DNA sequences, which are known to be nonstationary on a wide range of different length scales.
Collapse
Affiliation(s)
- Ivo Grosse
- Cold Spring Harbor Laboratory, Cold Spring Harbor, New York 11724, USA
| | | | | | | | | | | |
Collapse
|
28
|
Bernaola-Galván P, Ivanov PC, Nunes Amaral LA, Stanley HE. Scale invariance in the nonstationarity of human heart rate. Phys Rev Lett 2001; 87:168105. [PMID: 11690251 DOI: 10.1103/physrevlett.87.168105] [Citation(s) in RCA: 49] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/17/2000] [Indexed: 05/23/2023]
Abstract
We introduce a segmentation algorithm to probe the temporal organization of heterogeneities in human heartbeat interval time series. We find that the lengths of segments with different local mean heart rates follow a power-law distribution and show that this scale-invariant structure is not a simple consequence of the long-range correlations present in the data. The differences in mean heart rates between consecutive segments display a common functional form, but with different parameters for healthy individuals and for heart-failure patients. These findings suggest that there is relevant physiological information hidden in the heterogeneities of the heartbeat time series.
Collapse
Affiliation(s)
- P Bernaola-Galván
- Center for Polymer Studies and Department of Physics, Boston University, Boston, Massachusetts 02215, USA
| | | | | | | |
Collapse
|
29
|
Abstract
Analytical DNA ultracentrifugation revealed that eukaryotic genomes are mosaics of isochores: long DNA segments (>>300 kb on average) relatively homogeneous in G+C. Important genome features are dependent on this isochore structure, e.g. genes are found predominantly in the GC-richest isochore classes. However, no reliable method is available to rigorously partition the genome sequence into relatively homogeneous regions of different composition, thereby revealing the isochore structure of chromosomes at the sequence level. Homogeneous regions are currently ascertained by plain statistics on moving windows of arbitrary length, or simply by eye on G+C plots. On the contrary, the entropic segmentation method is able to divide a DNA sequence into relatively homogeneous, statistically significant domains. An early version of this algorithm only produced domains having an average length far below the typical isochore size. Here we show that an improved segmentation method, specifically intended to determine the most statistically significant partition of the sequence at each scale, is able to identify the boundaries between long homogeneous genome regions displaying the typical features of isochores. The algorithm precisely locates classes II and III of the human major histocompatibility complex region, two well-characterized isochores at the sequence level, the boundary between them being the first isochore boundary experimentally characterized at the sequence level. The analysis is then extended to a collection of human large contigs. The relatively homogeneous regions we find show many of the features (G+C range, relative proportion of isochore classes, size distribution, and relationship with gene density) of the isochores identified through DNA centrifugation. Isochore chromosome maps, with many potential applications in genomics, are then drawn for all the completely sequenced eukaryotic genomes available.
Collapse
Affiliation(s)
- J L Oliver
- Departamento de Genética, Instituto de Biotecnología, Universidad de Granada, E-18071, Granada, Spain.
| | | | | | | |
Collapse
|
30
|
Bernaola-Galván P, Grosse I, Carpena P, Oliver JL, Román-Roldán R, Stanley HE. Finding borders between coding and noncoding DNA regions by an entropic segmentation method. Phys Rev Lett 2000; 85:1342-1345. [PMID: 10991547 DOI: 10.1103/physrevlett.85.1342] [Citation(s) in RCA: 43] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/02/1999] [Indexed: 05/23/2023]
Abstract
We present a new computational approach to finding borders between coding and noncoding DNA. This approach has two features: (i) DNA sequences are described by a 12-letter alphabet that captures the differential base composition at each codon position, and (ii) the search for the borders is carried out by means of an entropic segmentation method which uses only the general statistical properties of coding DNA. We find that this method is highly accurate in finding borders between coding and noncoding regions and requires no "prior training" on known data sets. Our results appear to be more accurate than those obtained with moving windows in the discrimination of coding from noncoding DNA.
Collapse
Affiliation(s)
- P Bernaola-Galván
- Center for Polymer Studies and Department of Physics, Boston University, Boston, Massachusetts 02215, USA
| | | | | | | | | | | |
Collapse
|
31
|
Abstract
MOTIVATION DNA sequences are formed by patches or domains of different nucleotide composition. In a few simple sequences, domains can simply be identified by eye; however, most DNA sequences show a complex compositional heterogeneity (fractal structure), which cannot be properly detected by current methods. Recently, a computationally efficient segmentation method to analyse such nonstationary sequence structures, based on the Jensen-Shannon entropic divergence, has been described. Specific algorithms implementing this method are now needed. RESULTS Here we describe a heuristic segmentation algorithm for DNA sequences, which was implemented on a Windows program (SEGMENT). The program divides a DNA sequence into compositionally homogeneous domains by iterating a local optimization procedure at a given statistical significance. Once a sequence is partitioned into domains, a global measure of sequence compositional complexity (SCC), accounting for both the sizes and compositional biases of all the domains in the sequence, is derived. SEGMENT computes SCC as a function of the significance level, which provides a multiscale view of sequence complexity.
Collapse
Affiliation(s)
- J L Oliver
- Department of Genetics, Faculty of Sciences, University of Granada, Spain.
| | | | | | | |
Collapse
|
32
|
Abstract
The heterogeneity within, and similarities between, yeast chromosomes are studied. For the former, we show by the size distribution of domains, coding density, size distribution of open reading frames, spatial power spectra, and deviation from binomial distribution for C + G% in large moving windows that there is a strong deviation of the yeast sequences from random sequences. For the latter, not only do we graphically illustrate the similarity for the above mentioned statistics, but we also carry out a rigorous analysis of variance (ANOVA) test. The hypothesis that all yeast chromosomes are similar cannot be rejected by this test. We examine the two possible explanations of this interchromosomal uniformity: a common origin, such as genome-wide duplication (polyploidization), and a concerted evolutionary process.
Collapse
Affiliation(s)
- W Li
- Laboratory of Statistical Genetics, Rockefeller University, New York, New York 10021 USA.
| | | | | | | |
Collapse
|
33
|
Bernaola-Galván P, Román-Roldán R, Oliver JL. Compositional segmentation and long-range fractal correlations in DNA sequences. Phys Rev E Stat Phys Plasmas Fluids Relat Interdiscip Topics 1996; 53:5181-5189. [PMID: 9964850 DOI: 10.1103/physreve.53.5181] [Citation(s) in RCA: 97] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/22/2023]
|
34
|
Abstract
A new method to determine entropic profiles in DNA sequences is presented. It is based on the chaos-game representation (CGR) of gene structure, a technique which produces a fractal-like picture of DNA sequences. First, the CGR image was divided into squares 4-m in size (m being the desired resolution), and the point density counted. Second, appropriate intervals were adjusted, and then a histogram of densities was prepared. Third, Shannon's formula was applied to the probability-distribution histogram, thus obtaining a new entropic estimate for DNA sequences, the histogram entropy, a measurement that goes with the level of constraints on the DNA sequence. Lastly, the entropic profile for the sequence was drawn, by considering the entropies at each resolution level, thus providing a way to summarize the complexity of large genomic regions or even entire genomes at different resolution levels. The application of the method to DNA sequences reveals that entropic profiles obtained in this way, as opposed to previously published ones, clearly discriminate between random and natural DNA sequences. Entropic profiles also show a different degree of variability within and between genomes. The results of these analyses are discussed in relation both to the genome compartmentalization in vertebrates and to the differential action of compositional and/or functional constraints on DNA sequences.
Collapse
Affiliation(s)
- J L Oliver
- Department of Genetics, Faculty of Sciences, University of Granada, Spain
| | | | | | | |
Collapse
|