1
|
Díaz-Domínguez D, Leinonen M, Salmela L. Space-efficient computation of k-mer dictionaries for large values of k. Algorithms Mol Biol 2024; 19:14. [PMID: 38581000 PMCID: PMC10996146 DOI: 10.1186/s13015-024-00259-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/19/2023] [Accepted: 03/02/2024] [Indexed: 04/07/2024] Open
Abstract
Computing k-mer frequencies in a collection of reads is a common procedure in many genomic applications. Several state-of-the-art k-mer counters rely on hash tables to carry out this task but they are often optimised for small k as a hash table keeping keys explicitly (i.e., k-mer sequences) takes O ( N k w ) computer words, N being the number of distinct k-mers and w the computer word size, which is impractical for long values of k. This space usage is an important limitation as analysis of long and accurate HiFi sequencing reads can require larger values of k. We propose Kaarme, a space-efficient hash table for k-mers using O ( N + u k w ) words of space, where u is the number of reads. Our framework exploits the fact that consecutive k-mers overlap by k - 1 symbols. Thus, we only store the last symbol of a k-mer and a pointer within the hash table to a previous one, which we can use to recover the remaining k - 1 symbols. We adapt Kaarme to compute canonical k-mers as well. This variant also uses pointers within the hash table to save space but requires more work to decode the k-mers. Specifically, it takes O ( σ k ) time in the worst case, σ being the DNA alphabet, but our experiments show this is hardly ever the case. The canonical variant does not improve our theoretical results but greatly reduces space usage in practice while keeping a competitive performance to get the k-mers and their frequencies. We compare canonical Kaarme to a regular hash table storing canonical k-mers explicitly as keys and show that our method uses up to five times less space while being less than 1.5 times slower. We also show that canonical Kaarme uses significantly less memory than state-of-the-art k-mer counters when they do not resort to disk to keep intermediate results.
Collapse
Affiliation(s)
- Diego Díaz-Domínguez
- Department of Computer Science, University of Helsinki, Pietari Kalmin katu 5, 00014, Helsinki, Finland.
| | - Miika Leinonen
- Department of Computer Science, University of Helsinki, Pietari Kalmin katu 5, 00014, Helsinki, Finland
| | - Leena Salmela
- Department of Computer Science, University of Helsinki, Pietari Kalmin katu 5, 00014, Helsinki, Finland.
| |
Collapse
|
2
|
Leinonen M, Salmela L. SAKE: Strobemer-assisted k-mer extraction. PLoS One 2023; 18:e0294415. [PMID: 38019768 PMCID: PMC10686461 DOI: 10.1371/journal.pone.0294415] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/09/2023] [Accepted: 10/30/2023] [Indexed: 12/01/2023] Open
Abstract
K-mer-based analysis plays an important role in many bioinformatics applications, such as de novo assembly, sequencing error correction, and genotyping. To take full advantage of such methods, the k-mer content of a read set must be captured as accurately as possible. Often the use of long k-mers is preferred because they can be uniquely associated with a specific genomic region. Unfortunately, it is not possible to reliably extract long k-mers in high error rate reads with standard exact k-mer counting methods. We propose SAKE, a method to extract long k-mers from high error rate reads by utilizing strobemers and consensus k-mer generation through partial order alignment. Our experiments show that on simulated data with up to 6% error rate, SAKE can extract 97-mers with over 90% recall. Conversely, the recall of DSK, an exact k-mer counter, drops to less than 20%. Furthermore, the precision of SAKE remains similar to DSK. On real bacterial data, SAKE retrieves 97-mers with a recall of over 90% and slightly lower precision than DSK, while the recall of DSK already drops to 50%. We show that SAKE can extract more k-mers from uncorrected high error rate reads compared to exact k-mer counting. However, exact k-mer counters run on corrected reads can extract slightly more k-mers than SAKE run on uncorrected reads.
Collapse
Affiliation(s)
- Miika Leinonen
- Department of Computer Science, University of Helsinki, Helsinki, Finland
| | - Leena Salmela
- Department of Computer Science, University of Helsinki, Helsinki, Finland
| |
Collapse
|
3
|
Leinonen M, Salmela L. Extraction of long k-mers using spaced seeds. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2021; PP:1-1. [PMID: 34529572 DOI: 10.1109/tcbb.2021.3113131] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/13/2023]
Abstract
The extraction of k-mers from reads is an important task in many bioinformatics applications, such as all DNA sequence analysis methods based on de Bruijn graphs. These methods tend to be more accurate when the used k-mers are unique in the analyzed DNA, and thus the use of longer k-mers is preferred. When the read lengths of short read sequencing technologies increase, the error rate will become the determining factor for the largest possible value of k. Here we propose LoMeX which uses spaced seeds to extract long k-mers accurately even in the presence of sequencing errors. Our experiments show that LoMeX can extract long k-mers from current Illumina reads with a similar or higher recall than a standard k-mer counting tool. Furthermore, our experiments on simulated data show that when the read length further increases enabling even longer k-mers, the performance of standard k-mer counters declines, whereas LoMeX still extracts long k-mers successfully.
Collapse
|
4
|
Kundu S, Ray MD, Sharma A. Interplay between genome organization and epigenomic alterations of pericentromeric DNA in cancer. J Genet Genomics 2021; 48:184-197. [PMID: 33840602 DOI: 10.1016/j.jgg.2021.02.004] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/06/2020] [Revised: 02/07/2021] [Accepted: 02/20/2021] [Indexed: 12/16/2022]
Abstract
In eukaryotic genome biology, the genomic organization inside the three-dimensional (3D) nucleus is highly complex, and whether this organization governs gene expression is poorly understood. Nuclear lamina (NL) is a filamentous meshwork of proteins present at the lining of inner nuclear membrane that serves as an anchoring platform for genome organization. Large chromatin domains termed as lamina-associated domains (LADs), play a major role in silencing genes at the nuclear periphery. The interaction of the NL and genome is dynamic and stochastic. Furthermore, many genes change their positions during developmental processes or under disease conditions such as cancer, to activate certain sorts of genes and/or silence others. Pericentromeric heterochromatin (PCH) is mostly in the silenced region within the genome, which localizes at the nuclear periphery. Studies show that several genes located at the PCH are aberrantly expressed in cancer. The interesting question is that despite being localized in the pericentromeric region, how these genes still manage to overcome pericentromeric repression. Although epigenetic mechanisms control the expression of the pericentromeric region, recent studies about genome organization and genome-nuclear lamina interaction have shed light on a new aspect of pericentromeric gene regulation through a complex and coordinated interplay between epigenomic remodeling and genomic organization in cancer.
Collapse
Affiliation(s)
- Subhadip Kundu
- Laboratory of Chromatin and Cancer Epigenetics, Department of Biochemistry, All India Institute of Medical Sciences, Ansari Nagar, New Delhi 110029, India
| | - M D Ray
- Department of Surgical Oncology, IRCH, All India Institute of Medical Sciences, Ansari Nagar, New Delhi 110029, India
| | - Ashok Sharma
- Laboratory of Chromatin and Cancer Epigenetics, Department of Biochemistry, All India Institute of Medical Sciences, Ansari Nagar, New Delhi 110029, India.
| |
Collapse
|
5
|
Feng C, Dai M, Liu Y, Chen M. Sequence repetitiveness quantification and de novo repeat detection by weighted k-mer coverage. Brief Bioinform 2020; 22:5855256. [PMID: 32591772 DOI: 10.1093/bib/bbaa086] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/23/2020] [Revised: 04/10/2020] [Accepted: 04/22/2020] [Indexed: 11/12/2022] Open
Abstract
DNA repeats are abundant in eukaryotic genomes and have been proved to play a vital role in genome evolution and regulation. A large number of approaches have been proposed to identify various repeats in the genome. Some de novo repeat identification tools can efficiently generate sequence repetitive scores based on k-mer counting for repeat detection. However, we noticed that these tools can still be improved in terms of repetitive score calculation, sensitivity to segmental duplications and detection specificity. Therefore, here, we present a new computational approach named Repeat Locator (RepLoc), which is based on weighted k-mer coverage to quantify the genome sequence repetitiveness and locate the repetitive sequences. According to the repetitiveness map of the human genome generated by RepLoc, we found that there may be relationships between sequence repetitiveness and genome structures. A comprehensive benchmark shows that RepLoc is a more efficient k-mer counting based tool for de novo repeat detection. The RepLoc software is freely available at http://bis.zju.edu.cn/reploc.
Collapse
Affiliation(s)
- Cong Feng
- Ming Chen's laboratory in Zhejiang University
| | - Min Dai
- Key Laboratory of Genetic Network Biology, Institute of Genetics and Developmental Biology, Chinese Academy of Sciences
| | | | - Ming Chen
- Department of Bioinformatics, College of Life Sciences, Zhejiang University
| |
Collapse
|
6
|
O'Neill RJ. Seq'ing identity and function in a repeat-derived noncoding RNA world. Chromosome Res 2020; 28:111-127. [PMID: 32146545 PMCID: PMC7393779 DOI: 10.1007/s10577-020-09628-z] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/11/2019] [Revised: 02/07/2020] [Accepted: 02/14/2020] [Indexed: 01/06/2023]
Abstract
Innovations in high-throughout sequencing approaches are being marshaled to both reveal the composition of the abundant and heterogeneous noncoding RNAs that populate cell nuclei and lend insight to the mechanisms by which noncoding RNAs influence chromosome biology and gene expression. This review focuses on some of the recent technological developments that have enabled the isolation of nascent transcripts and chromatin-associated and DNA-interacting RNAs. Coupled with emerging genome assembly and analytical approaches, the field is poised to achieve a comprehensive catalog of nuclear noncoding RNAs, including those derived from repetitive regions within eukaryotic genomes. Herein, particular attention is paid to the challenges and advances in the sequence analyses of repeat and transposable element-derived noncoding RNAs and in ascribing specific function(s) to such RNAs.
Collapse
Affiliation(s)
- Rachel J O'Neill
- Institute for Systems Genomics, University of Connecticut, Storrs, CT, 06269, USA.
- Department of Molecular and Cell Biology, University of Connecticut, Storrs, CT, 06269, USA.
- Department of Genetics and Genome Sciences, University of Connecticut Health Center, Farmington, CT, 06030, USA.
| |
Collapse
|
7
|
Orozco-Arias S, Isaza G, Guyot R. Retrotransposons in Plant Genomes: Structure, Identification, and Classification through Bioinformatics and Machine Learning. Int J Mol Sci 2019; 20:E3837. [PMID: 31390781 PMCID: PMC6696364 DOI: 10.3390/ijms20153837] [Citation(s) in RCA: 47] [Impact Index Per Article: 7.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/21/2019] [Revised: 07/31/2019] [Accepted: 08/02/2019] [Indexed: 01/26/2023] Open
Abstract
Transposable elements (TEs) are genomic units able to move within the genome of virtually all organisms. Due to their natural repetitive numbers and their high structural diversity, the identification and classification of TEs remain a challenge in sequenced genomes. Although TEs were initially regarded as "junk DNA", it has been demonstrated that they play key roles in chromosome structures, gene expression, and regulation, as well as adaptation and evolution. A highly reliable annotation of these elements is, therefore, crucial to better understand genome functions and their evolution. To date, much bioinformatics software has been developed to address TE detection and classification processes, but many problematic aspects remain, such as the reliability, precision, and speed of the analyses. Machine learning and deep learning are algorithms that can make automatic predictions and decisions in a wide variety of scientific applications. They have been tested in bioinformatics and, more specifically for TEs, classification with encouraging results. In this review, we will discuss important aspects of TEs, such as their structure, importance in the evolution and architecture of the host, and their current classifications and nomenclatures. We will also address current methods and their limitations in identifying and classifying TEs.
Collapse
Affiliation(s)
- Simon Orozco-Arias
- Department of Computer Science, Universidad Autónoma de Manizales, Manizales 170001, Colombia
- Department of Systems and Informatics, Universidad de Caldas, Manizales 170001, Colombia
| | - Gustavo Isaza
- Department of Systems and Informatics, Universidad de Caldas, Manizales 170001, Colombia
| | - Romain Guyot
- Department of Electronics and Automatization, Universidad Autónoma de Manizales, Manizales 170001, Colombia.
- Institut de Recherche pour le Développement, CIRAD, University Montpellier, 34000 Montpellier, France.
| |
Collapse
|
8
|
Manekar SC, Sathe SR. Estimating the k-mer Coverage Frequencies in Genomic Datasets: A Comparative Assessment of the State-of-the-art. Curr Genomics 2019; 20:2-15. [PMID: 31015787 PMCID: PMC6446480 DOI: 10.2174/1389202919666181026101326] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/23/2018] [Revised: 10/05/2018] [Accepted: 10/24/2018] [Indexed: 12/24/2022] Open
Abstract
Background In bioinformatics, estimation of k-mer abundance histograms or just enumerat-ing the number of unique k-mers and the number of singletons are desirable in many genome sequence analysis applications. The applications include predicting genome sizes, data pre-processing for de Bruijn graph assembly methods (tune runtime parameters for analysis tools), repeat detection, sequenc-ing coverage estimation, measuring sequencing error rates, etc. Different methods for cardinality estima-tion in sequencing data have been developed in recent years. Objective In this article, we present a comparative assessment of the different k-mer frequency estima-tion programs (ntCard, KmerGenie, KmerStream and Khmer (abundance-dist-single.py and unique-kmers.py) to assess their relative merits and demerits. Methods Principally, the miscounts/error-rates of these tools are analyzed by rigorous experimental analysis for a varied range of k. We also present experimental results on runtime, scalability for larger datasets, memory, CPU utilization as well as parallelism of k-mer frequency estimation methods. Results The results indicate that ntCard is more accurate in estimating F0, f1 and full k-mer abundance histograms compared with other methods. ntCard is the fastest but it has more memory requirements compared to KmerGenie. Conclusion The results of this evaluation may serve as a roadmap to potential users and practitioners of streaming algorithms for estimating k-mer coverage frequencies, to assist them in identifying an appro-priate method. Such results analysis also help researchers to discover remaining open research ques-tions, effective combinations of existing techniques and possible avenues for future research.
Collapse
Affiliation(s)
- Swati C Manekar
- Department of Computer Science and Engineering, Visvesvaraya National Institute of Technology, Nagpur, India
| | - Shailesh R Sathe
- Department of Computer Science and Engineering, Visvesvaraya National Institute of Technology, Nagpur, India
| |
Collapse
|
9
|
Transposable Elements: Classification, Identification, and Their Use As a Tool For Comparative Genomics. Methods Mol Biol 2019; 1910:177-207. [PMID: 31278665 DOI: 10.1007/978-1-4939-9074-0_6] [Citation(s) in RCA: 56] [Impact Index Per Article: 9.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/16/2022]
Abstract
Most genomes are populated by hundreds of thousands of sequences originated from mobile elements. On the one hand, these sequences present a real challenge in the process of genome analysis and annotation. On the other hand, they are very interesting biological subjects involved in many cellular processes. Here we present an overview of transposable elements biodiversity, and we discuss different approaches to transposable elements detection and analyses.
Collapse
|
10
|
Manekar SC, Sathe SR. A benchmark study of k-mer counting methods for high-throughput sequencing. Gigascience 2018; 7:5140149. [PMID: 30346548 PMCID: PMC6280066 DOI: 10.1093/gigascience/giy125] [Citation(s) in RCA: 33] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/25/2017] [Accepted: 10/16/2018] [Indexed: 11/25/2022] Open
Abstract
The rapid development of high-throughput sequencing technologies means that hundreds of gigabytes of sequencing data can be produced in a single study. Many bioinformatics tools require counts of substrings of length k in DNA/RNA sequencing reads obtained for applications such as genome and transcriptome assembly, error correction, multiple sequence alignment, and repeat detection. Recently, several techniques have been developed to count k-mers in large sequencing datasets, with a trade-off between the time and memory required to perform this function. We assessed several k-mer counting programs and evaluated their relative performance, primarily on the basis of runtime and memory usage. We also considered additional parameters such as disk usage, accuracy, parallelism, the impact of compressed input, performance in terms of counting large k values and the scalability of the application to larger datasets.We make specific recommendations for the setup of a current state-of-the-art program and suggestions for further development.
Collapse
Affiliation(s)
- Swati C Manekar
- Department of Computer Science and Engineering, Visvesvaraya National Institute of Technology, Nagpur 440 010, India
| | - Shailesh R Sathe
- Department of Computer Science and Engineering, Visvesvaraya National Institute of Technology, Nagpur 440 010, India
| |
Collapse
|
11
|
Touyar N, Schbath S, Cellier D, Dauchel H. Poisson Approximation for the Number of Repeats in a Stationary Markov Chain. J Appl Probab 2016. [DOI: 10.1239/jap/1214950359] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
Detection of repeated sequences within complete genomes is a powerful tool to help understanding genome dynamics and species evolutionary history. To distinguish significant repeats from those that can be obtained just by chance, statistical methods have to be developed. In this paper we show that the distribution of the number of long repeats in long sequences generated by stationary Markov chains can be approximated by a Poisson distribution with explicit parameter. Thanks to the Chen-Stein method we provide a bound for the approximation error; this bound converges to 0 as soon as the length n of the sequence tends to ∞ and the length t of the repeats satisfies n2ρt = O(1) for some 0 < ρ < 1. Using this Poisson approximation, p-values can then be easily calculated to determine if a given genome is significantly enriched in repeats of length t.
Collapse
|
12
|
Taillefer E, Miller J. Exhaustive computation of exact duplications via super and non-nested local maximal repeats. J Bioinform Comput Biol 2013; 12:1350018. [PMID: 24467757 DOI: 10.1142/s0219720013500182] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/05/2023]
Abstract
We propose and implement a method to obtain all duplicated sequences (repeats) from a chromosome or whole genome. Unlike existing approaches our method makes it possible to simultaneously identify and classify repeats into super, local, and non-nested local maximal repeats. Computation verification demonstrates that maximal repeats for a genome of several gigabases can be identified in a reasonable time, enabling us to identified these maximal repeats for any sequenced genome. The algorithm used for the identification relies on enhanced suffix array data structure to achieve practical space and time efficiency, to identify and classify the maximal repeats, and to perform further post-processing on the identified duplicated sequences. The simplicity and effectiveness of the implementation makes the method readily extendible to more sophisticated computations. Maxmers can be exhaustively accounted for in few minutes for genome sequences of dozen megabases in length and in less than a day or two for genome sequences of few gigabases in length. One application of duplicated sequence identification is to the study of duplicated sequence length distributions, which our found to exhibit for large lengths a persistent power-law behavior. Variation of estimated exponents of this power law are studied among different species and successive assembly release versions of the same species. This makes the characterization of the power-law regime of sequenced genomes via maximal repeats identification and classification, an important task for the derivation of models that would help us to elucidate sequence duplication and genome evolution.
Collapse
Affiliation(s)
- Eddy Taillefer
- Physics and Biology Unit, Okinawa Institute of Science and Technology, 1919-1 Tancha, Onna-son, Kunigami-gun 904-0412, Japan
| | | |
Collapse
|
13
|
Abstract
The availability of a large amount of genomic sequences has provided unique opportunities for understanding the composition and dynamics of transposable elements (TEs) in plants. As the cost of sequencing declines, the genomic sequences of most crop plants will be available within the next few years. Thus, the annotation of genomic sequences, rather than sequence availability, will become the "bottleneck" for genome study. Since TEs are the largest component of most plant genomes, the automation of TE identification and classification is essential for future genome annotation as well as characterization of TEs. In this chapter, the functions and mechanisms of different repeat finding tools are reviewed, with a focus on de novo repeat identification programs. In addition, this chapter covers the further processing of results from de novo identification programs and the construction of repeat libraries for downstream genome analyses.
Collapse
Affiliation(s)
- Ning Jiang
- Department of Horticulture, Michigan State University, East Lansing, MI, USA
| |
Collapse
|
14
|
To detect and analyze sequence repeats whatever be their origin. Methods Mol Biol 2012; 859:69-90. [PMID: 22367866 DOI: 10.1007/978-1-61779-603-6_4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/11/2023]
Abstract
The development of numerous programs for the identification of mobile elements raises the issue of the founding concepts that are shared in their design. This is necessary for at least three reasons. First, the cost of designing, developing, debugging, and maintaining software could present a danger of distracting biologists from their main bioanalysis tasks that require a lot of energy. Some key concepts on exact repeats are always underlying the search for genomic repeats and we recall the most important ones. All along the chapter, we try to select practical tools that may help the design of new identification pipelines. Second, the huge increase of sequence production capacities requires to use the most efficient data structures and algorithms to scale up tools in front of the data deluge. This paper provides an up-to-date glimpse on the art of string indexing and string matching. Third, there exists a growing knowledge on the architecture of mobile elements built from literature and the analysis of results generated by these pipelines. Besides data management which has led to the discovery of new families or new elements of a family, the community has an increasing need in knowledge management tools in order to compare, validate, or simply keep trace of mobile element models. We end the paper with first considerations on what could help the near future of such research on models.
Collapse
|
15
|
Janicki M, Rooke R, Yang G. Bioinformatics and genomic analysis of transposable elements in eukaryotic genomes. Chromosome Res 2012; 19:787-808. [PMID: 21850457 DOI: 10.1007/s10577-011-9230-7] [Citation(s) in RCA: 34] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/18/2022]
Abstract
A major portion of most eukaryotic genomes are transposable elements (TEs). During evolution, TEs have introduced profound changes to genome size, structure, and function. As integral parts of genomes, the dynamic presence of TEs will continue to be a major force in reshaping genomes. Early computational analyses of TEs in genome sequences focused on filtering out "junk" sequences to facilitate gene annotation. When the high abundance and diversity of TEs in eukaryotic genomes were recognized, these early efforts transformed into the systematic genome-wide categorization and classification of TEs. The availability of genomic sequence data reversed the classical genetic approaches to discovering new TE families and superfamilies. Curated TE databases and their accurate annotation of genome sequences in turn facilitated the studies on TEs in a number of frontiers including: (1) TE-mediated changes of genome size and structure, (2) the influence of TEs on genome and gene functions, (3) TE regulation by host, (4) the evolution of TEs and their population dynamics, and (5) genomic scale studies of TE activity. Bioinformatics and genomic approaches have become an integral part of large-scale studies on TEs to extract information with pure in silico analyses or to assist wet lab experimental studies. The current revolution in genome sequencing technology facilitates further progress in the existing frontiers of research and emergence of new initiatives. The rapid generation of large-sequence datasets at record low costs on a routine basis is challenging the computing industry on storage capacity and manipulation speed and the bioinformatics community for improvement in algorithms and their implementations.
Collapse
Affiliation(s)
- Mateusz Janicki
- Department of Biology, University of Toronto at Mississauga, 3359 Mississauga Road, Mississauga, ON L5L1C6, Canada
| | | | | |
Collapse
|
16
|
Abstract
Most genomes are populated by thousands of sequences that originated from mobile elements. On the one hand, these sequences present a real challenge in the process of genome analysis and annotation. On the other hand, there are very interesting biological subjects involved in many cellular processes. Here, we present an overview of transposable elements (TEs) biodiversity and their impact on genomic evolution. Finally, we discuss different approaches to the TEs detection and analyses.
Collapse
|
17
|
Külekci MO, Vitter JS, Xu B. Efficient maximal repeat finding using the burrows-wheeler transform and wavelet tree. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2011; 9:421-429. [PMID: 21968959 DOI: 10.1109/tcbb.2011.127] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/31/2023]
Abstract
Finding repetitive structures in genomes and proteins is important to understand their biological functions. Many data compressors for modern genomic sequences rely heavily on finding repeats in the sequences. The notion of maximal repeats captures all the repeats in the data in a space-efficient way. Prior work on maximal repeat finding used either a suffix tree or a suffix array along with other auxiliary data structures. Their space usage is 19--50 times the text size with the best engineering efforts, prohibiting their usability on massive data. Our technique uses the Burrows-Wheeler Transform and wavelet trees. For data sets consisting of natural language texts, the space usage of our method is no more than three times the text size. For genomic sequences stored using one byte per base, the space usage is less than double the sequence size. Our method is also orders of magnitude faster than the prior methods for processing massive texts, since the prior methods must use external memory. For the first time, our method enables a desktop computer with 8GB internal memory to find all the maximal repeats in the whole human genome in less than 17 hours. We have implemented our method as general-purpose open-source software for public use.
Collapse
|
18
|
Marçais G, Kingsford C. A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. ACTA ACUST UNITED AC 2011; 27:764-70. [PMID: 21217122 DOI: 10.1093/bioinformatics/btr011] [Citation(s) in RCA: 2744] [Impact Index Per Article: 196.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022]
Abstract
MOTIVATION Counting the number of occurrences of every k-mer (substring of length k) in a long string is a central subproblem in many applications, including genome assembly, error correction of sequencing reads, fast multiple sequence alignment and repeat detection. Recently, the deep sequence coverage generated by next-generation sequencing technologies has caused the amount of sequence to be processed during a genome project to grow rapidly, and has rendered current k-mer counting tools too slow and memory intensive. At the same time, large multicore computers have become commonplace in research facilities allowing for a new parallel computational paradigm. RESULTS We propose a new k-mer counting algorithm and associated implementation, called Jellyfish, which is fast and memory efficient. It is based on a multithreaded, lock-free hash table optimized for counting k-mers up to 31 bases in length. Due to their flexibility, suffix arrays have been the data structure of choice for solving many string problems. For the task of k-mer counting, important in many biological applications, Jellyfish offers a much faster and more memory-efficient solution. AVAILABILITY The Jellyfish software is written in C++ and is GPL licensed. It is available for download at http://www.cbcb.umd.edu/software/jellyfish.
Collapse
Affiliation(s)
- Guillaume Marçais
- Department of Computer Science, University of Maryland, College Park, MD 20742, USA.
| | | |
Collapse
|
19
|
Identifying repeats and transposable elements in sequenced genomes: how to find your way through the dense forest of programs. Heredity (Edinb) 2009; 104:520-33. [PMID: 19935826 DOI: 10.1038/hdy.2009.165] [Citation(s) in RCA: 143] [Impact Index Per Article: 8.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/25/2022] Open
Abstract
The production of genome sequences has led to another important advance in their annotation, which is closely linked to the exact determination of their content in terms of repeats, among which are transposable elements (TEs). The evolutionary implications and the presence of coding regions in some TEs can confuse gene annotation, and also hinder the process of genome assembly, making particularly crucial to be able to annotate and classify them correctly in genome sequences. This review is intended to provide an overview as comprehensive as possible of the automated methods currently used to annotate and classify TEs in sequenced genomes. Different categories of programs exist according to their methodology and the repeat, which they can identify. I describe here the main characteristics of the programs, their main goals and the difficulties they can entail. The drawbacks of the different methods are also highlighted to help biologists who are unfamiliar with algorithmic methods to understand this methodology better. Globally, using several different programs and carrying out a cross comparison of their results has the best chance of finding reliable results as any single program. However, this makes it essential to verify the results provided by each program independently. The ideal solution would be to test all programs against the same data set to obtain a true comparison of their actual performance.
Collapse
|
20
|
Becher V, Deymonnaz A, Heiber P. Efficient computation of all perfect repeats in genomic sequences of up to half a gigabyte, with a case study on the human genome. Bioinformatics 2009; 25:1746-53. [DOI: 10.1093/bioinformatics/btp321] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/23/2023] Open
|
21
|
Scheibye-Alsing K, Hoffmann S, Frankel A, Jensen P, Stadler PF, Mang Y, Tommerup N, Gilchrist MJ, Nygård AB, Cirera S, Jørgensen CB, Fredholm M, Gorodkin J. Sequence assembly. Comput Biol Chem 2008; 33:121-36. [PMID: 19152793 DOI: 10.1016/j.compbiolchem.2008.11.003] [Citation(s) in RCA: 36] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/24/2008] [Revised: 11/28/2008] [Accepted: 11/28/2008] [Indexed: 01/20/2023]
Abstract
Despite the rapidly increasing number of sequenced and re-sequenced genomes, many issues regarding the computational assembly of large-scale sequencing data have remain unresolved. Computational assembly is crucial in large genome projects as well for the evolving high-throughput technologies and plays an important role in processing the information generated by these methods. Here, we provide a comprehensive overview of the current publicly available sequence assembly programs. We describe the basic principles of computational assembly along with the main concerns, such as repetitive sequences in genomic DNA, highly expressed genes and alternative transcripts in EST sequences. We summarize existing comparisons of different assemblers and provide a detailed descriptions and directions for download of assembly programs at: http://genome.ku.dk/resources/assembly/methods.html.
Collapse
Affiliation(s)
- K Scheibye-Alsing
- Division of Genetics and Bioinformatics, IBHV, University of Copenhagen, Grønnegårdsvej 3, 1870 Frederiksberg C, Denmark
| | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
22
|
Kurtz S, Narechania A, Stein JC, Ware D. A new method to compute K-mer frequencies and its application to annotate large repetitive plant genomes. BMC Genomics 2008; 9:517. [PMID: 18976482 PMCID: PMC2613927 DOI: 10.1186/1471-2164-9-517] [Citation(s) in RCA: 177] [Impact Index Per Article: 10.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/21/2008] [Accepted: 10/31/2008] [Indexed: 12/02/2022] Open
Abstract
Background The challenges of accurate gene prediction and enumeration are further aggravated in large genomes that contain highly repetitive transposable elements (TEs). Yet TEs play a substantial role in genome evolution and are themselves an important subject of study. Repeat annotation, based on counting occurrences of k-mers, has been previously used to distinguish TEs from low-copy genic regions; but currently available software solutions are impractical due to high memory requirements or specialization for specific user-tasks. Results Here we introduce the Tallymer software, a flexible and memory-efficient collection of programs for k-mer counting and indexing of large sequence sets. Unlike previous methods, Tallymer is based on enhanced suffix arrays. This gives a much larger flexibility concerning the choice of the k-mer size. Tallymer can process large data sizes of several billion bases. We used it in a variety of applications to study the genomes of maize and other plant species. In particular, Tallymer was used to index a set of whole genome shotgun sequences from maize (B73) (total size 109 bp.). We analyzed k-mer frequencies for a wide range of k. At this low genome coverage (≈ 0.45×) highly repetitive 20-mers constituted 44% of the genome but represented only 1% of all possible k-mers. Similar low-complexity was seen in the repeat fractions of sorghum and rice. When applying our method to other maize data sets, High-C0t derived sequences showed the greatest enrichment for low-copy sequences. Among annotated TEs, the most highly repetitive were of the Ty3/gypsy class of retrotransposons, followed by the Ty1/copia class, and DNA transposons. Among expressed sequence tags (EST), a notable fraction contained high-copy k-mers, suggesting that transposons are still active in maize. Retrotransposons in Mo17 and McC cultivars were readily detected using the B73 20-mer frequency index, indicating their conservation despite extensive rearrangement across cultivars. Among one hundred annotated bacterial artificial chromosomes (BACs), k-mer frequency could be used to detect transposon-encoded genes with 92% sensitivity, compared to 96% using alignment-based repeat masking, while both methods showed 92% specificity. Conclusion The Tallymer software was effective in a variety of applications to aid genome annotation in maize, despite limitations imposed by the relatively low coverage of sequence available. For more information on the software, see .
Collapse
Affiliation(s)
- Stefan Kurtz
- Center for Bioinformatics, University of Hamburg, Bundesstrasse 43, 20146 Hamburg, Germany.
| | | | | | | |
Collapse
|
23
|
Poisson Approximation for the Number of Repeats in a Stationary Markov Chain. J Appl Probab 2008. [DOI: 10.1017/s0021900200004344] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/07/2022]
Abstract
Detection of repeated sequences within complete genomes is a powerful tool to help understanding genome dynamics and species evolutionary history. To distinguish significant repeats from those that can be obtained just by chance, statistical methods have to be developed. In this paper we show that the distribution of the number of long repeats in long sequences generated by stationary Markov chains can be approximated by a Poisson distribution with explicit parameter. Thanks to the Chen-Stein method we provide a bound for the approximation error; this bound converges to 0 as soon as the length n of the sequence tends to ∞ and the length t of the repeats satisfies n
2ρ
t
= O(1) for some 0 < ρ < 1. Using this Poisson approximation, p-values can then be easily calculated to determine if a given genome is significantly enriched in repeats of length t.
Collapse
|
24
|
Kleffe J, Möller F, Wittig B. Simultaneous identification of long similar substrings in large sets of sequences. BMC Bioinformatics 2007; 8 Suppl 5:S7. [PMID: 17570866 PMCID: PMC1892095 DOI: 10.1186/1471-2105-8-s5-s7] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/23/2022] Open
Abstract
Background Sequence comparison faces new challenges today, with many complete genomes and large libraries of transcripts known. Gene annotation pipelines match these sequences in order to identify genes and their alternative splice forms. However, the software currently available cannot simultaneously compare sets of sequences as large as necessary especially if errors must be considered. Results We therefore present a new algorithm for the identification of almost perfectly matching substrings in very large sets of sequences. Its implementation, called ClustDB, is considerably faster and can handle 16 times more data than VMATCH, the most memory efficient exact program known today. ClustDB simultaneously generates large sets of exactly matching substrings of a given minimum length as seeds for a novel method of match extension with errors. It generates alignments of maximum length with a considered maximum number of errors within each overlapping window of a given size. Such alignments are not optimal in the usual sense but faster to calculate and often more appropriate than traditional alignments for genomic sequence comparisons, EST and full-length cDNA matching, and genomic sequence assembly. The method is used to check the overlaps and to reveal possible assembly errors for 1377 Medicago truncatula BAC-size sequences published at . Conclusion The program ClustDB proves that window alignment is an efficient way to find long sequence sections of homogenous alignment quality, as expected in case of random errors, and to detect systematic errors resulting from sequence contaminations. Such inserts are systematically overlooked in long alignments controlled by only tuning penalties for mismatches and gaps. ClustDB is freely available for academic use.
Collapse
Affiliation(s)
- Jürgen Kleffe
- Institut für Molekularbiologie und Bioinformatik, Charite-Universitätsmedizin Berlin, Arnimallee 22, 14195 Berlin, Germany
| | - Friedrich Möller
- Institut für Molekularbiologie und Bioinformatik, Charite-Universitätsmedizin Berlin, Arnimallee 22, 14195 Berlin, Germany
| | - Burghardt Wittig
- Institut für Molekularbiologie und Bioinformatik, Charite-Universitätsmedizin Berlin, Arnimallee 22, 14195 Berlin, Germany
| |
Collapse
|
25
|
Zhang S, Xiao Y. Quasiperiodic property in Alu repeats. PHYSICAL REVIEW. E, STATISTICAL, NONLINEAR, AND SOFT MATTER PHYSICS 2006; 74:022901. [PMID: 17025492 DOI: 10.1103/physreve.74.022901] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/21/2005] [Revised: 04/06/2006] [Indexed: 05/12/2023]
Abstract
We investigate the possible quasiperiodic property in the sequences of Alu repeats, one of typical noncoding DNA sequences. We calculated the quasiperiods of the right and left monomers of Alu repeats of different families with quasiperiodic matrix algorithm. It is interesting that the right monomers of all families show significant quasiperiod 8 in their sequences while the left monomers show quasiperiods 8 or 5. Our results indicate that there exist common quasiperiods in most Alu repeats. This may be helpful to further explore possible functions of Alu repeats.
Collapse
Affiliation(s)
- Shihua Zhang
- Biomolecular Physics and Modeling Group, Department of Physics, Huazhong University of Science and Technology, Wuhan 430074, Hubei, China
| | | |
Collapse
|
26
|
Morgulis A, Gertz EM, Schäffer AA, Agarwala R. WindowMasker: window-based masker for sequenced genomes. Bioinformatics 2005; 22:134-41. [PMID: 16287941 DOI: 10.1093/bioinformatics/bti774] [Citation(s) in RCA: 202] [Impact Index Per Article: 10.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022] Open
Abstract
MOTIVATION Matches to repetitive sequences are usually undesirable in the output of DNA database searches. Repetitive sequences need not be matched to a query, if they can be masked in the database. RepeatMasker/Maskeraid (RM), currently the most widely used software for DNA sequence masking, is slow and requires a library of repetitive template sequences, such as a manually curated RepBase library, that may not exist for newly sequenced genomes. RESULTS We have developed a software tool called WindowMasker (WM) that identifies and masks highly repetitive DNA sequences in a genome, using only the sequence of the genome itself. WM is orders of magnitude faster than RM because WM uses a few linear-time scans of the genome sequence, rather than local alignment methods that compare each library sequence with each piece of the genome. We validate WM by comparing BLAST outputs from large sets of queries applied to two versions of the same genome, one masked by WM, and the other masked by RM. Even for genomes such as the human genome, where a good RepBase library is available, searching the database as masked with WM yields more matches that are apparently non-repetitive and fewer matches to repetitive sequences. We show that these results hold for transcribed regions as well. WM also performs well on genomes for which much of the sequence was in draft form at the time of the analysis. AVAILABILITY WM is included in the NCBI C++ toolkit. The source code for the entire toolkit is available at ftp://ftp.ncbi.nih.gov/toolbox/ncbi_tools++/CURRENT/. Once the toolkit source is unpacked, the instructions for building WindowMasker application in the UNIX environment can be found in file src/app/winmasker/README.build. SUPPLEMENTARY INFORMATION Supplementary data are available at ftp://ftp.ncbi.nlm.nih.gov/pub/agarwala/windowmasker/windowmasker_suppl.pdf
Collapse
Affiliation(s)
- Aleksandr Morgulis
- National Center for Biotechnology Information, National Institutes of Health, Department of Health and Human Services Building 38A, Room 1003N, 8600 Rockville Pike, Bethesda, MD 20894, USA
| | | | | | | |
Collapse
|
27
|
Darling ACE, Mau B, Blattner FR, Perna NT. Mauve: multiple alignment of conserved genomic sequence with rearrangements. Genome Res 2004; 14:1394-403. [PMID: 15231754 PMCID: PMC442156 DOI: 10.1101/gr.2289704] [Citation(s) in RCA: 3508] [Impact Index Per Article: 167.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
Abstract
As genomes evolve, they undergo large-scale evolutionary processes that present a challenge to sequence comparison not posed by short sequences. Recombination causes frequent genome rearrangements, horizontal transfer introduces new sequences into bacterial chromosomes, and deletions remove segments of the genome. Consequently, each genome is a mosaic of unique lineage-specific segments, regions shared with a subset of other genomes and segments conserved among all the genomes under consideration. Furthermore, the linear order of these segments may be shuffled among genomes. We present methods for identification and alignment of conserved genomic DNA in the presence of rearrangements and horizontal transfer. Our methods have been implemented in a software package called Mauve. Mauve has been applied to align nine enterobacterial genomes and to determine global rearrangement structure in three mammalian genomes. We have evaluated the quality of Mauve alignments and drawn comparison to other methods through extensive simulations of genome evolution.
Collapse
Affiliation(s)
- Aaron C E Darling
- Department of Computer Science, University of Wisconsin-Madison, Madison, Wisconsin 53706, USA
| | | | | | | |
Collapse
|
28
|
Mizuta S, Munakata H, Aimaiti A, Oosawa K, Shimizu T. Evaluation of the color-coding method for searching tandem repeats in prokaryotic genomes. CHEM-BIO INFORMATICS JOURNAL 2004. [DOI: 10.1273/cbij.4.133] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/27/2022]
Affiliation(s)
- Satoshi Mizuta
- Department of Electronic and Information System Engineering, Faculty of Science and Technology, Hirosaki University
| | - Hikaru Munakata
- Department of Electronic and Information System Engineering, Faculty of Science and Technology, Hirosaki University
| | - Abulimiti Aimaiti
- Department of Electronic and Information System Engineering, Faculty of Science and Technology, Hirosaki University
| | - Kenji Oosawa
- Department of Nano-Material Systems, Graduate School of Engineering, Gunma University
| | - Toshio Shimizu
- Department of Electronic and Information System Engineering, Faculty of Science and Technology, Hirosaki University
| |
Collapse
|
29
|
Cannon SB, Kozik A, Chan B, Michelmore R, Young ND. DiagHunter and GenoPix2D: programs for genomic comparisons, large-scale homology discovery and visualization. Genome Biol 2003; 4:R68. [PMID: 14519203 PMCID: PMC328457 DOI: 10.1186/gb-2003-4-10-r68] [Citation(s) in RCA: 37] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/01/2003] [Revised: 06/19/2003] [Accepted: 08/08/2003] [Indexed: 11/10/2022] Open
Abstract
The DiagHunter and GenoPix2D applications work together to enable genomic comparisons and exploration at both genome-wide and single-gene scales. DiagHunter identifies homologous regions (synteny blocks) within or between genomes. GenoPix2D allows interactive display of synteny blocks and other genomic features, as well as querying by annotation and by sequence similarity. The DiagHunter and GenoPix2D applications work together to enable genomic comparisons and exploration at both genome-wide and single-gene scales. DiagHunter identifies homologous regions (synteny blocks) within or between genomes. DiagHunter works efficiently with diverse, large datasets to predict extended and interrupted synteny blocks and to generate graphical and text output quickly. GenoPix2D allows interactive display of synteny blocks and other genomic features, as well as querying by annotation and by sequence similarity.
Collapse
Affiliation(s)
- Steven B Cannon
- Plant Biology Department, University of Minnesota, St Paul, MN 55108, USA.
| | | | | | | | | |
Collapse
|
30
|
Cannon SB, Young ND. OrthoParaMap: distinguishing orthologs from paralogs by integrating comparative genome data and gene phylogenies. BMC Bioinformatics 2003; 4:35. [PMID: 12952558 PMCID: PMC200972 DOI: 10.1186/1471-2105-4-35] [Citation(s) in RCA: 37] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/27/2003] [Accepted: 09/02/2003] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND In eukaryotic genomes, most genes are members of gene families. When comparing genes from two species, therefore, most genes in one species will be homologous to multiple genes in the second. This often makes it difficult to distinguish orthologs (separated through speciation) from paralogs (separated by other types of gene duplication). Combining phylogenetic relationships and genomic position in both genomes helps to distinguish between these scenarios. This kind of comparison can also help to describe how gene families have evolved within a single genome that has undergone polyploidy or other large-scale duplications, as in the case of Arabidopsis thaliana - and probably most plant genomes. RESULTS We describe a suite of programs called OrthoParaMap (OPM) that makes genomic comparisons, identifies syntenic regions, determines whether sets of genes in a gene family are related through speciation or internal chromosomal duplications, maps this information onto phylogenetic trees, and infers internal nodes within the phylogenetic tree that may represent local - as opposed to speciation or segmental - duplication. We describe the application of the software using three examples: the melanoma-associated antigen (MAGE) gene family on the X chromosomes of mouse and human; the 20S proteasome subunit gene family in Arabidopsis, and the major latex protein gene family in Arabidopsis. CONCLUSION OPM combines comparative genomic positional information and phylogenetic reconstructions to identify which gene duplications are likely to have arisen through internal genomic duplications (such as polyploidy), through speciation, or through local duplications (such as unequal crossing-over). The software is freely available at http://www.tc.umn.edu/~cann0010/.
Collapse
Affiliation(s)
- Steven B Cannon
- Plant Biology Department, University of Minnesota, St. Paul, MN 55108, USA
| | - Nevin D Young
- Plant Biology Department, University of Minnesota, St. Paul, MN 55108, USA
- Plant Pathology Department, University of Minnesota, St. Paul, MN 55108, USA
| |
Collapse
|
31
|
Current Awareness on Comparative and Functional Genomics. Comp Funct Genomics 2003. [PMCID: PMC2447368 DOI: 10.1002/cfg.229] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/04/2022] Open
|