1
|
Shrestha AMS, B Guiao JE, R Santiago KC. Assembly-free rapid differential gene expression analysis in non-model organisms using DNA-protein alignment. BMC Genomics 2022; 23:97. [PMID: 35120462 PMCID: PMC8815227 DOI: 10.1186/s12864-021-08278-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/18/2021] [Accepted: 12/22/2021] [Indexed: 11/16/2022] Open
Abstract
BACKGROUND RNA-seq is being increasingly adopted for gene expression studies in a panoply of non-model organisms, with applications spanning the fields of agriculture, aquaculture, ecology, and environment. For organisms that lack a well-annotated reference genome or transcriptome, a conventional RNA-seq data analysis workflow requires constructing a de-novo transcriptome assembly and annotating it against a high-confidence protein database. The assembly serves as a reference for read mapping, and the annotation is necessary for functional analysis of genes found to be differentially expressed. However, assembly is computationally expensive. It is also prone to errors that impact expression analysis, especially since sequencing depth is typically much lower for expression studies than for transcript discovery. RESULTS We propose a shortcut, in which we obtain counts for differential expression analysis by directly aligning RNA-seq reads to the high-confidence proteome that would have been otherwise used for annotation. By avoiding assembly, we drastically cut down computational costs - the running time on a typical dataset improves from the order of tens of hours to under half an hour, and the memory requirement is reduced from the order of tens of Gbytes to tens of Mbytes. We show through experiments on simulated and real data that our pipeline not only reduces computational costs, but has higher sensitivity and precision than a typical assembly-based pipeline. A Snakemake implementation of our workflow is available at: https://bitbucket.org/project_samar/samar . CONCLUSIONS The flip side of RNA-seq becoming accessible to even modestly resourced labs has been that the time, labor, and infrastructure cost of bioinformatics analysis has become a bottleneck. Assembly is one such resource-hungry process, and we show here that it can be avoided for quick and easy, yet more sensitive and precise, differential gene expression analysis in non-model organisms.
Collapse
Affiliation(s)
- Anish M S Shrestha
- Bioinformatics Lab, Advanced Research Institute for Informatics, Computing, and Networking (AdRIC), De La Salle University, Manila, Philippines.
- Department of Software Technology, College of Computer Studies, De La Salle University, Manila, Philippines.
| | - Joyce Emlyn B Guiao
- Bioinformatics Lab, Advanced Research Institute for Informatics, Computing, and Networking (AdRIC), De La Salle University, Manila, Philippines
- Department of Mathematics and Statistics, College of Science, De La Salle University, Manila, Philippines
| | - Kyle Christian R Santiago
- Bioinformatics Lab, Advanced Research Institute for Informatics, Computing, and Networking (AdRIC), De La Salle University, Manila, Philippines
- Department of Software Technology, College of Computer Studies, De La Salle University, Manila, Philippines
| |
Collapse
|
2
|
Huang T, Li J, Jia B, Sang H. CNV-MEANN: A Neural Network and Mind Evolutionary Algorithm-Based Detection of Copy Number Variations From Next-Generation Sequencing Data. Front Genet 2021; 12:700874. [PMID: 34484298 PMCID: PMC8415314 DOI: 10.3389/fgene.2021.700874] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/27/2021] [Accepted: 07/19/2021] [Indexed: 11/20/2022] Open
Abstract
Copy number variation (CNV), is defined as repetitions or deletions of genomic segments of 1 Kb to 5 Mb, and is a major trigger for human disease. The high-throughput and low-cost characteristics of next-generation sequencing technology provide the possibility of the detection of CNVs in the whole genome, and also greatly improve the clinical practicability of next-generation sequencing (NGS) testing. However, current methods for the detection of CNVs are easily affected by sequencing and mapping errors, and uneven distribution of reads. In this paper, we propose an improved approach, CNV-MEANN, for the detection of CNVs, involving changing the structure of the neural network used in the MFCNV method. This method has three differences relative to the MFCNV method: (1) it utilizes a new feature, mapping quality, to replace two features in MFCNV, (2) it considers the influence of the loss categories of CNV on disease prediction, and refines the output structure, and (3) it uses a mind evolutionary algorithm to optimize the backpropagation (neural network) neural network model, and calculates individual scores for each genome bin to predict CNVs. Using both simulated and real datasets, we tested the performance of CNV-MEANN and compared its performance with those of seven widely used CNV detection methods. Experimental results demonstrated that the CNV-MEANN approach outperformed other methods with respect to sensitivity, precision, and F1-score. The proposed method was able to detect many CNVs that other approaches could not, and it reduced the boundary bias. CNV-MEANN is expected to be an effective method for the analysis of changes in CNVs in the genome.
Collapse
Affiliation(s)
- Tihao Huang
- School of Computer Science and Technology, Liaocheng University, Liaocheng, China
| | - Junqing Li
- School of Computer Science and Technology, Liaocheng University, Liaocheng, China
| | - Baoxian Jia
- School of Computer Science and Technology, Liaocheng University, Liaocheng, China
| | - Hongyan Sang
- School of Computer Science and Technology, Liaocheng University, Liaocheng, China
| |
Collapse
|
3
|
Kuo TCY, Hatakeyama M, Tameshige T, Shimizu KK, Sese J. Homeolog expression quantification methods for allopolyploids. Brief Bioinform 2021; 21:395-407. [PMID: 30590436 PMCID: PMC7299288 DOI: 10.1093/bib/bby121] [Citation(s) in RCA: 17] [Impact Index Per Article: 5.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/24/2018] [Revised: 11/06/2018] [Accepted: 11/21/2018] [Indexed: 12/19/2022] Open
Abstract
Genome duplication with hybridization, or allopolyploidization, occurs in animals, fungi and plants, and is especially common in crop plants. There is an increasing interest in the study of allopolyploids because of advances in polyploid genome assembly; however, the high level of sequence similarity in duplicated gene copies (homeologs) poses many challenges. Here we compared standard RNA-seq expression quantification approaches used currently for diploid species against subgenome-classification approaches which maps reads to each subgenome separately. We examined mapping error using our previous and new RNA-seq data in which a subgenome is experimentally added (synthetic allotetraploid Arabidopsis kamchatica) or reduced (allohexaploid wheat Triticum aestivum versus extracted allotetraploid) as ground truth. The error rates in the two species were very similar. The standard approaches showed higher error rates (>10% using pseudo-alignment with Kallisto) while subgenome-classification approaches showed much lower error rates (<1% using EAGLE-RC, <2% using HomeoRoq). Although downstream analysis may partly mitigate mapping errors, the difference in methods was substantial in hexaploid wheat, where Kallisto appeared to have systematic differences relative to other methods. Only approximately half of the differentially expressed homeologs detected using Kallisto overlapped with those by any other method in wheat. In general, disagreement in low-expression genes was responsible for most of the discordance between methods, which is consistent with known biases in Kallisto. We also observed that there exist uncertainties in genome sequences and annotation which can affect each method differently. Overall, subgenome-classification approaches tend to perform better than standard approaches with EAGLE-RC having the highest precision.
Collapse
Affiliation(s)
- Tony C Y Kuo
- Artificial Intelligence Research Center, AIST, 2-3-26 Aomi, Koto-ku, Tokyo 135-0064, Japan.,AIST-Tokyo Tech RWBC-OIL, 2-12-1 Okayama, Meguro-ku, Tokyo 152-8550, Japan
| | - Masaomi Hatakeyama
- Department of Evolutionary Biology and Environmental Studies, University of Zurich, Winterthurerstrasse 190, Zurich CH-8057, Switzerland.,Functional Genomics Center Zurich, Winterthurerstrasse 190, Zurich CH-8057, Switzerland.,Swiss Institute of Bioinformatics, Quartier Sorge - Batiment Genopode, Lausanne 1015, Switzerland
| | - Toshiaki Tameshige
- Kihara Institute for Biological Research, Yokohama City University, 641-12, Maioka, Totsuka-ku, Yokohama 244-0813, Japan
| | - Kentaro K Shimizu
- Department of Evolutionary Biology and Environmental Studies, University of Zurich, Winterthurerstrasse 190, Zurich CH-8057, Switzerland.,Kihara Institute for Biological Research, Yokohama City University, 641-12, Maioka, Totsuka-ku, Yokohama 244-0813, Japan
| | - Jun Sese
- Artificial Intelligence Research Center, AIST, 2-3-26 Aomi, Koto-ku, Tokyo 135-0064, Japan.,AIST-Tokyo Tech RWBC-OIL, 2-12-1 Okayama, Meguro-ku, Tokyo 152-8550, Japan
| |
Collapse
|
4
|
Corbett-Detig RB, Russell SL, Nielsen R, Losos J. Phenotypic Convergence Is Not Mirrored at the Protein Level in a Lizard Adaptive Radiation. Mol Biol Evol 2021; 37:1604-1614. [PMID: 32027369 DOI: 10.1093/molbev/msaa028] [Citation(s) in RCA: 12] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/28/2022] Open
Abstract
There are many compelling examples of molecular convergence at individual genes. However, the prevalence and the relative importance of adaptive genome-wide convergence remain largely unknown. Many recent works have reported striking examples of excess genome-wide convergence, but some of these studies have been called into question because of the use of inappropriate null models. Here, we sequenced and compared the genomes of 12 species of anole lizards that have independently converged on suites of adaptive behavioral and morphological traits. Despite extensive searches for a genome-wide signature of molecular convergence, we found no evidence supporting molecular convergence at specific amino acids either at individual genes or at genome-wide comparisons; we also uncovered no evidence supporting an excess of adaptive convergence in the rates of amino acid substitutions within genes. Our findings indicate that comprehensive phenotypic convergence is not mirrored at genome-wide protein-coding levels in anoles, and therefore, that adaptive phenotypic convergence is likely not constrained by the evolution of many specific protein sequences or structures.
Collapse
Affiliation(s)
- Russell B Corbett-Detig
- Department of Integrative Biology, University of California, Berkeley, Berkeley, CA.,Department of Biomolecular Engineering and Genomics Institute, University of California, Santa Cruz, Santa Cruz, CA
| | - Shelbi L Russell
- Department of Molecular, Cellular and Developmental Biology, University of California, Santa Cruz, Santa Cruz, CA
| | - Rasmus Nielsen
- Department of Integrative Biology, University of California, Berkeley, Berkeley, CA.,Centre for GeoGenetics, Globe Institute, University of Copenhagen, Copenhagen, Denmark
| | - Jonathan Losos
- Department of Biology and Living Earth Collaborative, Washington University, Saint Louis, MO
| |
Collapse
|
5
|
Johansson T, Yohannes DA, Koskela S, Partanen J, Saavalainen P. HLA RNA Sequencing With Unique Molecular Identifiers Reveals High Allele-Specific Variability in mRNA Expression. Front Immunol 2021; 12:629059. [PMID: 33717155 PMCID: PMC7949471 DOI: 10.3389/fimmu.2021.629059] [Citation(s) in RCA: 15] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/13/2020] [Accepted: 01/18/2021] [Indexed: 11/13/2022] Open
Abstract
The HLA gene complex is the most important single genetic factor in susceptibility to most diseases with autoimmune or autoinflammatory origin and in transplantation matching. Most studies have focused on the vast allelic variation in these genes; only a few studies have explored differences in the expression levels of HLA alleles. In this study, we quantified mRNA expression levels of HLA class I and II genes from peripheral blood samples of 50 healthy individuals. The gene- and allele-specific mRNA expression was assessed using unique molecular identifiers, which enabled PCR bias removal and calculation of the number of original mRNA transcripts. We identified differences in mRNA expression between different HLA genes and alleles. Our results suggest that HLA alleles are differentially expressed and these differences in expression levels are quantifiable using RNA sequencing technology. Our method provides novel insights into HLA research, and it can be applied to quantify expression differences of HLA alleles in various tissues and to evaluate the role of this type of variation in transplantation matching and susceptibility to autoimmune diseases.
Collapse
Affiliation(s)
- Tiira Johansson
- Research Programs Unit, Translational Immunology Program, University of Helsinki, Helsinki, Finland
- Research and Development, Finnish Red Cross Blood Service, Helsinki, Finland
| | - Dawit A. Yohannes
- Research Programs Unit, Translational Immunology Program, University of Helsinki, Helsinki, Finland
| | - Satu Koskela
- Research and Development, Finnish Red Cross Blood Service, Helsinki, Finland
| | - Jukka Partanen
- Research and Development, Finnish Red Cross Blood Service, Helsinki, Finland
| | - Päivi Saavalainen
- Research Programs Unit, Translational Immunology Program, University of Helsinki, Helsinki, Finland
- Research and Development, Finnish Red Cross Blood Service, Helsinki, Finland
| |
Collapse
|
6
|
Diaz Caballero J, Clark ST, Wang PW, Donaldson SL, Coburn B, Tullis DE, Yau YCW, Waters VJ, Hwang DM, Guttman DS. A genome-wide association analysis reveals a potential role for recombination in the evolution of antimicrobial resistance in Burkholderia multivorans. PLoS Pathog 2018; 14:e1007453. [PMID: 30532201 PMCID: PMC6300292 DOI: 10.1371/journal.ppat.1007453] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/11/2018] [Revised: 12/19/2018] [Accepted: 11/02/2018] [Indexed: 01/05/2023] Open
Abstract
Cystic fibrosis (CF) lung infections caused by members of the Burkholderia cepacia complex, such as Burkholderia multivorans, are associated with high rates of mortality and morbidity. We performed a population genomics study of 111 B. multivorans sputum isolates from one CF patient through three stages of infection including an early incident isolate, deep sampling of a one-year period of chronic infection occurring weeks before a lung transplant, and deep sampling of a post-transplant infection. We reconstructed the evolutionary history of the population and used a lineage-controlled genome-wide association study (GWAS) approach to identify genetic variants associated with antibiotic resistance. We found the incident isolate was basally related to the rest of the strains and more susceptible to antibiotics from three classes (β-lactams, aminoglycosides, quinolones). The chronic infection isolates diversified into multiple, distinct genetic lineages and showed reduced antimicrobial susceptibility to the same antibiotics. The post-transplant reinfection isolates derived from the same source as the incident isolate and were genetically distinct from the chronic isolates. They also had a level of susceptibility in between that of the incident and chronic isolates. We identified numerous examples of potential parallel pathoadaptation, in which multiple mutations were found in the same locus or even codon. The set of parallel pathoadaptive loci was enriched for functions associated with virulence and resistance. Our GWAS analysis identified statistical associations between a polymorphism in the ampD locus with resistance to β-lactams, and polymorphisms in an araC transcriptional regulator and an outer membrane porin with resistance to both aminoglycosides and quinolones. Additionally, these three loci were independently mutated four, three and two times, respectively, providing further support for parallel pathoadaptation. Finally, we identified a minimum of 14 recombination events, and observed that loci carrying putative parallel pathoadaptations and polymorphisms statistically associated with β-lactam resistance were over-represented in these recombinogenic regions. Cystic fibrosis (CF) is the most common lethal genetic disorder affecting individuals of European descent. Most CF patients die at a young age due to chronic lung infections. Among the organisms involved in these infections are bacteria from the Burkholderia cepacia complex (BCC), which are strongly associated with poor clinical prognosis. This study examines how the most prevalent BCC species among CF patients, B. multivorans, evolves within a single CF patient by studying the first B. multivorans isolate recovered from the patient, one hundred isolates recovered over a one year period during the chronic infection phase, and an additional ten isolates recovered after the reinfection of the transplanted lungs. We found that B. multivorans diversify phenotypically and genetically within the CF lung over the course of the infection, and evolves into a complex population during the chronic infection phase. We found that isolates collected from the post-transplant reinfection were more closely related to descendants of the original isolate rather than those recovered in the chronic infection. We identify genetic variants statistically associated with resistance to the antibiotics, and showed that some of these variants were found in regions that show patterns of recombination (genetic exchange) between strains. We also found that genes which were mutated multiple times during overall infection were more likely to be found in regions showing signals consistent with recombination. The presence of multiple independent mutations in a gene is a very strong signal that the gene helps bacteria adapt to their environment. Overall, this study provides insight into how pathogens adapt to the host during long-term infections, specific genes associated with antibiotic resistance, and the origin of new and recurrent infections.
Collapse
Affiliation(s)
- Julio Diaz Caballero
- Department of Cell and Systems Biology, University of Toronto, Toronto, Ontario, Canada
| | - Shawn T. Clark
- Latner Thoracic Surgery Laboratories, University Health Network, University of Toronto, Toronto, Ontario, Canada
- Department of Laboratory Medicine and Pathobiology, University of Toronto, Toronto, Ontario, Canada
| | - Pauline W. Wang
- Centre for the Analysis of Genome Evolution and Function, University of Toronto, Toronto, Ontario, Canada
| | - Sylva L. Donaldson
- Centre for the Analysis of Genome Evolution and Function, University of Toronto, Toronto, Ontario, Canada
| | - Bryan Coburn
- Division of Infectious Diseases, Department of Medicine, University Health Network, University of Toronto, Toronto, Ontario, Canada
| | - D. Elizabeth Tullis
- Adult Cystic Fibrosis Clinic, St. Michael's Hospital, Toronto, Ontario, Canada
| | - Yvonne C. W. Yau
- Department of Laboratory Medicine and Pathobiology, University of Toronto, Toronto, Ontario, Canada
- Department of Pediatric Laboratory Medicine, Division of Microbiology, The Hospital for Sick Children, Toronto, Ontario, Canada
| | - Valerie J. Waters
- Department of Pediatrics, Division of Infectious Diseases, The Hospital for Sick Children, University of Toronto, Toronto, Ontario, Canada
| | - David M. Hwang
- Latner Thoracic Surgery Laboratories, University Health Network, University of Toronto, Toronto, Ontario, Canada
- Department of Laboratory Medicine and Pathobiology, University of Toronto, Toronto, Ontario, Canada
- Department of Pathology, University Health Network, Toronto, Ontario, Canada
| | - David S. Guttman
- Department of Cell and Systems Biology, University of Toronto, Toronto, Ontario, Canada
- Centre for the Analysis of Genome Evolution and Function, University of Toronto, Toronto, Ontario, Canada
- * E-mail:
| |
Collapse
|
7
|
Abstract
Background Homology search is still a significant step in functional analysis for genomic data. Profile Hidden Markov Model-based homology search has been widely used in protein domain analysis in many different species. In particular, with the fast accumulation of transcriptomic data of non-model species and metagenomic data, profile homology search is widely adopted in integrated pipelines for functional analysis. While the state-of-the-art tool HMMER has achieved high sensitivity and accuracy in domain annotation, the sensitivity of HMMER on short reads declines rapidly. The low sensitivity on short read homology search can lead to inaccurate domain composition and abundance computation. Our experimental results showed that half of the reads were missed by HMMER for a RNA-Seq dataset. Thus, there is a need for better methods to improve the homology search performance for short reads. Results We introduce a profile homology search tool named Short-Pair that is designed for short paired-end reads. By using an approximate Bayesian approach employing distribution of fragment lengths and alignment scores, Short-Pair can retrieve the missing end and determine true domains. In particular, Short-Pair increases the accuracy in aligning short reads that are part of remote homologs. We applied Short-Pair to a RNA-Seq dataset and a metagenomic dataset and quantified its sensitivity and accuracy on homology search. The experimental results show that Short-Pair can achieve better overall performance than the state-of-the-art methodology of profile homology search. Conclusions Short-Pair is best used for next-generation sequencing (NGS) data that lack reference genomes. It provides a complementary paired-end read homology search tool to HMMER. The source code is freely available at https://sourceforge.net/projects/short-pair/.
Collapse
|
8
|
Tsuji J, Weng Z. Evaluation of preprocessing, mapping and postprocessing algorithms for analyzing whole genome bisulfite sequencing data. Brief Bioinform 2016; 17:938-952. [PMID: 26628557 PMCID: PMC5142012 DOI: 10.1093/bib/bbv103] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/14/2015] [Revised: 10/02/2015] [Indexed: 01/03/2023] Open
Abstract
Cytosine methylation regulates many biological processes such as gene expression, chromatin structure and chromosome stability. The whole genome bisulfite sequencing (WGBS) technique measures the methylation level at each cytosine throughout the genome. There are an increasing number of publicly available pipelines for analyzing WGBS data, reflecting many choices of read mapping algorithms as well as preprocessing and postprocessing methods. We simulated single-end and paired-end reads based on three experimental data sets, and comprehensively evaluated 192 combinations of three preprocessing, five postprocessing and five widely used read mapping algorithms. We also compared paired-end data with single-end data at the same sequencing depth for performance of read mapping and methylation level estimation. Bismark and LAST were the most robust mapping algorithms. We found that Mott trimming and quality filtering individually improved the performance of both read mapping and methylation level estimation, but combining them did not lead to further improvement. Furthermore, we confirmed that paired-end sequencing reduced error rate and enhanced sensitivity for both read mapping and methylation level estimation, especially for short reads and in repetitive regions of the human genome.
Collapse
|
9
|
Selective Sweeps and Parallel Pathoadaptation Drive Pseudomonas aeruginosa Evolution in the Cystic Fibrosis Lung. mBio 2015; 6:e00981-15. [PMID: 26330513 PMCID: PMC4556809 DOI: 10.1128/mbio.00981-15] [Citation(s) in RCA: 87] [Impact Index Per Article: 9.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/16/2022] Open
Abstract
UNLABELLED Pulmonary infections caused by Pseudomonas aeruginosa are a recalcitrant problem in cystic fibrosis (CF) patients. While the clinical implications and long-term evolutionary patterns of these infections are well studied, we know little about the short-term population dynamics that enable this pathogen to persist despite aggressive antimicrobial therapy. Here, we describe a short-term population genomic analysis of 233 P. aeruginosa isolates collected from 12 sputum specimens obtained over a 1-year period from a single patient. Whole-genome sequencing and antimicrobial susceptibility profiling identified the expansion of two clonal lineages. The first lineage originated from the coalescence of the entire sample less than 3 years before the end of the study and gave rise to a high-diversity ancestral population. The second expansion occurred 2 years later and gave rise to a derived population with a strong signal of positive selection. These events show characteristics consistent with recurrent selective sweeps. While we cannot identify the specific mutations responsible for the origins of the clonal lineages, we find that the majority of mutations occur in loci previously associated with virulence and resistance. Additionally, approximately one-third of all mutations occur in loci that are mutated multiple times, highlighting the importance of parallel pathoadaptation. One such locus is the gene encoding penicillin-binding protein 3, which received three independent mutations. Our functional analysis of these alleles shows that they provide differential fitness benefits dependent on the antibiotic under selection. These data reveal that bacterial populations can undergo extensive and dramatic changes that are not revealed by lower-resolution analyses. IMPORTANCE Pseudomonas aeruginosa is a bacterial opportunistic pathogen responsible for significant morbidity and mortality in cystic fibrosis (CF) patients. Once it has colonized the lung in CF, it is highly resilient and rarely eradicated. This study presents a deep sampling examination of the fine-scale evolutionary dynamics of P. aeruginosa in the lungs of a chronically infected CF patient. We show that diversity of P. aeruginosa is driven by recurrent clonal emergence and expansion within this patient and identify potential adaptive variants associated with these events. This high-resolution sequencing strategy thus reveals important intraspecies dynamics that explain a clinically important phenomenon not evident at a lower-resolution analysis of community structure.
Collapse
|
10
|
Killcoyne S, del Sol A. FIGG: simulating populations of whole genome sequences for heterogeneous data analyses. BMC Bioinformatics 2014; 15:149. [PMID: 24885193 PMCID: PMC4039316 DOI: 10.1186/1471-2105-15-149] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/30/2013] [Accepted: 05/09/2014] [Indexed: 12/15/2022] Open
Abstract
Background High-throughput sequencing has become one of the primary tools for investigation of the molecular basis of disease. The increasing use of sequencing in investigations that aim to understand both individuals and populations is challenging our ability to develop analysis tools that scale with the data. This issue is of particular concern in studies that exhibit a wide degree of heterogeneity or deviation from the standard reference genome. The advent of population scale sequencing studies requires analysis tools that are developed and tested against matching quantities of heterogeneous data. Results We developed a large-scale whole genome simulation tool, FIGG, which generates large numbers of whole genomes with known sequence characteristics based on direct sampling of experimentally known or theorized variations. For normal variations we used publicly available data to determine the frequency of different mutation classes across the genome. FIGG then uses this information as a background to generate new sequences from a parent sequence with matching frequencies, but different actual mutations. The background can be normal variations, known disease variations, or a theoretical frequency distribution of variations. Conclusion In order to enable the creation of large numbers of genomes, FIGG generates simulated sequences from known genomic variation and iteratively mutates each genome separately. The result is multiple whole genome sequences with unique variations that can primarily be used to provide different reference genomes, model heterogeneous populations, and can offer a standard test environment for new analysis algorithms or bioinformatics tools.
Collapse
Affiliation(s)
| | - Antonio del Sol
- Luxembourg Centre for Systems Biomedicine (LCSB), University of Luxembourg, Campus Belval, 7, avenue des Hauts fourneaux, Esch/Alzette L-4362, Luxembourg.
| |
Collapse
|
11
|
Abstract
Many bioinformatics problems, such as sequence alignment, gene prediction, phylogenetic tree estimation and RNA secondary structure prediction, are often affected by the 'uncertainty' of a solution, that is, the probability of the solution is extremely small. This situation arises for estimation problems on high-dimensional discrete spaces in which the number of possible discrete solutions is immense. In the analysis of biological data or the development of prediction algorithms, this uncertainty should be handled carefully and appropriately. In this review, I will explain several methods to combat this uncertainty, presenting a number of examples in bioinformatics. The methods include (i) avoiding point estimation, (ii) maximum expected accuracy (MEA) estimations and (iii) several strategies to design a pipeline involving several prediction methods. I believe that the basic concepts and ideas described in this review will be generally useful for estimation problems in various areas of bioinformatics.
Collapse
|