1
|
Merino DM, McShane LM, Fabrizio D, Funari V, Chen SJ, White JR, Wenz P, Baden J, Barrett JC, Chaudhary R, Chen L, Chen WS, Cheng JH, Cyanam D, Dickey JS, Gupta V, Hellmann M, Helman E, Li Y, Maas J, Papin A, Patidar R, Quinn KJ, Rizvi N, Tae H, Ward C, Xie M, Zehir A, Zhao C, Dietel M, Stenzinger A, Stewart M, Allen J. Establishing guidelines to harmonize tumor mutational burden (TMB): in silico assessment of variation in TMB quantification across diagnostic platforms: phase I of the Friends of Cancer Research TMB Harmonization Project. J Immunother Cancer 2021; 8:jitc-2019-000147. [PMID: 32217756 PMCID: PMC7174078 DOI: 10.1136/jitc-2019-000147] [Citation(s) in RCA: 281] [Impact Index Per Article: 93.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 02/11/2020] [Indexed: 12/13/2022] Open
Abstract
Background Tumor mutational burden (TMB), defined as the number of somatic mutations per megabase of interrogated genomic sequence, demonstrates predictive biomarker potential for the identification of patients with cancer most likely to respond to immune checkpoint inhibitors. TMB is optimally calculated by whole exome sequencing (WES), but next-generation sequencing targeted panels provide TMB estimates in a time-effective and cost-effective manner. However, differences in panel size and gene coverage, in addition to the underlying bioinformatics pipelines, are known drivers of variability in TMB estimates across laboratories. By directly comparing panel-based TMB estimates from participating laboratories, this study aims to characterize the theoretical variability of panel-based TMB estimates, and provides guidelines on TMB reporting, analytic validation requirements and reference standard alignment in order to maintain consistency of TMB estimation across platforms. Methods Eleven laboratories used WES data from The Cancer Genome Atlas Multi-Center Mutation calling in Multiple Cancers (MC3) samples and calculated TMB from the subset of the exome restricted to the genes covered by their targeted panel using their own bioinformatics pipeline (panel TMB). A reference TMB value was calculated from the entire exome using a uniform bioinformatics pipeline all members agreed on (WES TMB). Linear regression analyses were performed to investigate the relationship between WES and panel TMB for all 32 cancer types combined and separately. Variability in panel TMB values at various WES TMB values was also quantified using 95% prediction limits. Results Study results demonstrated that variability within and between panel TMB values increases as the WES TMB values increase. For each panel, prediction limits based on linear regression analyses that modeled panel TMB as a function of WES TMB were calculated and found to approximately capture the intended 95% of observed panel TMB values. Certain cancer types, such as uterine, bladder and colon cancers exhibited greater variability in panel TMB values, compared with lung and head and neck cancers. Conclusions Increasing uptake of TMB as a predictive biomarker in the clinic creates an urgent need to bring stakeholders together to agree on the harmonization of key aspects of panel-based TMB estimation, such as the standardization of TMB reporting, standardization of analytical validation studies and the alignment of panel-based TMB values with a reference standard. These harmonization efforts should improve consistency and reliability of panel TMB estimates and aid in clinical decision-making.
Collapse
Affiliation(s)
| | | | | | | | | | | | - Paul Wenz
- Clinical Genomics, Illumina Inc, San Diego, California, USA
| | | | - J Carl Barrett
- Translational Medicine, Oncology Research and Early Development, AstraZeneca Pharmaceuticals LP, Boston, Massachusetts, USA
| | - Ruchi Chaudhary
- Clinical Sequencing Division, Thermo Fisher Scientific, Ann Arbor, Michigan, USA
| | - Li Chen
- Molecular Characterization Laboratory, Frederick National Laboratory for Cancer Research, Frederick, Maryland, USA
| | | | | | - Dinesh Cyanam
- Clinical Sequencing Division, Thermo Fisher Scientific, Ann Arbor, Michigan, USA
| | | | | | | | - Elena Helman
- Bioinformatics, Guardant Health Inc, Redwood City, California, USA
| | - Yali Li
- Foundation Medicine Inc, Cambridge, Massachusetts, USA
| | - Joerg Maas
- Quality in Pathology (QuIP), Berlin, Germany
| | | | - Rajesh Patidar
- Molecular Characterization Laboratory, Frederick National Laboratory for Cancer Research, Frederick, Maryland, USA
| | - Katie J Quinn
- Bioinformatics, Guardant Health Inc, Redwood City, California, USA
| | - Naiyer Rizvi
- Division of Hematology/Oncology, Department of Medicine, Columbia University, New York, New York, USA
| | | | | | - Mingchao Xie
- AstraZeneca Pharmaceuticals LP, Waltham, Massachusetts, USA
| | - Ahmet Zehir
- Memorial Sloan Kettering Cancer Center, New York, New York, USA
| | - Chen Zhao
- Clinical Genomics, Illumina Inc, San Diego, California, USA
| | | | - Albrecht Stenzinger
- Institute of Pathology, University Hospital Heidelberg, Heidelberg, Baden-Württemberg, Germany
| | | | - Jeff Allen
- Friends of Cancer Research, Washington, DC, USA
| | | |
Collapse
|
2
|
Saul M, Poorman K, Tae H, Vanderwalde A, Stafford P, Spetzler D, Korn WM, Gatalica Z, Swensen J. Population bias in somatic measurement of microsatellite instability status. Cancer Med 2020; 9:6452-6460. [PMID: 32644297 PMCID: PMC7476819 DOI: 10.1002/cam4.3294] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/10/2020] [Revised: 06/17/2020] [Accepted: 06/18/2020] [Indexed: 11/23/2022] Open
Abstract
Microsatellite instability (MSI) is a key secondary effect of a defective DNA mismatch repair mechanism resulting in incorrectly replicated microsatellites in many malignant tumors. Historically, MSI detection has been performed by fragment analysis (FA) on a panel of representative genomic markers. More recently, using next-generation sequencing (NGS) to analyze thousands of microsatellites has been shown to improve the robustness and sensitivity of MSI detection. However, NGS-based MSI tests can be prone to population biases if NGS results are aligned to a reference genome instead of patient-matched normal tissue. We observed an increased rate of false positives in patients of African ancestry with an NGS-based diagnostic for MSI status utilizing 7317 microsatellite loci. We then minimized this bias by training a modified calling model that utilized 2011 microsatellite loci. With these adjustments 100% (95% CI: 89.1% to 100%) of African ancestry patients in an independent validation test were called correctly using the updated model. This poses not only a significant technical improvement but also has an important clinical impact on directing immune checkpoint inhibitor therapy.
Collapse
Affiliation(s)
| | | | | | - Ari Vanderwalde
- The University of Tennessee Health Science Center and West Cancer CenterMemphisTNUSA
| | | | | | | | | | | |
Collapse
|
3
|
Stein MK, Pandey M, Xiu J, Tae H, Swensen J, Mittal S, Brenner AJ, Korn WM, Heimberger AB, Martin MG. Tumor Mutational Burden Is Site Specific in Non–Small-Cell Lung Cancer and Is Highest in Lung Adenocarcinoma Brain Metastases. JCO Precis Oncol 2019; 3:1-13. [DOI: 10.1200/po.18.00376] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/10/2023] Open
Abstract
PURPOSE Tumor mutational burden (TMB) is a developing biomarker in non–small-cell lung cancer (NSCLC). Little is known regarding differences between TMB and sample location, histology, or other biomarkers. METHODS A total of 3,424 unmatched NSCLC samples, including 2,351 lung adenocarcinomas (LUADs) and 1,073 lung squamous cell carcinomas (LUSCs), underwent profiling, including next-generation sequencing of 592 cancer-related genes, programmed death ligand 1 immunohistochemistry, and TMB. The rate TMB of 10 mutations per megabase (Mb) or greater was compared between primary and metastatic LUAD and LUSC. Molecular alteration frequency was compared at a cutoff of 10 mutations/Mb. RESULTS LUAD metastases were more likely to have a TMB of 10 mutations/Mb or greater compared with primary LUADs (38% v 25%; P < .001), and this difference was most pronounced with brain metastases (61% v 35% for other metastases; P < .001). The median TMB for LUAD brain metastases was 13 mutations/Mb compared with six mutations/Mb for primary LUADs. Variability existed for other LUAD metastasis sites, with adrenal metastases most likely to meet the cutoff of 10 mutations/Mb (51%) and bone metastases least likely to meet the cutoff (19%). TMB was more commonly 10 mutations/Mb or greater for LUSC primary tumors than for LUAD primary tumors (35% v 25%, respectively; P < .001). LUSC metastases were more likely to have a TMB of 10 mutations/Mb or greater than LUSC primary tumors. Poorly differentiated disease was more likely have a TMB of 10 mutations/Mb or greater when stratified by histology and primary tumor or metastasis. Site-specific molecular differences existed at this TMB cutoff including programmed death ligand 1 positivity and STK11 and KRAS mutation rate. CONCLUSION TMB is a site-specific biomarker in NSCLC with important spatial and histologic differences. TMB is more frequently 10 mutations/Mb or greater in LUAD and LUSC metastases and highest in LUAD brain metastases. Along this TMB cutoff, clinically informative distinctions exist in other tumor profiling characteristics. Further investigation is needed to expand on these findings.
Collapse
Affiliation(s)
- Matthew K. Stein
- West Cancer Center, University of Tennessee Health Science Center, Memphis, TN
| | - Manjari Pandey
- West Cancer Center, University of Tennessee Health Science Center, Memphis, TN
| | | | | | | | - Sandeep Mittal
- Wayne State University, Detroit, MI
- Carilion Clinic and Virginia Tech Carilion School of Medicine and Research Institute, Roanoke, VA
| | - Andrew J. Brenner
- Mays Cancer Center, The University of Texas Health Science Center at San Antonio, San Antonio, TX
| | | | | | - Mike G. Martin
- West Cancer Center, University of Tennessee Health Science Center, Memphis, TN
| |
Collapse
|
4
|
Kang L, Settlage R, McMahon W, Michalak K, Tae H, Garner HR, Stacy EA, Price DK, Michalak P. Genomic Signatures of Speciation in Sympatric and Allopatric Hawaiian Picture-Winged Drosophila. Genome Biol Evol 2016; 8:1482-8. [PMID: 27189993 PMCID: PMC4898809 DOI: 10.1093/gbe/evw095] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/23/2023] Open
Abstract
The Hawaiian archipelago provides a natural arena for understanding adaptive radiation and speciation. The Hawaiian Drosophila are one of the most diverse endemic groups in Hawaiì with up to 1,000 species. We sequenced and analyzed entire genomes of recently diverged species of Hawaiian picture-winged Drosophila, Drosophila silvestris and Drosophila heteroneura from Hawaiì Island, in comparison with Drosophila planitibia, their sister species from Maui, a neighboring island where a common ancestor of all three had likely occurred. Genome-wide single nucleotide polymorphism patterns suggest the more recent origin of D. silvestris and D. heteroneura, as well as a pervasive influence of positive selection on divergence of the three species, with the signatures of positive selection more prominent in sympatry than allopatry. Positively selected genes were significantly enriched for functional terms related to sensory detection and mating, suggesting that sexual selection played an important role in speciation of these species. In particular, sequence variation in Olfactory receptor and Gustatory receptor genes seems to play a major role in adaptive radiation in Hawaiian pictured-winged Drosophila.
Collapse
Affiliation(s)
- Lin Kang
- Biocomplexity Institute, Virginia Tech, Blacksburg, Virginia
| | - Robert Settlage
- Biocomplexity Institute, Virginia Tech, Blacksburg, Virginia
| | - Wyatt McMahon
- Howard Hughes Medical Institute, Johns Hopkins Medical Institutes, Baltimore, Maryland
| | | | - Hongseok Tae
- Biocomplexity Institute, Virginia Tech, Blacksburg, Virginia
| | - Harold R Garner
- Primary Care Research Network and the Center for Bioinformatics and Genetics, Edward Via College of Osteopathic Medicine, Blacksburg, Virginia
| | - Elizabeth A Stacy
- Tropical Conservation Biology and Environmental Science Graduate Program, University of Hawaiì at Hilo
| | - Donald K Price
- Tropical Conservation Biology and Environmental Science Graduate Program, University of Hawaiì at Hilo
| | - Pawel Michalak
- Biocomplexity Institute, Virginia Tech, Blacksburg, Virginia
| |
Collapse
|
5
|
|
6
|
Tae H, Karunasena E, Bavarva JH, McIver LJ, Garner HR. Large scale comparison of non-human sequences in human sequencing data. Genomics 2014; 104:453-8. [PMID: 25173571 DOI: 10.1016/j.ygeno.2014.08.009] [Citation(s) in RCA: 26] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/21/2013] [Revised: 08/17/2014] [Accepted: 08/19/2014] [Indexed: 11/19/2022]
Abstract
Several studies have demonstrated that unmapped reads in next generation sequencing data could be used to identify infectious agents or structural variants, but there has been no intensive effort to analyze and classify all non-human sequences found in individual large data sets. To identify commonality in non-human sequences by infectious agents and putative contamination events, we analyzed non-human sequences in 150 genomic sequencing data files from the 1000 Genomes Project and observed that 0.13% of reads on average showed similarities to non-human genomes. We compared results among different sample groups divided based on ethnicities, sequencing centers and enrichment methods (whole genome sequencing vs. exome sequencing) and found that sequencing centers had specific signatures of contaminating genomes as 'time stamps'. We also observed many unmapped reads that falsely indicated contamination because of the high similarity of human sequences to sequences in non-human genome assemblies such as mouse and Nicotiana.
Collapse
Affiliation(s)
- Hongseok Tae
- Virginia Bioinformatics Institute, Virginia Tech, Blacksburg, VA, USA
| | - Enusha Karunasena
- Virginia Bioinformatics Institute, Virginia Tech, Blacksburg, VA, USA
| | - Jasmin H Bavarva
- Virginia Bioinformatics Institute, Virginia Tech, Blacksburg, VA, USA
| | - Lauren J McIver
- Virginia Bioinformatics Institute, Virginia Tech, Blacksburg, VA, USA
| | - Harold R Garner
- Virginia Bioinformatics Institute, Virginia Tech, Blacksburg, VA, USA.
| |
Collapse
|
7
|
Bavarva JH, Tae H, McIver L, Karunasena E, Garner HR. The dynamic exome: acquired variants as individuals age. Aging (Albany NY) 2014; 6:511-521. [PMID: 25063753 PMCID: PMC4100812 DOI: 10.18632/aging.100674] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/24/2014] [Accepted: 06/14/2014] [Indexed: 06/03/2023]
Abstract
A singular genome used for inference into population-based studies is a standard method in genomics. Recent studies show that spontaneous genomic variants can propagate into new generations and these changes can contribute to individual cell aging with environmental and evolutionary elements contributing to cumulative genomic variation. However, the contribution of aging to genomic changes in tissue samples remains uncharacterized. Here, we report the impact of aging on individual human exomes and their implications. We found the human genome to be dynamic, acquiring a varying number of mutations with age (5,000 to 50,000 in 9 to 16 years). This equates to a variation rate of 9.6x10(-7) to 8.4x10(-6) bp(-1) year(-1) for nonsynonymous single nucleotide variants and 2.0x10(-4) to 1.0x10(-3) locus(-1) year(-1) for microsatellite loci in these individuals. These mutations span across 3,000 to 13,000 genes, which commonly showed association with Wnt signaling and Gonadotropin releasing hormone receptor pathways, and indicated for individuals a specific and significant enrichment for increased risk for diabetes, kidney failure, cancer, Rheumatoid arthritis, and Alzheimer's disease--conditions usually associated with aging. The results suggest that "age" is an important variable while analyzing an individual human genome to extract individual-specific clinically significant information necessary for personalized genomics.
Collapse
Affiliation(s)
- Jasmin H Bavarva
- Virginia Bioinformatics Institute, Virginia Tech, Blacksburg, VA 24061, USA
| | | | | | | | | |
Collapse
|
8
|
Tae H, Kim DY, McCormick J, Settlage RE, Garner HR. Discretized Gaussian mixture for genotyping of microsatellite loci containing homopolymer runs. ACTA ACUST UNITED AC 2013; 30:652-9. [PMID: 24135263 DOI: 10.1093/bioinformatics/btt595] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/21/2023]
Abstract
MOTIVATION Inferring lengths of inherited microsatellite alleles with single base pair resolution from short sequence reads is challenging due to several sources of noise caused by the repetitive nature of microsatellites and the technologies used to generate raw sequence data. RESULTS We have developed a program, GenoTan, using a discretized Gaussian mixture model combined with a rules-based approach to identify inherited variation of microsatellite loci from short sequence reads without paired-end information. It effectively distinguishes length variants from noise including insertion/deletion errors in homopolymer runs by addressing the bidirectional aspect of insertion and deletion errors in sequence reads. Here we first introduce a homopolymer decomposition method which estimates error bias toward insertion or deletion in homopolymer sequence runs. Combining these approaches, GenoTan was able to genotype 94.9% of microsatellite loci accurately from simulated data with 40x sequence coverage quickly while the other programs showed <90% correct calls for the same data and required 5∼30× more computational time than GenoTan. It also showed the highest true-positive rate for real data using mixed sequence data of two Drosophila inbred lines, which was a novel validation approach for genotyping. AVAILABILITY GenoTan is open-source software available at http://genotan.sourceforge.net.
Collapse
Affiliation(s)
- Hongseok Tae
- Virginia Bioinformatics Institute, Virginia Tech, Blacksburg, VA 24061 and Office of Biostatistics Research, National Heart, Lung and Blood Institute, National Institutes of Health, Bethesda, MD 20892, USA
| | | | | | | | | |
Collapse
|
9
|
Radakovits R, Jinkerson RE, Fuerstenberg SI, Tae H, Settlage RE, Boore JL, Posewitz MC. Erratum: Corrigendum: Draft genome sequence and genetic transformation of the oleaginous alga Nannochloropsis gaditana. Nat Commun 2013. [PMCID: PMC3868315 DOI: 10.1038/ncomms3356] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022] Open
|
10
|
Bavarva JH, Tae H, Settlage RE, Garner HR. Characterizing the Genetic Basis for Nicotine Induced Cancer Development: A Transcriptome Sequencing Study. PLoS One 2013; 8:e67252. [PMID: 23825647 PMCID: PMC3688980 DOI: 10.1371/journal.pone.0067252] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/05/2013] [Accepted: 05/15/2013] [Indexed: 12/31/2022] Open
Abstract
Nicotine is a known risk factor for cancer development and has been shown to alter gene expression in cells and tissue upon exposure. We used Illumina® Next Generation Sequencing (NGS) technology to gain unbiased biological insight into the transcriptome of normal epithelial cells (MCF-10A) to nicotine exposure. We generated expression data from 54,699 transcripts using triplicates of control and nicotine stressed cells. As a result, we identified 138 differentially expressed transcripts, including 39 uncharacterized genes. Additionally, 173 transcripts that are primarily associated with DNA replication, recombination, and repair showed evidence for alternative splicing. We discovered the greatest nicotine stress response by HPCAL4 (up-regulated by 4.71 fold) and NPAS3 (down-regulated by -2.73 fold); both are genes that have not been previously implicated in nicotine exposure but are linked to cancer. We also discovered significant down-regulation (-2.3 fold) and alternative splicing of NEAT1 (lncRNA) that may have an important, yet undiscovered regulatory role. Gene ontology analysis revealed nicotine exposure influenced genes involved in cellular and metabolic processes. This study reveals previously unknown consequences of nicotine stress on the transcriptome of normal breast epithelial cells and provides insight into the underlying biological influence of nicotine on normal cells, marking the foundation for future studies.
Collapse
Affiliation(s)
- Jasmin H. Bavarva
- Virginia Bioinformatics Institute, Virginia Tech, Blacksburg, Virginia, United States of America
| | - Hongseok Tae
- Virginia Bioinformatics Institute, Virginia Tech, Blacksburg, Virginia, United States of America
| | - Robert E. Settlage
- Virginia Bioinformatics Institute, Virginia Tech, Blacksburg, Virginia, United States of America
| | - Harold R. Garner
- Virginia Bioinformatics Institute, Virginia Tech, Blacksburg, Virginia, United States of America
- * E-mail:
| |
Collapse
|
11
|
Tae H, McMahon KW, Settlage RE, Bavarva JH, Garner HR. ReviSTER: an automated pipeline to revise misaligned reads to simple tandem repeats. ACTA ACUST UNITED AC 2013; 29:1734-41. [PMID: 23677944 DOI: 10.1093/bioinformatics/btt277] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/22/2023]
Abstract
MOTIVATION Simple tandem repeats are highly variable genetic elements and widespread in genomes of many organisms. Next-generation sequencing technologies have enabled a robust comparison of large numbers of simple tandem repeat loci; however, analysis of their variation using traditional sequence analysis approaches still remains limiting and problematic due to variants occurring in repeat sequences confusing alignment programs into mapping sequence reads to incorrect loci when the sequence reads are significantly different from the reference sequence. RESULTS We have developed a program, ReviSTER, which is an automated pipeline using a 'local mapping reference reconstruction method' to revise mismapped or partially misaligned reads at simple tandem repeat loci. RevisSTER estimates alleles of repeat loci using a local alignment method and creates temporary local mapping reference sequences, and finally remaps reads to the local mapping references. Using this approach, ReviSTER was able to successfully revise reads misaligned to repeat loci from both simulated data and real data. AVAILABILITY ReviSTER is open-source software available at http://revister.sourceforge.net. CONTACT garner@vbi.vt.edu SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Hongseok Tae
- Virginia Bioinformatics Institute, Virginia Tech, Blacksburg, VA 24061, USA
| | | | | | | | | |
Collapse
|
12
|
Bavarva JH, Tae H, Michalak P, Garner HR. Life cycle of an n-globin pseudogene microsatellite locus. Front Genet 2013; 4:267. [PMID: 24363661 PMCID: PMC3849843 DOI: 10.3389/fgene.2013.00267] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/01/2013] [Accepted: 11/16/2013] [Indexed: 11/13/2022] Open
|
13
|
Tae H, Ryu D, Sureshchandra S, Choi JH. ESTclean: a cleaning tool for next-gen transcriptome shotgun sequencing. BMC Bioinformatics 2012; 13:247. [PMID: 23009593 PMCID: PMC3630001 DOI: 10.1186/1471-2105-13-247] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/09/2012] [Accepted: 09/22/2012] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND With the advent of next-generation sequencing (NGS) technologies, full cDNA shotgun sequencing has become a major approach in the study of transcriptomes, and several different protocols in 454 sequencing have been invented. As each protocol uses its own short DNA tags or adapters attached to the ends of cDNA fragments for labeling or sequencing, different contaminants may lead to mis-assembly and inaccurate sequence products. RESULTS We have designed and implemented a new program for raw sequence cleaning in a graphical user interface and a batch script. The cleaning process consists of several modules including barcode trimming, sequencing adapter trimming, amplification primer trimming, poly-A tail trimming, vector screening and low quality region trimming. These modules can be combined based on various sequencing applications. CONCLUSIONS ESTclean is a software package not only for cleaning cDNA sequences, but also for helping to develop sequencing protocols by providing summary tables and figures for sequencing quality control in a graphical user interface. It outperforms in cleaning read sequences from complicated sequencing protocols which use barcodes and multiple amplification primers.
Collapse
Affiliation(s)
- Hongseok Tae
- The Center for Genomics and Bioinformatics, Indiana University, Bloomington, IN 47401, USA
| | | | | | | |
Collapse
|
14
|
Tae H, Settlage RE, Shallom S, Bavarva JH, Preston D, Hawkins GN, Adams LG, Garner HR. Improved variation calling via an iterative backbone remapping and local assembly method for bacterial genomes. Genomics 2012; 100:271-6. [PMID: 22967795 DOI: 10.1016/j.ygeno.2012.07.015] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/08/2012] [Revised: 07/30/2012] [Accepted: 07/31/2012] [Indexed: 12/26/2022]
Abstract
Sequencing data analysis remains limiting and problematic, especially for low complexity repeat sequences and transposon elements due to inherent sequencing errors and short sequence read lengths. We have developed a program, ReviSeq, which uses a hybrid method composed of iterative remapping and local assembly upon a bacterial sequence backbone. Application of this method to six Brucella suis field isolates compared to the newly revised B. suis 1330 reference genome identified on average 13, 15, 19 and 9 more variants per sample than STAMPY/SAMtools, BWA/SAMtools, iCORN and BWA/PINDEL pipelines, and excluded on average 4, 2, 3 and 19 variants per sample, respectively. In total, using this iterative approach, we identified on average 87 variants including SNVs, short INDELs and long INDELs per strain when compared to the reference. Our program outperforms other methods especially for long INDEL calling. The program is available at http://reviseq.sourceforge.net.
Collapse
Affiliation(s)
- Hongseok Tae
- Virginia Bioinformatics Institute, Virginia Tech, Blacksburg, VA, USA
| | | | | | | | | | | | | | | |
Collapse
|
15
|
Pei L, Choi JH, Liu J, Lee EJ, McCarthy B, Wilson JM, Speir E, Awan F, Tae H, Arthur G, Schnabel JL, Taylor KH, Wang X, Xu D, Ding HF, Munn DH, Caldwell C, Shi H. Genome-wide DNA methylation analysis reveals novel epigenetic changes in chronic lymphocytic leukemia. Epigenetics 2012; 7:567-78. [PMID: 22534504 DOI: 10.4161/epi.20237] [Citation(s) in RCA: 74] [Impact Index Per Article: 6.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022] Open
Abstract
We conducted a genome-wide DNA methylation analysis in CD19 (+) B-cells from chronic lymphocytic leukemia (CLL) patients and normal control samples using reduced representation bisulfite sequencing (RRBS). The methylation status of 1.8-2.3 million CpGs in the CLL genome was determined; about 45% of these CpGs were located in more than 23,000 CpG islands (CGIs). While global CpG methylation was similar between CLL and normal B-cells, 1764 gene promoters were identified as being differentially methylated in at least one CLL sample when compared with normal B-cell samples. Nineteen percent of the differentially methylated genes were involved in transcriptional regulation. Aberrant hypermethylation was found in all HOX gene clusters and a significant number of WNT signaling pathway genes. Hypomethylation occurred more frequently in the gene body including introns, exons, and 3'-UTRs in CLL. The NFATc1 P2 promoter and first intron was found to be hypomethylated and correlated with upregulation of both NFATc1 RNA and protein expression levels in CLL suggesting that an epigenetic mechanism is involved in the constitutive activation of NFAT activity in CLL cells. This comprehensive DNA methylation analysis will further our understanding of the epigenetic contribution to cellular dysfunction in CLL.
Collapse
Affiliation(s)
- Lirong Pei
- GHSU Cancer Center; Georgia Health Sciences University; Augusta, GA, USA
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
16
|
Biedler JK, Hu W, Tae H, Tu Z. Identification of early zygotic genes in the yellow fever mosquito Aedes aegypti and discovery of a motif involved in early zygotic genome activation. PLoS One 2012; 7:e33933. [PMID: 22457801 PMCID: PMC3311545 DOI: 10.1371/journal.pone.0033933] [Citation(s) in RCA: 39] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/22/2011] [Accepted: 02/20/2012] [Indexed: 11/19/2022] Open
Abstract
During early embryogenesis the zygotic genome is transcriptionally silent and all mRNAs present are of maternal origin. The maternal-zygotic transition marks the time over which embryogenesis changes its dependence from maternal RNAs to zygotically transcribed RNAs. Here we present the first systematic investigation of early zygotic genes (EZGs) in a mosquito species and focus on genes involved in the onset of transcription during 2–4 hr. We used transcriptome sequencing to identify the “pure” (without maternal expression) EZGs by analyzing transcripts from four embryonic time ranges of 0–2, 2–4, 4–8, and 8–12 hr, which includes the time of cellular blastoderm formation and up to the start of gastrulation. Blast of 16,789 annotated transcripts vs. the transcriptome reads revealed evidence for 63 (P<0.001) and 143 (P<0.05) nonmaternally derived transcripts having a significant increase in expression at 2–4 hr. One third of the 63 EZG transcripts do not have predicted introns compared to 10% of all Ae. aegypti genes. We have confirmed by RT-PCR that zygotic transcription starts as early as 2–3 hours. A degenerate motif VBRGGTA was found to be overrepresented in the upstream sequences of the identified EZGs using a motif identification software called SCOPE. We find evidence for homology between this motif and the TAGteam motif found in Drosophila that has been implicated in EZG activation. A 38 bp sequence in the proximal upstream sequence of a kinesin light chain EZG (KLC2.1) contains two copies of the mosquito motif. This sequence was shown to support EZG transcription by luciferase reporter assays performed on injected early embryos, and confers early zygotic activity to a heterologous promoter from a divergent mosquito species. The results of these studies are consistent with the model of early zygotic genome activation via transcriptional activators, similar to what has been found recently in Drosophila.
Collapse
Affiliation(s)
- James K. Biedler
- Department of Biochemistry, Virginia Tech, Blacksburg, Virginia, United States of America
- * E-mail: (JKB); (ZT)
| | - Wanqi Hu
- Department of Biochemistry, Virginia Tech, Blacksburg, Virginia, United States of America
| | - Hongseok Tae
- Virginia Bioinformatics Institute, Virginia Tech, Blacksburg, Virginia, United States of America
| | - Zhijian Tu
- Department of Biochemistry, Virginia Tech, Blacksburg, Virginia, United States of America
- * E-mail: (JKB); (ZT)
| |
Collapse
|
17
|
Radakovits R, Jinkerson RE, Fuerstenberg SI, Tae H, Settlage RE, Boore JL, Posewitz MC. Draft genome sequence and genetic transformation of the oleaginous alga Nannochloropis gaditana. Nat Commun 2012; 3:686. [PMID: 22353717 PMCID: PMC3293424 DOI: 10.1038/ncomms1688] [Citation(s) in RCA: 395] [Impact Index Per Article: 32.9] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/27/2011] [Accepted: 01/17/2012] [Indexed: 11/09/2022] Open
Abstract
The potential use of algae in biofuels applications is receiving significant attention. However, none of the current algal model species are competitive production strains. Here we present a draft genome sequence and a genetic transformation method for the marine microalga Nannochloropsis gaditana CCMP526. We show that N. gaditana has highly favourable lipid yields, and is a promising production organism. The genome assembly includes nuclear (~29 Mb) and organellar genomes, and contains 9,052 gene models. We define the genes required for glycerolipid biogenesis and detail the differential regulation of genes during nitrogen-limited lipid biosynthesis. Phylogenomic analysis identifies genetic attributes of this organism, including unique stramenopile photosynthesis genes and gene expansions that may explain the distinguishing photoautotrophic phenotypes observed. The availability of a genome sequence and transformation methods will facilitate investigations into N. gaditana lipid biosynthesis and permit genetic engineering strategies to further improve this naturally productive alga.
Collapse
Affiliation(s)
- Randor Radakovits
- Department of Chemistry and Geochemistry, Colorado School of Mines, Golden, Colorado 80401, USA
- These authors contributed equally to this work
| | - Robert E. Jinkerson
- Department of Chemistry and Geochemistry, Colorado School of Mines, Golden, Colorado 80401, USA
- These authors contributed equally to this work
| | | | - Hongseok Tae
- Data Analysis Core, Virginia Bioinformatics Institute, Virginia Tech, 1 Washington Street, Blacksburg, Virginia 24060, USA
| | - Robert E. Settlage
- Data Analysis Core, Virginia Bioinformatics Institute, Virginia Tech, 1 Washington Street, Blacksburg, Virginia 24060, USA
| | - Jeffrey L. Boore
- Genome Project Solutions, 1024 Promenade Street, Hercules, California 94547, USA
- Department of Integrative Biology, University of California, Berkeley, California 94720, USA
| | - Matthew C. Posewitz
- Department of Chemistry and Geochemistry, Colorado School of Mines, Golden, Colorado 80401, USA
| |
Collapse
|
18
|
Galindo CL, McIver LJ, Tae H, McCormick JF, Skinner MA, Hoeschele I, Lewis CM, Minna JD, Boothman DA, Garner HR. Sporadic breast cancer patients' germline DNA exhibit an AT-rich microsatellite signature. Genes Chromosomes Cancer 2011; 50:275-83. [PMID: 21319262 DOI: 10.1002/gcc.20853] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/08/2010] [Accepted: 12/13/2010] [Indexed: 11/11/2022] Open
Abstract
Using a custom CGH-like oligonucleotide array to measure the global microsatellite content in the genomes of 72 cancer, cancer-free, and high risk patient and cell line samples (56 germline DNA and 16 in tumor or tumor cell line DNA) we found a unique, reproducible, and statistically significant pattern of 18 motif-specific microsatellite families (out of 962 possible 1-6 mer repeats) in breast cancer patient germline and tumor DNA, but not in germline DNA of cancer-free volunteer controls or in breast cancer patients with BRCA1/2 mutations. These high-similarity A/T rich repetitive motifs were also more pronounced in the germlines and tumors of colon cancer tumor patients (3/6 samples) and microsatellite unstable colon cancer cell lines; however, germline DNA of sporadic breast cancer patients exhibited the largest global content shift for those motifs with extreme AT/GC ratios. These results indicate that global microsatellite variability is complex, suggest the existence of a previously unknown genomic destabilization mechanism in breast cancer patients' germline DNA, and warrant further testing of such microsatellite variability as a predictor of future breast cancer development.
Collapse
Affiliation(s)
- Cristi L Galindo
- Virginia Bioinformatics Institute, Virginia Polytechnic Institute and State University, Blacksburg, VA 24061-0477, USA
| | | | | | | | | | | | | | | | | | | |
Collapse
|
19
|
Choi JH, Kijimoto T, Snell-Rood E, Tae H, Yang Y, Moczek AP, Andrews J. Gene discovery in the horned beetle Onthophagus taurus. BMC Genomics 2010; 11:703. [PMID: 21156066 PMCID: PMC3019233 DOI: 10.1186/1471-2164-11-703] [Citation(s) in RCA: 36] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/24/2010] [Accepted: 12/14/2010] [Indexed: 01/03/2023] Open
Abstract
Background Horned beetles, in particular in the genus Onthophagus, are important models for studies on sexual selection, biological radiations, the origin of novel traits, developmental plasticity, biocontrol, conservation, and forensic biology. Despite their growing prominence as models for studying both basic and applied questions in biology, little genomic or transcriptomic data are available for this genus. We used massively parallel pyrosequencing (Roche 454-FLX platform) to produce a comprehensive EST dataset for the horned beetle Onthophagus taurus. To maximize sequence diversity, we pooled RNA extracted from a normalized library encompassing diverse developmental stages and both sexes. Results We used 454 pyrosequencing to sequence ESTs from all post-embryonic stages of O. taurus. Approximately 1.36 million reads assembled into 50,080 non-redundant sequences encompassing a total of 26.5 Mbp. The non-redundant sequences match over half of the genes in Tribolium castaneum, the most closely related species with a sequenced genome. Analyses of Gene Ontology annotations and biochemical pathways indicate that the O. taurus sequences reflect a wide and representative sampling of biological functions and biochemical processes. An analysis of sequence polymorphisms revealed that SNP frequency was negatively related to overall expression level and the number of tissue types in which a given gene is expressed. The most variable genes were enriched for a limited number of GO annotations whereas the least variable genes were enriched for a wide range of GO terms directly related to fitness. Conclusions This study provides the first large-scale EST database for horned beetles, a much-needed resource for advancing the study of these organisms. Furthermore, we identified instances of gene duplications and alternative splicing, useful for future study of gene regulation, and a large number of SNP markers that could be used in population-genetic studies of O. taurus and possibly other horned beetles.
Collapse
Affiliation(s)
- Jeong-Hyeon Choi
- Department of Biology, Indiana University, Bloomington, Indiana 47405, USA
| | | | | | | | | | | | | |
Collapse
|
20
|
Tae H, Sohng JK, Park K. Development of an analysis program of type I polyketide synthase gene clusters using homology search and profile hidden Markov model. J Microbiol Biotechnol 2009; 19:140-6. [PMID: 19307762 DOI: 10.4014/jmb.0809.554] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/01/2022]
Abstract
MAPSI (Management and Analysis for Polyketide Synthase Type I) has been developed to offer computational analysis methods to detect type I PKS (polyketide synthase) gene clusters in genome sequences. MAPSI provides a genome analysis component, which detects PKS gene clusters by identifying domains in proteins of a genome. MAPSI also contains databases on polyketides and genome annotation data, as well as analytic components such as new PKS assembly and domain analysis. The polyketide data and analysis component are accessible through Web interfaces and are displayed with diverse information. MAPSI, which was developed to aid researchers studying type I polyketides, provides diverse components to access and analyze polyketide information and should become a very powerful computational tool for polyketide research. The system can be extended through further studies of factors related to the biological activities of polyketides.
Collapse
Affiliation(s)
- Hongseok Tae
- Information Technology Institute, SmallSoft Co., Ltd., Daejeon 305-343, Korea
| | | | | |
Collapse
|
21
|
Abstract
Polyketides have diverse biological activities, including pharmacological functions such as antibiotic, antitumor and agrochemical properties. They are biosynthesized from short carboxylic acid precursors by polyketide synthases (PKSs). As natural polyketide products include many clinically important drugs and the volume of data on polyketides is rapidly increasing, the development of a database system to manage polyketide data is essential. MapsiDB is an integrated web database formulated to contain data on type I polyketides and their PKSs, including domain and module composition and related genome information. Data on polyketides were collected from journals and online resources and processed with analysis programs. Web interfaces were utilized to construct and to access this database, allowing polyketide researchers to add their data to this database and to use it easily.
Collapse
Affiliation(s)
- Hongseok Tae
- SmallSoft Co, Ltd, Jang-Dong 59-5, Yusung-Gu, Daejeon 305-343, South Korea.
| | | | | |
Collapse
|
22
|
Abstract
BACKGROUND Polyketides are secondary metabolites of microorganisms with diverse biological activities, including pharmacological functions such as antibiotic, antitumor and agrochemical properties. Polyketides are synthesized by serialized reactions of a set of enzymes called polyketide synthase(PKS)s, which coordinate the elongation of carbon skeletons by the stepwise condensation of short carbon precursors. Due to their importance as drugs, the volume of data on polyketides is rapidly increasing and creating a need for computational analysis methods for efficient polyketide research. Moreover, the increasing use of genetic engineering to research new kinds of polyketides requires genome wide analysis. RESULTS We describe a system named ASMPKS (Analysis System for Modular Polyketide Synthesis) for computational analysis of PKSs against genome sequences. It also provides overall management of information on modular PKS, including polyketide database construction, new PKS assembly, and chain visualization. ASMPKS operates on a web interface to construct the database and to analyze PKSs, allowing polyketide researchers to add their data to this database and to use it easily. In addition, the ASMPKS can predict functional modules for a protein sequence submitted by users, estimate the chemical composition of a polyketide synthesized from the modules, and display the carbon chain structure on the web interface. CONCLUSION ASMPKS has powerful computation features to aid modular PKS research. As various factors, such as starter units and post-processing, are related to polyketide biosynthesis, ASMPKS will be improved through further development for study of the factors.
Collapse
Affiliation(s)
- Hongseok Tae
- Information Technology Institute, SmallSoft Co., Ltd., Jang-Dong 59-5, Yusung-Gu, Daejeon 305-343, South Korea
- Deptartment of Computer Engineering, Chungnam National University, 220 Gung-dong, Daejeon 305-764, South Korea
| | - Eun-Bae Kong
- Deptartment of Computer Engineering, Chungnam National University, 220 Gung-dong, Daejeon 305-764, South Korea
| | - Kiejung Park
- Information Technology Institute, SmallSoft Co., Ltd., Jang-Dong 59-5, Yusung-Gu, Daejeon 305-343, South Korea
| |
Collapse
|