1
|
Antunes J, Walichiewicz P, Forouzmand E, Barta R, Didier M, Han Y, Perez JC, Snedecor J, Zlatkov C, Padmabandu G, Devesse L, Radecke S, Holt CL, Kumar SA, Budowle B, Stephens KM. Developmental validation of the ForenSeq® Kintelligence kit, MiSeq FGx® sequencing system and ForenSeq Universal Analysis Software. Forensic Sci Int Genet 2024; 71:103055. [PMID: 38762965 DOI: 10.1016/j.fsigen.2024.103055] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/15/2023] [Revised: 04/23/2024] [Accepted: 04/25/2024] [Indexed: 05/21/2024]
Abstract
Forensic Investigative Genetic Genealogy, a recent sub discipline of forensic genomics, leverages the high throughput and sensitivity of detection of next generation sequencing and established genetic and genealogical approaches to support the identification of human remains from missing persons investigations and investigative lead generation in violent crimes. To facilitate forensic DNA evidence analysis, the ForenSeq® Kintelligence multiplex, consisting of 10,230 SNPs, was developed. Design of the ForenSeq Kintelligence Kit, the MiSeq FGx® Sequencing System and the ForenSeq Universal Analysis Software is described. Developmental validation in accordance with SWGDAM guidelines and forensic quality assurance standards, using single source samples, is reported for the end-to-end workflow from library preparation to data interpretation. Performance metrics support the conclusion that more genetic information can be obtained from challenging samples compared to other commercially available forensic targeted DNA assays developed for capillary electrophoresis (CE) or other current next generation sequencing (NGS) kits due to the higher number of markers, the overall shorter amplicon sizes (97.8% <150 bp), and kit design. Data indicate that the multiplex is robust and fit for purpose for a wide range of quantity and quality samples. The ForenSeq Kintelligence Kit and the Universal Analysis Software allow transfer of the genetic component of forensic investigative genetic genealogy to the operational forensic laboratory.
Collapse
Affiliation(s)
- Joana Antunes
- Verogen, Inc., now a QIAGEN company, 11111 Flintkote Ave., San Diego, CA 92121, USA
| | - Paulina Walichiewicz
- Verogen, Inc., now a QIAGEN company, 11111 Flintkote Ave., San Diego, CA 92121, USA
| | - Elmira Forouzmand
- Verogen, Inc., now a QIAGEN company, 11111 Flintkote Ave., San Diego, CA 92121, USA
| | - Richelle Barta
- Verogen, Inc., now a QIAGEN company, 11111 Flintkote Ave., San Diego, CA 92121, USA
| | - Meghan Didier
- Verogen, Inc., now a QIAGEN company, 11111 Flintkote Ave., San Diego, CA 92121, USA
| | - Yonmee Han
- Verogen, Inc., now a QIAGEN company, 11111 Flintkote Ave., San Diego, CA 92121, USA
| | - Juan Carlos Perez
- Verogen, Inc., now a QIAGEN company, 11111 Flintkote Ave., San Diego, CA 92121, USA
| | - June Snedecor
- Verogen, Inc., now a QIAGEN company, 11111 Flintkote Ave., San Diego, CA 92121, USA
| | - Clare Zlatkov
- Verogen, Inc., now a QIAGEN company, 11111 Flintkote Ave., San Diego, CA 92121, USA
| | - Gothami Padmabandu
- Verogen, Inc., now a QIAGEN company, 11111 Flintkote Ave., San Diego, CA 92121, USA
| | - Laurence Devesse
- Verogen, Inc., now a QIAGEN company, 11111 Flintkote Ave., San Diego, CA 92121, USA
| | - Sarah Radecke
- Verogen, Inc., now a QIAGEN company, 11111 Flintkote Ave., San Diego, CA 92121, USA
| | - Cydne L Holt
- Verogen, Inc., now a QIAGEN company, 11111 Flintkote Ave., San Diego, CA 92121, USA
| | - Swathi A Kumar
- Verogen, Inc., now a QIAGEN company, 11111 Flintkote Ave., San Diego, CA 92121, USA
| | - Bruce Budowle
- University of Helsinki, Department of Forensic Medicine, Haartmaninkatu 8, P.O. Box 63, Helsinki 00014, Finland; Forensic Science Institute, Radford University, Radford, VA 24142, USA
| | - Kathryn M Stephens
- Verogen, Inc., now a QIAGEN company, 11111 Flintkote Ave., San Diego, CA 92121, USA.
| |
Collapse
|
2
|
hgtseq: A Standard Pipeline to Study Horizontal Gene Transfer. Int J Mol Sci 2022; 23:ijms232314512. [PMID: 36498841 PMCID: PMC9738810 DOI: 10.3390/ijms232314512] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/24/2022] [Revised: 11/14/2022] [Accepted: 11/18/2022] [Indexed: 11/23/2022] Open
Abstract
Horizontal gene transfer (HGT) is well described in prokaryotes: it plays a crucial role in evolution, and has functional consequences in insects and plants. However, less is known about HGT in humans. Studies have reported bacterial integrations in cancer patients, and microbial sequences have been detected in data from well-known human sequencing projects. Few of the existing tools for investigating HGT are highly automated. Thanks to the adoption of Nextflow for life sciences workflows, and to the standards and best practices curated by communities such as nf-core, fully automated, portable, and scalable pipelines can now be developed. Here we present nf-core/hgtseq to facilitate the analysis of HGT from sequencing data in different organisms. We showcase its performance by analysing six exome datasets from five mammals. Hgtseq can be run seamlessly in any computing environment and accepts data generated by existing exome and whole-genome sequencing projects; this will enable researchers to expand their analyses into this area. Fundamental questions are still open about the mechanisms and the extent or role of horizontal gene transfer: by releasing hgtseq we provide a standardised tool which will enable a systematic investigation of this phenomenon, thus paving the way for a better understanding of HGT.
Collapse
|
3
|
Whole-Genome Sequencing Reveals Age-Specific Changes in the Human Blood Microbiota. J Pers Med 2022; 12:jpm12060939. [PMID: 35743724 PMCID: PMC9225573 DOI: 10.3390/jpm12060939] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/12/2022] [Revised: 06/03/2022] [Accepted: 06/06/2022] [Indexed: 11/17/2022] Open
Abstract
Based on several reports that indicate the presence of blood microbiota in patients with diseases, we became interested in identifying the presence of bacteria in the blood of healthy individuals. Using 37 samples from 5 families, we extracted sequences that were not mapped to the human reference genome and mapped them to the bacterial reference genome for characterization. Proteobacteria account for more than 95% of the blood microbiota. The results of clustering by means of principal component analysis showed similar patterns for each age group. We observed that the class Gammaproteobacteria was significantly higher in the elderly group (over 60 years old), whereas the arcsine square root-transformed relative abundance of the classes Alphaproteobacteria, Deltaproteobacteria, and Clostridia was significantly lower (p < 0.05). In addition, the diversity among the groups showed a significant difference (p < 0.05) in the elderly group. This result provides meaningful evidence of a consistent phenomenon that chronic diseases associated with aging are accompanied by metabolic endotoxemia and chronic inflammation.
Collapse
|
4
|
Bovo S, Schiavo G, Bolner M, Ballan M, Fontanesi L. Mining livestock genome datasets for an unconventional characterization of animal DNA viromes. Genomics 2022; 114:110312. [DOI: 10.1016/j.ygeno.2022.110312] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/26/2021] [Revised: 01/16/2022] [Accepted: 02/06/2022] [Indexed: 11/04/2022]
|
5
|
de Villiers EM, zur Hausen H. Bovine Meat and Milk Factors (BMMFs): Their Proposed Role in Common Human Cancers and Type 2 Diabetes Mellitus. Cancers (Basel) 2021; 13:cancers13215407. [PMID: 34771570 PMCID: PMC8582480 DOI: 10.3390/cancers13215407] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/23/2021] [Revised: 10/24/2021] [Accepted: 10/25/2021] [Indexed: 12/16/2022] Open
Abstract
Simple Summary This manuscript emphasizes the mechanistic differences of infectious agents contributing to human cancers either by “direct” or “indirect” interactions. The epidemiology of cancers linked to direct carcinogens differs (e.g., response to immunosuppression) from those cancers linked with indirect infectious interactions. We discuss their role in colon, breast, and prostate cancers and type II diabetes mellitus. A brief discussion covers the potential role of BMMF (bovine meat and milk factor) infections in acute myeloid leukemia. Abstract Exemplified by infections with bovine meat and milk factors (BMMFs), this manuscript emphasizes the different mechanistic aspects of infectious agents contributing to human cancers by “direct” or “indirect” interactions. The epidemiology of cancers linked to direct carcinogens (e.g., response to immunosuppression) differs from those cancers linked with indirect infectious interactions. Cancers induced by direct infectious carcinogens commonly increase under immunosuppression, whereas the cancer risk by indirect carcinogens is reduced. This influences their responses to preventive and therapeutic interferences. In addition, we discuss their role in colon, breast and prostate cancers and type II diabetes mellitus. A brief discussion covers the potential role of BMMF infections in acute myeloid leukemia.
Collapse
Affiliation(s)
- Ethel-Michele de Villiers
- Correspondence: (E.-M.d.V.); (H.z.H.); Tel.: +49-151-4312-3085 (E.-M.d.V.); +49-6221-423850 (H.z.H.)
| | - Harald zur Hausen
- Correspondence: (E.-M.d.V.); (H.z.H.); Tel.: +49-151-4312-3085 (E.-M.d.V.); +49-6221-423850 (H.z.H.)
| |
Collapse
|
6
|
Adekanmbi F, McNeely I, Omeler S, Kalalah A, Poudel A, Merner N, Wang C. Absence of bovine leukemia virus in the buffy coats of breast cancer cases from Alabama, USA. Microb Pathog 2021; 161:105238. [PMID: 34653545 DOI: 10.1016/j.micpath.2021.105238] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/23/2021] [Revised: 10/04/2021] [Accepted: 10/06/2021] [Indexed: 11/19/2022]
Abstract
Breast cancer is reported as one of the most common and deadly cancers among females. Recent findings have suggested that bovine leukemia virus (BLV), a highly prevalent bovine virus worldwide, might be linked to human breast cancer. However, the involvement of BLV as a risk factor for breast cancer remains controversial. In this study, BLV FRET-PCR was carried out on 238 blood-derived DNA samples from breast cancer patients from the Alabama Hereditary Cancer Cohort. In addition, randomly selected samples (n = 20) were evaluated by WGS for the presence of BLV genome. No BLV proviral DNA was detected in any of 238 samples assayed by FRET-qPCR in this study. Similarly, the WGS analysis did not detect the presence of the BLV genome in the DNA of the buffy coats from 20 randomly selected patients with breast cancer. This study did not support the findings of suggesting an association between BLV and breast cancer. Notably, nearly all the studies using in situ PCR and immunohistochemistry demonstrated positive associations while other studies using whole-genome sequencing and other methods failed to identify the BLV association with breast cancer. Further studies including all reported BLV detection techniques/methods on the same breast cancer sample sets would appear to be the most likely way of resolving the current contradictory evidence.
Collapse
Affiliation(s)
| | - Isaac McNeely
- Department of Pathobiology, Auburn University, AL, USA
| | | | - Anwar Kalalah
- Department of Pathobiology, Auburn University, AL, USA
| | - Anil Poudel
- Department of Pathobiology, Auburn University, AL, USA
| | - Nancy Merner
- Department of Pathobiology, Auburn University, AL, USA.
| | | |
Collapse
|
7
|
Rodriguez RM, Menor M, Hernandez BY, Deng Y, Khadka VS. Bacterial Diversity Correlates with Overall Survival in Cancers of the Head and Neck, Liver, and Stomach. Molecules 2021; 26:5659. [PMID: 34577130 PMCID: PMC8468759 DOI: 10.3390/molecules26185659] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/30/2021] [Revised: 09/15/2021] [Accepted: 09/15/2021] [Indexed: 11/16/2022] Open
Abstract
One in five cancers is attributed to infectious agents, and the extent of the impact on the initiation, progression, and disease outcomes may be underestimated. Infection-associated cancers are commonly attributed to viral, and to a lesser extent, parasitic and bacterial etiologies. There is growing evidence that microbial community variation rather than a single agent can influence cancer development, progression, response to therapy, and outcome. We evaluated microbial sequences from a subset of infection-associated cancers-namely, head and neck squamous cell carcinoma (HNSC), liver hepatocellular carcinoma (LIHC), and stomach adenocarcinoma (STAD) from The Cancer Genome Atlas (TCGA). A total of 470 paired tumor and adjacent normal samples were analyzed. In STAD, concurrent presence of EBV and Selemonas sputigena with a high diversity index were associated with poorer survival (HR: 2.23, 95% CI 1.26-3.94, p = 0.006 and HR: 2.31, 95% CI 1.1-4.9, p = 0.03, respectively). In LIHC, lower microbial diversity was associated with poorer overall survival (HR: 2.57, 95% CI: 1.2, 5.5, p = 0.14). Bacterial within-sample diversity correlates with overall survival in infection-associated cancers in a subset of TCGA cohorts.
Collapse
Affiliation(s)
- Rebecca M. Rodriguez
- Bioinformatics Core, Department of Quantitative Health Sciences, John A. Burns School of Medicine, University of Hawaii Mānoa, Honolulu, HI 96813, USA; (R.M.R.); (M.M.)
- Population Sciences in the Pacific Program-Cancer Epidemiology, University of Hawaii Cancer Center, Honolulu, HI 96813, USA;
- National Institute of Diabetes and Digestive and Kidney Diseases, NIH, Bethesda, MD 20892, USA
| | - Mark Menor
- Bioinformatics Core, Department of Quantitative Health Sciences, John A. Burns School of Medicine, University of Hawaii Mānoa, Honolulu, HI 96813, USA; (R.M.R.); (M.M.)
| | - Brenda Y. Hernandez
- Population Sciences in the Pacific Program-Cancer Epidemiology, University of Hawaii Cancer Center, Honolulu, HI 96813, USA;
| | - Youping Deng
- Bioinformatics Core, Department of Quantitative Health Sciences, John A. Burns School of Medicine, University of Hawaii Mānoa, Honolulu, HI 96813, USA; (R.M.R.); (M.M.)
| | - Vedbar S. Khadka
- Bioinformatics Core, Department of Quantitative Health Sciences, John A. Burns School of Medicine, University of Hawaii Mānoa, Honolulu, HI 96813, USA; (R.M.R.); (M.M.)
| |
Collapse
|
8
|
Abstract
Whole-genome sequencing (WGS) is becoming the de facto standard for bacterial typing and outbreak surveillance of resistant bacterial pathogens. However, interoperability for WGS of bacterial outbreaks is poorly understood. We hypothesized that harmonization of WGS for outbreak surveillance is achievable through the use of identical protocols for both data generation and data analysis. A set of 30 bacterial isolates, comprising of various species belonging to the Enterobacteriaceae family and Enterococcus genera, were selected and sequenced using the same protocol on the Illumina MiSeq platform in each individual centre. All generated sequencing data were analysed by one centre using BioNumerics (6.7.3) for (i) genotyping origin of replications and antimicrobial resistance genes, (ii) core-genome multi-locus sequence typing (cgMLST) for Escherichia coli and Klebsiella pneumoniae and whole-genome multi-locus sequencing typing (wgMLST) for all species. Additionally, a split k-mer analysis was performed to determine the number of SNPs between samples. A precision of 99.0% and an accuracy of 99.2% was achieved for genotyping. Based on cgMLST, a discrepant allele was called only in 2/27 and 3/15 comparisons between two genomes, for E. coli and K. pneumoniae, respectively. Based on wgMLST, the number of discrepant alleles ranged from 0 to 7 (average 1.6). For SNPs, this ranged from 0 to 11 SNPs (average 3.4). Furthermore, we demonstrate that using different de novo assemblers to analyse the same dataset introduces up to 150 SNPs, which surpasses most thresholds for bacterial outbreaks. This shows the importance of harmonization of data-processing surveillance of bacterial outbreaks. In summary, multi-centre WGS for bacterial surveillance is achievable, but only if protocols are harmonized.
Collapse
|
9
|
Cordoba J, Perez E, Van Vlierberghe M, Bertrand AR, Lupo V, Cardol P, Baurain D. De Novo Transcriptome Meta-Assembly of the Mixotrophic Freshwater Microalga Euglena gracilis. Genes (Basel) 2021; 12:842. [PMID: 34072576 PMCID: PMC8227486 DOI: 10.3390/genes12060842] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/08/2021] [Revised: 05/24/2021] [Accepted: 05/27/2021] [Indexed: 01/01/2023] Open
Abstract
Euglena gracilis is a well-known photosynthetic microeukaryote considered as the product of a secondary endosymbiosis between a green alga and a phagotrophic unicellular belonging to the same eukaryotic phylum as the parasitic trypanosomatids. As its nuclear genome has proven difficult to sequence, reliable transcriptomes are important for functional studies. In this work, we assembled a new consensus transcriptome by combining sequencing reads from five independent studies. Based on a detailed comparison with two previously released transcriptomes, our consensus transcriptome appears to be the most complete so far. Remapping the reads on it allowed us to compare the expression of the transcripts across multiple culture conditions at once and to infer a functionally annotated network of co-expressed genes. Although the emergence of meaningful gene clusters indicates that some biological signal lies in gene expression levels, our analyses confirm that gene regulation in euglenozoans is not primarily controlled at the transcriptional level. Regarding the origin of E. gracilis, we observe a heavily mixed gene ancestry, as previously reported, and rule out sequence contamination as a possible explanation for these observations. Instead, they indicate that this complex alga has evolved through a convoluted process involving much more than two partners.
Collapse
Affiliation(s)
- Javier Cordoba
- InBioS—PhytoSYSTEMS, Laboratoire de Génétique et Physiologie des Microalgues, ULiège, B-4000 Liège, Belgium; (J.C.); (E.P.); (P.C.)
| | - Emilie Perez
- InBioS—PhytoSYSTEMS, Laboratoire de Génétique et Physiologie des Microalgues, ULiège, B-4000 Liège, Belgium; (J.C.); (E.P.); (P.C.)
- InBioS—PhytoSYSTEMS, Unit of Eukaryotic Phylogenomics, ULiège, B-4000 Liège, Belgium; (M.V.V.); (A.R.B.); (V.L.)
| | - Mick Van Vlierberghe
- InBioS—PhytoSYSTEMS, Unit of Eukaryotic Phylogenomics, ULiège, B-4000 Liège, Belgium; (M.V.V.); (A.R.B.); (V.L.)
| | - Amandine R. Bertrand
- InBioS—PhytoSYSTEMS, Unit of Eukaryotic Phylogenomics, ULiège, B-4000 Liège, Belgium; (M.V.V.); (A.R.B.); (V.L.)
| | - Valérian Lupo
- InBioS—PhytoSYSTEMS, Unit of Eukaryotic Phylogenomics, ULiège, B-4000 Liège, Belgium; (M.V.V.); (A.R.B.); (V.L.)
| | - Pierre Cardol
- InBioS—PhytoSYSTEMS, Laboratoire de Génétique et Physiologie des Microalgues, ULiège, B-4000 Liège, Belgium; (J.C.); (E.P.); (P.C.)
| | - Denis Baurain
- InBioS—PhytoSYSTEMS, Unit of Eukaryotic Phylogenomics, ULiège, B-4000 Liège, Belgium; (M.V.V.); (A.R.B.); (V.L.)
| |
Collapse
|
10
|
Mwesigwa S, Williams L, Retshabile G, Katagirya E, Mboowa G, Mlotshwa B, Kyobe S, Kateete DP, Wampande EM, Wayengera M, Mpoloka SW, Mirembe AN, Kasvosve I, Morapedi K, Kisitu GP, Kekitiinwa AR, Anabwani G, Joloba ML, Matovu E, Mulindwa J, Noyes H, Botha G, Brown CW, Mardon G, Matshaba M, Hanchard NA. Unmapped exome reads implicate a role for Anelloviridae in childhood HIV-1 long-term non-progression. NPJ Genom Med 2021; 6:24. [PMID: 33741997 PMCID: PMC7979878 DOI: 10.1038/s41525-021-00185-w] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/28/2020] [Accepted: 01/25/2021] [Indexed: 01/31/2023] Open
Abstract
Human immunodeficiency virus (HIV) infection remains a significant public health burden globally. The role of viral co-infection in the rate of progression of HIV infection has been suggested but not empirically tested, particularly among children. We extracted and classified 42 viral species from whole-exome sequencing (WES) data of 813 HIV-infected children in Botswana and Uganda categorised as either long-term non-progressors (LTNPs) or rapid progressors (RPs). The Ugandan participants had a higher viral community diversity index compared to Batswana (p = 4.6 × 10-13), and viral sequences were more frequently detected among LTNPs than RPs (24% vs 16%; p = 0.008; OR, 1.9; 95% CI, 1.6-2.3), with Anelloviridae showing strong association with LTNP status (p = 3 × 10-4; q = 0.004, OR, 3.99; 95% CI, 1.74-10.25). This trend was still evident when stratified by country, sex, and sequencing platform, and after a logistic regression analysis adjusting for age, sex, country, and the sequencing platform (p = 0.02; q = 0.03; OR, 7.3; 95% CI, 1.6-40.5). Torque teno virus (TTV), which made up 95% of the Anelloviridae reads, has been associated with reduced immune activation. We identify an association between viral co-infection and prolonged AIDs-free survival status that may have utility as a biomarker of LTNP and could provide mechanistic insights to HIV progression in children, demonstrating the added value of interrogating off-target WES reads in cohort studies.
Collapse
Affiliation(s)
| | | | | | - Eric Katagirya
- College of Health Sciences, Makerere University, Kampala, Uganda
| | - Gerald Mboowa
- College of Health Sciences, Makerere University, Kampala, Uganda
| | | | - Samuel Kyobe
- College of Health Sciences, Makerere University, Kampala, Uganda
| | - David P Kateete
- College of Health Sciences, Makerere University, Kampala, Uganda
| | | | - Misaki Wayengera
- College of Health Sciences, Makerere University, Kampala, Uganda
| | | | - Angella N Mirembe
- Baylor College of Medicine Children's Foundation Uganda (Baylor Uganda), Kampala, Uganda
| | | | | | - Grace P Kisitu
- Baylor College of Medicine Children's Foundation Uganda (Baylor Uganda), Kampala, Uganda
| | - Adeodata R Kekitiinwa
- Baylor College of Medicine Children's Foundation Uganda (Baylor Uganda), Kampala, Uganda
| | - Gabriel Anabwani
- Botswana-Baylor Children's Clinical Centre of Excellence, Gaborone, Botswana
| | - Moses L Joloba
- College of Health Sciences, Makerere University, Kampala, Uganda
| | - Enock Matovu
- College of Veterinary Medicine, Animal Resources and Biosecurity, Makerere University, Kampala, Uganda
| | - Julius Mulindwa
- College of Veterinary Medicine, Animal Resources and Biosecurity, Makerere University, Kampala, Uganda
| | - Harry Noyes
- Institute of Integrative Biology, University of Liverpool, Liverpool, UK
| | - Gerrit Botha
- Institute of Infectious Disease and Molecular Medicine, University of Cape Town, Cape Town, South Africa
| | - Chester W Brown
- University of Tennessee Health Science Center, Le Bonheur Children's Hospital, Memphis, TN, USA
| | - Graeme Mardon
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX, USA
| | - Mogomotsi Matshaba
- Botswana-Baylor Children's Clinical Centre of Excellence, Gaborone, Botswana
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX, USA
| | - Neil A Hanchard
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX, USA.
- National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA.
| |
Collapse
|
11
|
Chen X, Li D. Sequencing facility and DNA source associated patterns of virus-mappable reads in whole-genome sequencing data. Genomics 2021; 113:1189-1198. [PMID: 33301893 PMCID: PMC7856238 DOI: 10.1016/j.ygeno.2020.12.004] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/25/2020] [Revised: 11/25/2020] [Accepted: 12/04/2020] [Indexed: 12/12/2022]
Abstract
Numerous viral sequences have been reported in the whole-genome sequencing (WGS) data of human blood. However, it is not clear to what degree the virus-mappable reads represent true viral sequences rather than random-mapping or noise originating from sample preparation, sequencing processes, or other sources. Identification of patterns of virus-mappable reads may generate novel indicators for evaluating the origins of these viral sequences. We characterized paired-end unmapped reads and reads aligned to viral references in human WGS datasets, then compared patterns of the virus-mappable reads among DNA sources and sequencing facilities which produced these datasets. We then examined potential origins of the source- and facility-associated viral reads. The proportions of clean unmapped reads among the seven sequencing facilities were significantly different (P < 2 × 10-16). We identified 260,339 reads that were mappable to a total of 99 viral references in 2535 samples. The majority (86.7%) of these virus-mappable reads (corresponding to 47 viral references), which can be classified into four groups based on their distinct patterns, were strongly associated with sequencing facility or DNA source (adjusted P value <0.01). Possible origins of these reads include artificial sequences in library preparation, recombinant vectors in cell culture, and phages co-contaminated with their host bacteria. The sequencing facility-associated virus-mappable reads and patterns were repeatedly observed in other datasets produced in the same facilities. We have constructed an analytic framework and profiled the unmapped reads mappable to viral references. The results provide a new understanding of sequencing facility- and DNA source-associated batch effects in deep sequencing data and may facilitate improved bioinformatics filtering of reads.
Collapse
Affiliation(s)
- Xun Chen
- Department of Microbiology and Molecular Genetics, University of Vermont, Burlington, VT 05405, USA
| | - Dawei Li
- Department of Microbiology and Molecular Genetics, University of Vermont, Burlington, VT 05405, USA; Department of Computer Science, University of Vermont, Burlington, VT 05405, USA; Neuroscience, Behavior, Health Initiative, University of Vermont, Burlington, VT 05405, USA.
| |
Collapse
|
12
|
Rodriguez RM, Khadka VS, Menor M, Hernandez BY, Deng Y. Tissue-associated microbial detection in cancer using human sequencing data. BMC Bioinformatics 2020; 21:523. [PMID: 33272199 PMCID: PMC7713026 DOI: 10.1186/s12859-020-03831-9] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/19/2020] [Accepted: 10/21/2020] [Indexed: 12/19/2022] Open
Abstract
Cancer is one of the leading causes of morbidity and mortality in the globe. Microbiological infections account for up to 20% of the total global cancer burden. The human microbiota within each organ system is distinct, and their compositional variation and interactions with the human host have been known to attribute detrimental and beneficial effects on tumor progression. With the advent of next generation sequencing (NGS) technologies, data generated from NGS is being used for pathogen detection in cancer. Numerous bioinformatics computational frameworks have been developed to study viral information from host-sequencing data and can be adapted to bacterial studies. This review highlights existing popular computational frameworks that utilize NGS data as input to decipher microbial composition, which output can predict functional compositional differences with clinically relevant applicability in the development of treatment and prevention strategies.
Collapse
Affiliation(s)
- Rebecca M. Rodriguez
- Bioinformatics Core, Department of Quantitative Health Sciences, John A. Burns School of Medicine, University of Hawaii, Mānoa, Honolulu, HI USA
- Population Sciences in the Pacific Program-Cancer Epidemiology, Honolulu, HI USA
- NIDDK Central Repository, National Institute of Diabetes and Digestive and Kidney Diseases, NIH, Bethesda, USA
| | - Vedbar S. Khadka
- Bioinformatics Core, Department of Quantitative Health Sciences, John A. Burns School of Medicine, University of Hawaii, Mānoa, Honolulu, HI USA
| | - Mark Menor
- Bioinformatics Core, Department of Quantitative Health Sciences, John A. Burns School of Medicine, University of Hawaii, Mānoa, Honolulu, HI USA
| | - Brenda Y. Hernandez
- Epidemiology, University of Hawaii Cancer Center, University of Hawaii, Honolulu, HI USA
- Population Sciences in the Pacific Program-Cancer Epidemiology, Honolulu, HI USA
| | - Youping Deng
- Bioinformatics Core, Department of Quantitative Health Sciences, John A. Burns School of Medicine, University of Hawaii, Mānoa, Honolulu, HI USA
| |
Collapse
|
13
|
Rodriguez RM, Hernandez BY, Menor M, Deng Y, Khadka VS. The landscape of bacterial presence in tumor and adjacent normal tissue across 9 major cancer types using TCGA exome sequencing. Comput Struct Biotechnol J 2020; 18:631-641. [PMID: 32257046 PMCID: PMC7109368 DOI: 10.1016/j.csbj.2020.03.003] [Citation(s) in RCA: 35] [Impact Index Per Article: 8.8] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/01/2019] [Revised: 03/02/2020] [Accepted: 03/06/2020] [Indexed: 12/26/2022] Open
Abstract
Identification of microbial composition directly from tumor tissue permits studying the relationship between microbial changes and cancer pathogenesis. We interrogated bacterial presence in tumor and adjacent normal tissue strictly in pairs utilizing human whole exome sequencing to generate microbial profiles. Profiles were generated for 813 cases from stomach, liver, colon, rectal, lung, head & neck, cervical and bladder TCGA cohorts. Core microbiota examination revealed twelve taxa to be common across the nine cancer types at all classification levels. Paired analyses demonstrated significant differences in bacterial shifts between tumor and adjacent normal tissue across stomach, colon, lung squamous cell, and head & neck cohorts, whereas little or no differences were evident in liver, rectal, lung adenocarcinoma, cervical and bladder cancer cohorts in adjusted models. Helicobacter pylori in stomach and Bacteroides vulgatus in colon were found to be significantly higher in adjacent normal compared to tumor tissue after false discovery rate correction. Computational results were validated with tissue from an independent population by species-specific qPCR showing similar patterns of co-occurrence among Fusobacterium nucleatum and Selenomonas sputigena in gastric samples. This study demonstrates the ability to identify bacteria differential composition derived from human tissue whole exome sequences. Taken together our results suggest the microbial profiles shift with advanced disease and that the microbial composition of the adjacent tissue can be indicative of cancer stage disease progression.
Collapse
Key Words
- BLCA, bladder carcinoma
- CESC, cervical & endocervical squamous cell carcinomas
- COAD, colon adenocarcinoma
- COREAD, colon and rectal adenocarcinoma TCGA cohorts
- Cancer microbiome
- Exome sequencing
- HNSC, head & neck squamous cell carcinoma
- L2FC, log 2 fold change
- LIHC, liver hepatocellular carcinoma
- LUAD, lung adenocarcinoma
- LUSC, lung squamous cell carcinoma
- Microbial landscape
- READ, rectal adenocarcinoma
- STAD, stomach adenocarcinoma
- TCGA
- TCGA, The Cancer Genome Atlas
Collapse
Affiliation(s)
- Rebecca M. Rodriguez
- Bioinformatics Core, Department of Quantitative Health Sciences, John A. Burns School of Medicine, University of Hawaii Mānoa, Honolulu, HI, United States
- Population Sciences in the Pacific Program-Cancer Epidemiology, University of Hawaii Cancer Center, Honolulu, HI, United States
| | - Brenda Y. Hernandez
- Epidemiology, University of Hawaii Cancer Center, University of Hawaii, Honolulu, HI, United States
- Population Sciences in the Pacific Program-Cancer Epidemiology, University of Hawaii Cancer Center, Honolulu, HI, United States
| | - Mark Menor
- Bioinformatics Core, Department of Quantitative Health Sciences, John A. Burns School of Medicine, University of Hawaii Mānoa, Honolulu, HI, United States
| | - Youping Deng
- Bioinformatics Core, Department of Quantitative Health Sciences, John A. Burns School of Medicine, University of Hawaii Mānoa, Honolulu, HI, United States
| | - Vedbar S. Khadka
- Bioinformatics Core, Department of Quantitative Health Sciences, John A. Burns School of Medicine, University of Hawaii Mānoa, Honolulu, HI, United States
| |
Collapse
|
14
|
Swanson GM, Moskovtsev S, Librach C, Pilsner JR, Goodrich R, Krawetz SA. What human sperm RNA-Seq tells us about the microbiome. J Assist Reprod Genet 2020; 37:359-368. [PMID: 31902104 PMCID: PMC7056791 DOI: 10.1007/s10815-019-01672-x] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/15/2019] [Accepted: 12/19/2019] [Indexed: 10/25/2022] Open
Abstract
PURPOSE The study was designed to assess the capacity of human sperm RNA-seq data to gauge the diversity of the associated microbiome within the ejaculate. METHODS Semen samples were collected, and semen parameters evaluated at time of collection. Sperm RNA was isolated and subjected to RNA-seq. Microbial composition was determined by aligning sequencing reads not mapped to the human genome to the NCBI RefSeq bacterial, viral and archaeal genomes following RNA-Seq. Analysis of microbial assignments utilized phyloseq and vegan. RESULTS Microbial composition within each sample was characterized as a function of microbial associated RNAs. Bacteria known to be associated with the male reproductive tract were present at similar levels in all samples representing 11 genera from four phyla with one exception, an outlier. Shannon diversity index (p < 0.001) and beta diversity (unweighted UniFrac distances, p = 9.99e-4; beta dispersion, p = 0.006) indicated the outlier was significantly different from all other samples. The outlier sample exhibited a dramatic increase in Streptococcus. Multiple testing indicated two operational taxonomic units, S. agalactiae and S. dysgalactiae (p = 0.009), were present. CONCLUSION These results provide a first look at the microbiome as a component of human sperm RNA sequencing that has sufficient sensitivity to identify contamination or potential pathogenic bacterial colonization at least among the known contributors.
Collapse
Affiliation(s)
- Grace M Swanson
- Department of Obstetrics and Gynecology, Center for Molecular Medicine and Genetics, Wayne State University School of Medicine, 275 E. Hancock, Detroit, MI, 48202, USA
| | | | | | - J Richard Pilsner
- Department of Environmental Health Sciences, University of Massachusetts Amherst School of Public Health and Health Sciences, Amherst, MA, 01003, USA
| | - Robert Goodrich
- Department of Obstetrics and Gynecology, Center for Molecular Medicine and Genetics, Wayne State University School of Medicine, 275 E. Hancock, Detroit, MI, 48202, USA
| | - Stephen A Krawetz
- Department of Obstetrics and Gynecology, Center for Molecular Medicine and Genetics, Wayne State University School of Medicine, 275 E. Hancock, Detroit, MI, 48202, USA.
| |
Collapse
|
15
|
Sangiovanni M, Granata I, Thind AS, Guarracino MR. From trash to treasure: detecting unexpected contamination in unmapped NGS data. BMC Bioinformatics 2019; 20:168. [PMID: 30999839 PMCID: PMC6472186 DOI: 10.1186/s12859-019-2684-x] [Citation(s) in RCA: 38] [Impact Index Per Article: 7.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/08/2023] Open
Abstract
Background Next Generation Sequencing (NGS) experiments produce millions of short sequences that, mapped to a reference genome, provide biological insights at genomic, transcriptomic and epigenomic level. Typically the amount of reads that correctly maps to the reference genome ranges between 70% and 90%, leaving in some cases a consistent fraction of unmapped sequences. This ’misalignment’ can be ascribed to low quality bases or sequence differences between the sample reads and the reference genome. Investigating the source of the unmapped reads is definitely important to better assess the quality of the whole experiment and to check for possible downstream or upstream ’contamination’ from exogenous nucleic acids. Results Here we propose DecontaMiner, a tool to unravel the presence of contaminating sequences among the unmapped reads. It uses a subtraction approach to identify bacteria, fungi and viruses genome contamination. DecontaMiner generates several output files to track all the processed reads, and to provide a complete report of their characteristics. The good quality matches on microorganism genomes are counted and compared among samples. DecontaMiner builds an offline HTML page containing summary statistics and plots. The latter are obtained using the state-of-the-art D3 javascript libraries. DecontaMiner has been mainly used to detect contamination in human RNA-Seq data. The software is freely available at http://www-labgtp.na.icar.cnr.it/decontaminer. Conclusions DecontaMiner is a tool designed and developed to investigate the presence of contaminating sequences in unmapped NGS data. It can suggest the presence of contaminating organisms in sequenced samples, that might derive either from laboratory contamination or from their biological source, and in both cases can be considered as worthy of further investigation and experimental validation. The novelty of DecontaMiner is mainly represented by its easy integration with the standard procedures of NGS data analysis, while providing a complete, reliable, and automatic pipeline. Electronic supplementary material The online version of this article (10.1186/s12859-019-2684-x) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Mara Sangiovanni
- Stazione Zoologica Anton Dohrn, Villa Comunale, Napoli, 80121, Italy
| | - Ilaria Granata
- High Performance Computing and Networking Institute, National Research Council of Italy, Via P. Castellino, 111, Napoli, 80131, Italy.
| | - Amarinder Singh Thind
- High Performance Computing and Networking Institute, National Research Council of Italy, Via P. Castellino, 111, Napoli, 80131, Italy
| | - Mario Rosario Guarracino
- High Performance Computing and Networking Institute, National Research Council of Italy, Via P. Castellino, 111, Napoli, 80131, Italy
| |
Collapse
|
16
|
Chen X, Kost J, Sulovari A, Wong N, Liang WS, Cao J, Li D. A virome-wide clonal integration analysis platform for discovering cancer viral etiology. Genome Res 2019; 29:819-830. [PMID: 30872350 PMCID: PMC6499315 DOI: 10.1101/gr.242529.118] [Citation(s) in RCA: 40] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/31/2018] [Accepted: 03/11/2019] [Indexed: 12/31/2022]
Abstract
Oncoviral infection is responsible for 12%–15% of cancer in humans. Convergent evidence from epidemiology, pathology, and oncology suggests that new viral etiologies for cancers remain to be discovered. Oncoviral profiles can be obtained from cancer genome sequencing data; however, widespread viral sequence contamination and noncausal viruses complicate the process of identifying genuine oncoviruses. Here, we propose a novel strategy to address these challenges by performing virome-wide screening of early-stage clonal viral integrations. To implement this strategy, we developed VIcaller, a novel platform for identifying viral integrations that are derived from any characterized viruses and shared by a large proportion of tumor cells using whole-genome sequencing (WGS) data. The sensitivity and precision were confirmed with simulated and benchmark cancer data sets. By applying this platform to cancer WGS data sets with proven or speculated viral etiology, we newly identified or confirmed clonal integrations of hepatitis B virus (HBV), human papillomavirus (HPV), Epstein-Barr virus (EBV), and BK Virus (BKV), suggesting the involvement of these viruses in early stages of tumorigenesis in affected tumors, such as HBV in TERT and KMT2B (also known as MLL4) gene loci in liver cancer, HPV and BKV in bladder cancer, and EBV in non-Hodgkin's lymphoma. We also showed the capacity of VIcaller to identify integrations from some uncharacterized viruses. This is the first study to systematically investigate the strategy and method of virome-wide screening of clonal integrations to identify oncoviruses. Searching clonal viral integrations with our platform has the capacity to identify virus-caused cancers and discover cancer viral etiologies.
Collapse
Affiliation(s)
- Xun Chen
- Department of Microbiology and Molecular Genetics, University of Vermont, Burlington, Vermont 05405, USA
| | - Jason Kost
- Department of Microbiology and Molecular Genetics, University of Vermont, Burlington, Vermont 05405, USA
| | - Arvis Sulovari
- Department of Microbiology and Molecular Genetics, University of Vermont, Burlington, Vermont 05405, USA
| | - Nathalie Wong
- Department of Anatomical and Cellular Pathology, Chinese University of Hong Kong, Prince of Wales Hospital, Shatin, NT, Hong Kong 999077, P.R. China
| | - Winnie S Liang
- Translational Genomics Research Institute, Phoenix, Arizona 85004, USA
| | - Jian Cao
- Division of Medical Oncology, Rutgers Cancer Institute of New Jersey, Rutgers, The State University of New Jersey, New Brunswick, New Jersey 08903, USA.,Department of Medicine, Rutgers Robert Wood Johnson Medical School, Rutgers, The State University of New Jersey, New Brunswick, New Jersey 08903, USA
| | - Dawei Li
- Department of Microbiology and Molecular Genetics, University of Vermont, Burlington, Vermont 05405, USA.,Neuroscience, Behavior, and Health Initiative, University of Vermont, Burlington, Vermont 05405, USA.,Department of Computer Science, University of Vermont, Burlington, Vermont 05405, USA
| |
Collapse
|
17
|
Uphoff CC, Pommerenke C, Denkmann SA, Drexler HG. Screening human cell lines for viral infections applying RNA-Seq data analysis. PLoS One 2019; 14:e0210404. [PMID: 30629668 PMCID: PMC6328144 DOI: 10.1371/journal.pone.0210404] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/28/2018] [Accepted: 12/21/2018] [Indexed: 01/09/2023] Open
Abstract
Monitoring viral infections of cell cultures is largely neglected although the viruses may have an impact on the physiology of cells and may constitute a biohazard regarding laboratory safety and safety of bioactive agents produced by cell cultures. PCR, immunological assays, and enzyme activity tests represent common methods to detect virus infections. We have screened more than 300 Cancer Cell Line Encyclopedia RNA sequencing and 60 whole exome sequencing human cell lines data sets for specific viral sequences and general viral nucleotide and protein sequence assessment applying the Taxonomer bioinformatics tool developed by IDbyDNA. The results were compared with our previous findings from virus specific PCR analyses. Both, the results obtained from the direct alignment method and the Taxonomer alignment method revealed a complete concordance with the PCR results: twenty cell lines were found to be infected with five virus species. Taxonomer further uncovered a bovine polyomavirus infection in the breast cancer cell line SK-BR-3 most likely introduced by contaminated fetal bovine serum. RNA-Seq data sets were more sensitive for virus detection although a significant proportion of cell lines revealed low numbers of virus specific alignments attributable to low level nucleotide contamination during RNA preparation or sequencing procedure. Low quality reads leading to Taxonomer false positive results can be eliminated by trimming the sequence data before analysis. One further important result is that no viruses were detected that had never been shown to occur in cell cultures. The results prove that the currently applied testing of cell cultures is adequate for the detection of contamination and for the risk assessment of cell cultures. The results emphasize that next generation sequencing is an efficient tool to determine the viral infection status of human cells.
Collapse
Affiliation(s)
- Cord C. Uphoff
- Department of Human and Animal Cell Lines, Leibniz Institute DSMZ—German Collection of Microorganisms and Cell Cultures, Braunschweig, Germany
| | - Claudia Pommerenke
- Department of Human and Animal Cell Lines, Leibniz Institute DSMZ—German Collection of Microorganisms and Cell Cultures, Braunschweig, Germany
| | - Sabine A. Denkmann
- Department of Human and Animal Cell Lines, Leibniz Institute DSMZ—German Collection of Microorganisms and Cell Cultures, Braunschweig, Germany
| | - Hans G. Drexler
- Department of Human and Animal Cell Lines, Leibniz Institute DSMZ—German Collection of Microorganisms and Cell Cultures, Braunschweig, Germany
| |
Collapse
|
18
|
Laine VN, Gossmann TI, van Oers K, Visser ME, Groenen MAM. Exploring the unmapped DNA and RNA reads in a songbird genome. BMC Genomics 2019; 20:19. [PMID: 30621573 PMCID: PMC6323668 DOI: 10.1186/s12864-018-5378-2] [Citation(s) in RCA: 18] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/19/2018] [Accepted: 12/16/2018] [Indexed: 01/19/2023] Open
Abstract
BACKGROUND A widely used approach in next-generation sequencing projects is the alignment of reads to a reference genome. Despite methodological and hardware improvements which have enhanced the efficiency and accuracy of alignments, a significant percentage of reads frequently remain unmapped. Usually, unmapped reads are discarded from the analysis process, but significant biological information and insights can be uncovered from these data. We explored the unmapped DNA (normal and bisulfite treated) and RNA sequence reads of the great tit (Parus major) reference genome individual. From the unmapped reads we generated de novo assemblies, after which the generated sequence contigs were aligned to the NCBI non-redundant nucleotide database using BLAST, identifying the closest known matching sequence. RESULTS Many of the aligned contigs showed sequence similarity to different bird species and genes that were absent in the great tit reference assembly. Furthermore, there were also contigs that represented known P. major pathogenic species. Most interesting were several species of blood parasites such as Plasmodium and Trypanosoma. CONCLUSIONS Our analyses revealed that meaningful biological information can be found when further exploring unmapped reads. For instance, it is possible to discover sequences that are either absent or misassembled in the reference genome, and sequences that indicate infection or sample contamination. In this study we also propose strategies to aid the capture and interpretation of this information from unmapped reads.
Collapse
Affiliation(s)
- Veronika N Laine
- Department of Animal Ecology, NIOO-KNAW, Wageningen, The Netherlands.
| | - Toni I Gossmann
- Department of Animal and Plant Sciences, The University of Sheffield, Sheffield, UK
| | - Kees van Oers
- Department of Animal Ecology, NIOO-KNAW, Wageningen, The Netherlands
| | - Marcel E Visser
- Department of Animal Ecology, NIOO-KNAW, Wageningen, The Netherlands.,Department of Animal Sciences, Wageningen University, Wageningen, The Netherlands
| | - Martien A M Groenen
- Department of Animal Sciences, Wageningen University, Wageningen, The Netherlands
| |
Collapse
|
19
|
Dolle DD, Liu Z, Cotten M, Simpson JT, Iqbal Z, Durbin R, McCarthy SA, Keane TM. Using reference-free compressed data structures to analyze sequencing reads from thousands of human genomes. Genome Res 2016; 27:300-309. [PMID: 27986821 PMCID: PMC5287235 DOI: 10.1101/gr.211748.116] [Citation(s) in RCA: 18] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/22/2016] [Accepted: 12/14/2016] [Indexed: 01/04/2023]
Abstract
We are rapidly approaching the point where we have sequenced millions of human genomes. There is a pressing need for new data structures to store raw sequencing data and efficient algorithms for population scale analysis. Current reference-based data formats do not fully exploit the redundancy in population sequencing nor take advantage of shared genetic variation. In recent years, the Burrows–Wheeler transform (BWT) and FM-index have been widely employed as a full-text searchable index for read alignment and de novo assembly. We introduce the concept of a population BWT and use it to store and index the sequencing reads of 2705 samples from the 1000 Genomes Project. A key feature is that, as more genomes are added, identical read sequences are increasingly observed, and compression becomes more efficient. We assess the support in the 1000 Genomes read data for every base position of two human reference assembly versions, identifying that 3.2 Mbp with population support was lost in the transition from GRCh37 with 13.7 Mbp added to GRCh38. We show that the vast majority of variant alleles can be uniquely described by overlapping 31-mers and show how rapid and accurate SNP and indel genotyping can be carried out across the genomes in the population BWT. We use the population BWT to carry out nonreference queries to search for the presence of all known viral genomes and discover human T-lymphotropic virus 1 integrations in six samples in a recognized epidemiological distribution.
Collapse
Affiliation(s)
- Dirk D Dolle
- Wellcome Trust Sanger Institute, Hinxton, Cambridge CB10 1SA, United Kingdom
| | - Zhicheng Liu
- Wellcome Trust Sanger Institute, Hinxton, Cambridge CB10 1SA, United Kingdom.,European Bioinformatics Institute, Hinxton, Cambridge CB10 1SD, United Kingdom
| | - Matthew Cotten
- Wellcome Trust Sanger Institute, Hinxton, Cambridge CB10 1SA, United Kingdom
| | - Jared T Simpson
- Ontario Institute for Cancer Research, Toronto, Ontario M5G 0A3, Canada.,Department of Computer Science, University of Toronto, Toronto, Ontario M5S 3G4, Canada
| | - Zamin Iqbal
- Wellcome Trust Centre for Human Genetics, Oxford OX3 7BN, United Kingdom
| | - Richard Durbin
- Wellcome Trust Sanger Institute, Hinxton, Cambridge CB10 1SA, United Kingdom
| | - Shane A McCarthy
- Wellcome Trust Sanger Institute, Hinxton, Cambridge CB10 1SA, United Kingdom
| | - Thomas M Keane
- Wellcome Trust Sanger Institute, Hinxton, Cambridge CB10 1SA, United Kingdom.,European Bioinformatics Institute, Hinxton, Cambridge CB10 1SD, United Kingdom
| |
Collapse
|
20
|
Taylor JF, Whitacre LK, Hoff JL, Tizioto PC, Kim J, Decker JE, Schnabel RD. Lessons for livestock genomics from genome and transcriptome sequencing in cattle and other mammals. Genet Sel Evol 2016; 48:59. [PMID: 27534529 PMCID: PMC4989351 DOI: 10.1186/s12711-016-0237-6] [Citation(s) in RCA: 20] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/16/2015] [Accepted: 08/02/2016] [Indexed: 12/31/2022] Open
Abstract
Background Decreasing sequencing costs and development of new protocols for characterizing global methylation, gene expression patterns and regulatory regions have stimulated the generation of large livestock datasets. Here, we discuss experiences in the analysis of whole-genome and transcriptome sequence data. Methods We analyzed whole-genome sequence (WGS) data from 132 individuals from five canid species (Canis familiaris, C. latrans, C. dingo, C. aureus and C. lupus) and 61 breeds, three bison (Bison bison), 64 water buffalo (Bubalus bubalis) and 297 bovines from 17 breeds. By individual, data vary in extent of reference genome depth of coverage from 4.9X to 64.0X. We have also analyzed RNA-seq data for 580 samples representing 159 Bos taurus and Rattus norvegicus animals and 98 tissues. By aligning reads to a reference assembly and calling variants, we assessed effects of average depth of coverage on the actual coverage and on the number of called variants. We examined the identity of unmapped reads by assembling them and querying produced contigs against the non-redundant nucleic acids database. By imputing high-density single nucleotide polymorphism data on 4010 US registered Angus animals to WGS using Run4 of the 1000 Bull Genomes Project and assessing the accuracy of imputation, we identified misassembled reference sequence regions. Results We estimate that a 24X depth of coverage is required to achieve 99.5 % coverage of the reference assembly and identify 95 % of the variants within an individual’s genome. Genomes sequenced to low average coverage (e.g., <10X) may fail to cover 10 % of the reference genome and identify <75 % of variants. About 10 % of genomic DNA or transcriptome sequence reads fail to align to the reference assembly. These reads include loci missing from the reference assembly and misassembled genes and interesting symbionts, commensal and pathogenic organisms. Conclusions Assembly errors and a lack of annotation of functional elements significantly limit the utility of the current draft livestock reference assemblies. The Functional Annotation of Animal Genomes initiative seeks to annotate functional elements, while a 70X Pac-Bio assembly for cow is underway and may result in a significantly improved reference assembly. Electronic supplementary material The online version of this article (doi:10.1186/s12711-016-0237-6) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Jeremy F Taylor
- Division of Animal Sciences, University of Missouri, Columbia, MO, USA.
| | - Lynsey K Whitacre
- Division of Animal Sciences, University of Missouri, Columbia, MO, USA.,Informatics Institute, University of Missouri, Columbia, MO, USA
| | - Jesse L Hoff
- Division of Animal Sciences, University of Missouri, Columbia, MO, USA
| | - Polyana C Tizioto
- Division of Animal Sciences, University of Missouri, Columbia, MO, USA.,Embrapa Southeast Livestock, São Carlos, SP, Brazil
| | - JaeWoo Kim
- Division of Animal Sciences, University of Missouri, Columbia, MO, USA
| | - Jared E Decker
- Division of Animal Sciences, University of Missouri, Columbia, MO, USA.,Informatics Institute, University of Missouri, Columbia, MO, USA
| | - Robert D Schnabel
- Division of Animal Sciences, University of Missouri, Columbia, MO, USA.,Informatics Institute, University of Missouri, Columbia, MO, USA
| |
Collapse
|
21
|
Genomic leftovers: identifying novel microsatellites, over-represented motifs and functional elements in the human genome. Sci Rep 2016; 6:27722. [PMID: 27278669 PMCID: PMC4899811 DOI: 10.1038/srep27722] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/15/2015] [Accepted: 05/23/2016] [Indexed: 01/29/2023] Open
Abstract
The human genome is 99% complete. This study contributes to filling the 1% gap by enriching previously unknown repeat regions called microsatellites (MST). We devised a Global MST Enrichment (GME) kit to enrich and nextgen sequence 2 colorectal cell lines and 16 normal human samples to illustrate its utility in identifying contigs from reads that do not map to the genome reference. The analysis of these samples yielded 790 novel extra-referential concordant contigs that are observed in more than one sample. We searched for evidence of functional elements in the concordant contigs in two ways: (1) BLAST-ing each contig against normal RNA-Seq samples, (2) Checking for predicted functional elements using GlimmerHMM. Of the 790 concordant contigs, 37 had an exact match to at least one RNA-Seq read; 15 aligned to more than 100 RNA-Seq reads. Of the 249 concordant contigs predicted by GlimmerHMM to have functional elements, 6 had at least one exact RNA-Seq match. BLAST-ing these novel contigs against all publically available sequences confirmed that they were found in human and chimpanzee BAC and FOSMID clones sequenced as part of the original human genome project. These extra-referential contigs predominantly contained pentameric repeats, especially two motifs: AATGG and GTGGA.
Collapse
|
22
|
Small RNA-Based Antiviral Defense in the Phytopathogenic Fungus Colletotrichum higginsianum. PLoS Pathog 2016; 12:e1005640. [PMID: 27253323 PMCID: PMC4890784 DOI: 10.1371/journal.ppat.1005640] [Citation(s) in RCA: 61] [Impact Index Per Article: 7.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/06/2016] [Accepted: 04/26/2016] [Indexed: 12/21/2022] Open
Abstract
Even though the fungal kingdom contains more than 3 million species, little is known about the biological roles of RNA silencing in fungi. The Colletotrichum genus comprises fungal species that are pathogenic for a wide range of crop species worldwide. To investigate the role of RNA silencing in the ascomycete fungus Colletotrichum higginsianum, knock-out mutants affecting genes for three RNA-dependent RNA polymerase (RDR), two Dicer-like (DCL), and two Argonaute (AGO) proteins were generated by targeted gene replacement. No effects were observed on vegetative growth for any mutant strain when grown on complex or minimal media. However, Δdcl1, Δdcl1Δdcl2 double mutant, and Δago1 strains showed severe defects in conidiation and conidia morphology. Total RNA transcripts and small RNA populations were analyzed in parental and mutant strains. The greatest effects on both RNA populations was observed in the Δdcl1, Δdcl1Δdcl2, and Δago1 strains, in which a previously uncharacterized dsRNA mycovirus [termed Colletotrichum higginsianum non-segmented dsRNA virus 1 (ChNRV1)] was derepressed. Phylogenetic analyses clearly showed a close relationship between ChNRV1 and members of the segmented Partitiviridae family, despite the non-segmented nature of the genome. Immunoprecipitation of small RNAs associated with AGO1 showed abundant loading of 5’U-containing viral siRNA. C. higginsianum parental and Δdcl1 mutant strains cured of ChNRV1 revealed that the conidiation and spore morphology defects were primarily caused by ChNRV1. Based on these results, RNA silencing involving ChDCL1 and ChAGO1 in C. higginsianum is proposed to function as an antiviral mechanism. Colletotrichum sp. comprises a diverse group of fungal pathogens that attack over 3000 plant species worldwide. Understanding the underlying mechanisms that govern fungal development and pathogenicity may enable more effective and sustainable approaches to crop disease management and control. In most organisms, RNA silencing is an important mechanism to control endogenous and exogenous RNA. RNA silencing utilizes small regulatory molecules (small RNAs) produced by proteins called Dicer (DCL), and exercise their function though effector proteins named Argonaute (AGO). Here, we investigated the role of RNA silencing machinery in the fungus Colletotrichum higginsianum, by generating deletions in genes encoding RNA silencing components. Severe defects were observed in both conidiation and conidia morphology in the Δdcl1, Δdcl1Δdcl2, and Δago1 strains. Analysis of transcripts and small RNAs revealed an uncharacterized dsRNA virus persistently infecting C. higginsianum. The virus was shown (1) to be de-repressed in the Δdcl1, Δdcl1Δdcl2 and Δago1 strains, and (2) to cause the conidiation and spore mutant phenotypes. Our results indicate that C. higginsianum employs RNA silencing as an antiviral mechanism to suppress viruses and their debilitating effects.
Collapse
|
23
|
Whitacre LK, Tizioto PC, Kim J, Sonstegard TS, Schroeder SG, Alexander LJ, Medrano JF, Schnabel RD, Taylor JF, Decker JE. What's in your next-generation sequence data? An exploration of unmapped DNA and RNA sequence reads from the bovine reference individual. BMC Genomics 2015; 16:1114. [PMID: 26714747 PMCID: PMC4696311 DOI: 10.1186/s12864-015-2313-7] [Citation(s) in RCA: 27] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/26/2015] [Accepted: 12/15/2015] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Next-generation sequencing projects commonly commence by aligning reads to a reference genome assembly. While improvements in alignment algorithms and computational hardware have greatly enhanced the efficiency and accuracy of alignments, a significant percentage of reads often remain unmapped. RESULTS We generated de novo assemblies of unmapped reads from the DNA and RNA sequencing of the Bos taurus reference individual and identified the closest matching sequence to each contig by alignment to the NCBI non-redundant nucleotide database using BLAST. As expected, many of these contigs represent vertebrate sequence that is absent, incomplete, or misassembled in the UMD3.1 reference assembly. However, numerous additional contigs represent invertebrate species. Most prominent were several species of Spirurid nematodes and a blood-borne parasite, Babesia bigemina. These species are either not present in the US or are not known to infect taurine cattle and the reference animal appears to have been host to unsequenced sister species. CONCLUSIONS We demonstrate the importance of exploring unmapped reads to ascertain sequences that are either absent or misassembled in the reference assembly and for detecting sequences indicative of parasitic or commensal organisms.
Collapse
Affiliation(s)
- Lynsey K Whitacre
- Informatics Institute, University of Missouri, Columbia, MO, 65211, USA. .,Division of Animal Sciences, University of Missouri, Columbia, MO, 65211, USA.
| | - Polyana C Tizioto
- Division of Animal Sciences, University of Missouri, Columbia, MO, 65211, USA. .,Embrapa Southeast Livestock, São Carlos, São Paulo, 13560-970, Brazil.
| | - JaeWoo Kim
- Division of Animal Sciences, University of Missouri, Columbia, MO, 65211, USA.
| | - Tad S Sonstegard
- Animal Genomics and Improvement Laboratory, USDA-ARS, Beltsville, MD, 20705, USA. .,Recombinetics Inc., 1246 University Ave W #301, St Paul, MN, 55104, USA.
| | - Steven G Schroeder
- Animal Genomics and Improvement Laboratory, USDA-ARS, Beltsville, MD, 20705, USA.
| | | | - Juan F Medrano
- Department of Animal Science, University of California-Davis, Davis, CA, 95616, USA.
| | - Robert D Schnabel
- Informatics Institute, University of Missouri, Columbia, MO, 65211, USA. .,Division of Animal Sciences, University of Missouri, Columbia, MO, 65211, USA.
| | - Jeremy F Taylor
- Division of Animal Sciences, University of Missouri, Columbia, MO, 65211, USA.
| | - Jared E Decker
- Informatics Institute, University of Missouri, Columbia, MO, 65211, USA. .,Division of Animal Sciences, University of Missouri, Columbia, MO, 65211, USA.
| |
Collapse
|