1
|
Lo T, Coombe L, Gagalova KK, Marr A, Warren RL, Kirk H, Pandoh P, Zhao Y, Moore RA, Mungall AJ, Ritland C, Pavy N, Jones SJM, Bohlmann J, Bousquet J, Birol I, Thomson A. Assembly and annotation of the black spruce genome provide insights on spruce phylogeny and evolution of stress response. G3 (Bethesda) 2023; 14:jkad247. [PMID: 37875130 PMCID: PMC10755193 DOI: 10.1093/g3journal/jkad247] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/17/2023] [Revised: 05/17/2023] [Accepted: 10/09/2023] [Indexed: 10/26/2023]
Abstract
Black spruce (Picea mariana [Mill.] B.S.P.) is a dominant conifer species in the North American boreal forest that plays important ecological and economic roles. Here, we present the first genome assembly of P. mariana with a reconstructed genome size of 18.3 Gbp and NG50 scaffold length of 36.0 kbp. A total of 66,332 protein-coding sequences were predicted in silico and annotated based on sequence homology. We analyzed the evolutionary relationships between P. mariana and 5 other spruces for which complete nuclear and organelle genome sequences were available. The phylogenetic tree estimated from mitochondrial genome sequences agrees with biogeography; specifically, P. mariana was strongly supported as a sister lineage to P. glauca and 3 other taxa found in western North America, followed by the European Picea abies. We obtained mixed topologies with weaker statistical support in phylogenetic trees estimated from nuclear and chloroplast genome sequences, indicative of ancient reticulate evolution affecting these 2 genomes. Clustering of protein-coding sequences from the 6 Picea taxa and 2 Pinus species resulted in 34,776 orthogroups, 560 of which appeared to be specific to P. mariana. Analysis of these specific orthogroups and dN/dS analysis of positive selection signatures for 497 single-copy orthogroups identified gene functions mostly related to plant development and stress response. The P. mariana genome assembly and annotation provides a valuable resource for forest genetics research and applications in this broadly distributed species, especially in relation to climate adaptation.
Collapse
Affiliation(s)
- Theodora Lo
- Canada’s Michael Smith Genome Sciences Centre, BC Cancer, Vancouver, BC V5Z 4S6, Canada
| | - Lauren Coombe
- Canada’s Michael Smith Genome Sciences Centre, BC Cancer, Vancouver, BC V5Z 4S6, Canada
| | - Kristina K Gagalova
- Canada’s Michael Smith Genome Sciences Centre, BC Cancer, Vancouver, BC V5Z 4S6, Canada
| | - Alex Marr
- Canada’s Michael Smith Genome Sciences Centre, BC Cancer, Vancouver, BC V5Z 4S6, Canada
| | - René L Warren
- Canada’s Michael Smith Genome Sciences Centre, BC Cancer, Vancouver, BC V5Z 4S6, Canada
| | - Heather Kirk
- Canada’s Michael Smith Genome Sciences Centre, BC Cancer, Vancouver, BC V5Z 4S6, Canada
| | - Pawan Pandoh
- Canada’s Michael Smith Genome Sciences Centre, BC Cancer, Vancouver, BC V5Z 4S6, Canada
| | - Yongjun Zhao
- Canada’s Michael Smith Genome Sciences Centre, BC Cancer, Vancouver, BC V5Z 4S6, Canada
| | - Richard A Moore
- Canada’s Michael Smith Genome Sciences Centre, BC Cancer, Vancouver, BC V5Z 4S6, Canada
| | - Andrew J Mungall
- Canada’s Michael Smith Genome Sciences Centre, BC Cancer, Vancouver, BC V5Z 4S6, Canada
| | - Carol Ritland
- Department of Forest and Conservation Sciences, University of British Columbia, Vancouver, BC V6T 1Z4, Canada
- Michael Smith Laboratories, University of British Columbia, Vancouver, BC V6T 1Z4, Canada
| | - Nathalie Pavy
- Canada Research Chair in Forest Genomics, Laval University, Quebec City, QC G1V 0A6, Canada
| | - Steven J M Jones
- Canada’s Michael Smith Genome Sciences Centre, BC Cancer, Vancouver, BC V5Z 4S6, Canada
| | - Joerg Bohlmann
- Department of Forest and Conservation Sciences, University of British Columbia, Vancouver, BC V6T 1Z4, Canada
- Michael Smith Laboratories, University of British Columbia, Vancouver, BC V6T 1Z4, Canada
- Department of Botany, University of British Columbia, Vancouver, BC V6T 1Z4, Canada
| | - Jean Bousquet
- Canada Research Chair in Forest Genomics, Laval University, Quebec City, QC G1V 0A6, Canada
| | - Inanç Birol
- Canada’s Michael Smith Genome Sciences Centre, BC Cancer, Vancouver, BC V5Z 4S6, Canada
| | - Ashley Thomson
- Faculty of Natural Resources Management, Lakehead University, Thunder Bay, ON P7B 5E1, Canada
| |
Collapse
|
2
|
Wong J, Kazemi P, Coombe L, Warren RL, Birol I. aaHash: recursive amino acid sequence hashing. Bioinform Adv 2023; 3:vbad162. [PMID: 38023332 PMCID: PMC10660294 DOI: 10.1093/bioadv/vbad162] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 07/14/2023] [Revised: 10/13/2023] [Accepted: 11/08/2023] [Indexed: 12/01/2023]
Abstract
Motivation K-mer hashing is a common operation in many foundational bioinformatics problems. However, generic string hashing algorithms are not optimized for this application. Strings in bioinformatics use specific alphabets, a trait leveraged for nucleic acid sequences in earlier work. We note that amino acid sequences, with complexities and context that cannot be captured by generic hashing algorithms, can also benefit from a domain-specific hashing algorithm. Such a hashing algorithm can accelerate and improve the sensitivity of bioinformatics applications developed for protein sequences. Results Here, we present aaHash, a recursive hashing algorithm tailored for amino acid sequences. This algorithm utilizes multiple hash levels to represent biochemical similarities between amino acids. aaHash performs ∼10× faster than generic string hashing algorithms in hashing adjacent k-mers. Availability and implementation aaHash is available online at https://github.com/bcgsc/btllib and is free for academic use.
Collapse
Affiliation(s)
- Johnathan Wong
- Canada’s Michael Smith Genome Sciences Centre, BC Cancer, Vancouver, BC V5Z 4S6, Canada
| | - Parham Kazemi
- Canada’s Michael Smith Genome Sciences Centre, BC Cancer, Vancouver, BC V5Z 4S6, Canada
| | - Lauren Coombe
- Canada’s Michael Smith Genome Sciences Centre, BC Cancer, Vancouver, BC V5Z 4S6, Canada
| | - René L Warren
- Canada’s Michael Smith Genome Sciences Centre, BC Cancer, Vancouver, BC V5Z 4S6, Canada
| | - Inanç Birol
- Canada’s Michael Smith Genome Sciences Centre, BC Cancer, Vancouver, BC V5Z 4S6, Canada
| |
Collapse
|
3
|
Wong J, Coombe L, Nikolić V, Zhang E, Nip KM, Sidhu P, Warren RL, Birol I. Linear time complexity de novo long read genome assembly with GoldRush. Nat Commun 2023; 14:2906. [PMID: 37217507 DOI: 10.1038/s41467-023-38716-x] [Citation(s) in RCA: 4] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/10/2022] [Accepted: 05/11/2023] [Indexed: 05/24/2023] Open
Abstract
Current state-of-the-art de novo long read genome assemblers follow the Overlap-Layout-Consensus paradigm. While read-to-read overlap - its most costly step - was improved in modern long read genome assemblers, these tools still often require excessive RAM when assembling a typical human dataset. Our work departs from this paradigm, foregoing all-vs-all sequence alignments in favor of a dynamic data structure implemented in GoldRush, a de novo long read genome assembly algorithm with linear time complexity. We tested GoldRush on Oxford Nanopore Technologies long sequencing read datasets with different base error profiles sourced from three human cell lines, rice, and tomato. Here, we show that GoldRush achieves assembly scaffold NGA50 lengths of 18.3-22.2, 0.3 and 2.6 Mbp, for the genomes of human, rice, and tomato, respectively, and assembles each genome within a day, using at most 54.5 GB of random-access memory, demonstrating the scalability of our genome assembly paradigm and its implementation.
Collapse
Affiliation(s)
- Johnathan Wong
- Canada's Michael Smith Genome Sciences Centre, BC Cancer, Vancouver, BC, V5Z 4S6, Canada.
| | - Lauren Coombe
- Canada's Michael Smith Genome Sciences Centre, BC Cancer, Vancouver, BC, V5Z 4S6, Canada
| | - Vladimir Nikolić
- Canada's Michael Smith Genome Sciences Centre, BC Cancer, Vancouver, BC, V5Z 4S6, Canada
| | - Emily Zhang
- Canada's Michael Smith Genome Sciences Centre, BC Cancer, Vancouver, BC, V5Z 4S6, Canada
| | - Ka Ming Nip
- Canada's Michael Smith Genome Sciences Centre, BC Cancer, Vancouver, BC, V5Z 4S6, Canada
| | - Puneet Sidhu
- Canada's Michael Smith Genome Sciences Centre, BC Cancer, Vancouver, BC, V5Z 4S6, Canada
| | - René L Warren
- Canada's Michael Smith Genome Sciences Centre, BC Cancer, Vancouver, BC, V5Z 4S6, Canada
| | - Inanç Birol
- Canada's Michael Smith Genome Sciences Centre, BC Cancer, Vancouver, BC, V5Z 4S6, Canada.
| |
Collapse
|
4
|
Wong J, Kazemi P, Coombe L, Warren RL, Birol I. aaHash: recursive amino acid sequence hashing. bioRxiv 2023:2023.05.08.539909. [PMID: 37214907 PMCID: PMC10197579 DOI: 10.1101/2023.05.08.539909] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/24/2023]
Abstract
Motivation K-mer hashing is a common operation in many foundational bioinformatics problems. However, generic string hashing algorithms are not optimized for this application. Strings in bioinformatics use specific alphabets, a trait leveraged for nucleic acid sequences in earlier work. We note that amino acid sequences, with complexities and context that cannot be captured by generic hashing algorithms, can also benefit from a domain-specific hashing algorithm. Such a hashing algorithm can accelerate and improve the sensitivity of bioinformatics applications developed for protein sequences. Results Here, we present aaHash, a recursive hashing algorithm tailored for amino acid sequences. This algorithm utilizes multiple hash levels to represent biochemical similarities between amino acids. aaHash performs ~10X faster than generic string hashing algorithms in hashing adjacent k-mers. Availability and implementation aaHash is available online at https://github.com/bcgsc/btllib and is free for academic use.
Collapse
Affiliation(s)
- Johnathan Wong
- Canada’s Michael Smith Genome Sciences Centre, BC Cancer, Vancouver, BC V5Z 4S6, Canada
| | - Parham Kazemi
- Canada’s Michael Smith Genome Sciences Centre, BC Cancer, Vancouver, BC V5Z 4S6, Canada
| | - Lauren Coombe
- Canada’s Michael Smith Genome Sciences Centre, BC Cancer, Vancouver, BC V5Z 4S6, Canada
| | - René L. Warren
- Canada’s Michael Smith Genome Sciences Centre, BC Cancer, Vancouver, BC V5Z 4S6, Canada
| | - Inanç Birol
- Canada’s Michael Smith Genome Sciences Centre, BC Cancer, Vancouver, BC V5Z 4S6, Canada
| |
Collapse
|
5
|
Yoo S, Garg E, Elliott LT, Hung RJ, Halevy AR, Brooks JD, Bull SB, Gagnon F, Greenwood C, Lawless JF, Paterson AD, Sun L, Zawati MH, Lerner-Ellis J, Abraham R, Birol I, Bourque G, Garant JM, Gosselin C, Li J, Whitney J, Thiruvahindrapuram B, Herbrick JA, Lorenti M, Reuter MS, Adeoye OO, Liu S, Allen U, Bernier FP, Biggs CM, Cheung AM, Cowan J, Herridge M, Maslove DM, Modi BP, Mooser V, Morris SK, Ostrowski M, Parekh RS, Pfeffer G, Suchowersky O, Taher J, Upton J, Warren RL, Yeung R, Aziz N, Turvey SE, Knoppers BM, Lathrop M, Jones S, Scherer SW, Strug LJ. HostSeq: a Canadian whole genome sequencing and clinical data resource. BMC Genom Data 2023; 24:26. [PMID: 37131148 PMCID: PMC10152008 DOI: 10.1186/s12863-023-01128-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/15/2022] [Accepted: 02/22/2023] [Indexed: 05/04/2023] Open
Abstract
HostSeq was launched in April 2020 as a national initiative to integrate whole genome sequencing data from 10,000 Canadians infected with SARS-CoV-2 with clinical information related to their disease experience. The mandate of HostSeq is to support the Canadian and international research communities in their efforts to understand the risk factors for disease and associated health outcomes and support the development of interventions such as vaccines and therapeutics. HostSeq is a collaboration among 13 independent epidemiological studies of SARS-CoV-2 across five provinces in Canada. Aggregated data collected by HostSeq are made available to the public through two data portals: a phenotype portal showing summaries of major variables and their distributions, and a variant search portal enabling queries in a genomic region. Individual-level data is available to the global research community for health research through a Data Access Agreement and Data Access Compliance Office approval. Here we provide an overview of the collective project design along with summary level information for HostSeq. We highlight several statistical considerations for researchers using the HostSeq platform regarding data aggregation, sampling mechanism, covariate adjustment, and X chromosome analysis. In addition to serving as a rich data source, the diversity of study designs, sample sizes, and research objectives among the participating studies provides unique opportunities for the research community.
Collapse
Affiliation(s)
- S Yoo
- The Hospital for Sick Children, Toronto, ON, Canada
- University of Ottawa, Ottawa, ON, Canada
| | - E Garg
- Simon Fraser University, Burnaby, BC, Canada
| | - L T Elliott
- Simon Fraser University, Burnaby, BC, Canada
| | - R J Hung
- University of Toronto, Toronto, ON, Canada
- Lunenfeld-Tanenbaum Research Institute, Sinai Health, Toronto, ON, Canada
| | - A R Halevy
- The Hospital for Sick Children, Toronto, ON, Canada
| | - J D Brooks
- University of Toronto, Toronto, ON, Canada
| | - S B Bull
- University of Toronto, Toronto, ON, Canada
- Lunenfeld-Tanenbaum Research Institute, Sinai Health, Toronto, ON, Canada
| | - F Gagnon
- University of Toronto, Toronto, ON, Canada
| | - Cmt Greenwood
- McGill University, Montreal, QC, Canada
- Lady Davis Institute for Medical Research, Jewish General Hospital, Montreal, QC, Canada
| | - J F Lawless
- University of Waterloo, Waterloo, ON, Canada
| | - A D Paterson
- The Hospital for Sick Children, Toronto, ON, Canada
- University of Toronto, Toronto, ON, Canada
| | - L Sun
- University of Toronto, Toronto, ON, Canada
| | | | - J Lerner-Ellis
- University of Toronto, Toronto, ON, Canada
- Sinai Health System, Toronto, ON, Canada
| | - Rjs Abraham
- Canada's Michael Smith Genome Sciences Centre, Vancouver, BC, Canada
| | - I Birol
- Canada's Michael Smith Genome Sciences Centre, Vancouver, BC, Canada
| | - G Bourque
- McGill University, Montreal, QC, Canada
| | - J-M Garant
- Canada's Michael Smith Genome Sciences Centre, Vancouver, BC, Canada
| | - C Gosselin
- Canada's Michael Smith Genome Sciences Centre, Vancouver, BC, Canada
| | - J Li
- Canada's Michael Smith Genome Sciences Centre, Vancouver, BC, Canada
| | - J Whitney
- The Hospital for Sick Children, Toronto, ON, Canada
| | | | - J-A Herbrick
- The Hospital for Sick Children, Toronto, ON, Canada
| | - M Lorenti
- The Hospital for Sick Children, Toronto, ON, Canada
| | - M S Reuter
- The Hospital for Sick Children, Toronto, ON, Canada
| | - O O Adeoye
- The Hospital for Sick Children, Toronto, ON, Canada
| | - S Liu
- The Hospital for Sick Children, Toronto, ON, Canada
| | - U Allen
- The Hospital for Sick Children, Toronto, ON, Canada
- University of Toronto, Toronto, ON, Canada
| | - F P Bernier
- University of Calgary, Calgary, AB, Canada
- Alberta Children's Hospital, Calgary, AB, Canada
| | - C M Biggs
- University of British Columbia, Vancouver, BC, Canada
- BC Children's Hospital, Vancouver, BC, Canada
- St. Paul's Hospital, Vancouver, BC, Canada
| | - A M Cheung
- University Health Network, Toronto, ON, Canada
| | - J Cowan
- University of Ottawa, Ottawa, ON, Canada
- The Ottawa Hospital Research Institute, Ottawa, ON, Canada
| | - M Herridge
- University Health Network, Toronto, ON, Canada
| | | | - B P Modi
- BC Children's Hospital, Vancouver, BC, Canada
| | - V Mooser
- McGill University, Montreal, QC, Canada
| | - S K Morris
- The Hospital for Sick Children, Toronto, ON, Canada
- University of Toronto, Toronto, ON, Canada
| | - M Ostrowski
- University of Toronto, Toronto, ON, Canada
- St. Michael's Hospital, Unity Health, Toronto, ON, Canada
| | - R S Parekh
- The Hospital for Sick Children, Toronto, ON, Canada
- University of Toronto, Toronto, ON, Canada
- Women's College Hospital, Toronto, ON, Canada
| | - G Pfeffer
- University of Calgary, Calgary, AB, Canada
| | | | - J Taher
- University of Toronto, Toronto, ON, Canada
- Sinai Health System, Toronto, ON, Canada
| | - J Upton
- The Hospital for Sick Children, Toronto, ON, Canada
- University of Toronto, Toronto, ON, Canada
| | - R L Warren
- Canada's Michael Smith Genome Sciences Centre, Vancouver, BC, Canada
| | - Rsm Yeung
- The Hospital for Sick Children, Toronto, ON, Canada
- University of Toronto, Toronto, ON, Canada
| | - N Aziz
- The Hospital for Sick Children, Toronto, ON, Canada
| | - S E Turvey
- University of British Columbia, Vancouver, BC, Canada
- BC Children's Hospital, Vancouver, BC, Canada
| | | | - M Lathrop
- McGill University, Montreal, QC, Canada
| | - Sjm Jones
- Canada's Michael Smith Genome Sciences Centre, Vancouver, BC, Canada
| | - S W Scherer
- The Hospital for Sick Children, Toronto, ON, Canada
- University of Toronto, Toronto, ON, Canada
| | - L J Strug
- The Hospital for Sick Children, Toronto, ON, Canada.
- University of Toronto, Toronto, ON, Canada.
| |
Collapse
|
6
|
Kazemi P, Wong J, Nikolić V, Mohamadi H, Warren RL, Birol I. ntHash2: recursive spaced seed hashing for nucleotide sequences. Bioinformatics 2022; 38:4812-4813. [PMID: 36000872 PMCID: PMC9563681 DOI: 10.1093/bioinformatics/btac564] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/11/2022] [Revised: 07/21/2022] [Indexed: 11/29/2022] Open
Abstract
Motivation Spaced seeds are robust alternatives to k-mers in analyzing nucleotide sequences with high base mismatch rates. Hashing is also crucial for efficiently storing abundant sequence data. Here, we introduce ntHash2, a fast algorithm for spaced seed hashing that can be integrated into various bioinformatics tools for efficient sequence analysis with applications in genome research. Results ntHash2 is up to 2.1× faster at hashing various spaced seeds than the previous version and 3.8× faster than conventional hashing algorithms with naïve adaptation. Additionally, we reduced the collision rate of ntHash for longer k-mer lengths and improved the uniformity of the hash distribution by modifying the canonical hashing mechanism. Availability and implementation ntHash2 is freely available online at github.com/bcgsc/ntHash under an MIT license. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Parham Kazemi
- Canada's Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC, V5Z 4S6, Canada.,Faculty of Science, University of British Columbia, Vancouver, Canada
| | - Johnathan Wong
- Canada's Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC, V5Z 4S6, Canada
| | - Vladimir Nikolić
- Canada's Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC, V5Z 4S6, Canada
| | | | - René L Warren
- Canada's Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC, V5Z 4S6, Canada
| | - Inanç Birol
- Canada's Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC, V5Z 4S6, Canada.,Department of Medical Genetics, University of British Columbia, Vancouver, Canada
| |
Collapse
|
7
|
Nikolić V, Afshinfard A, Chu J, Wong J, Coombe L, Nip KM, Warren RL, Birol I. RResolver: efficient short-read repeat resolution within ABySS. BMC Bioinformatics 2022; 23:246. [PMID: 35729491 PMCID: PMC9215042 DOI: 10.1186/s12859-022-04790-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/07/2022] [Accepted: 06/09/2022] [Indexed: 11/26/2022] Open
Abstract
BACKGROUND De novo genome assembly is essential to modern genomics studies. As it is not biased by a reference, it is also a useful method for studying genomes with high variation, such as cancer genomes. De novo short-read assemblers commonly use de Bruijn graphs, where nodes are sequences of equal length k, also known as k-mers. Edges in this graph are established between nodes that overlap by [Formula: see text] bases, and nodes along unambiguous walks in the graph are subsequently merged. The selection of k is influenced by multiple factors, and optimizing this value results in a trade-off between graph connectivity and sequence contiguity. Ideally, multiple k sizes should be used, so lower values can provide good connectivity in lesser covered regions and higher values can increase contiguity in well-covered regions. However, current approaches that use multiple k values do not address the scalability issues inherent to the assembly of large genomes. RESULTS Here we present RResolver, a scalable algorithm that takes a short-read de Bruijn graph assembly with a starting k as input and uses a k value closer to that of the read length to resolve repeats. RResolver builds a Bloom filter of sequencing reads which is used to evaluate the assembly graph path support at branching points and removes paths with insufficient support. RResolver runs efficiently, taking only 26 min on average for an ABySS human assembly with 48 threads and 60 GiB memory. Across all experiments, compared to a baseline assembly, RResolver improves scaffold contiguity (NGA50) by up to 15% and reduces misassemblies by up to 12%. CONCLUSIONS RResolver adds a missing component to scalable de Bruijn graph genome assembly. By improving the initial and fundamental graph traversal outcome, all downstream ABySS algorithms greatly benefit by working with a more accurate and less complex representation of the genome. The RResolver code is integrated into ABySS and is available at https://github.com/bcgsc/abyss/tree/master/RResolver .
Collapse
Affiliation(s)
- Vladimir Nikolić
- grid.434706.20000 0004 0410 5424Canada’s Michael Smith Genome Sciences Centre at BC Cancer, 570 W 7th Ave, Vancouver, V5Z 4S6 Canada ,grid.17091.3e0000 0001 2288 9830The University of British Columbia, 2329 West Mall, Vancouver, V6T 1Z4 Canada
| | - Amirhossein Afshinfard
- grid.434706.20000 0004 0410 5424Canada’s Michael Smith Genome Sciences Centre at BC Cancer, 570 W 7th Ave, Vancouver, V5Z 4S6 Canada ,grid.17091.3e0000 0001 2288 9830The University of British Columbia, 2329 West Mall, Vancouver, V6T 1Z4 Canada
| | - Justin Chu
- grid.434706.20000 0004 0410 5424Canada’s Michael Smith Genome Sciences Centre at BC Cancer, 570 W 7th Ave, Vancouver, V5Z 4S6 Canada ,grid.17091.3e0000 0001 2288 9830The University of British Columbia, 2329 West Mall, Vancouver, V6T 1Z4 Canada
| | - Johnathan Wong
- grid.434706.20000 0004 0410 5424Canada’s Michael Smith Genome Sciences Centre at BC Cancer, 570 W 7th Ave, Vancouver, V5Z 4S6 Canada
| | - Lauren Coombe
- grid.434706.20000 0004 0410 5424Canada’s Michael Smith Genome Sciences Centre at BC Cancer, 570 W 7th Ave, Vancouver, V5Z 4S6 Canada
| | - Ka Ming Nip
- grid.434706.20000 0004 0410 5424Canada’s Michael Smith Genome Sciences Centre at BC Cancer, 570 W 7th Ave, Vancouver, V5Z 4S6 Canada ,grid.17091.3e0000 0001 2288 9830The University of British Columbia, 2329 West Mall, Vancouver, V6T 1Z4 Canada
| | - René L. Warren
- grid.434706.20000 0004 0410 5424Canada’s Michael Smith Genome Sciences Centre at BC Cancer, 570 W 7th Ave, Vancouver, V5Z 4S6 Canada
| | - Inanç Birol
- Canada's Michael Smith Genome Sciences Centre at BC Cancer, 570 W 7th Ave, Vancouver, V5Z 4S6, Canada. .,The University of British Columbia, 2329 West Mall, Vancouver, V6T 1Z4, Canada.
| |
Collapse
|
8
|
Li JX, Coombe L, Wong J, Birol I, Warren RL. ntEdit+Sealer: Efficient Targeted Error Resolution and Automated Finishing of Long-Read Genome Assemblies. Curr Protoc 2022; 2:e442. [PMID: 35567771 PMCID: PMC9196995 DOI: 10.1002/cpz1.442] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/11/2022]
Abstract
High‐quality genome assemblies are crucial to many biological studies, and utilizing long sequencing reads can help achieve higher assembly contiguity. While long reads can resolve complex and repetitive regions of a genome, their relatively high associated error rates are still a major limitation. Long reads generally produce draft genome assemblies with lower base quality, which must be corrected with a genome polishing step. Hybrid genome polishing solutions can greatly improve the quality of long‐read genome assemblies by utilizing more accurate short reads to validate bases and correct errors. Currently available hybrid polishing methods rely on read alignments, and are therefore memory‐intensive and do not scale well to large genomes. Here we describe ntEdit+Sealer, an alignment‐free, k‐mer‐based genome finishing protocol that employs memory‐efficient Bloom filters. The protocol includes ntEdit for correcting base errors and small indels, and for marking potentially problematic regions, then Sealer for filling both assembly gaps and problematic regions flagged by ntEdit. ntEdit+Sealer produces highly accurate, error‐corrected genome assemblies, and is available as a Makefile pipeline from https://github.com/bcgsc/ntedit_sealer_protocol. © 2022 The Authors. Current Protocols published by Wiley Periodicals LLC. Basic Protocol: Automated long‐read genome finishing with short reads Support Protocol: Selecting optimal values for k‐mer lengths (k) and Bloom filter size (b)
Collapse
Affiliation(s)
- Janet X Li
- Canada's Michael Smith Genome Sciences Center, Vancouver, BC, Canada.,Bioinformatics Graduate Program, University of British Columbia, Vancouver, BC, Canada
| | - Lauren Coombe
- Canada's Michael Smith Genome Sciences Center, Vancouver, BC, Canada
| | - Johnathan Wong
- Canada's Michael Smith Genome Sciences Center, Vancouver, BC, Canada
| | - Inanç Birol
- Canada's Michael Smith Genome Sciences Center, Vancouver, BC, Canada.,Department of Medical Genetics, University of British Columbia, Vancouver, BC, Canada
| | - René L Warren
- Canada's Michael Smith Genome Sciences Center, Vancouver, BC, Canada
| |
Collapse
|
9
|
Gagalova KK, Whitehill JGA, Culibrk L, Lin D, Lévesque-Tremblay V, Keeling CI, Coombe L, Yuen MMS, Birol I, Bohlmann J, Jones SJM. The genome of the forest insect pest Pissodes strobi reveals genome expansion and evidence of a Wolbachia endosymbiont. G3 Genes|Genomes|Genetics 2022; 12:6529542. [PMID: 35171977 PMCID: PMC8982425 DOI: 10.1093/g3journal/jkac038] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 09/17/2021] [Accepted: 01/23/2022] [Indexed: 12/11/2022]
Abstract
The highly diverse insect family of true weevils, Curculionidae, includes many agricultural and forest pests. Pissodes strobi, commonly known as the spruce weevil or white pine weevil, is a major pest of spruce and pine forests in North America. Pissodes strobi larvae feed on the apical shoots of young trees, causing stunted growth and can destroy regenerating spruce or pine forests. Here, we describe the nuclear and mitochondrial Pissodes strobi genomes and their annotations, as well as the genome of an apparent Wolbachia endosymbiont. We report a substantial expansion of the weevil nuclear genome, relative to other Curculionidae species, possibly driven by an abundance of class II DNA transposons. The endosymbiont observed belongs to a group (supergroup A) of Wolbachia species that generally form parasitic relationships with their arthropod host.
Collapse
Affiliation(s)
- Kristina K Gagalova
- Canada’s Michael Smith Genome Sciences Centre, BC Cancer, Vancouver, BC V5Z4S6, Canada
- Bioinformatics Graduate Program, University of British Columbia, Vancouver, BC V6T1Z4, Canada
| | - Justin G A Whitehill
- Michael Smith Laboratories, University of British Columbia, Vancouver, BC V6T1Z4, Canada
- Department of Forestry and Environmental Resources, North Carolina State University, Raleigh, NC 27695, USA
| | - Luka Culibrk
- Canada’s Michael Smith Genome Sciences Centre, BC Cancer, Vancouver, BC V5Z4S6, Canada
- Bioinformatics Graduate Program, University of British Columbia, Vancouver, BC V6T1Z4, Canada
| | - Diana Lin
- Canada’s Michael Smith Genome Sciences Centre, BC Cancer, Vancouver, BC V5Z4S6, Canada
- Bioinformatics Graduate Program, University of British Columbia, Vancouver, BC V6T1Z4, Canada
| | | | - Christopher I Keeling
- Laurentian Forestry Centre, Canadian Forest Service, Natural Resources Canada, QC G1V4C7, Canada
- Département de Biochimie, De Microbiologie et de Bio-informatique, Université Laval, Laval, QC G1V0A6, Canada
| | - Lauren Coombe
- Canada’s Michael Smith Genome Sciences Centre, BC Cancer, Vancouver, BC V5Z4S6, Canada
| | - Macaire M S Yuen
- Michael Smith Laboratories, University of British Columbia, Vancouver, BC V6T1Z4, Canada
| | - Inanç Birol
- Canada’s Michael Smith Genome Sciences Centre, BC Cancer, Vancouver, BC V5Z4S6, Canada
- Department of Medical Genetics, University of British Columbia, Vancouver, BC V6T1Z4, Canada
| | - Jörg Bohlmann
- Michael Smith Laboratories, University of British Columbia, Vancouver, BC V6T1Z4, Canada
- Department of Botany, University of British Columbia, Vancouver, BC V6T1Z4, Canada
- Department of Forest and Conservation Sciences, University of British Columbia, Vancouver, BC V6T1Z4, Canada
| | - Steven J M Jones
- Canada’s Michael Smith Genome Sciences Centre, BC Cancer, Vancouver, BC V5Z4S6, Canada
- Department of Medical Genetics, University of British Columbia, Vancouver, BC V6T1Z4, Canada
| |
Collapse
|
10
|
Warren RL, Birol I. HLA predictions from the bronchoalveolar lavage fluid and blood samples of eight COVID-19 patients at the pandemic onset. Bioinformatics 2021; 36:5271-5273. [PMID: 32853340 PMCID: PMC7540287 DOI: 10.1093/bioinformatics/btaa756] [Citation(s) in RCA: 14] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Grants] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/01/2020] [Revised: 08/18/2020] [Accepted: 08/20/2020] [Indexed: 12/16/2022] Open
Affiliation(s)
- René L Warren
- Genome Sciences Centre, BC Cancer, Vancouver, BC V5Z 4S6, Canada
| | - Inanç Birol
- Genome Sciences Centre, BC Cancer, Vancouver, BC V5Z 4S6, Canada
| |
Collapse
|
11
|
Warren RL, Birol I. Interactive SARS-CoV-2 mutation timemaps. ArXiv 2020:2012.15697. [PMID: 33398246 PMCID: PMC7781321] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
As the year 2020 draws to an end, several new strains have been reported for the SARS-CoV-2 coronavirus, the agent responsible for the COVID-19 pandemic that has afflicted us all this past year. However, it is difficult to comprehend the scale, in sequence space, geographical location and time, at which SARS-CoV-2 mutates and evolves in its human hosts. To get an appreciation for the rapid evolution of the coronavirus, we built interactive scalable vector graphics maps that show daily nucleotide variations in genomes from the six most populated continents compared to that of the initial, ground-zero SARS-CoV-2 isolate sequenced at the beginning of the year. Availability: Mutation time maps are available from https://bcgsc.github.io/SARS2/.
Collapse
|
12
|
Abstract
BACKGROUND The Human Leukocyte Antigen (HLA) gene locus plays a fundamental role in human immunity, and it is established that certain HLA alleles are disease determinants. METHODS By combining the predictive power of multiple in silico HLA predictors, we have previously identified prevalent HLA class I and class II alleles, including DPA1*02:02, in two small cohorts at the COVID-19 pandemic onset. Since then, newer and larger patient cohorts with controls and associated demographic and clinical data have been deposited in public repositories. Here, we report on HLA-I and HLA-II alleles, along with their associated risk significance in one such cohort of 126 patients, including COVID-19 positive (n=100) and negative patients (n=26). RESULTS We recapitulate an enrichment of DPA1*02:02 in the COVID-19 positive cohort (29%) when compared to the COVID-negative control group (Fisher's exact test [FET] p=0.0174). Having this allele, however, does not appear to put this cohort's patients at an increased risk of hospitalization. Inspection of COVID-19 disease severity outcomes reveal nominally significant risk associations with A*11:01 (FET p=0.0078), C*04:01 (FET p=0.0087) and DQA1*01:02 (FET p=0.0121). CONCLUSIONS While enrichment of these alleles falls below statistical significance after Bonferroni correction, COVID-19 patients with the latter three alleles tend to fare worse overall. This is especially evident for patients with C*04:01, where disease prognosis measured by mechanical ventilation-free days was statistically significant after multiple hypothesis correction (Bonferroni p = 0.0023), and may hold potential clinical value.
Collapse
Affiliation(s)
- René L Warren
- Genome Sciences Centre, BC Cancer, Vancouver, BC, V5Z 4S6, Canada
| | - Inanç Birol
- Genome Sciences Centre, BC Cancer, Vancouver, BC, V5Z 4S6, Canada
| |
Collapse
|
13
|
Warren RL, Coombe L, Mohamadi H, Zhang J, Jaquish B, Isabel N, Jones SJM, Bousquet J, Bohlmann J, Birol I. ntEdit: scalable genome sequence polishing. Bioinformatics 2020; 35:4430-4432. [PMID: 31095290 PMCID: PMC6821332 DOI: 10.1093/bioinformatics/btz400] [Citation(s) in RCA: 43] [Impact Index Per Article: 10.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/05/2018] [Revised: 03/04/2019] [Accepted: 05/07/2019] [Indexed: 02/05/2023] Open
Abstract
Motivation In the modern genomics era, genome sequence assemblies are routine practice. However, depending on the methodology, resulting drafts may contain considerable base errors. Although utilities exist for genome base polishing, they work best with high read coverage and do not scale well. We developed ntEdit, a Bloom filter-based genome sequence editing utility that scales to large mammalian and conifer genomes. Results We first tested ntEdit and the state-of-the-art assembly improvement tools GATK, Pilon and Racon on controlled Escherichia coli and Caenorhabditis elegans sequence data. Generally, ntEdit performs well at low sequence depths (<20×), fixing the majority (>97%) of base substitutions and indels, and its performance is largely constant with increased coverage. In all experiments conducted using a single CPU, the ntEdit pipeline executed in <14 s and <3 m, on average, on E.coli and C.elegans, respectively. We performed similar benchmarks on a sub-20× coverage human genome sequence dataset, inspecting accuracy and resource usage in editing chromosomes 1 and 21, and whole genome. ntEdit scaled linearly, executing in 30–40 m on those sequences. We show how ntEdit ran in <2 h 20 m to improve upon long and linked read human genome assemblies of NA12878, using high-coverage (54×) Illumina sequence data from the same individual, fixing frame shifts in coding sequences. We also generated 17-fold coverage spruce sequence data from haploid sequence sources (seed megagametophyte), and used it to edit our pseudo haploid assemblies of the 20 Gb interior and white spruce genomes in <4 and <5 h, respectively, making roughly 50M edits at a (substitution+indel) rate of 0.0024. Availability and implementation https://github.com/bcgsc/ntedit Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- René L Warren
- Genome Sciences Centre, BC Cancer, Vancouver, Canada
| | - Lauren Coombe
- Genome Sciences Centre, BC Cancer, Vancouver, Canada
| | | | - Jessica Zhang
- Genome Sciences Centre, BC Cancer, Vancouver, Canada
| | - Barry Jaquish
- BC Ministry of Forests, Lands, and Natural Resource Operations, Victoria, Canada
| | - Nathalie Isabel
- Laurentian Forestry Centre, Natural Resources Canada, Québec, Canada
| | | | - Jean Bousquet
- Canada Research Chair in Forest Genomics, Université Laval, Québec, Canada
| | - Joerg Bohlmann
- Michael Smith Laboratories, University of British Columbia, Vancouver, Canada
| | - Inanç Birol
- Genome Sciences Centre, BC Cancer, Vancouver, Canada
| |
Collapse
|
14
|
Warren RL, Birol I. HLA predictions from the bronchoalveolar lavage fluid samples of five patients at the early stage of the wuhan seafood market COVID-19 outbreak. ArXiv 2020:arXiv:2004.07108v3. [PMID: 32550246 PMCID: PMC7280900] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 10/26/2022]
Abstract
We are in the midst of a global viral pandemic, one with no cure and a high mortality rate. The Human Leukocyte Antigen (HLA) gene complex plays a critical role in host immunity. We predicted HLA class I and II alleles from the transcriptome sequencing data prepared from the bronchoalveolar lavage fluid samples of five patients at the early stage of the COVID-19 outbreak. We identified the HLA-I allele A*24:02 in four out of five patients, which is higher than the expected frequency (17.2%) in the South Han Chinese population. The difference is statistically significant with a p-value less than 10-4. Our analysis results may help provide future insights on disease susceptibility.
Collapse
Affiliation(s)
- René L Warren
- Genome Sciences Centre, BC Cancer, Vancouver, BC, V5Z 4S6, Canada
| | - Inanç Birol
- Genome Sciences Centre, BC Cancer, Vancouver, BC, V5Z 4S6, Canada
| |
Collapse
|
15
|
Helbing CC, Hammond SA, Jackman SH, Houston S, Warren RL, Cameron CE, Birol I. Antimicrobial peptides from Rana [Lithobates] catesbeiana: Gene structure and bioinformatic identification of novel forms from tadpoles. Sci Rep 2019; 9:1529. [PMID: 30728430 PMCID: PMC6365531 DOI: 10.1038/s41598-018-38442-1] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/17/2018] [Accepted: 12/28/2018] [Indexed: 01/21/2023] Open
Abstract
Antimicrobial peptides (AMPs) exhibit broad-spectrum antimicrobial activity, and have promise as new therapeutic agents. While the adult North American bullfrog (Rana [Lithobates] catesbeiana) is a prolific source of high-potency AMPs, the aquatic tadpole represents a relatively untapped source for new AMP discovery. The recent publication of the bullfrog genome and transcriptomic resources provides an opportune bridge between known AMPs and bioinformatics-based AMP discovery. The objective of the present study was to identify novel AMPs with therapeutic potential using a combined bioinformatics and wet lab-based approach. In the present study, we identified seven novel AMP precursor-encoding transcripts expressed in the tadpole. Comparison of their amino acid sequences with known AMPs revealed evidence of mature peptide sequence conservation with variation in the prepro sequence. Two mature peptide sequences were unique and demonstrated bacteriostatic and bactericidal activity against Mycobacteria but not Gram-negative or Gram-positive bacteria. Nine known and seven novel AMP-encoding transcripts were detected in premetamorphic tadpole back skin, olfactory epithelium, liver, and/or tail fin. Treatment of tadpoles with 10 nM 3,5,3'-triiodothyronine for 48 h did not affect transcript abundance in the back skin, and had limited impact on these transcripts in the other three tissues. Gene mapping revealed considerable diversity in size (1.6-15 kbp) and exon number (one to four) of AMP-encoding genes with clear evidence of alternative splicing leading to both prepro and mature amino acid sequence diversity. These findings verify the accuracy and utility of the bullfrog genome assembly, and set a firm foundation for bioinformatics-based AMP discovery.
Collapse
Affiliation(s)
- Caren C Helbing
- Department of Biochemistry and Microbiology, University of Victoria, Victoria, British Columbia, V8P 5C2, Canada.
| | - S Austin Hammond
- Canada's Michael Smith Genome Sciences Centre, BC Cancer Agency, Vancouver, BC, V5Z 4S6, Canada
| | - Shireen H Jackman
- Department of Biochemistry and Microbiology, University of Victoria, Victoria, British Columbia, V8P 5C2, Canada
| | - Simon Houston
- Department of Biochemistry and Microbiology, University of Victoria, Victoria, British Columbia, V8P 5C2, Canada
| | - René L Warren
- Canada's Michael Smith Genome Sciences Centre, BC Cancer Agency, Vancouver, BC, V5Z 4S6, Canada
| | - Caroline E Cameron
- Department of Biochemistry and Microbiology, University of Victoria, Victoria, British Columbia, V8P 5C2, Canada
| | - Inanç Birol
- Canada's Michael Smith Genome Sciences Centre, BC Cancer Agency, Vancouver, BC, V5Z 4S6, Canada
| |
Collapse
|
16
|
Abstract
Motivation Sequencing of human genomes is now routine, and assembly of shotgun reads is increasingly feasible. However, assemblies often fail to inform about chromosome-scale structure due to a lack of linkage information over long stretches of DNA—a shortcoming that is being addressed by new sequencing protocols, such as the GemCode and Chromium linked reads from 10 × Genomics. Results Here, we present ARCS, an application that utilizes the barcoding information contained in linked reads to further organize draft genomes into highly contiguous assemblies. We show how the contiguity of an ABySS H.sapiens genome assembly can be increased over six-fold, using moderate coverage (25-fold) Chromium data. We expect ARCS to have broad utility in harnessing the barcoding information contained in linked read data for connecting high-quality sequences in genome assembly drafts. Availability and implementation https://github.com/bcgsc/ARCS/ Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
|
17
|
Chu J, Mohamadi H, Warren RL, Yang C, Birol I. Innovations and challenges in detecting long read overlaps: an evaluation of the state-of-the-art. Bioinformatics 2017; 33:1261-1270. [PMID: 28003261 PMCID: PMC5408847 DOI: 10.1093/bioinformatics/btw811] [Citation(s) in RCA: 18] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/03/2016] [Accepted: 12/16/2016] [Indexed: 01/23/2023] Open
Abstract
Identifying overlaps between error-prone long reads, specifically those from Oxford Nanopore Technologies (ONT) and Pacific Biosciences (PB), is essential for certain downstream applications, including error correction and de novo assembly. Though akin to the read-to-reference alignment problem, read-to-read overlap detection is a distinct problem that can benefit from specialized algorithms that perform efficiently and robustly on high error rate long reads. Here, we review the current state-of-the-art read-to-read overlap tools for error-prone long reads, including BLASR, DALIGNER, MHAP, GraphMap and Minimap. These specialized bioinformatics tools differ not just in their algorithmic designs and methodology, but also in their robustness of performance on a variety of datasets, time and memory efficiency and scalability. We highlight the algorithmic features of these tools, as well as their potential issues and biases when utilizing any particular method. To supplement our review of the algorithms, we benchmarked these tools, tracking their resource needs and computational performance, and assessed the specificity and precision of each. In the versions of the tools tested, we observed that Minimap is the most computationally efficient, specific and sensitive method on the ONT datasets tested; whereas GraphMap and DALIGNER are the most specific and sensitive methods on the tested PB datasets. The concepts surveyed may apply to future sequencing technologies, as scalability is becoming more relevant with increased sequencing throughput. Contact cjustin@bcgsc.ca , ibirol@bcgsc.ca. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Justin Chu
- University of British Columbia, Vancouver, BC, Canada
- Canada’s Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC, Canada
- To whom correspondence should be addressed. ,
| | - Hamid Mohamadi
- University of British Columbia, Vancouver, BC, Canada
- Canada’s Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC, Canada
| | - René L Warren
- Canada’s Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC, Canada
| | - Chen Yang
- University of British Columbia, Vancouver, BC, Canada
- Canada’s Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC, Canada
| | - Inanç Birol
- University of British Columbia, Vancouver, BC, Canada
- Canada’s Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC, Canada
- Simon Fraser University, Burnaby, BC, Canada
- To whom correspondence should be addressed. ,
| |
Collapse
|
18
|
Brown TM, Hammond SA, Behsaz B, Veldhoen N, Birol I, Helbing CC. De novo assembly of the ringed seal (Pusa hispida) blubber transcriptome: A tool that enables identification of molecular health indicators associated with PCB exposure. Aquat Toxicol 2017; 185:48-57. [PMID: 28187360 DOI: 10.1016/j.aquatox.2017.02.004] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/05/2016] [Revised: 02/02/2017] [Accepted: 02/03/2017] [Indexed: 06/06/2023]
Abstract
The ringed seal, Pusa hispida, is a keystone species in the Arctic marine ecosystem, and is proving a useful marine mammal for linking polychlorinated biphenyl (PCB) exposure to toxic injury. We report here the first de novo assembled transcriptome for the ringed seal (342,863 transcripts, of which 53% were annotated), which we then applied to a population of ringed seals exposed to a local PCB source in Arctic Labrador, Canada. We found an indication of energy metabolism imbalance in local ringed seals (n=4), and identified five significant gene transcript targets: plasminogen receptor (Plg-R(KT)), solute carrier family 25 member 43 receptor (Slc25a43), ankyrin repeat domain-containing protein 26-like receptor (Ankrd26), HIS30 (not yet annotated) and HIS16 (not yet annotated) that may represent indicators of PCB exposure and effects in marine mammals. The abundance profiles of these five gene targets were validated in blubber samples collected from 43 ringed seals using a qPCR assay. The mRNA transcript levels for all five gene targets, (Plg-R(KT), r2=0.43), (Slc25a43, r2=0.51), (Ankrd26, r2=0.43), (HIS30, r2=0.39) and (HIS16, r2=0.31) correlated with increasing levels of blubber PCBs. Results from the present study contribute to our understanding of PCB associated effects in marine mammals, and provide new tools for future molecular and toxicology work in pinnipeds.
Collapse
Affiliation(s)
- Tanya M Brown
- Department of Biochemistry and Microbiology, University of Victoria, Victoria, British Columbia V8W 3P6, Canada; Memorial University, St. John's, Newfoundland A1B 3X9, Canada
| | - S Austin Hammond
- Department of Biochemistry and Microbiology, University of Victoria, Victoria, British Columbia V8W 3P6, Canada; Canada's Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC V5Z 4S6, Canada
| | - Bahar Behsaz
- Canada's Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC V5Z 4S6, Canada
| | - Nik Veldhoen
- Department of Biochemistry and Microbiology, University of Victoria, Victoria, British Columbia V8W 3P6, Canada
| | - Inanç Birol
- Canada's Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC V5Z 4S6, Canada
| | - Caren C Helbing
- Department of Biochemistry and Microbiology, University of Victoria, Victoria, British Columbia V8W 3P6, Canada.
| |
Collapse
|
19
|
Pavy N, Lamothe M, Pelgas B, Gagnon F, Birol I, Bohlmann J, Mackay J, Isabel N, Bousquet J. A high-resolution reference genetic map positioning 8.8 K genes for the conifer white spruce: structural genomics implications and correspondence with physical distance. Plant J 2017; 90:189-203. [PMID: 28090692 DOI: 10.1111/tpj.13478] [Citation(s) in RCA: 23] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/27/2016] [Revised: 12/23/2016] [Accepted: 01/03/2017] [Indexed: 05/21/2023]
Abstract
Over the last decade, extensive genetic and genomic resources have been developed for the conifer white spruce (Picea glauca, Pinaceae), which has one of the largest plant genomes (20 Gbp). Draft genome sequences of white spruce and other conifers have recently been produced, but dense genetic maps are needed to comprehend genome macrostructure, delineate regions involved in quantitative traits, complement functional genomic investigations, and assist the assembly of fragmented genomic sequences. A greatly expanded P. glauca composite linkage map was generated from a set of 1976 full-sib progeny, with the positioning of 8793 expressed genes. Regions with significant low or high gene density were identified. Gene family members tended to be mapped on the same chromosomes, with tandemly arrayed genes significantly biased towards specific functional classes. The map was integrated with transcriptome data surveyed across eight tissues. In total, 69 clusters of co-expressed and co-localising genes were identified. A high level of synteny was found with pine genetic maps, which should facilitate the transfer of structural information in the Pinaceae. Although the current white spruce genome sequence remains highly fragmented, dozens of scaffolds encompassing more than one mapped gene were identified. From these, the relationship between genetic and physical distances was examined and the genome-wide recombination rate was found to be much smaller than most estimates reported for angiosperm genomes. This gene linkage map shall assist the large-scale assembly of the next-generation white spruce genome sequence and provide a reference resource for the conifer genomics community.
Collapse
Affiliation(s)
- Nathalie Pavy
- Canada Research Chair in Forest Genomics, Forest Research Centre and Institute for Systems and Integrative Biology, Université Laval, Québec, QC, G1V 0A6, Canada
| | - Manuel Lamothe
- Natural Resources Canada, Canadian Forest Service, Laurentian Forestry Centre, 1055 du P.E.P.S., P.O. Box 10380, Stn. Sainte-Foy, Québec, QC, G1V 4C7, Canada
| | - Betty Pelgas
- Canada Research Chair in Forest Genomics, Forest Research Centre and Institute for Systems and Integrative Biology, Université Laval, Québec, QC, G1V 0A6, Canada
- Natural Resources Canada, Canadian Forest Service, Laurentian Forestry Centre, 1055 du P.E.P.S., P.O. Box 10380, Stn. Sainte-Foy, Québec, QC, G1V 4C7, Canada
| | - France Gagnon
- Canada Research Chair in Forest Genomics, Forest Research Centre and Institute for Systems and Integrative Biology, Université Laval, Québec, QC, G1V 0A6, Canada
| | - Inanç Birol
- Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC, V5Z 4S6, Canada
| | - Joerg Bohlmann
- Michael Smith Laboratories, University of British Columbia, Vancouver, BC, V6T 1Z4, Canada
| | - John Mackay
- Canada Research Chair in Forest Genomics, Forest Research Centre and Institute for Systems and Integrative Biology, Université Laval, Québec, QC, G1V 0A6, Canada
- Department of Plant Sciences, University of Oxford, South Parks Road, Oxford, 0X1 3RB, UK
| | - Nathalie Isabel
- Canada Research Chair in Forest Genomics, Forest Research Centre and Institute for Systems and Integrative Biology, Université Laval, Québec, QC, G1V 0A6, Canada
- Natural Resources Canada, Canadian Forest Service, Laurentian Forestry Centre, 1055 du P.E.P.S., P.O. Box 10380, Stn. Sainte-Foy, Québec, QC, G1V 4C7, Canada
| | - Jean Bousquet
- Canada Research Chair in Forest Genomics, Forest Research Centre and Institute for Systems and Integrative Biology, Université Laval, Québec, QC, G1V 0A6, Canada
| |
Collapse
|
20
|
Yang C, Chu J, Warren RL, Birol I. NanoSim: nanopore sequence read simulator based on statistical characterization. Gigascience 2017; 6:1-6. [PMID: 28327957 PMCID: PMC5530317 DOI: 10.1093/gigascience/gix010] [Citation(s) in RCA: 106] [Impact Index Per Article: 15.1] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/26/2016] [Revised: 01/12/2017] [Accepted: 02/21/2017] [Indexed: 01/19/2023] Open
Abstract
Background The MinION sequencing instrument from Oxford Nanopore Technologies (ONT) produces long read lengths from single-molecule sequencing - valuable features for detailed genome characterization. To realize the potential of this platform, a number of groups are developing bioinformatics tools tuned for the unique characteristics of its data. We note that these development efforts would benefit from a simulator software, the output of which could be used to benchmark analysis tools. Results Here, we introduce NanoSim, a fast and scalable read simulator that captures the technology-specific features of ONT data and allows for adjustments upon improvement of nanopore sequencing technology. The first step of NanoSim is read characterization, which provides a comprehensive alignment-based analysis and generates a set of read profiles serving as the input to the next step, the simulation stage. The simulation stage uses the model built in the previous step to produce in silico reads for a given reference genome. NanoSim is written in Python and R. The source files and manual are available at the Genome Sciences Centre website: http://www.bcgsc.ca/platform/bioinfo/software/nanosim. Conclusion In this work, we model the base-calling errors of ONT reads to inform the simulation of sequences with similar characteristics. We showcase the performance of NanoSim on publicly available datasets generated using the R7 and R7.3 chemistries and different sequencing kits and compare the resulting synthetic reads to those of other long-sequence simulators and experimental ONT reads. We expect NanoSim to have an enabling role in the field and benefit the development of scalable next-generation sequencing technologies for the long nanopore reads, including genome assembly, mutation detection, and even metagenomic analysis software.
Collapse
Affiliation(s)
- Chen Yang
- Canada’s Michael Smith Genome Science Centre, British Columbia Cancer Agency, 570 W 7th Avenue, V5Z 4S6 Vancouver, Canada
- Falculty of Science, University of British Columbia, Vancouver, Canada
| | - Justin Chu
- Canada’s Michael Smith Genome Science Centre, British Columbia Cancer Agency, 570 W 7th Avenue, V5Z 4S6 Vancouver, Canada
- Falculty of Science, University of British Columbia, Vancouver, Canada
| | - René L Warren
- Canada’s Michael Smith Genome Science Centre, British Columbia Cancer Agency, 570 W 7th Avenue, V5Z 4S6 Vancouver, Canada
| | - Inanç Birol
- Canada’s Michael Smith Genome Science Centre, British Columbia Cancer Agency, 570 W 7th Avenue, V5Z 4S6 Vancouver, Canada
- Department of Medical Genetics, University of British Columbia, Vancouver, Canada
- School of Computer Science, Simon Fraser University, Burnaby, Canada
| |
Collapse
|
21
|
Feau N, Taylor G, Dale AL, Dhillon B, Bilodeau GJ, Birol I, Jones SJ, Hamelin RC. Genome sequences of six Phytophthora species threatening forest ecosystems. Genom Data 2016; 10:85-88. [PMID: 27752469 PMCID: PMC5061060 DOI: 10.1016/j.gdata.2016.09.013] [Citation(s) in RCA: 16] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 09/14/2016] [Revised: 09/26/2016] [Accepted: 09/29/2016] [Indexed: 01/25/2023]
Abstract
The Phytophthora genus comprises of some of the most destructive plant pathogens and attack a wide range of hosts including economically valuable tree species, both angiosperm and gymnosperm. Many known species of Phytophthora are invasive and have been introduced through nursery and agricultural trade. As part of a larger project aimed at utilizing genomic data for forest disease diagnostics, pathogen detection and monitoring (The TAIGA project: Tree Aggressors Identification using Genomic Approaches; http://taigaforesthealth.com/), we sequenced the genomes of six important Phytophthora species that are important invasive pathogens of trees and a serious threat to the international trade of forest products. This genomic data was used to develop highly sensitive and specific detection assays and for genome comparisons and to make evolutionary inferences and will be useful to the broader plant and tree health community. These WGS data have been deposited in the International Nucleotide Sequence Database Collaboration (DDBJ/ENA/GenBank) under the accession numbers AUPN01000000, AUVH01000000, AUWJ02000000, AUUF02000000, AWVV02000000 and AWVW02000000.
Collapse
Affiliation(s)
- Nicolas Feau
- Department of Forest and Conservation Sciences, University of British Columbia, Vancouver, British Columbia, Canada
| | - Greg Taylor
- Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, British Columbia, Canada
| | - Angela L. Dale
- Department of Forest and Conservation Sciences, University of British Columbia, Vancouver, British Columbia, Canada
- FPInnovations, Vancouver, British Columbia, Canada
| | - Braham Dhillon
- Department of Forest and Conservation Sciences, University of British Columbia, Vancouver, British Columbia, Canada
| | | | - Inanç Birol
- Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, British Columbia, Canada
- Department of Medical Genetics, University of British Columbia, Vancouver, BC, Canada
| | - Steven J.M. Jones
- Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, British Columbia, Canada
- Department of Medical Genetics, University of British Columbia, Vancouver, BC, Canada
- Department of Molecular Biology and Biochemistry, Simon Fraser University, Vancouver, BC, Canada
| | - Richard C. Hamelin
- Department of Forest and Conservation Sciences, University of British Columbia, Vancouver, British Columbia, Canada
- Institut de Biologie Intégrative des Systèmes, Université Laval, Québec, Canada
| |
Collapse
|
22
|
Jackman SD, Warren RL, Gibb EA, Vandervalk BP, Mohamadi H, Chu J, Raymond A, Pleasance S, Coope R, Wildung MR, Ritland CE, Bousquet J, Jones SJM, Bohlmann J, Birol I. Organellar Genomes of White Spruce (Picea glauca): Assembly and Annotation. Genome Biol Evol 2015; 8:29-41. [PMID: 26645680 PMCID: PMC4758241 DOI: 10.1093/gbe/evv244] [Citation(s) in RCA: 29] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
The genome sequences of the plastid and mitochondrion of white spruce (Picea glauca) were assembled from whole-genome shotgun sequencing data using ABySS. The sequencing data contained reads from both the nuclear and organellar genomes, and reads of the organellar genomes were abundant in the data as each cell harbors hundreds of mitochondria and plastids. Hence, assembly of the 123-kb plastid and 5.9-Mb mitochondrial genomes were accomplished by analyzing data sets primarily representing low coverage of the nuclear genome. The assembled organellar genomes were annotated for their coding genes, ribosomal RNA, and transfer RNA. Transcript abundances of the mitochondrial genes were quantified in three developmental tissues and five mature tissues using data from RNA-seq experiments. C-to-U RNA editing was observed in the majority of mitochondrial genes, and in four genes, editing events were noted to modify ACG codons to create cryptic AUG start codons. The informatics methodology presented in this study should prove useful to assemble organellar genomes of other plant species using whole-genome shotgun sequencing data.
Collapse
Affiliation(s)
- Shaun D Jackman
- Canada's Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC, Canada
| | - René L Warren
- Canada's Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC, Canada
| | - Ewan A Gibb
- Canada's Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC, Canada
| | - Benjamin P Vandervalk
- Canada's Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC, Canada
| | - Hamid Mohamadi
- Canada's Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC, Canada
| | - Justin Chu
- Canada's Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC, Canada
| | - Anthony Raymond
- Canada's Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC, Canada
| | - Stephen Pleasance
- Canada's Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC, Canada
| | - Robin Coope
- Canada's Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC, Canada
| | - Mark R Wildung
- School of Molecular Biosciences, Washington State University
| | - Carol E Ritland
- Department of Forest and Conservation Sciences, University of British Columbia, Vancouver, BC, Canada
| | - Jean Bousquet
- Department of Forest and Environmental Genomics, Université Laval, Québec, QC, Canada
| | - Steven J M Jones
- Canada's Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC, Canada Department of Medical Genetics, University of British Columbia, Vancouver, BC, Canada School of Computing Science, Simon Fraser University, Burnaby, BC, Canada
| | - Joerg Bohlmann
- Department of Forest and Conservation Sciences, University of British Columbia, Vancouver, BC, Canada Michael Smith Laboratories, University of British Columbia, Vancouver, BC, Canada Department of Botany, University of British Columbia, Vancouver, BC, Canada
| | - Inanç Birol
- Canada's Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC, Canada Department of Medical Genetics, University of British Columbia, Vancouver, BC, Canada School of Computing Science, Simon Fraser University, Burnaby, BC, Canada Department of Computer Science, University of British Columbia, Vancouver, BC, Canada
| |
Collapse
|
23
|
Vandervalk BP, Yang C, Xue Z, Raghavan K, Chu J, Mohamadi H, Jackman SD, Chiu R, Warren RL, Birol I. Konnector v2.0: pseudo-long reads from paired-end sequencing data. BMC Med Genomics 2015; 8 Suppl 3:S1. [PMID: 26399504 PMCID: PMC4582294 DOI: 10.1186/1755-8794-8-s3-s1] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/20/2023] Open
Abstract
Background Reading the nucleotides from two ends of a DNA fragment is called paired-end tag (PET) sequencing. When the fragment length is longer than the combined read length, there remains a gap of unsequenced nucleotides between read pairs. If the target in such experiments is sequenced at a level to provide redundant coverage, it may be possible to bridge these gaps using bioinformatics methods. Konnector is a local de novo assembly tool that addresses this problem. Here we report on version 2.0 of our tool. Results Konnector uses a probabilistic and memory-efficient data structure called Bloom filter to represent a k-mer spectrum - all possible sequences of length k in an input file, such as the collection of reads in a PET sequencing experiment. It performs look-ups to this data structure to construct an implicit de Bruijn graph, which describes (k-1) base pair overlaps between adjacent k-mers. It traverses this graph to bridge the gap between a given pair of flanking sequences. Conclusions Here we report the performance of Konnector v2.0 on simulated and experimental datasets, and compare it against other tools with similar functionality. We note that, representing k-mers with 1.5 bytes of memory on average, Konnector can scale to very large genomes. With our parallel implementation, it can also process over a billion bases on commodity hardware.
Collapse
|
24
|
Warren RL, Yang C, Vandervalk BP, Behsaz B, Lagman A, Jones SJM, Birol I. LINKS: Scalable, alignment-free scaffolding of draft genomes with long reads. Gigascience 2015; 4:35. [PMID: 26244089 PMCID: PMC4524009 DOI: 10.1186/s13742-015-0076-3] [Citation(s) in RCA: 121] [Impact Index Per Article: 13.4] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/28/2015] [Accepted: 07/29/2015] [Indexed: 12/05/2022] Open
Abstract
Background Owing to the complexity of the assembly problem, we do not yet have complete genome sequences. The difficulty in assembling reads into finished genomes is exacerbated by sequence repeats and the inability of short reads to capture sufficient genomic information to resolve those problematic regions. In this regard, established and emerging long read technologies show great promise, but their current associated higher error rates typically require computational base correction and/or additional bioinformatics pre-processing before they can be of value. Results We present LINKS, the Long Interval Nucleotide K-mer Scaffolder algorithm, a method that makes use of the sequence properties of nanopore sequence data and other error-containing sequence data, to scaffold high-quality genome assemblies, without the need for read alignment or base correction. Here, we show how the contiguity of an ABySS Escherichia coli K-12 genome assembly can be increased greater than five-fold by the use of beta-released Oxford Nanopore Technologies Ltd. long reads and how LINKS leverages long-range information in Saccharomyces cerevisiae W303 nanopore reads to yield assemblies whose resulting contiguity and correctness are on par with or better than that of competing applications. We also present the re-scaffolding of the colossal white spruce (Picea glauca) draft assembly (PG29, 20 Gbp) and demonstrate how LINKS scales to larger genomes. Conclusions This study highlights the present utility of nanopore reads for genome scaffolding in spite of their current limitations, which are expected to diminish as the nanopore sequencing technology advances. We expect LINKS to have broad utility in harnessing the potential of long reads in connecting high-quality sequences of small and large genome assembly drafts. Electronic supplementary material The online version of this article (doi:10.1186/s13742-015-0076-3) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- René L Warren
- BC Cancer Agency, Michael Smith Genome Sciences Centre, Vancouver, British Columbia V5Z 4S6 Canada
| | - Chen Yang
- BC Cancer Agency, Michael Smith Genome Sciences Centre, Vancouver, British Columbia V5Z 4S6 Canada
| | - Benjamin P Vandervalk
- BC Cancer Agency, Michael Smith Genome Sciences Centre, Vancouver, British Columbia V5Z 4S6 Canada
| | - Bahar Behsaz
- BC Cancer Agency, Michael Smith Genome Sciences Centre, Vancouver, British Columbia V5Z 4S6 Canada
| | - Albert Lagman
- BC Cancer Agency, Michael Smith Genome Sciences Centre, Vancouver, British Columbia V5Z 4S6 Canada
| | - Steven J M Jones
- BC Cancer Agency, Michael Smith Genome Sciences Centre, Vancouver, British Columbia V5Z 4S6 Canada
| | - Inanç Birol
- BC Cancer Agency, Michael Smith Genome Sciences Centre, Vancouver, British Columbia V5Z 4S6 Canada
| |
Collapse
|
25
|
Abstract
BACKGROUND While next-generation sequencing technologies have made sequencing genomes faster and more affordable, deciphering the complete genome sequence of an organism remains a significant bioinformatics challenge, especially for large genomes. Low sequence coverage, repetitive elements and short read length make de novo genome assembly difficult, often resulting in sequence and/or fragment "gaps" - uncharacterized nucleotide (N) stretches of unknown or estimated lengths. Some of these gaps can be closed by re-processing latent information in the raw reads. Even though there are several tools for closing gaps, they do not easily scale up to processing billion base pair genomes. RESULTS Here we describe Sealer, a tool designed to close gaps within assembly scaffolds by navigating de Bruijn graphs represented by space-efficient Bloom filter data structures. We demonstrate how it scales to successfully close 50.8% and 13.8% of gaps in human (3 Gbp) and white spruce (20 Gbp) draft assemblies in under 30 and 27 h, respectively - a feat that is not possible with other leading tools with the breadth of data used in our study. CONCLUSION Sealer is an automated finishing application that uses the succinct Bloom filter representation of a de Bruijn graph to close gaps in draft assemblies, including that of very large genomes. We expect Sealer to have broad utility for finishing genomes across the tree of life, from bacterial genomes to large plant genomes and beyond. Sealer is available for download at https://github.com/bcgsc/abyss/tree/sealer-release.
Collapse
Affiliation(s)
- Daniel Paulino
- Canada's Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC, V5Z 4S6, Canada.
| | - René L Warren
- Canada's Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC, V5Z 4S6, Canada.
| | - Benjamin P Vandervalk
- Canada's Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC, V5Z 4S6, Canada.
| | - Anthony Raymond
- Canada's Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC, V5Z 4S6, Canada.
| | - Shaun D Jackman
- Canada's Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC, V5Z 4S6, Canada.
| | - Inanç Birol
- Canada's Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC, V5Z 4S6, Canada. .,Department of Medical Genetics, University of British Columbia, Vancouver, BC, V6H 3N1, Canada.
| |
Collapse
|
26
|
Birol I, Raymond A, Chiu R, Nip KM, Jackman SD, Kreitzman M, Docking TR, Ennis CA, Robertson AG, Karsan A. Kleat: cleavage site analysis of transcriptomes. Pac Symp Biocomput 2015:347-358. [PMID: 25592595 PMCID: PMC4350765] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Subscribe] [Scholar Register] [Indexed: 06/04/2023]
Abstract
In eukaryotic cells, alternative cleavage of 3' untranslated regions (UTRs) can affect transcript stability, transport and translation. For polyadenylated (poly(A)) transcripts, cleavage sites can be characterized with short-read sequencing using specialized library construction methods. However, for large-scale cohort studies as well as for clinical sequencing applications, it is desirable to characterize such events using RNA-seq data, as the latter are already widely applied to identify other relevant information, such as mutations, alternative splicing and chimeric transcripts. Here we describe KLEAT, an analysis tool that uses de novo assembly of RNA-seq data to characterize cleavage sites on 3' UTRs. We demonstrate the performance of KLEAT on three cell line RNA-seq libraries constructed and sequenced by the ENCODE project, and assembled using Trans-ABySS. Validating the KLEAT predictions with matched ENCODE RNA-seq and RNA-PET libraries, we show that the tool has over 90% positive predictive value when there are at least three RNA-seq reads supporting a poly(A) tail and requiring at least three RNA-PET reads mapping within 100 nucleotides as validation. We also compare the performance of KLEAT with other popular RNA-seq analysis pipelines that reconstruct 3' UTR ends, and show that it performs favourably, based on an ROC-like curve.
Collapse
Affiliation(s)
- Inanç Birol
- Canada's Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC, V5Z 4S6, Canada.
| | | | | | | | | | | | | | | | | | | |
Collapse
|
27
|
Chu J, Sadeghi S, Raymond A, Jackman SD, Nip KM, Mar R, Mohamadi H, Butterfield YS, Robertson AG, Birol I. BioBloom tools: fast, accurate and memory-efficient host species sequence screening using bloom filters. ACTA ACUST UNITED AC 2014; 30:3402-4. [PMID: 25143290 PMCID: PMC4816029 DOI: 10.1093/bioinformatics/btu558] [Citation(s) in RCA: 71] [Impact Index Per Article: 7.1] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/31/2022]
Abstract
Large datasets can be screened for sequences from a specific organism, quickly and with low memory requirements, by a data structure that supports time- and memory-efficient set membership queries. Bloom filters offer such queries but require that false positives be controlled. We present BioBloom Tools, a Bloom filter-based sequence-screening tool that is faster than BWA, Bowtie 2 (popular alignment algorithms) and FACS (a membership query algorithm). It delivers accuracies comparable with these tools, controls false positives and has low memory requirements. Availability and implementaion:www.bcgsc.ca/platform/bioinfo/software/biobloomtools Contact:cjustin@bcgsc.ca or ibirol@bcgsc.ca Supplementary information:Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Justin Chu
- Canada's Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC V5Z 4S6, Canada
| | - Sara Sadeghi
- Canada's Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC V5Z 4S6, Canada
| | - Anthony Raymond
- Canada's Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC V5Z 4S6, Canada
| | - Shaun D Jackman
- Canada's Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC V5Z 4S6, Canada
| | - Ka Ming Nip
- Canada's Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC V5Z 4S6, Canada
| | - Richard Mar
- Canada's Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC V5Z 4S6, Canada
| | - Hamid Mohamadi
- Canada's Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC V5Z 4S6, Canada
| | - Yaron S Butterfield
- Canada's Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC V5Z 4S6, Canada
| | - A Gordon Robertson
- Canada's Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC V5Z 4S6, Canada
| | - Inanç Birol
- Canada's Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC V5Z 4S6, Canada
| |
Collapse
|
28
|
Bradnam KR, Fass JN, Alexandrov A, Baranay P, Bechner M, Birol I, Boisvert S, Chapman JA, Chapuis G, Chikhi R, Chitsaz H, Chou WC, Corbeil J, Del Fabbro C, Docking TR, Durbin R, Earl D, Emrich S, Fedotov P, Fonseca NA, Ganapathy G, Gibbs RA, Gnerre S, Godzaridis E, Goldstein S, Haimel M, Hall G, Haussler D, Hiatt JB, Ho IY, Howard J, Hunt M, Jackman SD, Jaffe DB, Jarvis ED, Jiang H, Kazakov S, Kersey PJ, Kitzman JO, Knight JR, Koren S, Lam TW, Lavenier D, Laviolette F, Li Y, Li Z, Liu B, Liu Y, Luo R, Maccallum I, Macmanes MD, Maillet N, Melnikov S, Naquin D, Ning Z, Otto TD, Paten B, Paulo OS, Phillippy AM, Pina-Martins F, Place M, Przybylski D, Qin X, Qu C, Ribeiro FJ, Richards S, Rokhsar DS, Ruby JG, Scalabrin S, Schatz MC, Schwartz DC, Sergushichev A, Sharpe T, Shaw TI, Shendure J, Shi Y, Simpson JT, Song H, Tsarev F, Vezzi F, Vicedomini R, Vieira BM, Wang J, Worley KC, Yin S, Yiu SM, Yuan J, Zhang G, Zhang H, Zhou S, Korf IF. Assemblathon 2: evaluating de novo methods of genome assembly in three vertebrate species. Gigascience 2013; 2:10. [PMID: 23870653 PMCID: PMC3844414 DOI: 10.1186/2047-217x-2-10] [Citation(s) in RCA: 415] [Impact Index Per Article: 37.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/23/2013] [Accepted: 07/15/2013] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND The process of generating raw genome sequence data continues to become cheaper, faster, and more accurate. However, assembly of such data into high-quality, finished genome sequences remains challenging. Many genome assembly tools are available, but they differ greatly in terms of their performance (speed, scalability, hardware requirements, acceptance of newer read technologies) and in their final output (composition of assembled sequence). More importantly, it remains largely unclear how to best assess the quality of assembled genome sequences. The Assemblathon competitions are intended to assess current state-of-the-art methods in genome assembly. RESULTS In Assemblathon 2, we provided a variety of sequence data to be assembled for three vertebrate species (a bird, a fish, and snake). This resulted in a total of 43 submitted assemblies from 21 participating teams. We evaluated these assemblies using a combination of optical map data, Fosmid sequences, and several statistical methods. From over 100 different metrics, we chose ten key measures by which to assess the overall quality of the assemblies. CONCLUSIONS Many current genome assemblers produced useful assemblies, containing a significant representation of their genes and overall genome structure. However, the high degree of variability between the entries suggests that there is still much room for improvement in the field of genome assembly and that approaches which work well in assembling the genome of one species may not necessarily work well for another.
Collapse
|
29
|
Atanur SS, Birol I, Guryev V, Hirst M, Hummel O, Morrissey C, Behmoaras J, Fernandez-Suarez XM, Johnson MD, McLaren WM, Patone G, Petretto E, Plessy C, Rockland KS, Rockland C, Saar K, Zhao Y, Carninci P, Flicek P, Kurtz T, Cuppen E, Pravenec M, Hubner N, Jones SJM, Birney E, Aitman TJ. The genome sequence of the spontaneously hypertensive rat: Analysis and functional significance. Genome Res 2010; 20:791-803. [PMID: 20430781 DOI: 10.1101/gr.103499.109] [Citation(s) in RCA: 81] [Impact Index Per Article: 5.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
Abstract
The spontaneously hypertensive rat (SHR) is the most widely studied animal model of hypertension. Scores of SHR quantitative loci (QTLs) have been mapped for hypertension and other phenotypes. We have sequenced the SHR/OlaIpcv genome at 10.7-fold coverage by paired-end sequencing on the Illumina platform. We identified 3.6 million high-quality single nucleotide polymorphisms (SNPs) between the SHR/OlaIpcv and Brown Norway (BN) reference genome, with a high rate of validation (sensitivity 96.3%-98.0% and specificity 99%-100%). We also identified 343,243 short indels between the SHR/OlaIpcv and reference genomes. These SNPs and indels resulted in 161 gain or loss of stop codons and 629 frameshifts compared with the BN reference sequence. We also identified 13,438 larger deletions that result in complete or partial absence of 107 genes in the SHR/OlaIpcv genome compared with the BN reference and 588 copy number variants (CNVs) that overlap with the gene regions of 688 genes. Genomic regions containing genes whose expression had been previously mapped as cis-regulated expression quantitative trait loci (eQTLs) were significantly enriched with SNPs, short indels, and larger deletions, suggesting that some of these variants have functional effects on gene expression. Genes that were affected by major alterations in their coding sequence were highly enriched for genes related to ion transport, transport, and plasma membrane localization, providing insights into the likely molecular and cellular basis of hypertension and other phenotypes specific to the SHR strain. This near complete catalog of genomic differences between two extensively studied rat strains provides the starting point for complete elucidation, at the molecular level, of the physiological and pathophysiological phenotypic differences between individuals from these strains.
Collapse
Affiliation(s)
- Santosh S Atanur
- Physiological Genomics and Medicine Group, Medical Research Council Clinical Sciences Centre, Faculty of Medicine, Imperial College London, Hammersmith Hospital, London W12 0NN, United Kingdom
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
30
|
Abstract
Gigabase-scale genome assemblies are now feasible using short-read sequencing technology, bringing the cost of such projects below the million-dollar mark.
Collapse
Affiliation(s)
- Shaun D Jackman
- Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, British, Columbia V5Z 4E6, Canada
| | | |
Collapse
|
31
|
Abstract
One bottleneck in large-scale genome sequencing projects is reconstructing the full genome sequence from the short subsequences produced by current technologies. The final stages of the genome assembly process inevitably require manual inspection of data inconsistencies and could be greatly aided by visualization. This paper presents our design decisions in translating key data features identified through discussions with analysts into a concise visual encoding. Current visualization tools in this domain focus on local sequence errors making high-level inspection of the assembly difficult if not impossible. We present a novel interactive graph display, ABySS-Explorer, that emphasizes the global assembly structure while also integrating salient data features such as sequence length. Our tool replaces manual and in some cases pen-and-paper based analysis tasks, and we discuss how user feedback was incorporated into iterative design refinements. Finally, we touch on applications of this representation not initially considered in our design phase, suggesting the generality of this encoding for DNA sequence data.
Collapse
|
32
|
Krzywinski M, Schein J, Birol I, Connors J, Gascoyne R, Horsman D, Jones SJ, Marra MA. Circos: an information aesthetic for comparative genomics. Genome Res 2009; 19:1639-45. [PMID: 19541911 DOI: 10.1101/gr.092759.109] [Citation(s) in RCA: 6757] [Impact Index Per Article: 450.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
Abstract
We created a visualization tool called Circos to facilitate the identification and analysis of similarities and differences arising from comparisons of genomes. Our tool is effective in displaying variation in genome structure and, generally, any other kind of positional relationships between genomic intervals. Such data are routinely produced by sequence alignments, hybridization arrays, genome mapping, and genotyping studies. Circos uses a circular ideogram layout to facilitate the display of relationships between pairs of positions by the use of ribbons, which encode the position, size, and orientation of related genomic elements. Circos is capable of displaying data as scatter, line, and histogram plots, heat maps, tiles, connectors, and text. Bitmap or vector images can be created from GFF-style data inputs and hierarchical configuration files, which can be easily generated by automated tools, making Circos suitable for rapid deployment in data analysis and reporting pipelines.
Collapse
Affiliation(s)
- Martin Krzywinski
- Canada's Michael Smith Genome Sciences Center, Vancouver, British Columbia V5Z 4S6, Canada.
| | | | | | | | | | | | | | | |
Collapse
|
33
|
Birol I, Jackman SD, Nielsen CB, Qian JQ, Varhol R, Stazyk G, Morin RD, Zhao Y, Hirst M, Schein JE, Horsman DE, Connors JM, Gascoyne RD, Marra MA, Jones SJM. De novo transcriptome assembly with ABySS. Bioinformatics 2009; 25:2872-7. [PMID: 19528083 DOI: 10.1093/bioinformatics/btp367] [Citation(s) in RCA: 295] [Impact Index Per Article: 19.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/26/2022] Open
Abstract
MOTIVATION Whole transcriptome shotgun sequencing data from non-normalized samples offer unique opportunities to study the metabolic states of organisms. One can deduce gene expression levels using sequence coverage as a surrogate, identify coding changes or discover novel isoforms or transcripts. Especially for discovery of novel events, de novo assembly of transcriptomes is desirable. RESULTS Transcriptome from tumor tissue of a patient with follicular lymphoma was sequenced with 36 base pair (bp) single- and paired-end reads on the Illumina Genome Analyzer II platform. We assembled approximately 194 million reads using ABySS into 66 921 contigs 100 bp or longer, with a maximum contig length of 10 951 bp, representing over 30 million base pairs of unique transcriptome sequence, or roughly 1% of the genome. AVAILABILITY AND IMPLEMENTATION Source code and binaries of ABySS are freely available for download at http://www.bcgsc.ca/platform/bioinfo/software/abyss. Assembler tool is implemented in C++. The parallel version uses Open MPI. ABySS-Explorer tool is implemented in Java using the Java universal network/graph framework. CONTACT ibirol@bcgsc.ca.
Collapse
Affiliation(s)
- Inanç Birol
- Genome Sciences Centre, Vancouver, BC V5Z 4S6, Canada.
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
34
|
Abstract
Widespread adoption of massively parallel deoxyribonucleic acid (DNA) sequencing instruments has prompted the recent development of de novo short read assembly algorithms. A common shortcoming of the available tools is their inability to efficiently assemble vast amounts of data generated from large-scale sequencing projects, such as the sequencing of individual human genomes to catalog natural genetic variation. To address this limitation, we developed ABySS (Assembly By Short Sequences), a parallelized sequence assembler. As a demonstration of the capability of our software, we assembled 3.5 billion paired-end reads from the genome of an African male publicly released by Illumina, Inc. Approximately 2.76 million contigs > or =100 base pairs (bp) in length were created with an N50 size of 1499 bp, representing 68% of the reference human genome. Analysis of these contigs identified polymorphic and novel sequences not present in the human reference assembly, which were validated by alignment to alternate human assemblies and to other primate genomes.
Collapse
Affiliation(s)
- Jared T Simpson
- Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, British Columbia V5Z 4E6, Canada
| | | | | | | | | | | |
Collapse
|
35
|
Abstract
Information about the enzyme kinetics in a metabolic network will enable understanding of the function of the network and quantitative prediction of the network responses to genetic and environmental perturbations. Despite recent advances in experimental techniques, such information is limited and existing experimental data show extensive variation and they are based on in vitro experiments. In this article, we present a computational framework based on the well-established (log)linear formalism of metabolic control analysis. The framework employs a Monte Carlo sampling procedure to simulate the uncertainty in the kinetic data and applies statistical tools for the identification of the rate-limiting steps in metabolic networks. We applied the proposed framework to a branched biosynthetic pathway and the yeast glycolysis pathway. Analysis of the results allowed us to interpret and predict the responses of metabolic networks to genetic and environmental changes, and to gain insights on how uncertainty in the kinetic mechanisms and kinetic parameters propagate into the uncertainty in predicting network responses. Some of the practical applications of the proposed approach include the identification of drug targets for metabolic diseases and the guidance for design strategies in metabolic engineering for the purposeful manipulation of the metabolism of industrial organisms.
Collapse
Affiliation(s)
- Liqing Wang
- Department of Chemical and Biological Engineering, Northwestern University, Evanston, Illinois 60616, USA
| | | | | |
Collapse
|
36
|
Birol I, Parulekar SJ, Teymour F. Effect of environment partitioning on the survival and coexistence of autocatalytic replicators. Phys Rev E Stat Nonlin Soft Matter Phys 2002; 66:051916. [PMID: 12513532 DOI: 10.1103/physreve.66.051916] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/14/2002] [Indexed: 05/24/2023]
Abstract
The paradigm of cubic autocatalytic replicators with decay in coupled isothermal continuous stirred tank reactors is selected as a model to study complex behavior in population dynamics of sexually reproducing species in a heterogenous environment. It is shown that, even a setup with single species in two coupled environments may have regions in parameter space that result in chaotic behavior, hence segregation in the environment causes complexity in the system dynamics. Furthermore, partitioning is found to lead to emergence phenomena exemplified by steady states not obtainable in the equivalent homogeneous system. These phenomena are illustrated through case studies involving single or multiple species. Results show that the coupled environments can host species, that would not survive should the coupling be removed.
Collapse
Affiliation(s)
- Inanç Birol
- Department of Chemical Engineering, Northwestern University, 2145 Sheridan Road, Evanston, Illinois 60208, USA.
| | | | | |
Collapse
|
37
|
Birol G, Birol I, Kirdar B, Onsan ZI. Modeling of recombinant yeast cells: reduction of phase space. Biomed Sci Instrum 1998; 34:163-8. [PMID: 9603032] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Subscribe] [Scholar Register] [Indexed: 02/07/2023]
Abstract
The mechanism of starch fermentation by recombinant Saccharomyces cerevisiae in batch reactor is studied. Experiments were carried in the presence and absence of oxygen, with different initial starch concentrations. A variety of data concerning biotic and abiotic phases are collected. Nonlinear data analysis techniques are used to determine the block diagram of the system under study. Data analysis and processing reported here, are believed to form a basis in further work in structured modeling of biological systems, recombinant yeast cultures in particular.
Collapse
Affiliation(s)
- G Birol
- Boğaziçi University, Dept. of Chemical Eng., Istanbul, Turkey
| | | | | | | |
Collapse
|
38
|
Birol I, Hacinliyan A. Approximately conserved quantity in the Hénon-Heiles problem. Phys Rev E Stat Phys Plasmas Fluids Relat Interdiscip Topics 1995; 52:4750-4753. [PMID: 9963971 DOI: 10.1103/physreve.52.4750] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/07/2022]
|