1
|
Seah BKB, Singh A, Vetter DE, Emmerich C, Peters M, Soltys V, Huettel B, Swart EC. Nuclear dualism without extensive DNA elimination in the ciliate Loxodes magnus. Proc Natl Acad Sci U S A 2024; 121:e2400503121. [PMID: 39298487 PMCID: PMC11441545 DOI: 10.1073/pnas.2400503121] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/17/2024] [Accepted: 08/08/2024] [Indexed: 09/21/2024] Open
Abstract
Most eukaryotes have one nucleus and nuclear genome per cell. Ciliates have instead evolved distinct nuclei that coexist in each cell: a silent germline vs. transcriptionally active somatic nuclei. In the best-studied model species, both nuclei can divide asexually, but only germline nuclei undergo meiosis and karyogamy during sex. Thereafter, thousands of DNA segments, called internally eliminated sequences (IESs), are excised from copies of the germline genomes to produce the streamlined somatic genome. In Loxodes, however, somatic nuclei cannot divide but instead develop from germline copies even during asexual cell division, which would incur a huge overhead cost if genome editing was required. Here, we purified and sequenced both genomes in Loxodes magnus to see whether their nondividing somatic nuclei are associated with differences in genome architecture. Unlike in other ciliates studied to date, we did not find canonical germline-limited IESs, implying Loxodes does not extensively edit its genomes. Instead, both genomes appear large and equivalent, replete with retrotransposons and repetitive sequences, unlike the compact, gene-rich somatic genomes of other ciliates. Two other hallmarks of nuclear development in ciliates-domesticated DDE-family transposases and editing-associated small RNAs-were also not found. Thus, among the ciliates, Loxodes genomes most resemble those of conventional eukaryotes. Nonetheless, base modifications, histone marks, and nucleosome positioning of vegetative Loxodes nuclei are consistent with functional differentiation between actively transcribed somatic vs. inactive germline nuclei. Given their phylogenetic position, it is likely that editing was present in the ancestral ciliate but secondarily lost in the Loxodes lineage.
Collapse
Affiliation(s)
- Brandon K B Seah
- Max Planck Institute for Biology, Tübingen 72076, Germany
- Thünen Institute for Biodiversity, Braunschweig 38116, Germany
| | - Aditi Singh
- Max Planck Institute for Biology, Tübingen 72076, Germany
| | - David E Vetter
- Max Planck Institute for Biology, Tübingen 72076, Germany
- Faculty of Science, Eberhard Karls Universität Tübingen, Tübingen 72076, Germany
| | | | - Moritz Peters
- Max Planck Institute for Biology, Tübingen 72076, Germany
- Friedrich Miescher Laboratory, Tübingen 72076, Germany
| | - Volker Soltys
- Max Planck Institute for Biology, Tübingen 72076, Germany
- Friedrich Miescher Laboratory, Tübingen 72076, Germany
| | - Bruno Huettel
- Max Planck Genome Centre Cologne, Max Planck Institute for Plant Breeding Research, Cologne 50829, Germany
| | | |
Collapse
|
2
|
Stankovic S, Shekari S, Huang QQ, Gardner EJ, Ivarsdottir EV, Owens NDL, Mavaddat N, Azad A, Hawkes G, Kentistou KA, Beaumont RN, Day FR, Zhao Y, Jonsson H, Rafnar T, Tragante V, Sveinbjornsson G, Oddsson A, Styrkarsdottir U, Gudmundsson J, Stacey SN, Gudbjartsson DF, Kennedy K, Wood AR, Weedon MN, Ong KK, Wright CF, Hoffmann ER, Sulem P, Hurles ME, Ruth KS, Martin HC, Stefansson K, Perry JRB, Murray A. Genetic links between ovarian ageing, cancer risk and de novo mutation rates. Nature 2024; 633:608-614. [PMID: 39261734 PMCID: PMC11410666 DOI: 10.1038/s41586-024-07931-x] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/20/2022] [Accepted: 08/08/2024] [Indexed: 09/13/2024]
Abstract
Human genetic studies of common variants have provided substantial insight into the biological mechanisms that govern ovarian ageing1. Here we report analyses of rare protein-coding variants in 106,973 women from the UK Biobank study, implicating genes with effects around five times larger than previously found for common variants (ETAA1, ZNF518A, PNPLA8, PALB2 and SAMHD1). The SAMHD1 association reinforces the link between ovarian ageing and cancer susceptibility1, with damaging germline variants being associated with extended reproductive lifespan and increased all-cause cancer risk in both men and women. Protein-truncating variants in ZNF518A are associated with shorter reproductive lifespan-that is, earlier age at menopause (by 5.61 years) and later age at menarche (by 0.56 years). Finally, using 8,089 sequenced trios from the 100,000 Genomes Project (100kGP), we observe that common genetic variants associated with earlier ovarian ageing associate with an increased rate of maternally derived de novo mutations. Although we were unable to replicate the finding in independent samples from the deCODE study, it is consistent with the expected role of DNA damage response genes in maintaining the genetic integrity of germ cells. This study provides evidence of genetic links between age of menopause and cancer risk.
Collapse
Affiliation(s)
- Stasa Stankovic
- MRC Epidemiology Unit, Wellcome-MRC Institute of Metabolic Science, University of Cambridge, Cambridge, UK
| | - Saleh Shekari
- University of Exeter Medical School, University of Exeter, Exeter, UK
- School of Public Health, Faculty of Medicine, University of Queensland, Brisbane, Queensland, Australia
| | - Qin Qin Huang
- Wellcome Sanger Institute, Wellcome Genome Campus, Cambridge, UK
| | - Eugene J Gardner
- MRC Epidemiology Unit, Wellcome-MRC Institute of Metabolic Science, University of Cambridge, Cambridge, UK
| | | | - Nick D L Owens
- University of Exeter Medical School, University of Exeter, Exeter, UK
| | - Nasim Mavaddat
- Centre for Cancer Genetic Epidemiology, Department of Public Health and Primary Care, University of Cambridge, Cambridge, UK
| | - Ajuna Azad
- DNRF Center for Chromosome Stability, Department of Cellular and Molecular Medicine, Faculty of Health and Medical Sciences, University of Copenhagen, Copenhagen, Denmark
| | - Gareth Hawkes
- University of Exeter Medical School, University of Exeter, Exeter, UK
| | - Katherine A Kentistou
- MRC Epidemiology Unit, Wellcome-MRC Institute of Metabolic Science, University of Cambridge, Cambridge, UK
| | - Robin N Beaumont
- University of Exeter Medical School, University of Exeter, Exeter, UK
| | - Felix R Day
- MRC Epidemiology Unit, Wellcome-MRC Institute of Metabolic Science, University of Cambridge, Cambridge, UK
| | - Yajie Zhao
- MRC Epidemiology Unit, Wellcome-MRC Institute of Metabolic Science, University of Cambridge, Cambridge, UK
| | | | | | | | | | | | | | | | | | | | - Kitale Kennedy
- University of Exeter Medical School, University of Exeter, Exeter, UK
| | - Andrew R Wood
- University of Exeter Medical School, University of Exeter, Exeter, UK
| | - Michael N Weedon
- University of Exeter Medical School, University of Exeter, Exeter, UK
| | - Ken K Ong
- MRC Epidemiology Unit, Wellcome-MRC Institute of Metabolic Science, University of Cambridge, Cambridge, UK
- Department of Paediatrics, University of Cambridge, Cambridge, UK
| | - Caroline F Wright
- University of Exeter Medical School, University of Exeter, Exeter, UK
| | - Eva R Hoffmann
- DNRF Center for Chromosome Stability, Department of Cellular and Molecular Medicine, Faculty of Health and Medical Sciences, University of Copenhagen, Copenhagen, Denmark
| | | | - Matthew E Hurles
- Wellcome Sanger Institute, Wellcome Genome Campus, Cambridge, UK
| | - Katherine S Ruth
- University of Exeter Medical School, University of Exeter, Exeter, UK
| | - Hilary C Martin
- Wellcome Sanger Institute, Wellcome Genome Campus, Cambridge, UK
| | | | - John R B Perry
- MRC Epidemiology Unit, Wellcome-MRC Institute of Metabolic Science, University of Cambridge, Cambridge, UK.
- Metabolic Research Laboratory, Wellcome-MRC Institute of Metabolic Science, University of Cambridge, Cambridge, UK.
| | - Anna Murray
- University of Exeter Medical School, University of Exeter, Exeter, UK.
| |
Collapse
|
3
|
Yang C, Trivedi V, Dyson K, Gu T, Candelario KM, Yegorov O, Mitchell DA. Identification of tumor rejection antigens and the immunologic landscape of medulloblastoma. Genome Med 2024; 16:102. [PMID: 39160595 PMCID: PMC11331754 DOI: 10.1186/s13073-024-01363-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/26/2023] [Accepted: 07/12/2024] [Indexed: 08/21/2024] Open
Abstract
BACKGROUND The current standard of care treatments for medulloblastoma are insufficient as these do not take tumor heterogeneity into account. Newer, safer, patient-specific treatment approaches are required to treat high-risk medulloblastoma patients who are not cured by the standard therapies. Immunotherapy is a promising treatment modality that could be key to improving survival and avoiding morbidity. For an effective immune response, appropriate tumor antigens must be targeted. While medulloblastoma patients with subgroup-specific genetic substitutions have been previously reported, the immunogenicity of these genetic alterations remains unknown. The aim of this study is to identify potential tumor rejection antigens for the development of antigen-directed cellular therapies for medulloblastoma. METHODS We developed a cancer immunogenomics pipeline and performed a comprehensive analysis of medulloblastoma subgroup-specific transcription profiles (n = 170, 18 WNT, 46 SHH, 41 Group 3, and 65 Group 4 patient tumors) available through International Cancer Genome Consortium (ICGC) and European Genome-Phenome Archive (EGA). We performed in silico antigen prediction across a broad array of antigen classes including neoantigens, tumor-associated antigens (TAAs), and fusion proteins. Furthermore, we evaluated the antigen processing and presentation pathway in tumor cells and the immune infiltrating cell landscape using the latest computational deconvolution methods. RESULTS Medulloblastoma patients were found to express multiple private and shared immunogenic antigens. The proportion of predicted TAAs was higher than neoantigens and gene fusions for all molecular subgroups, except for sonic hedgehog (SHH), which had a higher neoantigen burden. Importantly, cancer-testis antigens, as well as previously unappreciated neurodevelopmental antigens, were found to be expressed by most patients across all medulloblastoma subgroups. Despite being immunologically cold, medulloblastoma subgroups were found to have distinct immune cell gene signatures. CONCLUSIONS Using a custom antigen prediction pipeline, we identified potential tumor rejection antigens with important implications for the development of immunotherapy for medulloblastoma.
Collapse
Affiliation(s)
- Changlin Yang
- UF Brain Tumor Immunotherapy Program, Preston A. Wells Center for Brain Tumor Therapy, Lillian S. Wells Department of Neurosurgery, University of Florida, 1333 Center Drive, BSB B1-118, Gainesville, FL, 32610, USA
| | - Vrunda Trivedi
- UF Brain Tumor Immunotherapy Program, Preston A. Wells Center for Brain Tumor Therapy, Lillian S. Wells Department of Neurosurgery, University of Florida, 1333 Center Drive, BSB B1-118, Gainesville, FL, 32610, USA
| | - Kyle Dyson
- UF Brain Tumor Immunotherapy Program, Preston A. Wells Center for Brain Tumor Therapy, Lillian S. Wells Department of Neurosurgery, University of Florida, 1333 Center Drive, BSB B1-118, Gainesville, FL, 32610, USA
| | - Tongjun Gu
- Department of Biostatistics, University of Florida, Gainesville, FL, USA
| | - Kate M Candelario
- UF Brain Tumor Immunotherapy Program, Preston A. Wells Center for Brain Tumor Therapy, Lillian S. Wells Department of Neurosurgery, University of Florida, 1333 Center Drive, BSB B1-118, Gainesville, FL, 32610, USA
| | - Oleg Yegorov
- UF Brain Tumor Immunotherapy Program, Preston A. Wells Center for Brain Tumor Therapy, Lillian S. Wells Department of Neurosurgery, University of Florida, 1333 Center Drive, BSB B1-118, Gainesville, FL, 32610, USA
| | - Duane A Mitchell
- UF Brain Tumor Immunotherapy Program, Preston A. Wells Center for Brain Tumor Therapy, Lillian S. Wells Department of Neurosurgery, University of Florida, 1333 Center Drive, BSB B1-118, Gainesville, FL, 32610, USA.
| |
Collapse
|
4
|
Rossini R, Oshaghi M, Nekrasov M, Bellanger A, Domaschenz R, Dijkwel Y, Abdelhalim M, Collas P, Tremethick D, Paulsen J. Loss of multi-level 3D genome organization during breast cancer progression. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2023.11.26.568711. [PMID: 38076897 PMCID: PMC10705249 DOI: 10.1101/2023.11.26.568711] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Indexed: 12/23/2023]
Abstract
Breast cancer entails intricate alterations in genome organization and expression. However, how three-dimensional (3D) chromatin structure changes in the progression from a normal to a breast cancer malignant state remains unknown. To address this, we conducted an analysis combining Hi-C data with lamina-associated domains (LADs), epigenomic marks, and gene expression in an in vitro model of breast cancer progression. Our results reveal that while the fundamental properties of topologically associating domains (TADs) are overall maintained, significant changes occur in the organization of compartments and subcompartments. These changes are closely correlated with alterations in the expression of oncogenic genes. We also observe a restructuring of TAD-TAD interactions, coinciding with a loss of spatial compartmentalization and radial positioning of the 3D genome. Notably, we identify a previously unrecognized interchromosomal insertion event, wherein a locus on chromosome 8 housing the MYC oncogene is inserted into a highly active subcompartment on chromosome 10. This insertion is accompanied by the formation of de novo enhancer contacts and activation of MYC , illustrating how structural genomic variants can alter the 3D genome to drive oncogenic states. In summary, our findings provide evidence for the loss of genome organization at multiple scales during breast cancer progression revealing novel relationships between genome 3D structure and oncogenic processes.
Collapse
Affiliation(s)
- Roberto Rossini
- Department of Biosciences, Faculty of Mathematics and Natural Sciences, University of Oslo, 0316 Oslo, Norway
| | - Mohammadsaleh Oshaghi
- Department of Biosciences, Faculty of Mathematics and Natural Sciences, University of Oslo, 0316 Oslo, Norway
| | - Maxim Nekrasov
- Department of Genome Sciences, The John Curtin School of Medical Research, The Australian National University, Canberra, Australian Capital Territory, Australia
| | - Aurélie Bellanger
- Department of Molecular Medicine, Institute of Basic Medical Sciences, Faculty of Medicine, University of Oslo, 0317 Oslo, Norway
| | - Renae Domaschenz
- Department of Genome Sciences, The John Curtin School of Medical Research, The Australian National University, Canberra, Australian Capital Territory, Australia
| | - Yasmin Dijkwel
- Department of Genome Sciences, The John Curtin School of Medical Research, The Australian National University, Canberra, Australian Capital Territory, Australia
| | - Mohamed Abdelhalim
- Department of Molecular Medicine, Institute of Basic Medical Sciences, Faculty of Medicine, University of Oslo, 0317 Oslo, Norway
| | - Philippe Collas
- Department of Molecular Medicine, Institute of Basic Medical Sciences, Faculty of Medicine, University of Oslo, 0317 Oslo, Norway
- Department of Immunology and Transfusion Medicine, Oslo University Hospital, 0424 Oslo, Norway
| | - David Tremethick
- Department of Genome Sciences, The John Curtin School of Medical Research, The Australian National University, Canberra, Australian Capital Territory, Australia
| | - Jonas Paulsen
- Department of Biosciences, Faculty of Mathematics and Natural Sciences, University of Oslo, 0316 Oslo, Norway
- Centre for Bioinformatics, Department of Informatics, University of Oslo, 0316 Oslo, Norway
| |
Collapse
|
5
|
Pan B, Bruno M, Macfarlan TS, Akera T. Meiosis-specific decoupling of the pericentromere from the kinetochore. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.07.21.604490. [PMID: 39091844 PMCID: PMC11291024 DOI: 10.1101/2024.07.21.604490] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 08/04/2024]
Abstract
The primary constriction site of the M-phase chromosome is an established marker for the kinetochore position, often used to determine the karyotype of each species. Underlying this observation is the concept that the kinetochore is spatially linked with the pericentromere where sister-chromatids are most tightly cohered. Here, we found an unconventional pericentromere specification with sister chromatids mainly cohered at a chromosome end, spatially separated from the kinetochore in Peromyscus mouse oocytes. This distal locus enriched cohesin protectors, such as the Chromosomal Passenger Complex (CPC) and PP2A, at a higher level compared to its centromere/kinetochore region, acting as the primary site for sister-chromatid cohesion. Chromosomes with the distal cohesion site exhibited enhanced cohesin protection at anaphase I compared to those without it, implying that these distal cohesion sites may have evolved to ensure sister-chromatid cohesion during meiosis. In contrast, mitotic cells enriched CPC only near the kinetochore and the distal locus was not cohered between sister chromatids, suggesting a meiosis-specific mechanism to protect cohesin at this distal locus. We found that this distal locus corresponds to an additional centromeric satellite block, located far apart from the centromeric satellite block that builds the kinetochore. Several Peromyscus species carry chromosomes with two such centromeric satellite blocks. Analyses on three Peromyscus species revealed that the internal satellite consistently assembles the kinetochore in both mitosis and meiosis, whereas the distal satellite selectively enriches cohesin protectors in meiosis to promote sister-chromatid cohesion at that site. Thus, our study demonstrates that pericentromere specification is remarkably flexible and can control chromosome segregation in a cell-type and context dependent manner.
Collapse
Affiliation(s)
- Bo Pan
- Cell and Developmental Biology Center, National Heart, Lung, and Blood Institute, National Institutes of Health; Bethesda, Maryland 20894, USA
| | - Melania Bruno
- The Eunice Kennedy Shriver National Institute of Child Health and Human Development, National Institutes of Health; Bethesda, Maryland 20894, USA
| | - Todd S Macfarlan
- The Eunice Kennedy Shriver National Institute of Child Health and Human Development, National Institutes of Health; Bethesda, Maryland 20894, USA
| | - Takashi Akera
- Cell and Developmental Biology Center, National Heart, Lung, and Blood Institute, National Institutes of Health; Bethesda, Maryland 20894, USA
| |
Collapse
|
6
|
Aguado-Puig Q, Doblas M, Matzoros C, Espinosa A, Moure JC, Marco-Sola S, Moreto M. WFA-GPU: gap-affine pairwise read-alignment using GPUs. Bioinformatics 2023; 39:btad701. [PMID: 37975878 PMCID: PMC10697739 DOI: 10.1093/bioinformatics/btad701] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/21/2023] [Revised: 11/09/2023] [Accepted: 11/16/2023] [Indexed: 11/19/2023] Open
Abstract
MOTIVATION Advances in genomics and sequencing technologies demand faster and more scalable analysis methods that can process longer sequences with higher accuracy. However, classical pairwise alignment methods, based on dynamic programming (DP), impose impractical computational requirements to align long and noisy sequences like those produced by PacBio and Nanopore technologies. The recently proposed wavefront alignment (WFA) algorithm paves the way for more efficient alignment tools, improving time and memory complexity over previous methods. However, high-performance computing (HPC) platforms require efficient parallel algorithms and tools to exploit the computing resources available on modern accelerator-based architectures. RESULTS This paper presents WFA-GPU, a GPU (graphics processing unit)-accelerated tool to compute exact gap-affine alignments based on the WFA algorithm. We present the algorithmic adaptations and performance optimizations that allow exploiting the massively parallel capabilities of modern GPU devices to accelerate the alignment computations. In particular, we propose a CPU-GPU co-design capable of performing inter-sequence and intra-sequence parallel sequence alignment, combining a succinct WFA-data representation with an efficient GPU implementation. As a result, we demonstrate that our implementation outperforms the original multi-threaded WFA implementation by up to 4.3× and up to 18.2× when using heuristic methods on long and noisy sequences. Compared to other state-of-the-art tools and libraries, the WFA-GPU is up to 29× faster than other GPU implementations and up to four orders of magnitude faster than other CPU implementations. Furthermore, WFA-GPU is the only GPU solution capable of correctly aligning long reads using a commodity GPU. AVAILABILITY AND IMPLEMENTATION WFA-GPU code and documentation are publicly available at https://github.com/quim0/WFA-GPU.
Collapse
Affiliation(s)
- Quim Aguado-Puig
- Departament d’Arquitectura de Computadors i Sistemes Operatius, Universitat Autònoma de Barcelona, Barcelona 08193, Spain
| | - Max Doblas
- Computer Sciences Department, Barcelona Supercomputing Center, Barcelona 08034, Spain
| | - Christos Matzoros
- Computer Sciences Department, Barcelona Supercomputing Center, Barcelona 08034, Spain
| | - Antonio Espinosa
- Departament d’Arquitectura de Computadors i Sistemes Operatius, Universitat Autònoma de Barcelona, Barcelona 08193, Spain
| | - Juan Carlos Moure
- Departament d’Arquitectura de Computadors i Sistemes Operatius, Universitat Autònoma de Barcelona, Barcelona 08193, Spain
| | - Santiago Marco-Sola
- Computer Sciences Department, Barcelona Supercomputing Center, Barcelona 08034, Spain
- Departament d’Arquitectura de Computadors, Universitat Politècnica de Catalunya, Barcelona 08034, Spain
| | - Miquel Moreto
- Computer Sciences Department, Barcelona Supercomputing Center, Barcelona 08034, Spain
- Departament d’Arquitectura de Computadors, Universitat Politècnica de Catalunya, Barcelona 08034, Spain
| |
Collapse
|
7
|
Wei ZG, Bu PY, Zhang XD, Liu F, Qian Y, Wu FX. invMap: a sensitive mapping tool for long noisy reads with inversion structural variants. Bioinformatics 2023; 39:btad726. [PMID: 38058196 PMCID: PMC11320709 DOI: 10.1093/bioinformatics/btad726] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/06/2023] [Revised: 11/02/2023] [Accepted: 12/05/2023] [Indexed: 12/08/2023] Open
Abstract
MOTIVATION Longer reads produced by PacBio or Oxford Nanopore sequencers could more frequently span the breakpoints of structural variations (SVs) than shorter reads. Therefore, existing long-read mapping methods often generate wrong alignments and variant calls. Compared to deletions and insertions, inversion events are more difficult to be detected since the anchors in inversion regions are nonlinear to those in SV-free regions. To address this issue, this study presents a novel long-read mapping algorithm (named as invMap). RESULTS For each long noisy read, invMap first locates the aligned region with a specifically designed scoring method for chaining, then checks the remaining anchors in the aligned region to discover potential inversions. We benchmark invMap on simulated datasets across different genomes and sequencing coverages, experimental results demonstrate that invMap is more accurate to locate aligned regions and call SVs for inversions than the competing methods. The real human genome sequencing dataset of NA12878 illustrates that invMap can effectively find more candidate variant calls for inversions than the competing methods. AVAILABILITY AND IMPLEMENTATION The invMap software is available at https://github.com/zhang134/invMap.git.
Collapse
Affiliation(s)
- Ze-Gang Wei
- School of Physics and Optoelectronics Technology, Baoji University of Arts
and Sciences, Baoji 721016, China
- Division of Biomedical Engineering, Department of Computer Science and
Department of Mechanical Engineering, University of Saskatchewan,
Saskatoon, SK S7N 5A9, Canada
| | - Peng-Yu Bu
- School of Physics and Optoelectronics Technology, Baoji University of Arts
and Sciences, Baoji 721016, China
| | - Xiao-Dan Zhang
- School of Physics and Optoelectronics Technology, Baoji University of Arts
and Sciences, Baoji 721016, China
| | - Fei Liu
- School of Physics and Optoelectronics Technology, Baoji University of Arts
and Sciences, Baoji 721016, China
| | - Yu Qian
- School of Physics and Optoelectronics Technology, Baoji University of Arts
and Sciences, Baoji 721016, China
| | - Fang-Xiang Wu
- Division of Biomedical Engineering, Department of Computer Science and
Department of Mechanical Engineering, University of Saskatchewan,
Saskatoon, SK S7N 5A9, Canada
| |
Collapse
|
8
|
Borozan L, Rojas Ringeling F, Kao SY, Nikonova E, Monteagudo-Mesas P, Matijević D, Spletter ML, Canzar S. Counting pseudoalignments to novel splicing events. Bioinformatics 2023; 39:btad419. [PMID: 37432342 PMCID: PMC10348833 DOI: 10.1093/bioinformatics/btad419] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/14/2022] [Revised: 04/21/2023] [Accepted: 07/10/2023] [Indexed: 07/12/2023] Open
Abstract
MOTIVATION Alternative splicing (AS) of introns from pre-mRNA produces diverse sets of transcripts across cell types and tissues, but is also dysregulated in many diseases. Alignment-free computational methods have greatly accelerated the quantification of mRNA transcripts from short RNA-seq reads, but they inherently rely on a catalog of known transcripts and might miss novel, disease-specific splicing events. By contrast, alignment of reads to the genome can effectively identify novel exonic segments and introns. Event-based methods then count how many reads align to predefined features. However, an alignment is more expensive to compute and constitutes a bottleneck in many AS analysis methods. RESULTS Here, we propose fortuna, a method that guesses novel combinations of annotated splice sites to create transcript fragments. It then pseudoaligns reads to fragments using kallisto and efficiently derives counts of the most elementary splicing units from kallisto's equivalence classes. These counts can be directly used for AS analysis or summarized to larger units as used by other widely applied methods. In experiments on synthetic and real data, fortuna was around 7× faster than traditional align and count approaches, and was able to analyze almost 300 million reads in just 15 min when using four threads. It mapped reads containing mismatches more accurately across novel junctions and found more reads supporting aberrant splicing events in patients with autism spectrum disorder than existing methods. We further used fortuna to identify novel, tissue-specific splicing events in Drosophila. AVAILABILITY AND IMPLEMENTATION fortuna source code is available at https://github.com/canzarlab/fortuna.
Collapse
Affiliation(s)
- Luka Borozan
- Department of Mathematics, Josip Juraj Strossmayer University of Osijek, Osijek 31000, Croatia
| | - Francisca Rojas Ringeling
- Gene Center, Ludwig-Maximilians-Universität München, Munich 81377, Germany
- Huck Institutes of the Life Sciences, The Pennsylvania State University, University Park, PA 16802, United States
| | - Shao-Yen Kao
- Biomedical Center, Department of Physiological Chemistry, Ludwig-Maximilians-Universität München, Planegg-Martinsried 82152, Germany
| | - Elena Nikonova
- Biomedical Center, Department of Physiological Chemistry, Ludwig-Maximilians-Universität München, Planegg-Martinsried 82152, Germany
| | | | - Domagoj Matijević
- Department of Mathematics, Josip Juraj Strossmayer University of Osijek, Osijek 31000, Croatia
| | - Maria L Spletter
- Biomedical Center, Department of Physiological Chemistry, Ludwig-Maximilians-Universität München, Planegg-Martinsried 82152, Germany
- School of Science and Engineering, Division of Biological & Biomedical Systems, University of Missouri Kansas City, Kansas City, MO 64110, United States
| | - Stefan Canzar
- Gene Center, Ludwig-Maximilians-Universität München, Munich 81377, Germany
- Huck Institutes of the Life Sciences, The Pennsylvania State University, University Park, PA 16802, United States
- Department of Computer Science and Engineering, The Pennsylvania State University, University Park, PA 16802, United States
| |
Collapse
|
9
|
Berger B, Yu YW. Navigating bottlenecks and trade-offs in genomic data analysis. Nat Rev Genet 2023; 24:235-250. [PMID: 36476810 PMCID: PMC10204111 DOI: 10.1038/s41576-022-00551-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 10/27/2022] [Indexed: 12/12/2022]
Abstract
Genome sequencing and analysis allow researchers to decode the functional information hidden in DNA sequences as well as to study cell to cell variation within a cell population. Traditionally, the primary bottleneck in genomic analysis pipelines has been the sequencing itself, which has been much more expensive than the computational analyses that follow. However, an important consequence of the continued drive to expand the throughput of sequencing platforms at lower cost is that often the analytical pipelines are struggling to keep up with the sheer amount of raw data produced. Computational cost and efficiency have thus become of ever increasing importance. Recent methodological advances, such as data sketching, accelerators and domain-specific libraries/languages, promise to address these modern computational challenges. However, despite being more efficient, these innovations come with a new set of trade-offs, both expected, such as accuracy versus memory and expense versus time, and more subtle, including the human expertise needed to use non-standard programming interfaces and set up complex infrastructure. In this Review, we discuss how to navigate these new methodological advances and their trade-offs.
Collapse
Affiliation(s)
- Bonnie Berger
- Department of Mathematics, Massachusetts Institute of Technology, Cambridge, MA, USA.
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA, USA.
| | - Yun William Yu
- Department of Computer and Mathematical Sciences, University of Toronto Scarborough, Toronto, Ontario, Canada
- Tri-Campus Department of Mathematics, University of Toronto, Toronto, Ontario, Canada
| |
Collapse
|
10
|
Yu C, Zhao Y, Zhao C, Ma H, Wang G. DiagAF: A More Accurate and Efficient Pre-Alignment Filter for Sequence Alignment. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022; 19:3404-3415. [PMID: 34780330 DOI: 10.1109/tcbb.2021.3127879] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/13/2023]
Abstract
Sequence alignment is an essential step in computational genomics. More accurate and efficient sequence pre-alignment methods that run before conducting expensive computation for final verification are still urgently needed. In this article, we propose a more accurate and efficient pre-alignment algorithm for sequence alignment, called DiagAF. Firstly, DiagAF uses a new lower bound of edit distance based on shift hamming masks. The new lower bound makes use of fewer shift hamming masks comparing with state-of-the-art algorithms such as SHD and MAGNET. Moreover, it takes account the information of edit distance path exchanging on shift hamming masks. Secondly, DiagAF can deal with alignments of sequence pairs with not equal length, rather than state-of-the-art methods just for equal length. Thirdly, DiagAF can align sequences with early termination for true alignments. In the experiment, we compared DiagAF with state-of-the-art methods. DiagAF can achieve a much smaller error rate than them, meanwhile use less time than them. We believe that DiagAF algorithm can further improve the performance of state-of-the-art sequence alignment softwares. The source codes of DiagAF can be downloaded from web site https://github.com/BioLab-cz/DiagAF.
Collapse
|
11
|
Swain MT, Vickers M. Interpreting alignment-free sequence comparison: what makes a score a good score? NAR Genom Bioinform 2022; 4:lqac062. [PMID: 36071721 PMCID: PMC9442500 DOI: 10.1093/nargab/lqac062] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/01/2022] [Revised: 07/01/2022] [Accepted: 08/16/2022] [Indexed: 11/13/2022] Open
Abstract
Alignment-free methods are alternatives to alignment-based methods when searching sequence data sets. The output from an alignment-free sequence comparison is a similarity score, the interpretation of which is not straightforward. We propose objective functions to interpret and calibrate outputs from alignment-free searches, noting that different objective functions are necessary for different biological contexts. This leads to advantages: visualising and comparing score distributions, including those from true positives, may be a relatively simple method to gain insight into the performance of different metrics. Using an empirical approach with both DNA and protein sequences, we characterise different similarity score distributions generated under different parameters. In particular, we demonstrate how sequence length can affect the scores. We show that scores of true positive sequence pairs may correlate significantly with their mean length; and even if the correlation is weak, the relative difference in length of the sequence pair may significantly reduce the effectiveness of alignment-free metrics. Importantly, we show how objective functions can be used with test data to accurately estimate the probability of true positives. This can significantly increase the utility of alignment-free approaches. Finally, we have developed a general-purpose software tool called KAST for use in high-throughput workflows on Linux clusters.
Collapse
Affiliation(s)
- Martin T Swain
- Department of Life Sciences, Aberystwyth University , Penglais, Aberystwyth, Ceredigion, SY23 3DA, UK
| | - Martin Vickers
- The John Innes Centre, Norwich Research Park , Norwich NR4 7UH, UK
| |
Collapse
|
12
|
Galaxy Dnpatterntools for Computational Analysis of Nucleosome Positioning Sequence Patterns. Int J Mol Sci 2022; 23:ijms23094869. [PMID: 35563261 PMCID: PMC9102330 DOI: 10.3390/ijms23094869] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/30/2022] [Revised: 04/25/2022] [Accepted: 04/26/2022] [Indexed: 01/25/2023] Open
Abstract
Nucleosomes are basic units of DNA packing in eukaryotes. Their structure is well conserved from yeast to human and consists of the histone octamer core and 147 bp DNA wrapped around it. Nucleosomes are bound to a majority of the eukaryotic genomic DNA, including its regulatory regions. Hence, they also play a major role in gene regulation. For the latter, their precise positioning on DNA is essential. In the present paper, we describe Galaxy dnpatterntools—software package for nucleosome DNA sequence analysis and mapping. This software will be useful for computational biologists practitioners to conduct more profound studies of gene regulatory mechanisms.
Collapse
|
13
|
Patel RS, Romero R, Watson EV, Liang AC, Burger M, Westcott PMK, Mercer KL, Bronson RT, Wooten EC, Bhutkar A, Jacks T, Elledge SJ. A GATA4-regulated secretory program suppresses tumors through recruitment of cytotoxic CD8 T cells. Nat Commun 2022; 13:256. [PMID: 35017504 PMCID: PMC8752777 DOI: 10.1038/s41467-021-27731-5] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/22/2021] [Accepted: 12/06/2021] [Indexed: 12/11/2022] Open
Abstract
The GATA4 transcription factor acts as a master regulator of development of multiple tissues. GATA4 also acts in a distinct capacity to control a stress-inducible pro-inflammatory secretory program that is associated with senescence, a potent tumor suppression mechanism, but also operates in non-senescent contexts such as tumorigenesis. This secretory pathway is composed of chemokines, cytokines, growth factors, and proteases. Since GATA4 is deleted or epigenetically silenced in cancer, here we examine the role of GATA4 in tumorigenesis in mouse models through both loss-of-function and overexpression experiments. We find that GATA4 promotes non-cell autonomous tumor suppression in multiple model systems. Mechanistically, we show that Gata4-dependent tumor suppression requires cytotoxic CD8 T cells and partially requires the secreted chemokine CCL2. Analysis of transcriptome data in human tumors reveals reduced lymphocyte infiltration in GATA4-deficient tumors, consistent with our murine data. Notably, activation of the GATA4-dependent secretory program combined with an anti-PD-1 antibody robustly abrogates tumor growth in vivo.
Collapse
Affiliation(s)
- Rupesh S Patel
- Division of Genetics, Department of Medicine, Brigham and Women's Hospital, Boston, MA, USA.,Department of Genetics, Harvard Medical School, Boston, MA, USA.,Howard Hughes Medical Institute, Chevy Chase, MD, USA.,Scripps Green Hospital, San Diego, CA, USA
| | - Rodrigo Romero
- David H. Koch Institute for Integrative Cancer Research, Massachusetts Institute of Technology, Cambridge, MA, USA.,Department of Biology, Massachusetts Institute of Technology, Cambridge, MA, USA.,Human Oncology and Pathogenesis Program, Memorial Sloan Kettering Cancer Center, New York, NY, USA
| | - Emma V Watson
- Division of Genetics, Department of Medicine, Brigham and Women's Hospital, Boston, MA, USA.,Department of Genetics, Harvard Medical School, Boston, MA, USA.,Howard Hughes Medical Institute, Chevy Chase, MD, USA
| | - Anthony C Liang
- Division of Genetics, Department of Medicine, Brigham and Women's Hospital, Boston, MA, USA.,Department of Genetics, Harvard Medical School, Boston, MA, USA.,Howard Hughes Medical Institute, Chevy Chase, MD, USA
| | - Megan Burger
- David H. Koch Institute for Integrative Cancer Research, Massachusetts Institute of Technology, Cambridge, MA, USA
| | - Peter M K Westcott
- David H. Koch Institute for Integrative Cancer Research, Massachusetts Institute of Technology, Cambridge, MA, USA
| | - Kim L Mercer
- David H. Koch Institute for Integrative Cancer Research, Massachusetts Institute of Technology, Cambridge, MA, USA
| | | | - Eric C Wooten
- Division of Genetics, Department of Medicine, Brigham and Women's Hospital, Boston, MA, USA.,Department of Genetics, Harvard Medical School, Boston, MA, USA.,Howard Hughes Medical Institute, Chevy Chase, MD, USA
| | - Arjun Bhutkar
- David H. Koch Institute for Integrative Cancer Research, Massachusetts Institute of Technology, Cambridge, MA, USA
| | - Tyler Jacks
- David H. Koch Institute for Integrative Cancer Research, Massachusetts Institute of Technology, Cambridge, MA, USA.,Department of Biology, Massachusetts Institute of Technology, Cambridge, MA, USA
| | - Stephen J Elledge
- Division of Genetics, Department of Medicine, Brigham and Women's Hospital, Boston, MA, USA. .,Department of Genetics, Harvard Medical School, Boston, MA, USA. .,Howard Hughes Medical Institute, Chevy Chase, MD, USA.
| |
Collapse
|
14
|
Abstract
Ancient polyploidy events are widely distributed across the evolutionary history of eukaryotes. Here, we describe a likelihood-based tool, POInT (the Polyploidy Orthology Inference Tool), for modeling ancient whole genome duplications and triplications, assigning homoeologous genes to subgenomes and inferring gene losses across different parental subgenomes after polyploidy.
Collapse
Affiliation(s)
- Yue Hao
- Biodesign Center for Mechanisms of Evolution, Arizona State University, Tempe, AZ, USA
| | - Gavin C Conant
- Bioinformatics Research Center, North Carolina State University, Raleigh, NC, USA.
- Program in Genetics, North Carolina State University, Raleigh, NC, USA.
- Department of Biological Sciences, North Carolina State University, Raleigh, NC, USA.
| |
Collapse
|
15
|
Flores SC, Alexiou A, Glaros A. Mining the Protein Data Bank to improve prediction of changes in protein-protein binding. PLoS One 2021; 16:e0257614. [PMID: 34727109 PMCID: PMC8562805 DOI: 10.1371/journal.pone.0257614] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/21/2021] [Accepted: 09/05/2021] [Indexed: 12/23/2022] Open
Abstract
Predicting the effect of mutations on protein-protein interactions is important for relating structure to function, as well as for in silico affinity maturation. The effect of mutations on protein-protein binding energy (ΔΔG) can be predicted by a variety of atomic simulation methods involving full or limited flexibility, and explicit or implicit solvent. Methods which consider only limited flexibility are naturally more economical, and many of them are quite accurate, however results are dependent on the atomic coordinate set used. In this work we perform a sequence and structure based search of the Protein Data Bank to find additional coordinate sets and repeat the calculation on each. The method increases precision and Positive Predictive Value, and decreases Root Mean Square Error, compared to using single structures. Given the ongoing growth of near-redundant structures in the Protein Data Bank, our method will only increase in applicability and accuracy.
Collapse
Affiliation(s)
| | - Athanasios Alexiou
- Department of Computer Science and Biomedical Informatics, University of Thessaly, Volos, Greece
| | - Anastasios Glaros
- Eukaryotic Single Cell Genomics Facility, Science For Life Laboratory, Stockholm, Sweden
| |
Collapse
|
16
|
Uhlitz F, Bischoff P, Peidli S, Sieber A, Trinks A, Lüthen M, Obermayer B, Blanc E, Ruchiy Y, Sell T, Mamlouk S, Arsie R, Wei T, Klotz‐Noack K, Schwarz RF, Sawitzki B, Kamphues C, Beule D, Landthaler M, Sers C, Horst D, Blüthgen N, Morkel M. Mitogen-activated protein kinase activity drives cell trajectories in colorectal cancer. EMBO Mol Med 2021; 13:e14123. [PMID: 34409732 PMCID: PMC8495451 DOI: 10.15252/emmm.202114123] [Citation(s) in RCA: 35] [Impact Index Per Article: 11.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/11/2021] [Revised: 07/27/2021] [Accepted: 07/30/2021] [Indexed: 01/07/2023] Open
Abstract
In colorectal cancer, oncogenic mutations transform a hierarchically organized and homeostatic epithelium into invasive cancer tissue lacking visible organization. We sought to define transcriptional states of colorectal cancer cells and signals controlling their development by performing single-cell transcriptome analysis of tumors and matched non-cancerous tissues of twelve colorectal cancer patients. We defined patient-overarching colorectal cancer cell clusters characterized by differential activities of oncogenic signaling pathways such as mitogen-activated protein kinase and oncogenic traits such as replication stress. RNA metabolic labeling and assessment of RNA velocity in patient-derived organoids revealed developmental trajectories of colorectal cancer cells organized along a mitogen-activated protein kinase activity gradient. This was in contrast to normal colon organoid cells developing along graded Wnt activity. Experimental targeting of EGFR-BRAF-MEK in cancer organoids affected signaling and gene expression contingent on predictive KRAS/BRAF mutations and induced cell plasticity overriding default developmental trajectories. Our results highlight directional cancer cell development as a driver of non-genetic cancer cell heterogeneity and re-routing of trajectories as a response to targeted therapy.
Collapse
Affiliation(s)
- Florian Uhlitz
- Institute of PathologyCharité – Universitätsmedizin Berlin, Corporate Member of Freie Universität Berlin and Humboldt‐Universität zu BerlinBerlinGermany
- IRI Life SciencesHumboldt University of BerlinBerlinGermany
- German Cancer Consortium (DKTK) Partner Site BerlinGerman Cancer Research Center (DKFZ)HeidelbergGermany
| | - Philip Bischoff
- Institute of PathologyCharité – Universitätsmedizin Berlin, Corporate Member of Freie Universität Berlin and Humboldt‐Universität zu BerlinBerlinGermany
| | - Stefan Peidli
- Institute of PathologyCharité – Universitätsmedizin Berlin, Corporate Member of Freie Universität Berlin and Humboldt‐Universität zu BerlinBerlinGermany
- IRI Life SciencesHumboldt University of BerlinBerlinGermany
| | - Anja Sieber
- Institute of PathologyCharité – Universitätsmedizin Berlin, Corporate Member of Freie Universität Berlin and Humboldt‐Universität zu BerlinBerlinGermany
- IRI Life SciencesHumboldt University of BerlinBerlinGermany
| | - Alexandra Trinks
- Institute of PathologyCharité – Universitätsmedizin Berlin, Corporate Member of Freie Universität Berlin and Humboldt‐Universität zu BerlinBerlinGermany
- BIH Bioportal Single CellsBerlin Institute of Health at Charité – Universitätsmedizin BerlinBerlinGermany
| | - Mareen Lüthen
- Institute of PathologyCharité – Universitätsmedizin Berlin, Corporate Member of Freie Universität Berlin and Humboldt‐Universität zu BerlinBerlinGermany
- German Cancer Consortium (DKTK) Partner Site BerlinGerman Cancer Research Center (DKFZ)HeidelbergGermany
| | - Benedikt Obermayer
- Core Unit Bioinformatics (CUBI)Berlin Institute of Health at Charité Universitätsmedizin – BerlinBerlinGermany
| | - Eric Blanc
- Core Unit Bioinformatics (CUBI)Berlin Institute of Health at Charité Universitätsmedizin – BerlinBerlinGermany
| | - Yana Ruchiy
- Institute of PathologyCharité – Universitätsmedizin Berlin, Corporate Member of Freie Universität Berlin and Humboldt‐Universität zu BerlinBerlinGermany
| | - Thomas Sell
- Institute of PathologyCharité – Universitätsmedizin Berlin, Corporate Member of Freie Universität Berlin and Humboldt‐Universität zu BerlinBerlinGermany
- IRI Life SciencesHumboldt University of BerlinBerlinGermany
| | - Soulafa Mamlouk
- Institute of PathologyCharité – Universitätsmedizin Berlin, Corporate Member of Freie Universität Berlin and Humboldt‐Universität zu BerlinBerlinGermany
- German Cancer Consortium (DKTK) Partner Site BerlinGerman Cancer Research Center (DKFZ)HeidelbergGermany
| | - Roberto Arsie
- Max Delbrück Center for Molecular MedicineBerlin Institute for Medical Systems Biology (BIMSB)BerlinGermany
| | - Tzu‐Ting Wei
- Institute of PathologyCharité – Universitätsmedizin Berlin, Corporate Member of Freie Universität Berlin and Humboldt‐Universität zu BerlinBerlinGermany
- Max Delbrück Center for Molecular MedicineBerlin Institute for Medical Systems Biology (BIMSB)BerlinGermany
| | - Kathleen Klotz‐Noack
- Institute of PathologyCharité – Universitätsmedizin Berlin, Corporate Member of Freie Universität Berlin and Humboldt‐Universität zu BerlinBerlinGermany
- Institute of Medical ImmunologyCharité – Universitätsmedizin Berlin, Corporate Member of Freie Universität Berlin and Humboldt‐Universität zu BerlinBerlinGermany
| | - Roland F Schwarz
- Max Delbrück Center for Molecular MedicineBerlin Institute for Medical Systems Biology (BIMSB)BerlinGermany
- BIFOLD – Berlin Institute for the Foundations of Learning and DataBerlinGermany
| | - Birgit Sawitzki
- Institute of Medical ImmunologyCharité – Universitätsmedizin Berlin, Corporate Member of Freie Universität Berlin and Humboldt‐Universität zu BerlinBerlinGermany
| | - Carsten Kamphues
- German Cancer Consortium (DKTK) Partner Site BerlinGerman Cancer Research Center (DKFZ)HeidelbergGermany
- Department of SurgeryCharité – Universitätsmedizin Berlin, Corporate Member of Freie Universität Berlin and Humboldt‐Universität zu BerlinBerlinGermany
| | - Dieter Beule
- Core Unit Bioinformatics (CUBI)Berlin Institute of Health at Charité Universitätsmedizin – BerlinBerlinGermany
| | - Markus Landthaler
- Max Delbrück Center for Molecular MedicineBerlin Institute for Medical Systems Biology (BIMSB)BerlinGermany
| | - Christine Sers
- Institute of PathologyCharité – Universitätsmedizin Berlin, Corporate Member of Freie Universität Berlin and Humboldt‐Universität zu BerlinBerlinGermany
- German Cancer Consortium (DKTK) Partner Site BerlinGerman Cancer Research Center (DKFZ)HeidelbergGermany
| | - David Horst
- Institute of PathologyCharité – Universitätsmedizin Berlin, Corporate Member of Freie Universität Berlin and Humboldt‐Universität zu BerlinBerlinGermany
- German Cancer Consortium (DKTK) Partner Site BerlinGerman Cancer Research Center (DKFZ)HeidelbergGermany
| | - Nils Blüthgen
- Institute of PathologyCharité – Universitätsmedizin Berlin, Corporate Member of Freie Universität Berlin and Humboldt‐Universität zu BerlinBerlinGermany
- IRI Life SciencesHumboldt University of BerlinBerlinGermany
- German Cancer Consortium (DKTK) Partner Site BerlinGerman Cancer Research Center (DKFZ)HeidelbergGermany
| | - Markus Morkel
- Institute of PathologyCharité – Universitätsmedizin Berlin, Corporate Member of Freie Universität Berlin and Humboldt‐Universität zu BerlinBerlinGermany
- German Cancer Consortium (DKTK) Partner Site BerlinGerman Cancer Research Center (DKFZ)HeidelbergGermany
- BIH Bioportal Single CellsBerlin Institute of Health at Charité – Universitätsmedizin BerlinBerlinGermany
| |
Collapse
|
17
|
Shajii A, Numanagić I, Leighton AT, Greenyer H, Amarasinghe S, Berger B. A Python-based programming language for high-performance computational genomics. Nat Biotechnol 2021; 39:1062-1064. [PMID: 34282326 PMCID: PMC8542382 DOI: 10.1038/s41587-021-00985-6] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/07/2023]
Affiliation(s)
- Ariya Shajii
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA, USA
| | - Ibrahim Numanagić
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA, USA
- Department of Computer Science, University of Victoria, Victoria, British Columbia, Canada
| | - Alexander T Leighton
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA, USA
- Department of Mathematics, Massachusetts Institute of Technology, Cambridge, MA, USA
| | - Haley Greenyer
- Department of Computer Science, University of Victoria, Victoria, British Columbia, Canada
| | - Saman Amarasinghe
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA, USA.
| | - Bonnie Berger
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA, USA.
- Department of Mathematics, Massachusetts Institute of Technology, Cambridge, MA, USA.
| |
Collapse
|
18
|
Eggertsson HP, Halldorsson BV. read_haps: using read haplotypes to detect same species contamination in DNA sequences. Bioinformatics 2021; 37:2215-2217. [PMID: 33135043 DOI: 10.1093/bioinformatics/btaa936] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/10/2020] [Revised: 09/10/2020] [Accepted: 10/22/2020] [Indexed: 11/12/2022] Open
Abstract
MOTIVATION Data analysis is requisite on reliable data. In genetics this includes verifying that the sample is not contaminated with another, a problem ubiquitous in biology. RESULTS In human, and other diploid species, DNA contamination from the same species can be found by the presence of three haplotypes between polymorphic SNPs. read_haps is a tool that detects sample contamination from short read whole genome sequencing data. AVAILABILITYAND IMPLEMENTATION github.com/DecodeGenetics/read_haps.
Collapse
Affiliation(s)
| | - Bjarni V Halldorsson
- deCODE Genetics, Reykjavík 102, Iceland.,Department of Engineering, School of Technology, Reykjavík University, Reykjavík 102, Iceland
| |
Collapse
|
19
|
Smith SR, Normandeau E, Djambazian H, Nawarathna PM, Berube P, Muir AM, Ragoussis J, Penney CM, Scribner KT, Luikart G, Wilson CC, Bernatchez L. A chromosome-anchored genome assembly for Lake Trout (Salvelinus namaycush). Mol Ecol Resour 2021; 22:679-694. [PMID: 34351050 PMCID: PMC9291852 DOI: 10.1111/1755-0998.13483] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/23/2021] [Revised: 07/25/2021] [Accepted: 07/28/2021] [Indexed: 01/23/2023]
Abstract
Here, we present an annotated, chromosome‐anchored, genome assembly for Lake Trout (Salvelinus namaycush) – a highly diverse salmonid species of notable conservation concern and an excellent model for research on adaptation and speciation. We leveraged Pacific Biosciences long‐read sequencing, paired‐end Illumina sequencing, proximity ligation (Hi‐C) sequencing, and a previously published linkage map to produce a highly contiguous assembly composed of 7378 contigs (contig N50 = 1.8 Mb) assigned to 4120 scaffolds (scaffold N50 = 44.975 Mb). Long read sequencing data were generated using DNA from a female double haploid individual. 84.7% of the genome was assigned to 42 chromosome‐sized scaffolds and 93.2% of Benchmarking Universal Single Copy Orthologues were recovered, putting this assembly on par with the best currently available salmonid genomes. Estimates of genome size based on k‐mer frequency analysis were highly similar to the total size of the finished genome, suggesting that the entirety of the genome was recovered. A mitochondrial genome assembly was also produced. Self‐versus‐self synteny analysis allowed us to identify homeologs resulting from the salmonid specific autotetraploid event (Ss4R) as well as regions exhibiting delayed rediploidization. Alignment with three other salmonid genomes and the Northern Pike (Esox lucius) genome also allowed us to identify homologous chromosomes in related taxa. We also generated multiple resources useful for future genomic research on Lake Trout, including a repeat library and a sex‐averaged recombination map. A novel RNA sequencing data set for liver tissue was also generated in order to produce a publicly available set of annotations for 49,668 genes and pseudogenes. Potential applications of these resources to population genetics and the conservation of native populations are discussed.
Collapse
Affiliation(s)
- Seth R Smith
- Department of Integrative Biology, Michigan State University, East Lansing, MI, USA.,Ecology, Evolution, and Behavior Program, Michigan State University, East Lansing, MI, USA
| | - Eric Normandeau
- Institut de Biologie Intégrative et des Systèmes, Université Laval, Quebec, QC, Canada
| | - Haig Djambazian
- McGill Genome Centre, Department of Human Genetics, Montreal, QC, Canada
| | - Pubudu M Nawarathna
- Department of Human Genetics, Canadian Centre for Computational Genomics (C3G, McGill University, Montréal, QC, Canada
| | - Pierre Berube
- McGill Genome Centre, Department of Human Genetics, Montreal, QC, Canada
| | | | - Jiannis Ragoussis
- McGill Genome Centre, Department of Human Genetics, Montreal, QC, Canada
| | - Chantelle M Penney
- Environmental and Life Sciences Graduate Program, Trent University, Peterborough, ON, Canada
| | - Kim T Scribner
- Department of Integrative Biology, Michigan State University, East Lansing, MI, USA.,Ecology, Evolution, and Behavior Program, Michigan State University, East Lansing, MI, USA.,Department of Fisheries and Wildlife, Michigan State University, East Lansing, MI, USA
| | - Gordon Luikart
- Fish and Wildlife Genomics Group, University of Montana, Missoula, MT, USA.,Flathead Lake Biological Station, Division of Biological Sciences, University of Montana, Polson, MT, USA
| | - Chris C Wilson
- Aquatic Research and Monitoring Section, Ontario Ministry of Natural Resources and Forestry, Peterborough, ON, Canada
| | - Louis Bernatchez
- Institut de Biologie Intégrative et des Systèmes, Université Laval, Quebec, QC, Canada
| |
Collapse
|
20
|
Wang Q, Boenigk S, Boehm V, Gehring NH, Altmueller J, Dieterich C. Single cell transcriptome sequencing on the Nanopore platform with ScNapBar. RNA (NEW YORK, N.Y.) 2021; 27:rna.078154.120. [PMID: 33906975 PMCID: PMC8208055 DOI: 10.1261/rna.078154.120] [Citation(s) in RCA: 13] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 10/16/2020] [Accepted: 04/20/2021] [Indexed: 06/12/2023]
Abstract
The current ecosystem of single cell RNA-seq platforms is rapidly expanding, but robust solutions for single cell and single molecule full- length RNA sequencing are virtually absent. A high-throughput solution that covers all aspects is necessary to study the complex life of mRNA on the single cell level. The Nanopore platform offers long read sequencing and can be integrated with the popular single cell sequencing method on the 10x Chromium platform. However, the high error-rate of Nanopore reads poses a challenge in downstream processing (e.g. for cell barcode assignment). We propose a solution to this particular problem by using a hybrid sequencing approach on Nanopore and Illumina platforms. Our software ScNapBar enables cell barcode assignment with high accuracy, especially if sequencing satura- tion is low. ScNapBar uses unique molecular identifier (UMI) or Naıve Bayes probabilistic approaches in the barcode assignment, depending on the available Illumina sequencing depth. We have benchmarked the two approaches on simulated and real Nanopore datasets. We further applied ScNapBar to pools of cells with an active or a silenced non-sense mediated RNA decay pathway. Our Nanopore read assignment distinguishes the respective cell populations and reveals characteristic nonsense-mediated mRNA decay events depending on cell status.
Collapse
Affiliation(s)
- Qi Wang
- Klaus Tschira Institute for Integrative Computational Cardiology, University Hospital Heidelberg
| | - Sven Boenigk
- Klaus Tschira Institute for Integrative Computational Cardiology, University Hospital Heidelberg
| | | | | | | | | |
Collapse
|
21
|
Broseus L, Thomas A, Oldfield AJ, Severac D, Dubois E, Ritchie W. TALC: Transcript-level Aware Long-read Correction. Bioinformatics 2021; 36:5000-5006. [PMID: 32910174 DOI: 10.1093/bioinformatics/btaa634] [Citation(s) in RCA: 10] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/03/2020] [Revised: 05/08/2020] [Accepted: 07/09/2020] [Indexed: 02/06/2023] Open
Abstract
MOTIVATION Long-read sequencing technologies are invaluable for determining complex RNA transcript architectures but are error-prone. Numerous 'hybrid correction' algorithms have been developed for genomic data that correct long reads by exploiting the accuracy and depth of short reads sequenced from the same sample. These algorithms are not suited for correcting more complex transcriptome sequencing data. RESULTS We have created a novel reference-free algorithm called Transcript-level Aware Long-Read Correction (TALC) which models changes in RNA expression and isoform representation in a weighted De Bruijn graph to correct long reads from transcriptome studies. We show that transcript-level aware correction by TALC improves the accuracy of the whole spectrum of downstream RNA-seq applications and is thus necessary for transcriptome analyses that use long read technology. AVAILABILITY AND IMPLEMENTATION TALC is implemented in C++ and available at https://github.com/lbroseus/TALC. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Lucile Broseus
- Department of Genome Dynamics, Institut de Génétique Humaine, Centre National de la Recherche Scientifique (CNRS), Université de Montpellier, Montpellier 34396, France
| | - Aubin Thomas
- Department of Genome Dynamics, Institut de Génétique Humaine, Centre National de la Recherche Scientifique (CNRS), Université de Montpellier, Montpellier 34396, France
| | - Andrew J Oldfield
- Department of Genome Dynamics, Institut de Génétique Humaine, Centre National de la Recherche Scientifique (CNRS), Université de Montpellier, Montpellier 34396, France
| | - Dany Severac
- MGX-Montpellier GenomiX, c/o Institut de Génomique Fonctionnelle, Montpellier Cedex 5 34094, France
| | - Emeric Dubois
- MGX-Montpellier GenomiX, c/o Institut de Génomique Fonctionnelle, Montpellier Cedex 5 34094, France
| | - William Ritchie
- Department of Genome Dynamics, Institut de Génétique Humaine, Centre National de la Recherche Scientifique (CNRS), Université de Montpellier, Montpellier 34396, France
| |
Collapse
|
22
|
Firtina C, Kim JS, Alser M, Senol Cali D, Cicek AE, Alkan C, Mutlu O. Apollo: a sequencing-technology-independent, scalable and accurate assembly polishing algorithm. Bioinformatics 2020; 36:3669-3679. [PMID: 32167530 DOI: 10.1093/bioinformatics/btaa179] [Citation(s) in RCA: 22] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/13/2019] [Revised: 12/16/2019] [Accepted: 03/11/2020] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION Third-generation sequencing technologies can sequence long reads that contain as many as 2 million base pairs. These long reads are used to construct an assembly (i.e. the subject's genome), which is further used in downstream genome analysis. Unfortunately, third-generation sequencing technologies have high sequencing error rates and a large proportion of base pairs in these long reads is incorrectly identified. These errors propagate to the assembly and affect the accuracy of genome analysis. Assembly polishing algorithms minimize such error propagation by polishing or fixing errors in the assembly by using information from alignments between reads and the assembly (i.e. read-to-assembly alignment information). However, current assembly polishing algorithms can only polish an assembly using reads from either a certain sequencing technology or a small assembly. Such technology-dependency and assembly-size dependency require researchers to (i) run multiple polishing algorithms and (ii) use small chunks of a large genome to use all available readsets and polish large genomes, respectively. RESULTS We introduce Apollo, a universal assembly polishing algorithm that scales well to polish an assembly of any size (i.e. both large and small genomes) using reads from all sequencing technologies (i.e. second- and third-generation). Our goal is to provide a single algorithm that uses read sets from all available sequencing technologies to improve the accuracy of assembly polishing and that can polish large genomes. Apollo (i) models an assembly as a profile hidden Markov model (pHMM), (ii) uses read-to-assembly alignment to train the pHMM with the Forward-Backward algorithm and (iii) decodes the trained model with the Viterbi algorithm to produce a polished assembly. Our experiments with real readsets demonstrate that Apollo is the only algorithm that (i) uses reads from any sequencing technology within a single run and (ii) scales well to polish large assemblies without splitting the assembly into multiple parts. AVAILABILITY AND IMPLEMENTATION Source code is available at https://github.com/CMU-SAFARI/Apollo. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Can Firtina
- Department of Computer Science, ETH Zurich, Zurich 8092, Switzerland
| | - Jeremie S Kim
- Department of Computer Science, ETH Zurich, Zurich 8092, Switzerland.,Department of Electrical and Computer Engineering, Carnegie Mellon University, Pittsburgh, PA 15213, USA
| | - Mohammed Alser
- Department of Computer Science, ETH Zurich, Zurich 8092, Switzerland
| | - Damla Senol Cali
- Department of Electrical and Computer Engineering, Carnegie Mellon University, Pittsburgh, PA 15213, USA
| | - A Ercument Cicek
- Department of Computer Engineering, Bilkent University, Ankara 06800, Turkey
| | - Can Alkan
- Department of Computer Engineering, Bilkent University, Ankara 06800, Turkey
| | - Onur Mutlu
- Department of Computer Science, ETH Zurich, Zurich 8092, Switzerland.,Department of Electrical and Computer Engineering, Carnegie Mellon University, Pittsburgh, PA 15213, USA.,Department of Computer Engineering, Bilkent University, Ankara 06800, Turkey
| |
Collapse
|
23
|
Schmartz GP, Kern F, Fehlmann T, Wagner V, Fromm B, Keller A. Encyclopedia of tools for the analysis of miRNA isoforms. Brief Bioinform 2020; 22:6032629. [PMID: 33313643 DOI: 10.1093/bib/bbaa346] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/02/2020] [Revised: 10/15/2020] [Accepted: 10/29/2020] [Indexed: 12/14/2022] Open
Abstract
RNA sequencing data sets rapidly increase in quantity. For microRNAs (miRNAs), frequently dozens to hundreds of billion reads are generated per study. The quantification of annotated miRNAs and the prediction of new miRNAs are leading computational tasks. Now, the increased depth of coverage allows to gain deeper insights into the variability of miRNAs. The analysis of isoforms of miRNAs (isomiRs) is a trending topic, and a range of computational tools for the analysis of isomiRs has been developed. We provide an overview on 27 available computational solutions for the analysis of isomiRs. These include both stand-alone programs (17 tools) and web-based solutions (10 tools) and span a publication time range from 2010 to 2020. Seven of the tools were published in 2019 and 2020, confirming the rising importance of the topic. While most of the analyzed tools work for a broad range of organisms or are completely independent of a reference organism, several tools have been tailored for the analysis of human miRNA data or for plants. While 14 of the tools are general analysis tools of miRNAs, and isomiR analysis is one of their features, the remaining 13 tools have specifically been developed for isomiR analysis. A direct comparison on 20 deep sequencing data sets for selected tools provides insights into the heterogeneity of results. With our work, we provide users a comprehensive overview on the landscape of isomiR analysis tools and in that support the selection of the most appropriate tool for their respective research task.
Collapse
Affiliation(s)
| | | | | | | | - Bastian Fromm
- Science for Life Laboratory, Department of Molecular Biosciences, The Wenner-Gren Institute, Stockholm University, Stockholm, Sweden
| | - Andreas Keller
- Saarland Center for Bioinformatics and Chair for Clinical Bioinformatics, Saarland University Building E2.1, 66123 Saarbrücken, Germany
| |
Collapse
|
24
|
Yang H, Wang Y, Zhang Z, Li H. Identification of KIF18B as a Hub Candidate Gene in the Metastasis of Clear Cell Renal Cell Carcinoma by Weighted Gene Co-expression Network Analysis. Front Genet 2020; 11:905. [PMID: 32973873 PMCID: PMC7468490 DOI: 10.3389/fgene.2020.00905] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/17/2019] [Accepted: 07/21/2020] [Indexed: 12/13/2022] Open
Abstract
Background Clear cell renal cell carcinoma (ccRCC) is a common type of fatal malignancy in the urinary system. As the therapeutic strategies of ccRCC are severely limited at present, the prognosis of patients with metastatic carcinoma is usually not promising. Revealing the pathogenesis and identifying hub candidate genes for prognosis prediction and precise treatment are urgently needed in metastatic ccRCC. Methods In the present study, we conducted a series of bioinformatics studies with the gene expression profiles of ccRCC samples from Gene Expression Omnibus (GEO) and the cancer genome atlas (TCGA) database for identifying and validating the hub gene of metastatic ccRCC. We constructed a co-expression network, divided genes into co-expression modules, and identified ccRCC-related modules by weighted gene co-expression network analysis (WGCNA) with data from GEO. Then, we investigated the functions of genes in the ccRCC-related modules by enrichment analyses and built a sub-network accordingly. A hub candidate gene of the metastatic ccRCC was identified by maximal clique centrality (MCC) method. We validate the hub gene by differentially expressed gene analysis, overall survival analysis, and correlation analysis with clinical traits with the external dataset (TCGA). Finally, we explored the function of the hub gene by correlation analysis with targets of precise therapies and single-gene gene set enrichment analysis. Results We conducted WGCNA with the expression profiles of GSE73731 from GEO and divided all genes into 8 meaningful co-expression modules. One module is proved to be positively correlated with pathological stage and tumor grade of ccRCC. Genes in the ccRCC-related module were mainly enriched in functions of mitotic cell division and several proverbial tumor related signal pathways. We then identified KIF18B as a hub gene of the metastasis of ccRCC. Validating analyses in external dataset observed the up-regulation of KIF18B in ccRCC and its correlation with worse outcomes. Further analyses found that the expression of KIF18B is related to that of targets of precise therapies. Conclusion Our study proposed KIF18B as a hub candidate gene of ccRCC for the first time. Our conclusion may provide a brand-new clue for prognosis evaluating and precise treatment for ccRCC in the future.
Collapse
Affiliation(s)
- Huiying Yang
- Department of Nephrology, Sir Run Run Shaw Hospital, Zhejiang University School of Medicine, Hangzhou, China
| | - Yukun Wang
- Department of Urology, Sir Run Run Shaw Hospital, Zhejiang University School of Medicine, Hangzhou, China
| | - Ziyi Zhang
- Department of Endocrinology, Sir Run Run Shaw Hospital, Zhejiang University School of Medicine, Hangzhou, China
| | - Hua Li
- Department of Nephrology, Sir Run Run Shaw Hospital, Zhejiang University School of Medicine, Hangzhou, China
| |
Collapse
|
25
|
Lees JA, Mai TT, Galardini M, Wheeler NE, Horsfield ST, Parkhill J, Corander J. Improved Prediction of Bacterial Genotype-Phenotype Associations Using Interpretable Pangenome-Spanning Regressions. mBio 2020; 11:e01344-20. [PMID: 32636251 PMCID: PMC7343994 DOI: 10.1128/mbio.01344-20] [Citation(s) in RCA: 38] [Impact Index Per Article: 9.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/22/2020] [Accepted: 06/05/2020] [Indexed: 12/19/2022] Open
Abstract
Discovery of genetic variants underlying bacterial phenotypes and the prediction of phenotypes such as antibiotic resistance are fundamental tasks in bacterial genomics. Genome-wide association study (GWAS) methods have been applied to study these relations, but the plastic nature of bacterial genomes and the clonal structure of bacterial populations creates challenges. We introduce an alignment-free method which finds sets of loci associated with bacterial phenotypes, quantifies the total effect of genetics on the phenotype, and allows accurate phenotype prediction, all within a single computationally scalable joint modeling framework. Genetic variants covering the entire pangenome are compactly represented by extended DNA sequence words known as unitigs, and model fitting is achieved using elastic net penalization, an extension of standard multiple regression. Using an extensive set of state-of-the-art bacterial population genomic data sets, we demonstrate that our approach performs accurate phenotype prediction, comparable to popular machine learning methods, while retaining both interpretability and computational efficiency. Compared to those of previous approaches, which test each genotype-phenotype association separately for each variant and apply a significance threshold, the variants selected by our joint modeling approach overlap substantially.IMPORTANCE Being able to identify the genetic variants responsible for specific bacterial phenotypes has been the goal of bacterial genetics since its inception and is fundamental to our current level of understanding of bacteria. This identification has been based primarily on painstaking experimentation, but the availability of large data sets of whole genomes with associated phenotype metadata promises to revolutionize this approach, not least for important clinical phenotypes that are not amenable to laboratory analysis. These models of phenotype-genotype association can in the future be used for rapid prediction of clinically important phenotypes such as antibiotic resistance and virulence by rapid-turnaround or point-of-care tests. However, despite much effort being put into adapting genome-wide association study (GWAS) approaches to cope with bacterium-specific problems, such as strong population structure and horizontal gene exchange, current approaches are not yet optimal. We describe a method that advances methodology for both association and generation of portable prediction models.
Collapse
Affiliation(s)
- John A Lees
- MRC Centre for Global Infectious Disease Analysis, Department of Infectious Disease Epidemiology, Imperial College London, London, United Kingdom
| | - T Tien Mai
- Oslo Centre for Biostatistics and Epidemiology, Department of Biostatistics, University of Oslo, Oslo, Norway
| | - Marco Galardini
- Biological Design Center, Boston University, Boston, Massachusetts, USA
| | - Nicole E Wheeler
- Centre for Genomic Pathogen Surveillance, Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, United Kingdom
| | - Samuel T Horsfield
- MRC Centre for Global Infectious Disease Analysis, Department of Infectious Disease Epidemiology, Imperial College London, London, United Kingdom
| | - Julian Parkhill
- Department of Veterinary Medicine, University of Cambridge, Cambridge, United Kingdom
| | - Jukka Corander
- Oslo Centre for Biostatistics and Epidemiology, Department of Biostatistics, University of Oslo, Oslo, Norway
- Centre for Genomic Pathogen Surveillance, Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, United Kingdom
- Helsinki Institute of Information Technology, Department of Mathematics and Statistics, University of Helsinki, Helsinki, Finland
| |
Collapse
|
26
|
Wylie DC, Hofmann HA, Zemelman BV. SArKS: de novo discovery of gene expression regulatory motif sites and domains by suffix array kernel smoothing. Bioinformatics 2020; 35:3944-3952. [PMID: 30903136 DOI: 10.1093/bioinformatics/btz198] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/12/2018] [Revised: 03/04/2019] [Accepted: 03/20/2019] [Indexed: 11/14/2022] Open
Abstract
MOTIVATION We set out to develop an algorithm that can mine differential gene expression data to identify candidate cell type-specific DNA regulatory sequences. Differential expression is usually quantified as a continuous score-fold-change, test-statistic, P-value-comparing biological classes. Unlike existing approaches, our de novo strategy, termed SArKS, applies non-parametric kernel smoothing to uncover promoter motif sites that correlate with elevated differential expression scores. SArKS detects motif k-mers by smoothing sequence scores over sequence similarity. A second round of smoothing over spatial proximity reveals multi-motif domains (MMDs). Discovered motif sites can then be merged or extended based on adjacency within MMDs. False positive rates are estimated and controlled by permutation testing. RESULTS We applied SArKS to published gene expression data representing distinct neocortical neuron classes in Mus musculus and interneuron developmental states in Homo sapiens. When benchmarked against several existing algorithms using a cross-validation procedure, SArKS identified larger motif sets that formed the basis for regression models with higher correlative power. AVAILABILITY AND IMPLEMENTATION https://github.com/denniscwylie/sarks. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Dennis C Wylie
- Center for Computational Biology and Bioinformatics, University of Texas at Austin, Austin, TX, USA
| | - Hans A Hofmann
- Center for Computational Biology and Bioinformatics, University of Texas at Austin, Austin, TX, USA.,Institute for Cellular and Molecular Biology, University of Texas at Austin, Austin, TX, USA.,Department of Integrative Biology, University of Texas at Austin, Austin, TX, USA.,Institute for Neuroscience, University of Texas at Austin, Austin, TX, USA
| | - Boris V Zemelman
- Institute for Cellular and Molecular Biology, University of Texas at Austin, Austin, TX, USA.,Institute for Neuroscience, University of Texas at Austin, Austin, TX, USA.,Department of Neuroscience, University of Texas at Austin, Austin, TX, USA.,Center for Learning and Memory, University of Texas at Austin, Austin, TX, USA
| |
Collapse
|
27
|
Abstract
Motivation Sequence graphs are versatile data structures that are, for instance, able to represent the genetic variation found in a population and to facilitate genome assembly. Read mapping to sequence graphs constitutes an important step for many applications and is usually done by first finding exact seed matches, which are then extended by alignment. Existing methods for finding seed hits prune the graph in complex regions, leading to a loss of information especially in highly polymorphic regions of the genome. While such complex graph structures can indeed lead to a combinatorial explosion of possible alleles, the query set of reads from a diploid individual realizes only two alleles per locus—a property that is not exploited by extant methods. Results We present the Pan-genome Seed Index (PSI), a fully-sensitive hybrid method for seed finding, which takes full advantage of this property by combining an index over selected paths in the graph with an index over the query reads. This enables PSI to find all seeds while eliminating the need to prune the graph. We demonstrate its performance with different parameter settings on both simulated data and on a whole human genome graph constructed from variants in the 1000 Genome Project dataset. On this graph, PSI outperforms GCSA2 in terms of index size, query time and sensitivity. Availability and implementation The C++ implementation is publicly available at: https://github.com/cartoonist/psi.
Collapse
Affiliation(s)
- Ali Ghaffaari
- Center for Bioinformatics, Saarland University, Saarbrücken, Germany.,Max Planck Institute for Informatics, Saarbrücken, Germany
| | - Tobias Marschall
- Center for Bioinformatics, Saarland University, Saarbrücken, Germany.,Max Planck Institute for Informatics, Saarbrücken, Germany
| |
Collapse
|
28
|
Rautiainen M, Mäkinen V, Marschall T. Bit-parallel sequence-to-graph alignment. Bioinformatics 2020; 35:3599-3607. [PMID: 30851095 PMCID: PMC6761980 DOI: 10.1093/bioinformatics/btz162] [Citation(s) in RCA: 26] [Impact Index Per Article: 6.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/23/2018] [Revised: 01/19/2019] [Accepted: 03/07/2019] [Indexed: 01/16/2023] Open
Abstract
Motivation Graphs are commonly used to represent sets of sequences. Either edges or nodes can be labeled by sequences, so that each path in the graph spells a concatenated sequence. Examples include graphs to represent genome assemblies, such as string graphs and de Bruijn graphs, and graphs to represent a pan-genome and hence the genetic variation present in a population. Being able to align sequencing reads to such graphs is a key step for many analyses and its applications include genome assembly, read error correction and variant calling with respect to a variation graph. Results We generalize two linear sequence-to-sequence algorithms to graphs: the Shift-And algorithm for exact matching and Myers’ bitvector algorithm for semi-global alignment. These linear algorithms are both based on processing w sequence characters with a constant number of operations, where w is the word size of the machine (commonly 64), and achieve a speedup of up to w over naive algorithms. For a graph with |V| nodes and |E| edges and a sequence of length m, our bitvector-based graph alignment algorithm reaches a worst case runtime of O(|V|+⌈mw⌉|E| log w) for acyclic graphs and O(|V|+m|E| log w) for arbitrary cyclic graphs. We apply it to five different types of graphs and observe a speedup between 3-fold and 20-fold compared with a previous (asymptotically optimal) alignment algorithm. Availability and implementation https://github.com/maickrau/GraphAligner Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Mikko Rautiainen
- Center for Bioinformatics, Saarland University, Saarland Informatics Campus E2.1, 66123 Saarbrücken, Germany.,Max Planck Institute for Informatics, Saarland Informatics Campus E1.4, 66123 Saarbrücken, Germany.,Saarbrücken Graduate School of Computer Science, Saarland University, Saarland Informatics Campus E1.3, 66123 Saarbrücken, Germany
| | - Veli Mäkinen
- Department of Computer Science, University of Helsinki, Helsinki, Finland
| | - Tobias Marschall
- Center for Bioinformatics, Saarland University, Saarland Informatics Campus E2.1, 66123 Saarbrücken, Germany.,Max Planck Institute for Informatics, Saarland Informatics Campus E1.4, 66123 Saarbrücken, Germany
| |
Collapse
|
29
|
Romero R, Sánchez-Rivera FJ, Westcott PMK, Mercer KL, Bhutkar A, Muir A, González Robles TJ, Lamboy Rodríguez S, Liao LZ, Ng SR, Li L, Colón CI, Naranjo S, Beytagh MC, Lewis CA, Hsu PP, Bronson RT, Vander Heiden MG, Jacks T. Keap1 mutation renders lung adenocarcinomas dependent on Slc33a1. NATURE CANCER 2020; 1:589-602. [PMID: 34414377 PMCID: PMC8373048 DOI: 10.1038/s43018-020-0071-1] [Citation(s) in RCA: 35] [Impact Index Per Article: 8.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/18/2019] [Accepted: 05/01/2020] [Indexed: 12/13/2022]
Abstract
Approximately 20-30% of human lung adenocarcinomas (LUAD) harbor loss-of-function (LOF) mutations in Kelch-like ECH Associated-Protein 1 (KEAP1), which lead to hyperactivation of the nuclear factor, erythroid 2-like 2 (NRF2) antioxidant pathway and correlate with poor prognosis1-3. We previously showed that Keap1 mutation accelerates KRAS-driven LUAD and produces a marked dependency on glutaminolysis4. To extend the investigation of genetic dependencies in the context of Keap1 mutation, we performed a druggable genome CRISPR-Cas9 screen in Keap1-mutant cells. This analysis uncovered a profound Keap1 mutant-specific dependency on solute carrier family 33 member 1 (Slc33a1), an endomembrane-associated protein with roles in autophagy regulation5, as well as a series of functionally-related genes implicated in the unfolded protein response. Targeted genetic and biochemical experiments using mouse and human Keap1-mutant tumor lines, as well as preclinical genetically-engineered mouse models (GEMMs) of LUAD, validate Slc33a1 as a robust Keap1-mutant-specific dependency. Furthermore, unbiased genome-wide CRISPR screening identified additional genes related to Slc33a1 dependency. Overall, our study provides a strong rationale for stratification of patients harboring KEAP1-mutant or NRF2-hyperactivated tumors as likely responders to targeted SLC33A1 inhibition and underscores the value of integrating functional genetic approaches with GEMMs to identify and validate genotype-specific therapeutic targets.
Collapse
Affiliation(s)
- Rodrigo Romero
- Koch Institute for Integrative Cancer Research, Cambridge, MA, USA
- Massachusetts Institute of Technology Department of Biology, Cambridge, MA, USA
| | - Francisco J Sánchez-Rivera
- Koch Institute for Integrative Cancer Research, Cambridge, MA, USA
- Massachusetts Institute of Technology Department of Biology, Cambridge, MA, USA
- Department of Cancer Biology and Genetics, Memorial Sloan Kettering Cancer Center, New York, NY, USA
| | | | - Kim L Mercer
- Koch Institute for Integrative Cancer Research, Cambridge, MA, USA
- Howard Hughes Medical Institute, Chevy Chase, MD, USA
| | - Arjun Bhutkar
- Koch Institute for Integrative Cancer Research, Cambridge, MA, USA
| | - Alexander Muir
- Koch Institute for Integrative Cancer Research, Cambridge, MA, USA
- Ben May Department for Cancer Research, University of Chicago, Chicago, IL, USA
| | | | | | - Laura Z Liao
- Massachusetts Institute of Technology Department of Biology, Cambridge, MA, USA
| | - Sheng Rong Ng
- Koch Institute for Integrative Cancer Research, Cambridge, MA, USA
- Massachusetts Institute of Technology Department of Biology, Cambridge, MA, USA
| | - Leanne Li
- Koch Institute for Integrative Cancer Research, Cambridge, MA, USA
| | - Caterina I Colón
- Koch Institute for Integrative Cancer Research, Cambridge, MA, USA
| | - Santiago Naranjo
- Koch Institute for Integrative Cancer Research, Cambridge, MA, USA
- Massachusetts Institute of Technology Department of Biology, Cambridge, MA, USA
| | - Mary Clare Beytagh
- Massachusetts Institute of Technology Department of Biology, Cambridge, MA, USA
| | - Caroline A Lewis
- Whitehead Institute, Massachusetts Institute of Technology, Cambridge, MA, USA
| | - Peggy P Hsu
- Koch Institute for Integrative Cancer Research, Cambridge, MA, USA
- Massachusetts General Hospital Cancer Center, Boston, MA, USA
- Dana-Farber Cancer Institute, Boston, MA, USA
| | - Roderick T Bronson
- Tufts University, Boston, MA, USA
- Harvard Medical School, Boston, MA, USA
| | - Matthew G Vander Heiden
- Koch Institute for Integrative Cancer Research, Cambridge, MA, USA
- Massachusetts Institute of Technology Department of Biology, Cambridge, MA, USA
- Dana-Farber Cancer Institute, Boston, MA, USA
| | - Tyler Jacks
- Koch Institute for Integrative Cancer Research, Cambridge, MA, USA.
- Massachusetts Institute of Technology Department of Biology, Cambridge, MA, USA.
- Howard Hughes Medical Institute, Chevy Chase, MD, USA.
| |
Collapse
|
30
|
Urgese G, Parisi E, Scicolone O, Di Cataldo S, Ficarra E. BioSeqZip: a collapser of NGS redundant reads for the optimization of sequence analysis. Bioinformatics 2020; 36:2705-2711. [PMID: 31999333 PMCID: PMC7203750 DOI: 10.1093/bioinformatics/btaa051] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/05/2019] [Revised: 12/20/2019] [Accepted: 01/22/2020] [Indexed: 01/08/2023] Open
Abstract
MOTIVATION High-throughput next-generation sequencing can generate huge sequence files, whose analysis requires alignment algorithms that are typically very demanding in terms of memory and computational resources. This is a significant issue, especially for machines with limited hardware capabilities. As the redundancy of the sequences typically increases with coverage, collapsing such files into compact sets of non-redundant reads has the 2-fold advantage of reducing file size and speeding-up the alignment, avoiding to map the same sequence multiple times. METHOD BioSeqZip generates compact and sorted lists of alignment-ready non-redundant sequences, keeping track of their occurrences in the raw files as well as of their quality score information. By exploiting a memory-constrained external sorting algorithm, it can be executed on either single- or multi-sample datasets even on computers with medium computational capabilities. On request, it can even re-expand the compacted files to their original state. RESULTS Our extensive experiments on RNA-Seq data show that BioSeqZip considerably brings down the computational costs of a standard sequence analysis pipeline, with particular benefits for the alignment procedures that typically have the highest requirements in terms of memory and execution time. In our tests, BioSeqZip was able to compact 2.7 billion of reads into 963 million of unique tags reducing the size of sequence files up to 70% and speeding-up the alignment by 50% at least. AVAILABILITY AND IMPLEMENTATION BioSeqZip is available at https://github.com/bioinformatics-polito/BioSeqZip. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Gianvito Urgese
- Interuniversity Department of Regional and Urban Studies and Planning, Politecnico di Torino, Torino, Italy
| | - Emanuele Parisi
- Department of Control and Computer Engineering, Politecnico di Torino, Torino, Italy
| | - Orazio Scicolone
- Department of Control and Computer Engineering, Politecnico di Torino, Torino, Italy
| | - Santa Di Cataldo
- Department of Control and Computer Engineering, Politecnico di Torino, Torino, Italy
| | - Elisa Ficarra
- Department of Control and Computer Engineering, Politecnico di Torino, Torino, Italy
| |
Collapse
|
31
|
Meyer F, Bagchi S, Chaterji S, Gerlach W, Grama A, Harrison T, Paczian T, Trimble WL, Wilke A. MG-RAST version 4-lessons learned from a decade of low-budget ultra-high-throughput metagenome analysis. Brief Bioinform 2020; 20:1151-1159. [PMID: 29028869 DOI: 10.1093/bib/bbx105] [Citation(s) in RCA: 77] [Impact Index Per Article: 19.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/25/2017] [Revised: 07/21/2017] [Indexed: 11/12/2022] Open
Abstract
As technologies change, MG-RAST is adapting. Newly available software is being included to improve accuracy and performance. As a computational service constantly running large volume scientific workflows, MG-RAST is the right location to perform benchmarking and implement algorithmic or platform improvements, in many cases involving trade-offs between specificity, sensitivity and run-time cost. The work in [Glass EM, Dribinsky Y, Yilmaz P, et al. ISME J 2014;8:1-3] is an example; we use existing well-studied data sets as gold standards representing different environments and different technologies to evaluate any changes to the pipeline. Currently, we use well-understood data sets in MG-RAST as platform for benchmarking. The use of artificial data sets for pipeline performance optimization has not added value, as these data sets are not presenting the same challenges as real-world data sets. In addition, the MG-RAST team welcomes suggestions for improvements of the workflow. We are currently working on versions 4.02 and 4.1, both of which contain significant input from the community and our partners that will enable double barcoding, stronger inferences supported by longer-read technologies, and will increase throughput while maintaining sensitivity by using Diamond and SortMeRNA. On the technical platform side, the MG-RAST team intends to support the Common Workflow Language as a standard to specify bioinformatics workflows, both to facilitate development and efficient high-performance implementation of the community's data analysis tasks.
Collapse
|
32
|
Dietz C, Rueden CT, Helfrich S, Dobson ETA, Horn M, Eglinger J, Evans EL, McLean DT, Novitskaya T, Ricke WA, Sherer NM, Zijlstra A, Berthold MR, Eliceiri KW. Integration of the ImageJ Ecosystem in the KNIME Analytics Platform. FRONTIERS IN COMPUTER SCIENCE 2020; 2. [PMID: 32905440 DOI: 10.3389/fcomp.2020.00008] [Citation(s) in RCA: 17] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/19/2023] Open
Abstract
Open-source software tools are often used for analysis of scientific image data due to their flexibility and transparency in dealing with rapidly evolving imaging technologies. The complex nature of image analysis problems frequently requires many tools to be used in conjunction, including image processing and analysis, data processing, machine learning and deep learning, statistical analysis of the results, visualization, correlation to heterogeneous but related data, and more. However, the development, and therefore application, of these computational tools is impeded by a lack of integration across platforms. Integration of tools goes beyond convenience, as it is impractical for one tool to anticipate and accommodate the current and future needs of every user. This problem is emphasized in the field of bioimage analysis, where various rapidly emerging methods are quickly being adopted by researchers. ImageJ is a popular open-source image analysis platform, with contributions from a global community resulting in hundreds of specialized routines for a wide array of scientific tasks. ImageJ's strength lies in its accessibility and extensibility, allowing researchers to easily improve the software to solve their image analysis tasks. However, ImageJ is not designed for development of complex end-to-end image analysis workflows. Scientists are often forced to create highly specialized and hard-to-reproduce scripts to orchestrate individual software fragments and cover the entire life-cycle of an analysis of an image dataset. KNIME Analytics Platform, a user-friendly data integration, analysis, and exploration workflow system, was designed to handle huge amounts of heterogeneous data in a platform-agnostic, computing environment and has been successful in meeting complex end-to-end demands in several communities, such as cheminformatics and mass spectrometry. Similar needs within the bioimage analysis community led to the creation of the KNIME Image Processing extension which integrates ImageJ into KNIME Analytics Platform, enabling researchers to develop reproducible and scalable workflows, integrating a diverse range of analysis tools. Here we present how users and developers alike can leverage the ImageJ ecosystem via the KNIME Image Processing extension to provide robust and extensible image analysis within KNIME workflows. We illustrate the benefits of this integration with examples, as well as representative scientific use cases.
Collapse
Affiliation(s)
| | - Curtis T Rueden
- Laboratory for Optical and Computational Instrumentation (LOCI), Laboratory of Cell and Molecular Biology, University of Wisconsin-Madison, Madison, WI, USA
| | | | - Ellen T A Dobson
- Laboratory for Optical and Computational Instrumentation (LOCI), Laboratory of Cell and Molecular Biology, University of Wisconsin-Madison, Madison, WI, USA
| | | | - Jan Eglinger
- Friedrich Miescher Institute for Biomedical Research, Basel, Switzerland
| | - Edward L Evans
- McArdle Laboratory for Cancer Research, Institute for Molecular Virology, and Carbone Cancer Center, University of Wisconsin-Madison, Madison, WI, USA
| | - Dalton T McLean
- George M. O'Brien Center of Research Excellence, University of Wisconsin Madison, WI, USA
| | - Tatiana Novitskaya
- Department of Pathology, Microbiology and Immunology, Vanderbilt University Medical Center, Nashville, TN, USA
| | - William A Ricke
- George M. O'Brien Center of Research Excellence, University of Wisconsin Madison, WI, USA
| | - Nathan M Sherer
- McArdle Laboratory for Cancer Research, Institute for Molecular Virology, and Carbone Cancer Center, University of Wisconsin-Madison, Madison, WI, USA
| | - Andries Zijlstra
- Department of Pathology, Microbiology and Immunology, Vanderbilt University Medical Center, Nashville, TN, USA
| | - Michael R Berthold
- KNIME GmbH, Konstanz, Germany.,University of Konstanz, Konstanz, Germany
| | - Kevin W Eliceiri
- Laboratory for Optical and Computational Instrumentation (LOCI), Laboratory of Cell and Molecular Biology, University of Wisconsin-Madison, Madison, WI, USA.,Morgridge Institute for Research, Madison, WI, USA
| |
Collapse
|
33
|
Li R, He X, Dai C, Zhu H, Lang X, Chen W, Li X, Zhao D, Zhang Y, Han X, Niu T, Zhao Y, Cao R, He R, Lu Z, Chi X, Li W, Niu B. Gclust: A Parallel Clustering Tool for Microbial Genomic Data. GENOMICS PROTEOMICS & BIOINFORMATICS 2020; 17:496-502. [PMID: 31917259 PMCID: PMC7056916 DOI: 10.1016/j.gpb.2018.10.008] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 02/07/2018] [Revised: 05/29/2018] [Accepted: 10/23/2018] [Indexed: 11/12/2022]
Abstract
The accelerating growth of the public microbial genomic data imposes substantial burden on the research community that uses such resources. Building databases for non-redundant reference sequences from massive microbial genomic data based on clustering analysis is essential. However, existing clustering algorithms perform poorly on long genomic sequences. In this article, we present Gclust, a parallel program for clustering complete or draft genomic sequences, where clustering is accelerated with a novel parallelization strategy and a fast sequence comparison algorithm using sparse suffix arrays (SSAs). Moreover, genome identity measures between two sequences are calculated based on their maximal exact matches (MEMs). In this paper, we demonstrate the high speed and clustering quality of Gclust by examining four genome sequence datasets. Gclust is freely available for non-commercial use at https://github.com/niu-lab/gclust. We also introduce a web server for clustering user-uploaded genomes at http://niulab.scgrid.cn/gclust.
Collapse
Affiliation(s)
- Ruilin Li
- Computer Network Information Center, Chinese Academy of Sciences, Beijing 100190, China; University of Chinese Academy of Sciences, Beijing 100190, China
| | - Xiaoyu He
- Computer Network Information Center, Chinese Academy of Sciences, Beijing 100190, China; University of Chinese Academy of Sciences, Beijing 100190, China
| | - Chuangchuang Dai
- Computer Network Information Center, Chinese Academy of Sciences, Beijing 100190, China; University of Chinese Academy of Sciences, Beijing 100190, China
| | - Haidong Zhu
- Computer Network Information Center, Chinese Academy of Sciences, Beijing 100190, China; University of Chinese Academy of Sciences, Beijing 100190, China
| | - Xianyu Lang
- Computer Network Information Center, Chinese Academy of Sciences, Beijing 100190, China
| | - Wei Chen
- Computer Network Information Center, Chinese Academy of Sciences, Beijing 100190, China; University of Chinese Academy of Sciences, Beijing 100190, China
| | - Xiaodong Li
- Computer Network Information Center, Chinese Academy of Sciences, Beijing 100190, China; University of Chinese Academy of Sciences, Beijing 100190, China
| | - Dan Zhao
- Computer Network Information Center, Chinese Academy of Sciences, Beijing 100190, China; University of Chinese Academy of Sciences, Beijing 100190, China
| | - Yu Zhang
- Computer Network Information Center, Chinese Academy of Sciences, Beijing 100190, China; University of Chinese Academy of Sciences, Beijing 100190, China
| | - Xinyin Han
- Computer Network Information Center, Chinese Academy of Sciences, Beijing 100190, China; University of Chinese Academy of Sciences, Beijing 100190, China
| | - Tie Niu
- Computer Network Information Center, Chinese Academy of Sciences, Beijing 100190, China
| | - Yi Zhao
- Computer Network Information Center, Chinese Academy of Sciences, Beijing 100190, China
| | - Rongqiang Cao
- Computer Network Information Center, Chinese Academy of Sciences, Beijing 100190, China
| | - Rong He
- Computer Network Information Center, Chinese Academy of Sciences, Beijing 100190, China
| | - Zhonghua Lu
- Computer Network Information Center, Chinese Academy of Sciences, Beijing 100190, China
| | - Xuebin Chi
- Computer Network Information Center, Chinese Academy of Sciences, Beijing 100190, China; University of Chinese Academy of Sciences, Beijing 100190, China; Center of Scientific Computing Applications & Research, Chinese Academy of Sciences, Beijing 100190, China
| | - Weizhong Li
- J. Craig Venter Institute, La Jolla, CA 92037, USA.
| | - Beifang Niu
- Computer Network Information Center, Chinese Academy of Sciences, Beijing 100190, China; University of Chinese Academy of Sciences, Beijing 100190, China; Guizhou University School of Medicine, Guiyang 550025, China.
| |
Collapse
|
34
|
Guo H, Liu B, Guan D, Fu Y, Wang Y. Fast read alignment with incorporation of known genomic variants. BMC Med Inform Decis Mak 2019; 19:265. [PMID: 31856811 PMCID: PMC6921400 DOI: 10.1186/s12911-019-0960-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022] Open
Abstract
Background Many genetic variants have been reported from sequencing projects due to decreasing experimental costs. Compared to the current typical paradigm, read mapping incorporating existing variants can improve the performance of subsequent analysis. This method is supposed to map sequencing reads efficiently to a graphical index with a reference genome and known variation to increase alignment quality and variant calling accuracy. However, storing and indexing various types of variation require costly RAM space. Methods Aligning reads to a graph model-based index including the whole set of variants is ultimately an NP-hard problem in theory. Here, we propose a variation-aware read alignment algorithm (VARA), which generates the alignment between read and multiple genomic sequences simultaneously utilizing the schema of the Landau-Vishkin algorithm. VARA dynamically extracts regional variants to construct a pseudo tree-based structure on-the-fly for seed extension without loading the whole genome variation into memory space. Results We developed the novel high-throughput sequencing read aligner deBGA-VARA by integrating VARA into deBGA. The deBGA-VARA is benchmarked both on simulated reads and the NA12878 sequencing dataset. The experimental results demonstrate that read alignment incorporating genetic variation knowledge can achieve high sensitivity and accuracy. Conclusions Due to its efficiency, VARA provides a promising solution for further improvement of variant calling while maintaining small memory footprints. The deBGA-VARA is available at: https://github.com/hitbc/deBGA-VARA.
Collapse
Affiliation(s)
- Hongzhe Guo
- Center for Bioinformatics, Harbin Institute of Technology, 92 West Dazhi Street, Harbin, 150001, China
| | - Bo Liu
- Center for Bioinformatics, Harbin Institute of Technology, 92 West Dazhi Street, Harbin, 150001, China
| | - Dengfeng Guan
- Center for Bioinformatics, Harbin Institute of Technology, 92 West Dazhi Street, Harbin, 150001, China
| | - Yilei Fu
- Center for Bioinformatics, Harbin Institute of Technology, 92 West Dazhi Street, Harbin, 150001, China
| | - Yadong Wang
- Center for Bioinformatics, Harbin Institute of Technology, 92 West Dazhi Street, Harbin, 150001, China.
| |
Collapse
|
35
|
Liao Y, Smyth GK, Shi W. The R package Rsubread is easier, faster, cheaper and better for alignment and quantification of RNA sequencing reads. Nucleic Acids Res 2019; 47:e47. [PMID: 30783653 PMCID: PMC6486549 DOI: 10.1093/nar/gkz114] [Citation(s) in RCA: 1424] [Impact Index Per Article: 284.8] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/25/2018] [Revised: 01/02/2019] [Accepted: 02/13/2019] [Indexed: 11/29/2022] Open
Abstract
We present Rsubread, a Bioconductor software package that provides high-performance alignment and read counting functions for RNA-seq reads. Rsubread is based on the successful Subread suite with the added ease-of-use of the R programming environment, creating a matrix of read counts directly as an R object ready for downstream analysis. It integrates read mapping and quantification in a single package and has no software dependencies other than R itself. We demonstrate Rsubread’s ability to detect exon–exon junctions de novo and to quantify expression at the level of either genes, exons or exon junctions. The resulting read counts can be input directly into a wide range of downstream statistical analyses using other Bioconductor packages. Using SEQC data and simulations, we compare Rsubread to TopHat2, STAR and HTSeq as well as to counting functions in the Bioconductor infrastructure packages. We consider the performance of these tools on the combined quantification task starting from raw sequence reads through to summary counts, and in particular evaluate the performance of different combinations of alignment and counting algorithms. We show that Rsubread is faster and uses less memory than competitor tools and produces read count summaries that more accurately correlate with true values.
Collapse
Affiliation(s)
- Yang Liao
- Bioinformatics Division, The Walter and Eliza Hall Institute of Medical Research, 1G Royal Parade, Parkville, Victoria 3052, Australia.,Department of Medical Biology, The University of Melbourne, Parkville, Victoria 3010, Australia
| | - Gordon K Smyth
- Bioinformatics Division, The Walter and Eliza Hall Institute of Medical Research, 1G Royal Parade, Parkville, Victoria 3052, Australia.,School of Mathematics and Statistics, The University of Melbourne, Parkville, Victoria 3010, Australia
| | - Wei Shi
- Bioinformatics Division, The Walter and Eliza Hall Institute of Medical Research, 1G Royal Parade, Parkville, Victoria 3052, Australia.,School of Computing and Information Systems, The University of Melbourne, Parkville, Victoria 3010, Australia
| |
Collapse
|
36
|
Loka TP, Tausch SH, Renard BY. Reliable variant calling during runtime of Illumina sequencing. Sci Rep 2019; 9:16502. [PMID: 31712740 PMCID: PMC6848508 DOI: 10.1038/s41598-019-52991-z] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/28/2018] [Accepted: 10/16/2019] [Indexed: 02/03/2023] Open
Abstract
The sequential paradigm of data acquisition and analysis in next-generation sequencing leads to high turnaround times for the generation of interpretable results. We combined a novel real-time read mapping algorithm with fast variant calling to obtain reliable variant calls still during the sequencing process. Thereby, our new algorithm allows for accurate read mapping results for intermediate cycles and supports large reference genomes such as the complete human reference. This enables the combination of real-time read mapping results with complex follow-up analysis. In this study, we showed the accuracy and scalability of our approach by applying real-time read mapping and variant calling to seven publicly available human whole exome sequencing datasets. Thereby, up to 89% of all detected SNPs were already identified after 40 sequencing cycles while showing similar precision as at the end of sequencing. Final results showed similar accuracy to those of conventional post-hoc analysis methods. When compared to standard routines, our live approach enables considerably faster interventions in clinical applications and infectious disease outbreaks. Besides variant calling, our approach can be adapted for a plethora of other mapping-based analyses.
Collapse
Affiliation(s)
- Tobias P Loka
- Bioinformatics Division (MF 1), Department for Methodology and Research Infrastructure, Robert Koch Institute, Berlin, Germany
| | - Simon H Tausch
- Bioinformatics Division (MF 1), Department for Methodology and Research Infrastructure, Robert Koch Institute, Berlin, Germany
- Centre for Biological Threats and Special Pathogens: Highly Pathogenic Viruses (ZBS 1), Robert Koch Institute, Berlin, Germany
- German Federal Institute for Risk Assessment (BfR), Department of Biological Safety, Berlin, Germany
| | - Bernhard Y Renard
- Bioinformatics Division (MF 1), Department for Methodology and Research Infrastructure, Robert Koch Institute, Berlin, Germany.
| |
Collapse
|
37
|
Leimeister CA, Dencker T, Morgenstern B. Accurate multiple alignment of distantly related genome sequences using filtered spaced word matches as anchor points. Bioinformatics 2019; 35:211-218. [PMID: 29992260 PMCID: PMC6330006 DOI: 10.1093/bioinformatics/bty592] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/23/2017] [Accepted: 07/09/2018] [Indexed: 01/30/2023] Open
Abstract
Motivation Most methods for pairwise and multiple genome alignment use fast local homology search tools to identify anchor points, i.e. high-scoring local alignments of the input sequences. Sequence segments between those anchor points are then aligned with slower, more sensitive methods. Finding suitable anchor points is therefore crucial for genome sequence comparison; speed and sensitivity of genome alignment depend on the underlying anchoring methods. Results In this article, we use filtered spaced word matches to generate anchor points for genome alignment. For a given binary pattern representing match and don't-care positions, we first search for spaced-word matches, i.e. ungapped local pairwise alignments with matching nucleotides at the match positions of the pattern and possible mismatches at the don't-care positions. Those spaced-word matches that have similarity scores above some threshold value are then extended using a standard X-drop algorithm; the resulting local alignments are used as anchor points. To evaluate this approach, we used the popular multiple-genome-alignment pipeline Mugsy and replaced the exact word matches that Mugsy uses as anchor points with our spaced-word-based anchor points. For closely related genome sequences, the two anchoring procedures lead to multiple alignments of similar quality. For distantly related genomes, however, alignments calculated with our filtered-spaced-word matches are superior to alignments produced with the original Mugsy program where exact word matches are used to find anchor points. Availability and implementation http://spacedanchor.gobics.de. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
| | - Thomas Dencker
- Department of Bioinformatics, Institute of Microbiology and Genetics
| | - Burkhard Morgenstern
- Department of Bioinformatics, Institute of Microbiology and Genetics.,Center for Computational Sciences, University of Goettingen, Goettingen, Germany
| |
Collapse
|
38
|
Shajii A, Numanagić I, Baghdadi R, Berger B, Amarasinghe S. Seq: A High-Performance Language for Bioinformatics. PROCEEDINGS OF THE ACM ON PROGRAMMING LANGUAGES 2019; 3:125. [PMID: 35775031 PMCID: PMC9241673 DOI: 10.1145/3360551] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]
Abstract
The scope and scale of biological data are increasing at an exponential rate, as technologies like next-generation sequencing are becoming radically cheaper and more prevalent. Over the last two decades, the cost of sequencing a genome has dropped from $100 million to nearly $100-a factor of over 106-and the amount of data to be analyzed has increased proportionally. Yet, as Moore's Law continues to slow, computational biologists can no longer rely on computing hardware to compensate for the ever-increasing size of biological datasets. In a field where many researchers are primarily focused on biological analysis over computational optimization, the unfortunate solution to this problem is often to simply buy larger and faster machines. Here, we introduce Seq, the first language tailored specifically to bioinformatics, which marries the ease and productivity of Python with C-like performance. Seq starts with a subset of Python-and is in many cases a drop-in replacement-yet also incorporates novel bioinformatics- and computational genomics-oriented data types, language constructs and optimizations. Seq enables users to write high-level, Pythonic code without having to worry about low-level or domain-specific optimizations, and allows for the seamless expression of the algorithms, idioms and patterns found in many genomics or bioinformatics applications. We evaluated Seq on several standard computational genomics tasks like reverse complementation, k-mer manipulation, sequence pattern matching and large genomic index queries. On equivalent CPython code, Seq attains a performance improvement of up to two orders of magnitude, and a 160× improvement once domain-specific language features and optimizations are used. With parallelism, we demonstrate up to a 650× improvement. Compared to optimized C++ code, which is already difficult for most biologists to produce, Seq frequently attains up to a 2× improvement, and with shorter, cleaner code. Thus, Seq opens the door to an age of democratization of highly-optimized bioinformatics software.
Collapse
Affiliation(s)
- Ariya Shajii
- MIT CSAIL, 77 Massachusetts Ave, Cambridge, MA, 02139, USA
| | | | | | - Bonnie Berger
- MIT CSAIL, 77 Massachusetts Ave, Cambridge, MA, 02139, USA
| | | |
Collapse
|
39
|
Afik S, Raulet G, Yosef N. Reconstructing B-cell receptor sequences from short-read single-cell RNA sequencing with BRAPeS. Life Sci Alliance 2019; 2:2/4/e201900371. [PMID: 31451449 PMCID: PMC6709718 DOI: 10.26508/lsa.201900371] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/08/2019] [Revised: 08/13/2019] [Accepted: 08/14/2019] [Indexed: 12/17/2022] Open
Abstract
BRAPeS is a software for B-cell receptor reconstruction in single cells from very short (25–30 bp) read lengths, which achieves similar success rates and accuracy as applying other methods on long reads. RNA sequencing of single B cells provides simultaneous measurements of the cell state and its antigen specificity as determined by the B-cell receptor (BCR). However, to uncover the latter, further reconstruction of the BCR sequence is needed. We present BRAPeS (“BCR Reconstruction Algorithm for Paired-end Single cells” ), an algorithm for reconstructing BCRs from short-read paired-end single-cell RNA sequencing. BRAPeS is accurate and achieves a high success rate even at very short (25 bp) read length, which can decrease the cost and increase the number of cells that can be analyzed compared with long reads. BRAPeS is publicly available at the following link: https://github.com/YosefLab/BRAPeS.
Collapse
Affiliation(s)
- Shaked Afik
- Center for Computational Biology, University of California, Berkeley, Berkeley, CA, USA
| | - Gabriel Raulet
- Department of Computer Science, University of California, Davis, Davis, CA, USA
| | - Nir Yosef
- Center for Computational Biology, University of California, Berkeley, Berkeley, CA, USA .,Department of Electrical Engineering and Computer Science, University of California, Berkeley, Berkeley, CA, USA.,Ragon Institute of Massachusetts General Hospital, Massachusetts Institute of Technology and Harvard, Cambridge, MA, USA.,Chan Zuckerberg Biohub, San Francisco, CA, USA
| |
Collapse
|
40
|
Challenges of big data integration in the life sciences. Anal Bioanal Chem 2019; 411:6791-6800. [PMID: 31463515 DOI: 10.1007/s00216-019-02074-9] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/21/2019] [Revised: 07/08/2019] [Accepted: 08/06/2019] [Indexed: 10/26/2022]
Abstract
Big data has been reported to be revolutionizing many areas of life, including science. It summarizes data that is unprecedentedly large, rapidly generated, heterogeneous, and hard to accurately interpret. This availability has also brought new challenges: How to properly annotate data to make it searchable? What are the legal and ethical hurdles when sharing data? How to store data securely, preventing loss and corruption? The life sciences are not the only disciplines that must align themselves with big data requirements to keep up with the latest developments. The large hadron collider, for instance, generates research data at a pace beyond any current biomedical research center. There are three recent major coinciding events that explain the emergence of big data in the context of research: the technological revolution for data generation, the development of tools for data analysis, and a conceptual change towards open science and data. The true potential of big data lies in pattern discovery in large datasets, as well as the formulation of new models and hypotheses. Confirmation of the existence of the Higgs boson, for instance, is one of the most recent triumphs of big data analysis in physics. Digital representations of biological systems have become more comprehensive. This, in combination with advances in machine learning, creates exciting new research possibilities. In this paper, we review the state of big data in bioanalytical research and provide an overview of the guidelines for its proper usage.
Collapse
|
41
|
Siragusa E, Haiminen N, Utro F, Parida L. Linear Time Algorithms to Construct Populations Fitting Multiple Constraint Distributions at Genomic Scales. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2019; 16:1132-1142. [PMID: 28991752 DOI: 10.1109/tcbb.2017.2760879] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/07/2023]
Abstract
Computer simulations can be used to study population genetic methods, models, and parameters, as well as to predict potential outcomes. For example, in plant populations, predicting the outcome of breeding operations can be studied using simulations. In-silico construction of populations with pre-specified characteristics is an important task in breeding optimization and other population genetic studies. We present two linear time Simulation using Best-fit Algorithms (SimBA) for two classes of problems where each co-fits two distributions: SimBA-LD fits linkage disequilibrium and minimum allele frequency distributions, while SimBA-hap fits founder-haplotype and polyploid allele dosage distributions. An incremental gap-filling version of previously introduced SimBA-LD is here demonstrated to accurately fit the target distributions, allowing efficient large scale simulations. SimBA-hap accuracy and efficiency is demonstrated by simulating tetraploid populations with varying numbers of founder haplotypes, we evaluate both a linear time greedy algoritm and an optimal solution based on mixed-integer programming. SimBA is available on http://researcher.watson.ibm.com/project/5669.
Collapse
|
42
|
Firtina C, Bar-Joseph Z, Alkan C, Cicek AE. Hercules: a profile HMM-based hybrid error correction algorithm for long reads. Nucleic Acids Res 2019; 46:e125. [PMID: 30124947 PMCID: PMC6265270 DOI: 10.1093/nar/gky724] [Citation(s) in RCA: 13] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/24/2018] [Accepted: 08/07/2018] [Indexed: 01/15/2023] Open
Abstract
Choosing whether to use second or third generation sequencing platforms can lead to trade-offs between accuracy and read length. Several types of studies require long and accurate reads. In such cases researchers often combine both technologies and the erroneous long reads are corrected using the short reads. Current approaches rely on various graph or alignment based techniques and do not take the error profile of the underlying technology into account. Efficient machine learning algorithms that address these shortcomings have the potential to achieve more accurate integration of these two technologies. We propose Hercules, the first machine learning-based long read error correction algorithm. Hercules models every long read as a profile Hidden Markov Model with respect to the underlying platform’s error profile. The algorithm learns a posterior transition/emission probability distribution for each long read to correct errors in these reads. We show on two DNA-seq BAC clones (CH17-157L1 and CH17-227A2) that Hercules-corrected reads have the highest mapping rate among all competing algorithms and have the highest accuracy when the breadth of coverage is high. On a large human CHM1 cell line WGS data set, Hercules is one of the few scalable algorithms; and among those, it achieves the highest accuracy.
Collapse
Affiliation(s)
- Can Firtina
- Department of Computer Engineering, Bilkent University, Ankara 06800, Turkey
| | - Ziv Bar-Joseph
- Computational Biology Department, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213, USA
| | - Can Alkan
- Department of Computer Engineering, Bilkent University, Ankara 06800, Turkey
| | - A Ercument Cicek
- Department of Computer Engineering, Bilkent University, Ankara 06800, Turkey.,Computational Biology Department, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213, USA
| |
Collapse
|
43
|
Allmer J. Towards an Internet of Science. J Integr Bioinform 2019; 16:/j/jib.ahead-of-print/jib-2019-0024/jib-2019-0024.xml. [PMID: 31145694 PMCID: PMC6798852 DOI: 10.1515/jib-2019-0024] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/03/2019] [Accepted: 04/25/2019] [Indexed: 11/15/2022] Open
Abstract
Big data and complex analysis workflows (pipelines) are common issues in data driven science such as bioinformatics. Large amounts of computational tools are available for data analysis. Additionally, many workflow management systems to piece together such tools into data analysis pipelines have been developed. For example, more than 50 computational tools for read mapping are available representing a large amount of duplicated effort. Furthermore, it is unclear whether these tools are correct and only a few have a user base large enough to have encountered and reported most of the potential problems. Bringing together many largely untested tools in a computational pipeline must lead to unpredictable results. Yet, this is the current state. While presently data analysis is performed on personal computers/workstations/clusters, the future will see development and analysis shift to the cloud. None of the workflow management systems is ready for this transition. This presents the opportunity to build a new system, which will overcome current duplications of effort, introduce proper testing, allow for development and analysis in public and private clouds, and include reporting features leading to interactive documents.
Collapse
Affiliation(s)
- Jens Allmer
- Hochschule Ruhr West, University of Applied Sciences, Medical Informatics and Bioinformatics, 45407 Mülheim an der Ruhr, Germany
| |
Collapse
|
44
|
Bayat A, Gaëta B, Ignjatovic A, Parameswaran S. Pairwise alignment of nucleotide sequences using maximal exact matches. BMC Bioinformatics 2019; 20:261. [PMID: 31113356 PMCID: PMC6528274 DOI: 10.1186/s12859-019-2827-0] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/02/2018] [Accepted: 04/17/2019] [Indexed: 12/30/2022] Open
Abstract
BACKGROUND Pairwise alignment of short DNA sequences with affine-gap scoring is a common processing step performed in a range of bioinformatics analyses. Dynamic programming (i.e. Smith-Waterman algorithm) is widely used for this purpose. Despite using data level parallelisation, pairwise alignment consumes much time. There are faster alignment algorithms but they suffer from the lack of accuracy. RESULTS In this paper, we present MEM-Align, a fast semi-global alignment algorithm for short DNA sequences that allows for affine-gap scoring and exploit sequence similarity. In contrast to traditional alignment method (such as Smith-Waterman) where individual symbols are aligned, MEM-Align extracts Maximal Exact Matches (MEMs) using a bit-level parallel method and then looks for a subset of MEMs that forms the alignment using a novel dynamic programming method. MEM-Align tries to mimic alignment produced by Smith-Waterman. As a result, for 99.9% of input sequence pair, the computed alignment score is identical to the alignment score computed by Smith-Waterman. Yet MEM-Align is up to 14.5 times faster than the Smith-Waterman algorithm. Fast run-time is achieved by: (a) using a bit-level parallel method to extract MEMs; (b) processing MEMs rather than individual symbols; and, (c) applying heuristics. CONCLUSIONS MEM-Align is a potential candidate to replace other pairwise alignment algorithms used in processes such as DNA read-mapping and Variant-Calling.
Collapse
Affiliation(s)
- Arash Bayat
- School of Computer Science and Engineering, University of New South Wales (UNSW), Sydney, 2052 Australia
- Health and Biosecurity, CSIRO, 53/11 Julius Ave, North Ryde, Sydney, 2113 Australia
| | - Bruno Gaëta
- School of Computer Science and Engineering, University of New South Wales (UNSW), Sydney, 2052 Australia
| | - Aleksandar Ignjatovic
- School of Computer Science and Engineering, University of New South Wales (UNSW), Sydney, 2052 Australia
| | - Sri Parameswaran
- School of Computer Science and Engineering, University of New South Wales (UNSW), Sydney, 2052 Australia
| |
Collapse
|
45
|
Du N, Chen J, Sun Y. Improving the sensitivity of long read overlap detection using grouped short k-mer matches. BMC Genomics 2019; 20:190. [PMID: 30967123 PMCID: PMC6456931 DOI: 10.1186/s12864-019-5475-x] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/26/2022] Open
Abstract
Background Single-molecule, real-time sequencing (SMRT) developed by Pacific BioSciences produces longer reads than second-generation sequencing technologies such as Illumina. The increased read length enables PacBio sequencing to close gaps in genome assembly, reveal structural variations, and characterize the intra-species variations. It also holds the promise to decipher the community structure in complex microbial communities because long reads help metagenomic assembly. One key step in genome assembly using long reads is to quickly identify reads forming overlaps. Because PacBio data has higher sequencing error rate and lower coverage than popular short read sequencing technologies (such as Illumina), efficient detection of true overlaps requires specially designed algorithms. In particular, there is still a need to improve the sensitivity of detecting small overlaps or overlaps with high error rates in both reads. Addressing this need will enable better assembly for metagenomic data produced by third-generation sequencing technologies. Results In this work, we designed and implemented an overlap detection program named GroupK, for third-generation sequencing reads based on grouped k-mer hits. While using k-mer hits for detecting reads’ overlaps has been adopted by several existing programs, our method uses a group of short k-mer hits satisfying statistically derived distance constraints to increase the sensitivity of small overlap detection. Grouped k-mer hit was originally designed for homology search. We are the first to apply group hit for long read overlap detection. The experimental results of applying our pipeline to both simulated and real third-generation sequencing data showed that GroupK enables more sensitive overlap detection, especially for datasets of low sequencing coverage. Conclusions GroupK is best used for detecting small overlaps for third-generation sequencing data. It provides a useful supplementary tool to existing ones for more sensitive and accurate overlap detection. The source code is freely available at https://github.com/Strideradu/GroupK.
Collapse
Affiliation(s)
- Nan Du
- Department of Computer Science and Engineering, Michigan State University, East Lansing, 48824, MI, USA
| | - Jiao Chen
- Department of Computer Science and Engineering, Michigan State University, East Lansing, 48824, MI, USA
| | - Yanni Sun
- Electronic Engineering Department, City University of Hong Kong, Hong Kong SAR, China.
| |
Collapse
|
46
|
Halldorsson BV, Palsson G, Stefansson OA, Jonsson H, Hardarson MT, Eggertsson HP, Gunnarsson B, Oddsson A, Halldorsson GH, Zink F, Gudjonsson SA, Frigge ML, Thorleifsson G, Sigurdsson A, Stacey SN, Sulem P, Masson G, Helgason A, Gudbjartsson DF, Thorsteinsdottir U, Stefansson K. Characterizing mutagenic effects of recombination through a sequence-level genetic map. Science 2019; 363:363/6425/eaau1043. [DOI: 10.1126/science.aau1043] [Citation(s) in RCA: 156] [Impact Index Per Article: 31.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/08/2018] [Revised: 05/16/2018] [Accepted: 12/07/2018] [Indexed: 12/14/2022]
Abstract
Genetic diversity arises from recombination and de novo mutation (DNM). Using a combination of microarray genotype and whole-genome sequence data on parent-child pairs, we identified 4,531,535 crossover recombinations and 200,435 DNMs. The resulting genetic map has a resolution of 682 base pairs. Crossovers exhibit a mutagenic effect, with overrepresentation of DNMs within 1 kilobase of crossovers in males and females. In females, a higher mutation rate is observed up to 40 kilobases from crossovers, particularly for complex crossovers, which increase with maternal age. We identified 35 loci associated with the recombination rate or the location of crossovers, demonstrating extensive genetic control of meiotic recombination, and our results highlight genes linked to the formation of the synaptonemal complex as determinants of crossovers.
Collapse
|
47
|
Pericard P, Dufresne Y, Couderc L, Blanquart S, Touzet H. MATAM: reconstruction of phylogenetic marker genes from short sequencing reads in metagenomes. Bioinformatics 2018; 34:585-591. [PMID: 29040406 DOI: 10.1093/bioinformatics/btx644] [Citation(s) in RCA: 29] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/23/2017] [Accepted: 10/10/2017] [Indexed: 01/18/2023] Open
Abstract
Motivation Advances in the sequencing of uncultured environmental samples, dubbed metagenomics, raise a growing need for accurate taxonomic assignment. Accurate identification of organisms present within a community is essential to understanding even the most elementary ecosystems. However, current high-throughput sequencing technologies generate short reads which partially cover full-length marker genes and this poses difficult bioinformatic challenges for taxonomy identification at high resolution. Results We designed MATAM, a software dedicated to the fast and accurate targeted assembly of short reads sequenced from a genomic marker of interest. The method implements a stepwise process based on construction and analysis of a read overlap graph. It is applied to the assembly of 16S rRNA markers and is validated on simulated, synthetic and genuine metagenomes. We show that MATAM outperforms other available methods in terms of low error rates and recovered fractions and is suitable to provide improved assemblies for precise taxonomic assignments. Availability and implementation https://github.com/bonsai-team/matam. Contact pierre.pericard@gmail.com or helene.touzet@univ-lille1.fr. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Pierre Pericard
- CRIStAL (UMR CNRS 9189, Université Lille 1).,Inria Lille Nord-Europe
| | - Yoann Dufresne
- CRIStAL (UMR CNRS 9189, Université Lille 1).,Inria Lille Nord-Europe
| | - Loïc Couderc
- CRIStAL (UMR CNRS 9189, Université Lille 1).,Bilille, 59650 Villeneuve d'Ascq, France
| | - Samuel Blanquart
- CRIStAL (UMR CNRS 9189, Université Lille 1).,Inria Lille Nord-Europe
| | - Hélène Touzet
- CRIStAL (UMR CNRS 9189, Université Lille 1).,Inria Lille Nord-Europe
| |
Collapse
|
48
|
Single-cell mutation identification via phylogenetic inference. Nat Commun 2018; 9:5144. [PMID: 30514897 PMCID: PMC6279798 DOI: 10.1038/s41467-018-07627-7] [Citation(s) in RCA: 52] [Impact Index Per Article: 8.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/07/2018] [Accepted: 11/15/2018] [Indexed: 12/25/2022] Open
Abstract
Reconstructing the evolution of tumors is a key aspect towards the identification of appropriate cancer therapies. The task is challenging because tumors evolve as heterogeneous cell populations. Single-cell sequencing holds the promise of resolving the heterogeneity of tumors; however, it has its own challenges including elevated error rates, allelic drop-out, and uneven coverage. Here, we develop a new approach to mutation detection in individual tumor cells by leveraging the evolutionary relationship among cells. Our method, called SCIΦ, jointly calls mutations in individual cells and estimates the tumor phylogeny among these cells. Employing a Markov Chain Monte Carlo scheme enables us to reliably call mutations in each single cell even in experiments with high drop-out rates and missing data. We show that SCIΦ outperforms existing methods on simulated data and applied it to different real-world datasets, namely a whole exome breast cancer as well as a panel acute lymphoblastic leukemia dataset. Cross-cell heterogeneity of genotypes can be revealed by analyzing single-cell sequencing data. Here the authors develop a tool for single-cell variant calling via phylogenetic inference, and use it to analyze cancer genomics datasets.
Collapse
|
49
|
Patil RD, Ellison MJ, Wolff SM, Shearer C, Wright AM, Cockrum RR, Austin KJ, Lamberson WR, Cammack KM, Conant GC. Poor feed efficiency in sheep is associated with several structural abnormalities in the community metabolic network of their ruminal microbes. J Anim Sci 2018; 96:2113-2124. [PMID: 29788417 DOI: 10.1093/jas/sky096] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/26/2017] [Accepted: 03/14/2018] [Indexed: 12/19/2022] Open
Abstract
Ruminant animals have a symbiotic relationship with the microorganisms in their rumens. In this relationship, rumen microbes efficiently degrade complex plant-derived compounds into smaller digestible compounds, a process that is very likely associated with host animal feed efficiency. The resulting simpler metabolites can then be absorbed by the host and converted into other compounds by host enzymes. We used a microbial community metabolic network inferred from shotgun metagenomics data to assess how this metabolic system differs between animals that are able to turn ingested feedstuffs into body mass with high efficiency and those that are not. We conducted shotgun sequencing of microbial DNA from the rumen contents of 16 sheep that differed in their residual feed intake (RFI), a measure of feed efficiency. Metagenomic reads from each sheep were mapped onto a database-derived microbial metabolic network, which was linked to the sheep metabolic network by interface metabolites (metabolites transferred from microbes to host). No single enzyme was identified as being significantly different in abundance between the low and high RFI animals (P > 0.05, Wilcoxon test). However, when we analyzed the metabolic network as a whole, we found several differences between efficient and inefficient animals. Microbes from low RFI (efficient) animals use a suite of enzymes closer in network space to the host's reactions than those of the high RFI (inefficient) animals. Similarly, low RFI animals have microbial metabolic networks that, on average, contain reactions using shorter carbon chains than do those of high RFI animals, potentially allowing the host animals to extract metabolites more efficiently. Finally, the efficient animals possess community networks with greater Shannon diversity among their enzymes than do inefficient ones. Thus, our system approach to the ruminal microbiome identified differences attributable to feed efficiency in the structure of the microbes' community metabolic network that were undetected at the level of individual microbial taxa or reactions.
Collapse
Affiliation(s)
- Rocky D Patil
- Division of Animal Sciences, University of Missouri-Columbia, Columbia, MO
| | - Melinda J Ellison
- Animal and Veterinary Science, Nancy M. Cummings Research, Extension, and Education Center, University of Idaho, Carmen, ID
| | - Sara M Wolff
- Division of Animal Sciences, University of Missouri-Columbia, Columbia, MO
| | | | - Anna M Wright
- Department of Psychological Sciences, University of Missouri-Columbia, Columbia, MO
| | - Rebecca R Cockrum
- Department of Dairy Science, Virginia Polytechnic Institute and State University, Blacksburg, VA
| | - Kathy J Austin
- Department of Animal Science, University of Wyoming, Laramie, WY
| | | | - Kristi M Cammack
- Department of Animal Science, South Dakota State University, Brookings, SD
| | - Gavin C Conant
- Division of Animal Sciences, University of Missouri-Columbia, Columbia, MO.,Program in Genetics, North Carolina State University, Raleigh, NC.,Bioinformatics Research Center, North Carolina State University, Raleigh, NC.,Department of Biological Sciences, North Carolina State University, Raleigh, NC
| |
Collapse
|
50
|
Sambo F, Finotello F, Lavezzo E, Baruzzo G, Masi G, Peta E, Falda M, Toppo S, Barzon L, Di Camillo B. Optimizing PCR primers targeting the bacterial 16S ribosomal RNA gene. BMC Bioinformatics 2018; 19:343. [PMID: 30268091 PMCID: PMC6162885 DOI: 10.1186/s12859-018-2360-6] [Citation(s) in RCA: 23] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/22/2017] [Accepted: 09/09/2018] [Indexed: 02/01/2023] Open
Abstract
BACKGROUND Targeted amplicon sequencing of the 16S ribosomal RNA gene is one of the key tools for studying microbial diversity. The accuracy of this approach strongly depends on the choice of primer pairs and, in particular, on the balance between efficiency, specificity and sensitivity in the amplification of the different bacterial 16S sequences contained in a sample. There is thus the need for computational methods to design optimal bacterial 16S primers able to take into account the knowledge provided by the new sequencing technologies. RESULTS We propose here a computational method for optimizing the choice of primer sets, based on multi-objective optimization, which simultaneously: 1) maximizes efficiency and specificity of target amplification; 2) maximizes the number of different bacterial 16S sequences matched by at least one primer; 3) minimizes the differences in the number of primers matching each bacterial 16S sequence. Our algorithm can be applied to any desired amplicon length without affecting computational performance. The source code of the developed algorithm is released as the mopo16S software tool (Multi-Objective Primer Optimization for 16S experiments) under the GNU General Public License and is available at http://sysbiobig.dei.unipd.it/?q=Software#mopo16S . CONCLUSIONS Results show that our strategy is able to find better primer pairs than the ones available in the literature according to all three optimization criteria. We also experimentally validated three of the primer pairs identified by our method on multiple bacterial species, belonging to different genera and phyla. Results confirm the predicted efficiency and the ability to maximize the number of different bacterial 16S sequences matched by primers.
Collapse
Affiliation(s)
- Francesco Sambo
- Department of Information Engineering, University of Padova, Padova, Italy
| | - Francesca Finotello
- Biocenter, Division of Bioinformatics, Medical University of Innsbruck, Innsbruck, Austria
| | - Enrico Lavezzo
- Department of Molecular Medicine, University of Padova, Padova, Italy
| | - Giacomo Baruzzo
- Department of Information Engineering, University of Padova, Padova, Italy
| | - Giulia Masi
- Department of Molecular Medicine, University of Padova, Padova, Italy
| | - Elektra Peta
- Department of Molecular Medicine, University of Padova, Padova, Italy
| | - Marco Falda
- Department of Molecular Medicine, University of Padova, Padova, Italy
| | - Stefano Toppo
- Department of Molecular Medicine, University of Padova, Padova, Italy
| | - Luisa Barzon
- Department of Molecular Medicine, University of Padova, Padova, Italy
| | - Barbara Di Camillo
- Department of Information Engineering, University of Padova, Padova, Italy
| |
Collapse
|