1
|
Brooks TG, Lahens NF, Mrčela A, Grant GR. Challenges and best practices in omics benchmarking. Nat Rev Genet 2024; 25:326-339. [PMID: 38216661 DOI: 10.1038/s41576-023-00679-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 11/14/2023] [Indexed: 01/14/2024]
Abstract
Technological advances enabling massively parallel measurement of biological features - such as microarrays, high-throughput sequencing and mass spectrometry - have ushered in the omics era, now in its third decade. The resulting complex landscape of analytical methods has naturally fostered the growth of an omics benchmarking industry. Benchmarking refers to the process of objectively comparing and evaluating the performance of different computational or analytical techniques when processing and analysing large-scale biological data sets, such as transcriptomics, proteomics and metabolomics. With thousands of omics benchmarking studies published over the past 25 years, the field has matured to the point where the foundations of benchmarking have been established and well described. However, generating meaningful benchmarking data and properly evaluating performance in this complex domain remains challenging. In this Review, we highlight some common oversights and pitfalls in omics benchmarking. We also establish a methodology to bring the issues that can be addressed into focus and to be transparent about those that cannot: this takes the form of a spreadsheet template of guidelines for comprehensive reporting, intended to accompany publications. In addition, a survey of recent developments in benchmarking is provided as well as specific guidance for commonly encountered difficulties.
Collapse
Affiliation(s)
- Thomas G Brooks
- Institute for Translational Medicine and Therapeutics, University of Pennsylvania, Philadelphia, PA, USA
| | - Nicholas F Lahens
- Institute for Translational Medicine and Therapeutics, University of Pennsylvania, Philadelphia, PA, USA
| | - Antonijo Mrčela
- Institute for Translational Medicine and Therapeutics, University of Pennsylvania, Philadelphia, PA, USA
| | - Gregory R Grant
- Institute for Translational Medicine and Therapeutics, University of Pennsylvania, Philadelphia, PA, USA.
- Department of Genetics, University of Pennsylvania, Philadelphia, PA, USA.
| |
Collapse
|
2
|
Brooks TG, Lahens NF, Mrčela A, Sarantopoulou D, Nayak S, Naik A, Sengupta S, Choi PS, Grant GR. BEERS2: RNA-Seq simulation through high fidelity in silico modeling. Brief Bioinform 2024; 25:bbae164. [PMID: 38605641 PMCID: PMC11009461 DOI: 10.1093/bib/bbae164] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/12/2023] [Revised: 01/26/2024] [Accepted: 03/26/2024] [Indexed: 04/13/2024] Open
Abstract
Simulation of RNA-seq reads is critical in the assessment, comparison, benchmarking and development of bioinformatics tools. Yet the field of RNA-seq simulators has progressed little in the last decade. To address this need we have developed BEERS2, which combines a flexible and highly configurable design with detailed simulation of the entire library preparation and sequencing pipeline. BEERS2 takes input transcripts (typically fully length messenger RNA transcripts with polyA tails) from either customizable input or from CAMPAREE simulated RNA samples. It produces realistic reads of these transcripts as FASTQ, SAM or BAM formats with the SAM or BAM formats containing the true alignment to the reference genome. It also produces true transcript-level quantification values. BEERS2 combines a flexible and highly configurable design with detailed simulation of the entire library preparation and sequencing pipeline and is designed to include the effects of polyA selection and RiboZero for ribosomal depletion, hexamer priming sequence biases, GC-content biases in polymerase chain reaction (PCR) amplification, barcode read errors and errors during PCR amplification. These characteristics combine to make BEERS2 the most complete simulation of RNA-seq to date. Finally, we demonstrate the use of BEERS2 by measuring the effect of several settings on the popular Salmon pseudoalignment algorithm.
Collapse
Affiliation(s)
- Thomas G Brooks
- Institute for Translational Medicine and Therapeutics, University of Pennsylvania, PA, USA
| | - Nicholas F Lahens
- Institute for Translational Medicine and Therapeutics, University of Pennsylvania, PA, USA
| | - Antonijo Mrčela
- Institute for Translational Medicine and Therapeutics, University of Pennsylvania, PA, USA
| | - Dimitra Sarantopoulou
- Institute for Translational Medicine and Therapeutics, University of Pennsylvania, PA, USA
- Current address: National Institute on Aging, National Institutes of Health, Baltimore, MD, USA
| | - Soumyashant Nayak
- Institute for Translational Medicine and Therapeutics, University of Pennsylvania, PA, USA
- Current address: Statistics and Mathematics Unit, Indian Statistical Institute, Bengaluru, Karnataka, India
| | - Amruta Naik
- Institute for Translational Medicine and Therapeutics, University of Pennsylvania, PA, USA
- Children’s Hospital of Philadelphia, Philadelphia, PA, USA
| | - Shaon Sengupta
- Institute for Translational Medicine and Therapeutics, University of Pennsylvania, PA, USA
- Children’s Hospital of Philadelphia, Philadelphia, PA, USA
- Department of Pediatrics, University of Pennsylvania Perelman School of Medicine, Philadelphia, Pennsylvania, USA
| | - Peter S Choi
- Division of Cancer Pathobiology, Children’s Hospital of Philadelphia, Philadelphia, PA, USA
- Department of Pathology & Laboratory Medicine, University of Pennsylvania Perelman School of Medicine, Philadelphia, PA, USA
| | - Gregory R Grant
- Institute for Translational Medicine and Therapeutics, University of Pennsylvania, PA, USA
- Department of Genetics, University of Pennsylvania, Philadelphia, PA, USA
| |
Collapse
|
3
|
Vicente-Garcés C, Maynou J, Fernández G, Esperanza-Cebollada E, Torrebadell M, Català A, Rives S, Camós M, Vega-García N. Fusion InPipe, an integrative pipeline for gene fusion detection from RNA-seq data in acute pediatric leukemia. Front Mol Biosci 2023; 10:1141310. [PMID: 37363396 PMCID: PMC10288994 DOI: 10.3389/fmolb.2023.1141310] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/10/2023] [Accepted: 05/30/2023] [Indexed: 06/28/2023] Open
Abstract
RNA sequencing (RNA-seq) is a reliable tool for detecting gene fusions in acute leukemia. Multiple bioinformatics pipelines have been developed to analyze RNA-seq data, but an agreed gold standard has not been established. This study aimed to compare the applicability of 5 fusion calling pipelines (Arriba, deFuse, CICERO, FusionCatcher, and STAR-Fusion), as well as to define and develop an integrative bioinformatics pipeline (Fusion InPipe) to detect clinically relevant gene fusions in acute pediatric leukemia. We analyzed RNA-seq data by each pipeline individually and by Fusion InPipe. Each algorithm individually called most of the fusions with similar sensitivity and precision. However, not all rearrangements were called, suggesting that choosing a single pipeline might cause missing important fusions. To improve this, we integrated the results of the five algorithms in just one pipeline, Fusion InPipe, comparing the output from the agreement of 5/5, 4/5, and 3/5 algorithms. The maximum sensitivity was achieved with the agreement of 3/5 algorithms, with a global sensitivity of 95%, achieving a 100% in patients' data. Furthermore, we showed the necessity of filtering steps to reduce the false positive detection rate. Here, we demonstrate that Fusion InPipe is an excellent tool for fusion detection in pediatric acute leukemia with the best performance when selecting those fusions called by at least 3/5 pipelines.
Collapse
Affiliation(s)
- Clara Vicente-Garcés
- Pediatric Cancer Center Barcelona (PCCB), Institut de Recerca Sant Joan de Déu, Leukemia and Pediatric Hematology Disorders, Developmental Tumors Biology Group, Esplugues de Llobregat, Spain
| | - Joan Maynou
- Hospital Sant Joan de Déu Barcelona, Genetics Medicine Section, Esplugues de Llobregat, Spain
- Institut de Recerca Hospital Sant Joan de Déu, Neurogenetics and Molecular Medicine, Esplugues de Llobregat, Spain
| | - Guerau Fernández
- Hospital Sant Joan de Déu Barcelona, Genetics Medicine Section, Esplugues de Llobregat, Spain
- Institut de Recerca Hospital Sant Joan de Déu, Neurogenetics and Molecular Medicine, Esplugues de Llobregat, Spain
| | - Elena Esperanza-Cebollada
- Pediatric Cancer Center Barcelona (PCCB), Institut de Recerca Sant Joan de Déu, Leukemia and Pediatric Hematology Disorders, Developmental Tumors Biology Group, Esplugues de Llobregat, Spain
| | - Montserrat Torrebadell
- Pediatric Cancer Center Barcelona (PCCB), Institut de Recerca Sant Joan de Déu, Leukemia and Pediatric Hematology Disorders, Developmental Tumors Biology Group, Esplugues de Llobregat, Spain
- Hospital Sant Joan de Déu Barcelona, Hematology Laboratory, Esplugues de Llobregat, Spain
- Instituto de Salud Carlos III, Centro de Investigación Biomédica en Red De Enfermedades Raras (CIBERER), Madrid, Spain
| | - Albert Català
- Pediatric Cancer Center Barcelona (PCCB), Institut de Recerca Sant Joan de Déu, Leukemia and Pediatric Hematology Disorders, Developmental Tumors Biology Group, Esplugues de Llobregat, Spain
- Instituto de Salud Carlos III, Centro de Investigación Biomédica en Red De Enfermedades Raras (CIBERER), Madrid, Spain
- Pediatric Cancer Center Barcelona (PCCB), Hospital Sant Joan De Déu Barcelona, Leukemia and Lymphoma Unit, Barcelona, Spain
| | - Susana Rives
- Pediatric Cancer Center Barcelona (PCCB), Institut de Recerca Sant Joan de Déu, Leukemia and Pediatric Hematology Disorders, Developmental Tumors Biology Group, Esplugues de Llobregat, Spain
- Instituto de Salud Carlos III, Centro de Investigación Biomédica en Red De Enfermedades Raras (CIBERER), Madrid, Spain
- Pediatric Cancer Center Barcelona (PCCB), Hospital Sant Joan De Déu Barcelona, Leukemia and Lymphoma Unit, Barcelona, Spain
| | - Mireia Camós
- Pediatric Cancer Center Barcelona (PCCB), Institut de Recerca Sant Joan de Déu, Leukemia and Pediatric Hematology Disorders, Developmental Tumors Biology Group, Esplugues de Llobregat, Spain
- Hospital Sant Joan de Déu Barcelona, Hematology Laboratory, Esplugues de Llobregat, Spain
- Instituto de Salud Carlos III, Centro de Investigación Biomédica en Red De Enfermedades Raras (CIBERER), Madrid, Spain
| | - Nerea Vega-García
- Pediatric Cancer Center Barcelona (PCCB), Institut de Recerca Sant Joan de Déu, Leukemia and Pediatric Hematology Disorders, Developmental Tumors Biology Group, Esplugues de Llobregat, Spain
- Hospital Sant Joan de Déu Barcelona, Hematology Laboratory, Esplugues de Llobregat, Spain
| |
Collapse
|
4
|
Brooks TG, Lahens NF, Mrčela A, Sarantopoulou D, Nayak S, Naik A, Sengupta S, Choi PS, Grant GR. BEERS2: RNA-Seq simulation through high fidelity in silico modeling. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.04.21.537847. [PMID: 37162982 PMCID: PMC10168222 DOI: 10.1101/2023.04.21.537847] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/11/2023]
Abstract
Simulation of RNA-seq reads is critical in the assessment, comparison, benchmarking, and development of bioinformatics tools. Yet the field of RNA-seq simulators has progressed little in the last decade. To address this need we have developed BEERS2, which combines a flexible and highly configurable design with detailed simulation of the entire library preparation and sequencing pipeline. BEERS2 takes input transcripts (typically fully-length mRNA transcripts with polyA tails) from either customizable input or from CAMPAREE simulated RNA samples. It produces realistic reads of these transcripts as FASTQ, SAM, or BAM formats with the SAM or BAM formats containing the true alignment to the reference genome. It also produces true transcript-level quantification values. BEERS2 combines a flexible and highly configurable design with detailed simulation of the entire library preparation and sequencing pipeline and is designed to include the effects of polyA selection and RiboZero for ribosomal depletion, hexamer priming sequence biases, GC-content biases in PCR amplification, barcode read errors, and errors during PCR amplification. These characteristics combine to make BEERS2 the most complete simulation of RNA-seq to date. Finally, we demonstrate the use of BEERS2 by measuring the effect of several settings on the popular Salmon pseudoalignment algorithm.
Collapse
Affiliation(s)
- Thomas G Brooks
- Institute for Translational Medicine and Therapeutics, University of Pennsylvania, PA, USA
| | - Nicholas F Lahens
- Institute for Translational Medicine and Therapeutics, University of Pennsylvania, PA, USA
| | - Antonijo Mrčela
- Institute for Translational Medicine and Therapeutics, University of Pennsylvania, PA, USA
| | - Dimitra Sarantopoulou
- Institute for Translational Medicine and Therapeutics, University of Pennsylvania, PA, USA
- Current address: National Institute on Aging, National Institutes of Health, Baltimore, MD, USA
| | - Soumyashant Nayak
- Institute for Translational Medicine and Therapeutics, University of Pennsylvania, PA, USA
- Current address: Statistics and Mathematics Unit, Indian Statistical Institute, Bengaluru, Karnataka, India
| | - Amruta Naik
- Institute for Translational Medicine and Therapeutics, University of Pennsylvania, PA, USA
- Children's Hospital of Philadelphia, Philadelphia, PA, USA
| | - Shaon Sengupta
- Institute for Translational Medicine and Therapeutics, University of Pennsylvania, PA, USA
- Children's Hospital of Philadelphia, Philadelphia, PA, USA
- Department of Pediatrics, University of Pennsylvania Perelman School of Medicine, Philadelphia, Pennsylvania, USA
| | - Peter S Choi
- Division of Cancer Pathobiology, Children's Hospital of Philadelphia, Philadelphia, PA, USA
- Department of Pathology & Laboratory Medicine, University of Pennsylvania Perelman School of Medicine, Philadelphia, PA, USA
| | - Gregory R Grant
- Institute for Translational Medicine and Therapeutics, University of Pennsylvania, PA, USA
- Department of Genetics, University of Pennsylvania, Philadelphia, PA, USA
| |
Collapse
|
5
|
Yang A, Tang JYS, Troup M, Ho JWK. Scavenger: A pipeline for recovery of unaligned reads utilising similarity with aligned reads. F1000Res 2022; 8:1587. [PMID: 32913631 PMCID: PMC7459848 DOI: 10.12688/f1000research.19426.2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Accepted: 08/08/2022] [Indexed: 11/25/2022] Open
Abstract
Read alignment is an important step in RNA-seq analysis as the result of alignment forms the basis for downstream analyses. However, recent studies have shown that published alignment tools have variable mapping sensitivity and do not necessarily align all the reads which should have been aligned, a problem we termed as the false-negative non-alignment problem. Here we present Scavenger, a python-based bioinformatics pipeline for recovering unaligned reads using a novel mechanism in which a putative alignment location is discovered based on sequence similarity between aligned and unaligned reads. We showed that Scavenger could recover unaligned reads in a range of simulated and real RNA-seq datasets, including single-cell RNA-seq data. We found that recovered reads tend to contain more genetic variants with respect to the reference genome compared to previously aligned reads, indicating that divergence between personal and reference genomes plays a role in the false-negative non-alignment problem. Even when the number of recovered reads is relatively small compared to the total number of reads, the addition of these recovered reads can impact downstream analyses, especially in terms of estimating the expression and differential expression of lowly expressed genes, such as pseudogenes.
Collapse
Affiliation(s)
- Andrian Yang
- Victor Chang Cardiac Research Institute, Sydney, NSW, 2010, Australia
- St. Vincent’s Clinical School, University of New South Wales, Sydney, NSW, 2052, Australia
| | - Joshua Y. S. Tang
- Victor Chang Cardiac Research Institute, Sydney, NSW, 2010, Australia
- St. Vincent’s Clinical School, University of New South Wales, Sydney, NSW, 2052, Australia
| | - Michael Troup
- Victor Chang Cardiac Research Institute, Sydney, NSW, 2010, Australia
| | - Joshua W. K. Ho
- Victor Chang Cardiac Research Institute, Sydney, NSW, 2010, Australia
- St. Vincent’s Clinical School, University of New South Wales, Sydney, NSW, 2052, Australia
- School of Biomedical Sciences, Li Ka Shing Faculty of Medicine, University of Hong Kong, Hong Kong, China
| |
Collapse
|
6
|
Kühl MA, Stich B, Ries DC. Mutation-Simulator: fine-grained simulation of random mutations in any genome. Bioinformatics 2021; 37:568-569. [PMID: 32780803 PMCID: PMC8088320 DOI: 10.1093/bioinformatics/btaa716] [Citation(s) in RCA: 14] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/18/2020] [Revised: 06/12/2020] [Accepted: 08/05/2020] [Indexed: 01/11/2023] Open
Abstract
Summary Mutation-Simulator allows the introduction of various types of sequence alterations in reference sequences, with reasonable compute-time even for large eukaryotic genomes. Its intuitive system for fine-grained control over mutation rates along the sequence enables the mimicking of natural mutation patterns. Using standard file formats for input and output data, it can easily be integrated into any development and benchmarking workflow for high-throughput sequencing applications. Availability and implementation Mutation-Simulator is written in Python 3 and the source code, documentation, help and use cases are available on the Github page at https://github.com/mkpython3/Mutation-Simulator. It is free for use under the GPL 3 license.
Collapse
Affiliation(s)
- M A Kühl
- Quantitative Genetics and Genomics of Plants, Heinrich Heine University, Düsseldorf 40225, Germany
| | - B Stich
- Quantitative Genetics and Genomics of Plants, Heinrich Heine University, Düsseldorf 40225, Germany
| | - D C Ries
- Quantitative Genetics and Genomics of Plants, Heinrich Heine University, Düsseldorf 40225, Germany
| |
Collapse
|
7
|
Egert J, Brombacher E, Warscheid B, Kreutz C. DIMA: Data-Driven Selection of an Imputation Algorithm. J Proteome Res 2021; 20:3489-3496. [PMID: 34062065 DOI: 10.1021/acs.jproteome.1c00119] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Abstract
Imputation is a prominent strategy when dealing with missing values (MVs) in proteomics data analysis pipelines. However, it is difficult to assess the performance of different imputation methods and varies strongly depending on data characteristics. To overcome this issue, we present the concept of a data-driven selection of an imputation algorithm (DIMA). The performance and broad applicability of DIMA are demonstrated on 142 quantitative proteomics data sets from the PRoteomics IDEntifications (PRIDE) database and on simulated data consisting of 5-50% MVs with different proportions of missing not at random and missing completely at random values. DIMA reliably suggests a high-performing imputation algorithm, which is always among the three best algorithms and results in a root mean square error difference (ΔRMSE) ≤ 10% in 80% of the cases. DIMA implementation is available in MATLAB at github.com/kreutz-lab/OmicsData and in R at github.com/kreutz-lab/DIMAR.
Collapse
Affiliation(s)
- Janine Egert
- Institute of Medical Biometry and Statistics (IMBI), Institute of Medicine and Medical Center Freiburg, 79104 Freiburg im Breisgau, Germany.,Centre for Integrative Biological Signalling Studies (CIBSS), Albert-Ludwigs-Universität Freiburg, 79104 Freiburg, Germany
| | - Eva Brombacher
- Institute of Medical Biometry and Statistics (IMBI), Institute of Medicine and Medical Center Freiburg, 79104 Freiburg im Breisgau, Germany.,Centre for Integrative Biological Signalling Studies (CIBSS), Albert-Ludwigs-Universität Freiburg, 79104 Freiburg, Germany.,Spemann Graduate School of Biology and Medicine (SGBM), Albert-Ludwigs-Universität Freiburg, 79104 Freiburg, Germany.,Faculty of Biology, Albert-Ludwigs-Universität Freiburg, 79104 Freiburg im Breisgau, Germany
| | - Bettina Warscheid
- Biochemistry and Functional Proteomics, Institute of Biology II, Faculty of Biology, Albert-Ludwigs-Universität Freiburg, 79104 Freiburg im Breisgau, Germany.,Signalling Research Centres BIOSS and CIBSS, Albert-Ludwigs-Universität Freiburg, 79104 Freiburg im Breisgau, Germany
| | - Clemens Kreutz
- Institute of Medical Biometry and Statistics (IMBI), Institute of Medicine and Medical Center Freiburg, 79104 Freiburg im Breisgau, Germany.,Signalling Research Centres BIOSS and CIBSS, Albert-Ludwigs-Universität Freiburg, 79104 Freiburg im Breisgau, Germany.,Center for Data Analysis and Modeling (FDM), Albert-Ludwigs-Universität Freiburg, 79104 Freiburg im Breisgau, Germany
| |
Collapse
|
8
|
Das JK, Roy S. A study on non-synonymous mutational patterns in structural proteins of SARS-CoV-2. Genome 2021; 64:665-678. [PMID: 33788636 DOI: 10.1139/gen-2020-0157] [Citation(s) in RCA: 11] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/20/2022]
Abstract
SARS-CoV-2 is mutating and creating divergent variants across the world. An in-depth investigation of the amino acid substitutions in the genomic signature of SARS-CoV-2 proteins is highly essential for understanding its host adaptation and infection biology. A total of 9587 SARS-CoV-2 structural protein sequences collected from 49 different countries are used to characterize protein-wise variants, substitution patterns (type and location), and major substitution changes. The majority of the substitutions are distinct, mostly in a particular location, and lead to a change in an amino acid's biochemical properties. In terms of mutational changes, envelope (E) and membrane (M) proteins are relatively more stable than nucleocapsid (N) and spike (S) proteins. Several co-occurrence substitutions are observed, particularly in S and N proteins. Substitution specific to active sub-domains reveals that heptapeptide repeat, fusion peptides, transmembrane in S protein, and N-terminal and C-terminal domains in the N protein are remarkably mutated. We also observe a few deleterious mutations in the above domains. The overall study on non-synonymous mutation in structural proteins of SARS-CoV-2 at the start of the pandemic indicates a diversity amongst virus sequences.
Collapse
Affiliation(s)
- Jayanta Kumar Das
- Department of Pediatrics, Johns Hopkins University School of Medicine, Maryland, USA
| | - Swarup Roy
- Network Reconstruction & Analysis (NetRA) Lab, Department of Computer Applications, Sikkim University, Gangtok, India
| |
Collapse
|