1
|
Janzen T, Etienne RS. Phylogenetic tree statistics: A systematic overview using the new R package 'treestats'. Mol Phylogenet Evol 2024; 200:108168. [PMID: 39117295 DOI: 10.1016/j.ympev.2024.108168] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/24/2024] [Revised: 07/19/2024] [Accepted: 08/04/2024] [Indexed: 08/10/2024]
Abstract
Phylogenetic trees are believed to contain a wealth of information on diversification processes. However, comparing phylogenetic trees is not straightforward due to their high dimensionality. Researchers have therefore defined a wide range of low-dimensional summary statistics. Currently, it remains unexplored to what extent these summary statistics cover the same underlying information and what summary statistics best explain observed variation across phylogenies. Furthermore, a large subset of available summary statistics focusses on measuring the topological features of a phylogenetic tree, but are often only explored at the extreme edge cases of the fully balanced or imbalanced tree and not for trees of intermediate balance. Here, we introduce a new R package called 'treestats', that provides speed optimized code to compute 70 summary statistics. We study correlations between summary statistics on empirical trees and on trees simulated using several diversification models. Furthermore, we introduce an algorithm to create intermediately balanced trees in a well-defined manner, in order to explore variation in summary statistics across a balance gradient. We find that almost all summary statistics are correlated with tree size, and find that it is difficult, if not impossible, to correct for tree size, unless the tree generating model is known. Furthermore, we find that across empirical and simulated trees, at least three large clusters of correlated summary statistics can be found, where statistics group together based on information used (topology or branching times). However, the finer grained correlation structure appears to depend strongly on either the taxonomic group studied (in empirical studies) or the tree generating model (in simulation studies). Amongst statistics describing the (im)balance of a tree, we find that almost all statistics vary non-linearly, and sometimes even non-monotonically, with our generated balance gradient. This indicates that balance is perhaps a more complex property of a tree than previously thought. Furthermore, using our new imbalancing algorithm, we devise a numerical test to identify balance statistics, and identify several statistics as balance statistics that were not previously considered as such. Lastly, our results lead to several recommendations on which statistics to select when analyzing and comparing phylogenetic trees.
Collapse
Affiliation(s)
- Thijs Janzen
- Groningen Institute for Evolutionary Life Sciences, University of Groningen, Groningen, the Netherlands.
| | - Rampal S Etienne
- Groningen Institute for Evolutionary Life Sciences, University of Groningen, Groningen, the Netherlands
| |
Collapse
|
2
|
Alser M, Lawlor B, Abdill RJ, Waymost S, Ayyala R, Rajkumar N, LaPierre N, Brito J, Ribeiro-Dos-Santos AM, Almadhoun N, Sarwal V, Firtina C, Osinski T, Eskin E, Hu Q, Strong D, Kim BDBD, Abedalthagafi MS, Mutlu O, Mangul S. Packaging and containerization of computational methods. Nat Protoc 2024; 19:2529-2539. [PMID: 38565959 DOI: 10.1038/s41596-024-00986-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/29/2022] [Accepted: 02/12/2024] [Indexed: 04/04/2024]
Abstract
Methods for analyzing the full complement of a biomolecule type, e.g., proteomics or metabolomics, generate large amounts of complex data. The software tools used to analyze omics data have reshaped the landscape of modern biology and become an essential component of biomedical research. These tools are themselves quite complex and often require the installation of other supporting software, libraries and/or databases. A researcher may also be using multiple different tools that require different versions of the same supporting materials. The increasing dependence of biomedical scientists on these powerful tools creates a need for easier installation and greater usability. Packaging and containerization are different approaches to satisfy this need by delivering omics tools already wrapped in additional software that makes the tools easier to install and use. In this systematic review, we describe and compare the features of prominent packaging and containerization platforms. We outline the challenges, advantages and limitations of each approach and some of the most widely used platforms from the perspectives of users, software developers and system administrators. We also propose principles to make the distribution of omics software more sustainable and robust to increase the reproducibility of biomedical and life science research.
Collapse
Affiliation(s)
- Mohammed Alser
- Department of Information Technology and Electrical Engineering, ETH Zürich, Zurich, Switzerland
| | - Brendan Lawlor
- Department of Computer Science, Munster Technological University, Cork, Ireland
- Department of Biological Sciences, Munster Technological University, Cork, Ireland
| | - Richard J Abdill
- Section of Genetic Medicine, Department of Medicine, University of Chicago, Chicago, IL, USA
| | - Sharon Waymost
- Department of Computer Science, University of California, Los Angeles, Los Angeles, CA, USA
| | - Ram Ayyala
- Department of Quantitative and Computational Biology, University of Southern California, Los Angeles, CA, USA
- Titus Family Department of Clinical Pharmacy, USC Alfred E. Mann School of Pharmacy and Pharmaceutical Sciences, University of Southern California, Los Angeles, CA, USA
| | - Neha Rajkumar
- Department of Bioengineering, University of California, Los Angeles, Los Angeles, CA, USA
| | - Nathan LaPierre
- Department of Computer Science, University of California, Los Angeles, Los Angeles, CA, USA
- Department of Human Genetics, University of Chicago, Chicago, IL, USA
| | - Jaqueline Brito
- Titus Family Department of Clinical Pharmacy, USC Alfred E. Mann School of Pharmacy and Pharmaceutical Sciences, University of Southern California, Los Angeles, CA, USA
| | | | - Nour Almadhoun
- Department of Information Technology and Electrical Engineering, ETH Zürich, Zurich, Switzerland
| | - Varuni Sarwal
- Department of Computer Science, University of California, Los Angeles, Los Angeles, CA, USA
| | - Can Firtina
- Department of Information Technology and Electrical Engineering, ETH Zürich, Zurich, Switzerland
| | - Tomasz Osinski
- Center for Advanced Research Computing, University of Southern California, Los Angeles, CA, USA
| | - Eleazar Eskin
- Department of Computer Science, University of California, Los Angeles, Los Angeles, CA, USA
- Department of Computational Medicine, University of California, Los Angeles, Los Angeles, CA, USA
- Department of Human Genetics, University of California, Los Angeles, CA, USA
| | - Qiyang Hu
- Office of Advanced Research Computing, University of California, Los Angeles, CA, USA
| | - Derek Strong
- Center for Advanced Research Computing, University of Southern California, Los Angeles, CA, USA
| | - Byoung-Do B D Kim
- Center for Advanced Research Computing, University of Southern California, Los Angeles, CA, USA
| | - Malak S Abedalthagafi
- Department of Pathology & Laboratory Medicine, Emory University Hospital, Atlanta, GA, USA
- King Salman Center for Disability Research, Riyadh, Saudi Arabia
| | - Onur Mutlu
- Department of Information Technology and Electrical Engineering, ETH Zürich, Zurich, Switzerland
| | - Serghei Mangul
- Department of Quantitative and Computational Biology, University of Southern California, Los Angeles, CA, USA.
- Titus Family Department of Clinical Pharmacy, USC Alfred E. Mann School of Pharmacy and Pharmaceutical Sciences, University of Southern California, Los Angeles, CA, USA.
| |
Collapse
|
3
|
Abdill RJ, Talarico E, Grieneisen L. A how-to guide for code sharing in biology. PLoS Biol 2024; 22:e3002815. [PMID: 39255324 PMCID: PMC11414921 DOI: 10.1371/journal.pbio.3002815] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Revised: 09/20/2024] [Indexed: 09/12/2024] Open
Abstract
In 2024, all biology is computational biology. Computer-aided analysis continues to spread into new fields, becoming more accessible to researchers trained in the wet lab who are eager to take advantage of growing datasets, falling costs, and novel assays that present new opportunities for discovery. It is currently much easier to find guidance for implementing these techniques than for reporting their use, leaving biologists to guess which details and files are relevant. In this essay, we review existing literature on the topic, summarize common tips, and link to additional resources for training. Following this overview, we then provide a set of recommendations for sharing code, with an eye toward guiding those who are comparatively new to applying open science principles to their computational work. Taken together, we provide a guide for biologists who seek to follow code sharing best practices but are unsure where to start.
Collapse
Affiliation(s)
- Richard J. Abdill
- Section of Genetic Medicine, Department of Medicine, University of Chicago, Chicago, Illinois, United States of America
| | - Emma Talarico
- Department of Biology, University of British Columbia—Okanagan Campus, Kelowna, British Columbia, Canada
| | - Laura Grieneisen
- Department of Biology, University of British Columbia—Okanagan Campus, Kelowna, British Columbia, Canada
- Okanagan Institute for Biodiversity, Resilience, and Ecosystem Services, University of British Columbia—Okanagan Campus, Kelowna, British Columbia, Canada
| |
Collapse
|
4
|
Deshpande D, Chhugani K, Ramesh T, Pellegrini M, Shiffman S, Abedalthagafi MS, Alqahtani S, Ye J, Liu XS, Leek JT, Brazma A, Ophoff RA, Rao G, Butte AJ, Moore JH, Katritch V, Mangul S. The evolution of computational research in a data-centric world. Cell 2024; 187:4449-4457. [PMID: 39178828 DOI: 10.1016/j.cell.2024.07.045] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/23/2024] [Revised: 07/21/2024] [Accepted: 07/24/2024] [Indexed: 08/26/2024]
Abstract
Computational data-centric research techniques play a prevalent and multi-disciplinary role in life science research. In the past, scientists in wet labs generated the data, and computational researchers focused on creating tools for the analysis of those data. Computational researchers are now becoming more independent and taking leadership roles within biomedical projects, leveraging the increased availability of public data. We are now able to generate vast amounts of data, and the challenge has shifted from data generation to data analysis. Here we discuss the pitfalls, challenges, and opportunities facing the field of data-centric research in biology. We discuss the evolving perception of computational data-driven research and its rise as an independent domain in biomedical research while also addressing the significant collaborative opportunities that arise from integrating computational research with experimental and translational biology. Additionally, we discuss the future of data-centric research and its applications across various areas of the biomedical field.
Collapse
Affiliation(s)
- Dhrithi Deshpande
- Titus Department of Clinical Pharmacy, Alfred E. Mann School of Pharmacy and Pharmaceutical Sciences, University of Southern California, Los Angeles, CA 90089, USA.
| | - Karishma Chhugani
- Titus Department of Clinical Pharmacy, Alfred E. Mann School of Pharmacy and Pharmaceutical Sciences, University of Southern California, Los Angeles, CA 90089, USA
| | - Tejasvene Ramesh
- Department of Pharmacology and Pharmaceutical Sciences, Alfred E. Mann School of Pharmacy and Pharmaceutical Sciences, University of Southern California, Los Angeles, CA 90089, USA
| | - Matteo Pellegrini
- Department of Molecular, Cell and Developmental Biology, University of California, Los Angeles, Los Angeles, CA 90095, USA
| | - Sagiv Shiffman
- Department of Genetics, The Alexander Silberman Institute of Life Sciences, The Hebrew University of Jerusalem, Jerusalem 9190401, Israel
| | - Malak S Abedalthagafi
- Genomics Research Department, King Fahad Medical City, Riyadh, Saudi Arabia; Department of Pathology & Laboratory Medicine, Emory University Hospital, Atlanta, GA, USA
| | - Saleh Alqahtani
- The Liver Transplant Unit, King Faisal Specialist Hospital and Research Center, Riyadh, Saudi Arabia; The Division of Gastroenterology and Hepatology, Johns Hopkins University, Baltimore, MD 21205, USA
| | - Jimmie Ye
- Department of Epidemiology & Biostatistics, Institute for Human Genetics, University of California, San Francisco, 513 Parnassus Avenue S965F, San Francisco, CA 94143, USA
| | - Xiaole Shirley Liu
- GV20 Oncotherapy, One Broadway, 14th Floor, Kendall Square, Cambridge, MA 02142, USA
| | - Jeffrey T Leek
- Biostatistics and Oncology at the Johns Hopkins Bloomberg School of Public Health and Johns Hopkins Data Science Lab, John Hopkins University, 615 N. Wolfe Street, Baltimore, MD 21205, USA
| | - Alvis Brazma
- EMBL European Bioinformatics Institute, Wellcome Genome Campus, Hinxton CB10 1SD, UK
| | - Roel A Ophoff
- Department of Psychiatry and Human Genetics, Center for Neurobehavioral Genetics, University of California, Los Angeles, Los Angeles, CA, USA
| | - Gauri Rao
- Titus Department of Clinical Pharmacy, Alfred E. Mann School of Pharmacy and Pharmaceutical Sciences, University of Southern California, Los Angeles, CA 90089, USA
| | - Atul J Butte
- Bakar Computational Health Sciences Institute, University of California, San Francisco, 490 Illinois Street, San Francisco, CA 94158, USA
| | - Jason H Moore
- Department of Computational Biomedicine, Cedars-Sinai Medical Center, 700 N. San Vicente Boulevard, Pacific Design Center Suite G540, West Hollywood, CA 90068, USA
| | - Vsevolod Katritch
- Department of Quantitative and Computational Biology, USC Dornsife College of Letters, Arts and Sciences, University of Southern California, Los Angeles, CA 90007, USA
| | - Serghei Mangul
- Department of Quantitative and Computational Biology, USC Dornsife College of Letters, Arts and Sciences, University of Southern California, Los Angeles, CA 90007, USA.
| |
Collapse
|
5
|
Baykal PI, Łabaj PP, Markowetz F, Schriml LM, Stekhoven DJ, Mangul S, Beerenwinkel N. Genomic reproducibility in the bioinformatics era. Genome Biol 2024; 25:213. [PMID: 39123217 PMCID: PMC11312195 DOI: 10.1186/s13059-024-03343-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/14/2023] [Accepted: 07/23/2024] [Indexed: 08/12/2024] Open
Abstract
In biomedical research, validating a scientific discovery hinges on the reproducibility of its experimental results. However, in genomics, the definition and implementation of reproducibility remain imprecise. We argue that genomic reproducibility, defined as the ability of bioinformatics tools to maintain consistent results across technical replicates, is essential for advancing scientific knowledge and medical applications. Initially, we examine different interpretations of reproducibility in genomics to clarify terms. Subsequently, we discuss the impact of bioinformatics tools on genomic reproducibility and explore methods for evaluating these tools regarding their effectiveness in ensuring genomic reproducibility. Finally, we recommend best practices to improve genomic reproducibility.
Collapse
Affiliation(s)
- Pelin Icer Baykal
- Department of Biosystems Science and Engineering, ETH Zurich, 4058, Basel, Switzerland
- SIB Swiss Institute of Bioinformatics, 4058, Basel, Switzerland
| | - Paweł Piotr Łabaj
- Małopolska Centre of Biotechnology, Jagiellonian University, 30-387, Gronostajowa 7A, Krakow, Poland
- Department of Biotechnology, Boku University Vienna, Muthgasse 18, 1190, Vienna, Austria
| | - Florian Markowetz
- Cancer Research UK Cambridge Research Institute, Cambridge, CB2 0RE, UK
- Department of Oncology, University of Cambridge, Cambridge, CB2 2XZ, UK
| | - Lynn M Schriml
- Institute for Genome Sciences, University of Maryland School of Medicine, HSFIII, 670 W. Baltimore St, Baltimore, MD, 21201, USA
| | - Daniel J Stekhoven
- SIB Swiss Institute of Bioinformatics, 4058, Basel, Switzerland
- NEXUS Personalized Health Technologies, ETH Zurich, 8952, Zurich, Switzerland
| | - Serghei Mangul
- Titus Family Department of Clinical Pharmacy, USC Alfred E. Mann School of Pharmacy and Pharmaceutical Sciences, University of Southern California, 1540 Alcazar Street, Los Angeles, CA, 90033, USA.
- Department of Quantitative and Computational Biology, University of Southern California Dornsife College of Letters, Arts, and Sciences, Los Angeles, CA, 90089, USA.
| | - Niko Beerenwinkel
- Department of Biosystems Science and Engineering, ETH Zurich, 4058, Basel, Switzerland.
- SIB Swiss Institute of Bioinformatics, 4058, Basel, Switzerland.
| |
Collapse
|
6
|
Li Y, Yang Y, Tong Z, Wang Y, Mi Q, Bai M, Liang G, Li B, Shu K. A comparative benchmarking and evaluation framework for heterogeneous network-based drug repositioning methods. Brief Bioinform 2024; 25:bbae172. [PMID: 38647153 PMCID: PMC11033846 DOI: 10.1093/bib/bbae172] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/10/2023] [Revised: 02/25/2024] [Accepted: 04/02/2024] [Indexed: 04/25/2024] Open
Abstract
Computational drug repositioning, which involves identifying new indications for existing drugs, is an increasingly attractive research area due to its advantages in reducing both overall cost and development time. As a result, a growing number of computational drug repositioning methods have emerged. Heterogeneous network-based drug repositioning methods have been shown to outperform other approaches. However, there is a dearth of systematic evaluation studies of these methods, encompassing performance, scalability and usability, as well as a standardized process for evaluating new methods. Additionally, previous studies have only compared several methods, with conflicting results. In this context, we conducted a systematic benchmarking study of 28 heterogeneous network-based drug repositioning methods on 11 existing datasets. We developed a comprehensive framework to evaluate their performance, scalability and usability. Our study revealed that methods such as HGIMC, ITRPCA and BNNR exhibit the best overall performance, as they rely on matrix completion or factorization. HINGRL, MLMC, ITRPCA and HGIMC demonstrate the best performance, while NMFDR, GROBMC and SCPMF display superior scalability. For usability, HGIMC, DRHGCN and BNNR are the top performers. Building on these findings, we developed an online tool called HN-DREP (http://hn-drep.lyhbio.com/) to facilitate researchers in viewing all the detailed evaluation results and selecting the appropriate method. HN-DREP also provides an external drug repositioning prediction service for a specific disease or drug by integrating predictions from all methods. Furthermore, we have released a Snakemake workflow named HN-DRES (https://github.com/lyhbio/HN-DRES) to facilitate benchmarking and support the extension of new methods into the field.
Collapse
Affiliation(s)
- Yinghong Li
- Chongqing Key Laboratory of Big Data for Bio Intelligence, Chongqing University of Posts and Telecommunications, Chongqing 400065, P. R. China
| | - Yinqi Yang
- Chongqing Key Laboratory of Big Data for Bio Intelligence, Chongqing University of Posts and Telecommunications, Chongqing 400065, P. R. China
| | - Zhuohao Tong
- Chongqing Key Laboratory of Big Data for Bio Intelligence, Chongqing University of Posts and Telecommunications, Chongqing 400065, P. R. China
| | - Yu Wang
- Chongqing Key Laboratory of Big Data for Bio Intelligence, Chongqing University of Posts and Telecommunications, Chongqing 400065, P. R. China
| | - Qin Mi
- Chongqing Key Laboratory of Big Data for Bio Intelligence, Chongqing University of Posts and Telecommunications, Chongqing 400065, P. R. China
| | - Mingze Bai
- Chongqing Key Laboratory of Big Data for Bio Intelligence, Chongqing University of Posts and Telecommunications, Chongqing 400065, P. R. China
| | - Guizhao Liang
- Key Laboratory of Biorheological Science and Technology, Ministry of Education, Bioengineering College, Chongqing University, Chongqing, 400044, P. R. China
| | - Bo Li
- College of Life Sciences, Chongqing Normal University, Chongqing 401331, P. R. China
| | - Kunxian Shu
- Chongqing Key Laboratory of Big Data for Bio Intelligence, Chongqing University of Posts and Telecommunications, Chongqing 400065, P. R. China
| |
Collapse
|
7
|
Loreto ELS, Melo ESD, Wallau GL, Gomes TMFF. The good, the bad and the ugly of transposable elements annotation tools. Genet Mol Biol 2024; 46:e20230138. [PMID: 38373163 PMCID: PMC10876081 DOI: 10.1590/1678-4685-gmb-2023-0138] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/09/2023] [Accepted: 11/26/2023] [Indexed: 02/21/2024] Open
Abstract
Transposable elements are repetitive and mobile DNA segments that can be found in virtually all organisms investigated to date. Their complex structure and variable nature are particularly challenging from the genomic annotation point of view. Many softwares have been developed to automate and facilitate TEs annotation at the genomic level, but they are highly heterogeneous regarding documentation, usability and methods. In this review, we revisited the existing software for TE genomic annotation, concentrating on the most often used ones, the methodologies they apply, and usability. Building on the state of the art of TE annotation software we propose best practices and highlight the strengths and weaknesses from the available solutions.
Collapse
Affiliation(s)
- Elgion L S Loreto
- Universidade Federal do Rio Grande do Sul, Programa de Pós-Graduação em Genética e Biologia Molecular, Porto Alegre, RS, Brazil
- Universidade Federal de Santa Maria, Departamento de Bioquímica e Biologia Molecular, Santa Maria, RS, Brazil
| | - Elverson S de Melo
- Fundação Oswaldo Cruz, Instituto Aggeu Magalhães, Departamento de Entomologia, Recife, PE, Brazil
| | - Gabriel L Wallau
- Fundação Oswaldo Cruz, Instituto Aggeu Magalhães, Departamento de Entomologia, Recife, PE, Brazil
| | - Tiago M F F Gomes
- Universidade Federal do Rio Grande do Sul, Programa de Pós-Graduação em Genética e Biologia Molecular, Porto Alegre, RS, Brazil
| |
Collapse
|
8
|
Rocha U, Coelho Kasmanas J, Kallies R, Saraiva JP, Toscan RB, Štefanič P, Bicalho MF, Borim Correa F, Baştürk MN, Fousekis E, Viana Barbosa LM, Plewka J, Probst AJ, Baldrian P, Stadler PF. MuDoGeR: Multi-Domain Genome recovery from metagenomes made easy. Mol Ecol Resour 2024; 24:e13904. [PMID: 37994269 DOI: 10.1111/1755-0998.13904] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/12/2022] [Revised: 10/18/2023] [Accepted: 11/13/2023] [Indexed: 11/24/2023]
Abstract
Several computational frameworks and workflows that recover genomes from prokaryotes, eukaryotes and viruses from metagenomes exist. Yet, it is difficult for scientists with little bioinformatics experience to evaluate quality, annotate genes, dereplicate, assign taxonomy and calculate relative abundance and coverage of genomes belonging to different domains. MuDoGeR is a user-friendly tool tailored for those familiar with Unix command-line environment that makes it easy to recover genomes of prokaryotes, eukaryotes and viruses from metagenomes, either alone or in combination. We tested MuDoGeR using 24 individual-isolated genomes and 574 metagenomes, demonstrating the applicability for a few samples and high throughput. While MuDoGeR can recover eukaryotic viral sequences, its characterization is predominantly skewed towards bacterial and archaeal viruses, reflecting the field's current state. However, acting as a dynamic wrapper, the MuDoGeR is designed to constantly incorporate updates and integrate new tools, ensuring its ongoing relevance in the rapidly evolving field. MuDoGeR is open-source software available at https://github.com/mdsufz/MuDoGeR. Additionally, MuDoGeR is also available as a Singularity container.
Collapse
Affiliation(s)
- Ulisses Rocha
- Department of Environmental Microbiology, Helmholtz Centre for Environmental Research - UFZ, Leipzig, Germany
| | - Jonas Coelho Kasmanas
- Department of Environmental Microbiology, Helmholtz Centre for Environmental Research - UFZ, Leipzig, Germany
- Institute of Mathematics and Computer Sciences, University of São Paulo, São Carlos, Brazil
| | - René Kallies
- Department of Environmental Microbiology, Helmholtz Centre for Environmental Research - UFZ, Leipzig, Germany
| | - Joao Pedro Saraiva
- Department of Environmental Microbiology, Helmholtz Centre for Environmental Research - UFZ, Leipzig, Germany
| | - Rodolfo Brizola Toscan
- Department of Environmental Microbiology, Helmholtz Centre for Environmental Research - UFZ, Leipzig, Germany
| | - Polonca Štefanič
- Biotechnical Faculty, University of Ljubljana, Ljubljana, Slovenia
| | - Marcos Fleming Bicalho
- Department of Environmental Microbiology, Helmholtz Centre for Environmental Research - UFZ, Leipzig, Germany
| | - Felipe Borim Correa
- Department of Environmental Microbiology, Helmholtz Centre for Environmental Research - UFZ, Leipzig, Germany
| | - Merve Nida Baştürk
- Department of Environmental Microbiology, Helmholtz Centre for Environmental Research - UFZ, Leipzig, Germany
| | - Efthymios Fousekis
- Department of Environmental Microbiology, Helmholtz Centre for Environmental Research - UFZ, Leipzig, Germany
| | - Luiz Miguel Viana Barbosa
- Department of Environmental Microbiology, Helmholtz Centre for Environmental Research - UFZ, Leipzig, Germany
| | - Julia Plewka
- Environmental Microbiology and Biotechnology, Department of Chemistry, University of Duisburg-Essen, Essen, Germany
| | - Alexander J Probst
- Environmental Microbiology and Biotechnology, Department of Chemistry, University of Duisburg-Essen, Essen, Germany
| | - Petr Baldrian
- Laboratory of Environmental Microbiology, Institute of Microbiology of the Czech Academy of Sciences, Praha 4, Czech Republic
| | - Peter F Stadler
- Department of Computer Science and Interdisciplinary Center of Bioinformatics, University of Leipzig, Leipzig, Germany
- Max Planck Institute for Mathematics in the Sciences, Leipzig, Germany
- Institute for Theoretical Chemistry, University of Vienna, Vienna, Austria
- The Santa Fe Institute, Santa Fe, New Mexico, USA
| |
Collapse
|
9
|
Cavalcante JVF, de Souza ID, Morais DADA, Dalmolin RJS. Bridging the Gaps in Meta-Omic Analysis: Workflows and Reproducibility. OMICS : A JOURNAL OF INTEGRATIVE BIOLOGY 2023; 27:547-549. [PMID: 38019198 DOI: 10.1089/omi.2023.0232] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/30/2023]
Abstract
The past few years have seen significant advances in the study of complex microbial communities associated with the evolution of sequencing technologies and increasing adoption of whole genome shotgun sequencing methods over the once more traditional Amplicon-based methods. Although these advances have broadened the horizon of meta-omic analyses in planetary health, human health, and ecology from simple sample composition studies to comprehensive taxonomic and metabolic profiles, there are still significant challenges in processing these data. First, there is a widespread lack of standardization in data processing, including software choices and the ease of installing and running attendant software. This can lead to several inconsistencies, making comparing results across studies and reproducing original results difficult. We argue that these drawbacks are especially evident in metatranscriptomic analysis, with most analyses relying on ad hoc scripts instead of pipelines implemented in workflow managers. Additional challenges rely on integrating meta-omic data, since methods have to consider the biases in the library preparation and sequencing methods and the technical noise that can arise from it. Here, we critically discuss the current limitations in metagenomics and metatranscriptomics methods with a view to catalyze future innovations in the field of Planetary Health, ecology, and allied fields of life sciences. We highlight possible solutions for these constraints to bring about more standardization, with ease of installation, high performance, and reproducibility as guiding principles.
Collapse
Affiliation(s)
| | - Iara Dantas de Souza
- Bioinformatics Multidisciplinary Environment-IMD, Federal University of Rio Grande do Norte, Natal, Brazil
| | | | - Rodrigo Juliani Siqueira Dalmolin
- Bioinformatics Multidisciplinary Environment-IMD, Federal University of Rio Grande do Norte, Natal, Brazil
- Department of Biochemistry-CB, Federal University of Rio Grande do Norte, Natal, Brazil
| |
Collapse
|
10
|
Ziemann M, Poulain P, Bora A. The five pillars of computational reproducibility: bioinformatics and beyond. Brief Bioinform 2023; 24:bbad375. [PMID: 37870287 PMCID: PMC10591307 DOI: 10.1093/bib/bbad375] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/16/2023] [Revised: 09/26/2023] [Accepted: 09/30/2023] [Indexed: 10/24/2023] Open
Abstract
Computational reproducibility is a simple premise in theory, but is difficult to achieve in practice. Building upon past efforts and proposals to maximize reproducibility and rigor in bioinformatics, we present a framework called the five pillars of reproducible computational research. These include (1) literate programming, (2) code version control and sharing, (3) compute environment control, (4) persistent data sharing and (5) documentation. These practices will ensure that computational research work can be reproduced quickly and easily, long into the future. This guide is designed for bioinformatics data analysts and bioinformaticians in training, but should be relevant to other domains of study.
Collapse
Affiliation(s)
- Mark Ziemann
- Deakin University, School of Life and Environmental Sciences, Geelong, Australia
- Burnet Institute, Melbourne, Australia
| | - Pierre Poulain
- Université Paris Cité, CNRS, Institut Jacques Monod, Paris, France
| | - Anusuiya Bora
- Deakin University, School of Life and Environmental Sciences, Geelong, Australia
| |
Collapse
|
11
|
Sonrel A, Luetge A, Soneson C, Mallona I, Germain PL, Knyazev S, Gilis J, Gerber R, Seurinck R, Paul D, Sonder E, Crowell HL, Fanaswala I, Al-Ajami A, Heidari E, Schmeing S, Milosavljevic S, Saeys Y, Mangul S, Robinson MD. Meta-analysis of (single-cell method) benchmarks reveals the need for extensibility and interoperability. Genome Biol 2023; 24:119. [PMID: 37198712 PMCID: PMC10189979 DOI: 10.1186/s13059-023-02962-5] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/27/2022] [Accepted: 05/06/2023] [Indexed: 05/19/2023] Open
Abstract
Computational methods represent the lifeblood of modern molecular biology. Benchmarking is important for all methods, but with a focus here on computational methods, benchmarking is critical to dissect important steps of analysis pipelines, formally assess performance across common situations as well as edge cases, and ultimately guide users on what tools to use. Benchmarking can also be important for community building and advancing methods in a principled way. We conducted a meta-analysis of recent single-cell benchmarks to summarize the scope, extensibility, and neutrality, as well as technical features and whether best practices in open data and reproducible research were followed. The results highlight that while benchmarks often make code available and are in principle reproducible, they remain difficult to extend, for example, as new methods and new ways to assess methods emerge. In addition, embracing containerization and workflow systems would enhance reusability of intermediate benchmarking results, thus also driving wider adoption.
Collapse
Affiliation(s)
- Anthony Sonrel
- Department of Molecular Life Sciences, University of Zurich, Zurich, Switzerland
- SIB Swiss Institute of Bioinformatics, Zurich, Switzerland
| | - Almut Luetge
- Department of Molecular Life Sciences, University of Zurich, Zurich, Switzerland
- SIB Swiss Institute of Bioinformatics, Zurich, Switzerland
| | - Charlotte Soneson
- SIB Swiss Institute of Bioinformatics, Zurich, Switzerland
- Friedrich Miescher Institute for Biomedical Research, Basel, Switzerland
| | - Izaskun Mallona
- Department of Molecular Life Sciences, University of Zurich, Zurich, Switzerland
- SIB Swiss Institute of Bioinformatics, Zurich, Switzerland
- Department of Quantitative Biomedicine, University of Zurich, Zurich, Switzerland
| | - Pierre-Luc Germain
- Department of Molecular Life Sciences, University of Zurich, Zurich, Switzerland
- SIB Swiss Institute of Bioinformatics, Zurich, Switzerland
- D-HEST Institute for Neuroscience, ETH Zürich, Zurich, Switzerland
| | - Sergey Knyazev
- Department of Pathology and Laboratory Medicine, David Geffen School of Medicine, University of California, Los Angeles, CA, USA
- Department of Clinical Pharmacy, School of Pharmacy, University of Southern California, Los Angeles, CA, USA
| | - Jeroen Gilis
- Department of Applied Mathematics, Computer Science & Statistics, Ghent University, Ghent, Belgium
- Data Mining and Modeling for Biomedicine, VIB Center for Inflammation Research, Ghent, Belgium
- Bioinformatics Institute Ghent, Ghent University, Ghent, Belgium
| | - Reto Gerber
- Department of Molecular Life Sciences, University of Zurich, Zurich, Switzerland
- SIB Swiss Institute of Bioinformatics, Zurich, Switzerland
| | - Ruth Seurinck
- Department of Applied Mathematics, Computer Science & Statistics, Ghent University, Ghent, Belgium
- Data Mining and Modeling for Biomedicine, VIB Center for Inflammation Research, Ghent, Belgium
| | - Dominique Paul
- Department of Molecular Life Sciences, University of Zurich, Zurich, Switzerland
| | - Emanuel Sonder
- Department of Molecular Life Sciences, University of Zurich, Zurich, Switzerland
- SIB Swiss Institute of Bioinformatics, Zurich, Switzerland
- D-HEST Institute for Neuroscience, ETH Zürich, Zurich, Switzerland
| | - Helena L Crowell
- Department of Molecular Life Sciences, University of Zurich, Zurich, Switzerland
- SIB Swiss Institute of Bioinformatics, Zurich, Switzerland
| | - Imran Fanaswala
- Department of Molecular Life Sciences, University of Zurich, Zurich, Switzerland
- SIB Swiss Institute of Bioinformatics, Zurich, Switzerland
| | - Ahmad Al-Ajami
- Department of Molecular Life Sciences, University of Zurich, Zurich, Switzerland
- SIB Swiss Institute of Bioinformatics, Zurich, Switzerland
| | - Elyas Heidari
- Department of Molecular Life Sciences, University of Zurich, Zurich, Switzerland
- SIB Swiss Institute of Bioinformatics, Zurich, Switzerland
| | - Stephan Schmeing
- Department of Molecular Life Sciences, University of Zurich, Zurich, Switzerland
- SIB Swiss Institute of Bioinformatics, Zurich, Switzerland
| | - Stefan Milosavljevic
- Department of Molecular Life Sciences, University of Zurich, Zurich, Switzerland
- SIB Swiss Institute of Bioinformatics, Zurich, Switzerland
- Department of Evolutionary Biology and Environmental Studies, University of Zurich, Zurich, Switzerland
| | - Yvan Saeys
- Department of Applied Mathematics, Computer Science & Statistics, Ghent University, Ghent, Belgium
- Data Mining and Modeling for Biomedicine, VIB Center for Inflammation Research, Ghent, Belgium
| | - Serghei Mangul
- Department of Clinical Pharmacy, School of Pharmacy, University of Southern California, Los Angeles, CA, USA
| | - Mark D Robinson
- Department of Molecular Life Sciences, University of Zurich, Zurich, Switzerland.
- SIB Swiss Institute of Bioinformatics, Zurich, Switzerland.
| |
Collapse
|
12
|
Tello D, Gonzalez-Garcia LN, Gomez J, Zuluaga-Monares JC, Garcia R, Angel R, Mahecha D, Duarte E, Leon MDR, Reyes F, Escobar-Velásquez C, Linares-Vásquez M, Cardozo N, Duitama J. NGSEP 4: Efficient and accurate identification of orthogroups and whole-genome alignment. Mol Ecol Resour 2023; 23:712-724. [PMID: 36377253 DOI: 10.1111/1755-0998.13737] [Citation(s) in RCA: 5] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/30/2022] [Revised: 10/26/2022] [Accepted: 11/09/2022] [Indexed: 11/16/2022]
Abstract
Whole-genome alignment allows researchers to understand the genomic structure and variation among genomes. Approaches based on direct pairwise comparisons of DNA sequences require large computational capacities. As a consequence, pipelines combining tools for orthologous gene identification and synteny have been developed. In this manuscript, we present the latest functionalities implemented in NGSEP 4, to identify orthogroups and perform whole genome alignments. NGSEP implements functionalities for identification of clusters of homologus genes, synteny analysis and whole genome alignment. Our results showed that the NGSEP algorithm for orthogroups identification has competitive accuracy and efficiency in comparison to commonly used tools. The implementation also includes a visualization of the whole genome alignment based on synteny of the orthogroups that were identified, and a reconstruction of the pangenome based on frequencies of the orthogroups among the genomes. NGSEP 4 also includes a new graphical user interface based on the JavaFX technology. We expect that these new developments will be very useful for several studies in evolutionary biology and population genomics.
Collapse
Affiliation(s)
- Daniel Tello
- Systems and Computing Engineering Department, Universidad de los Andes, Bogotá, Colombia
| | | | - Jorge Gomez
- Systems and Computing Engineering Department, Universidad de los Andes, Bogotá, Colombia
| | | | - Rogelio Garcia
- Systems and Computing Engineering Department, Universidad de los Andes, Bogotá, Colombia
| | - Ricardo Angel
- Systems and Computing Engineering Department, Universidad de los Andes, Bogotá, Colombia
| | - Daniel Mahecha
- Systems and Computing Engineering Department, Universidad de los Andes, Bogotá, Colombia
| | - Erick Duarte
- Systems and Computing Engineering Department, Universidad de los Andes, Bogotá, Colombia
| | - Maria Del Rosario Leon
- Systems and Computing Engineering Department, Universidad de los Andes, Bogotá, Colombia
| | - Fernando Reyes
- Systems and Computing Engineering Department, Universidad de los Andes, Bogotá, Colombia
| | | | - Mario Linares-Vásquez
- Systems and Computing Engineering Department, Universidad de los Andes, Bogotá, Colombia
| | - Nicolas Cardozo
- Systems and Computing Engineering Department, Universidad de los Andes, Bogotá, Colombia
| | - Jorge Duitama
- Systems and Computing Engineering Department, Universidad de los Andes, Bogotá, Colombia
| |
Collapse
|
13
|
Deshpande D, Chhugani K, Chang Y, Karlsberg A, Loeffler C, Zhang J, Muszyńska A, Munteanu V, Yang H, Rotman J, Tao L, Balliu B, Tseng E, Eskin E, Zhao F, Mohammadi P, P. Łabaj P, Mangul S. RNA-seq data science: From raw data to effective interpretation. Front Genet 2023; 14:997383. [PMID: 36999049 PMCID: PMC10043755 DOI: 10.3389/fgene.2023.997383] [Citation(s) in RCA: 18] [Impact Index Per Article: 18.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/18/2022] [Accepted: 02/24/2023] [Indexed: 03/14/2023] Open
Abstract
RNA sequencing (RNA-seq) has become an exemplary technology in modern biology and clinical science. Its immense popularity is due in large part to the continuous efforts of the bioinformatics community to develop accurate and scalable computational tools to analyze the enormous amounts of transcriptomic data that it produces. RNA-seq analysis enables genes and their corresponding transcripts to be probed for a variety of purposes, such as detecting novel exons or whole transcripts, assessing expression of genes and alternative transcripts, and studying alternative splicing structure. It can be a challenge, however, to obtain meaningful biological signals from raw RNA-seq data because of the enormous scale of the data as well as the inherent limitations of different sequencing technologies, such as amplification bias or biases of library preparation. The need to overcome these technical challenges has pushed the rapid development of novel computational tools, which have evolved and diversified in accordance with technological advancements, leading to the current myriad of RNA-seq tools. These tools, combined with the diverse computational skill sets of biomedical researchers, help to unlock the full potential of RNA-seq. The purpose of this review is to explain basic concepts in the computational analysis of RNA-seq data and define discipline-specific jargon.
Collapse
Affiliation(s)
- Dhrithi Deshpande
- Department of Pharmacology and Pharmaceutical Sciences, USC Alfred E. Mann School of Pharmacy and Pharmaceutical Sciences, Los Angeles, CA, United States
| | - Karishma Chhugani
- Department of Pharmacology and Pharmaceutical Sciences, USC Alfred E. Mann School of Pharmacy and Pharmaceutical Sciences, Los Angeles, CA, United States
| | - Yutong Chang
- Department of Pharmacology and Pharmaceutical Sciences, USC Alfred E. Mann School of Pharmacy and Pharmaceutical Sciences, Los Angeles, CA, United States
| | - Aaron Karlsberg
- Department of Clinical Pharmacy, USC Alfred E. Mann School of Pharmacy and Pharmaceutical Sciences, Los Angeles, CA, United States
| | - Caitlin Loeffler
- Department of Computer Science, University of California, Los Angeles, CA, United States
| | - Jinyang Zhang
- Beijing Institutes of Life Science, Chinese Academy of Sciences, Beijing, China
| | - Agata Muszyńska
- Małopolska Centre of Biotechnology, Jagiellonian University, Krakow, Poland
- Institute of Automatic Control, Electronics and Computer Science, Silesian University of Technology, Gliwice, Poland
| | - Viorel Munteanu
- Department of Computers, Informatics and Microelectronics, Technical University of Moldova, Chisinau, Moldova
| | - Harry Yang
- Department of Microbiology, Immunology and Molecular Genetics, University of California Los Angeles, Los Angeles, CA, United States
| | - Jeremy Rotman
- Department of Clinical Pharmacy, USC Alfred E. Mann School of Pharmacy and Pharmaceutical Sciences, Los Angeles, CA, United States
| | - Laura Tao
- Department of Computational Medicine, David Geffen School of Medicine at UCLA, CHS, Los Angeles, CA, United States
| | - Brunilda Balliu
- Department of Computational Medicine, David Geffen School of Medicine at UCLA, CHS, Los Angeles, CA, United States
| | | | - Eleazar Eskin
- Department of Computer Science, University of California, Los Angeles, CA, United States
- Department of Computational Medicine, David Geffen School of Medicine at UCLA, CHS, Los Angeles, CA, United States
- Department of Human Genetics, David Geffen School of Medicine at UCLA, Los Angeles, CA, United States
| | - Fangqing Zhao
- Beijing Institutes of Life Science, Chinese Academy of Sciences, Beijing, China
- Key Laboratory of Systems Biology, Hangzhou Institute for Advanced Study, University of Chinese Academy of Sciences, Hangzhou, China
| | - Pejman Mohammadi
- Department of Integrative Structural and Computational Biology, The Scripps Research Institute, La Jolla, CA, United States
| | - Paweł P. Łabaj
- Małopolska Centre of Biotechnology, Jagiellonian University, Krakow, Poland
- Department of Biotechnology, Boku University Vienna, Vienna, Austria
| | - Serghei Mangul
- Department of Clinical Pharmacy, USC Alfred E. Mann School of Pharmacy and Pharmaceutical Sciences, Los Angeles, CA, United States
- Department of Quantitative and Computational Biology, USC Dornsife College of Letters, Arts and Sciences, Los Angeles, CA, United States
- *Correspondence: Serghei Mangul,
| |
Collapse
|
14
|
Abstract
Experiments involving metagenomics data are become increasingly commonplace. Processing such data requires a unique set of considerations. Quality control of metagenomics data is critical to extracting pertinent insights. In this chapter, we outline some considerations in terms of study design and other confounding factors that can often only be realized at the point of data analysis.In this chapter, we outline some basic principles of quality control in metagenomics, including overall reproducibility and some good practices to follow. The general quality control of sequencing data is then outlined, and we introduce ways to process this data by using bash scripts and developing pipelines in Snakemake (Python).A significant part of quality control in metagenomics is in analyzing the data to ensure you can spot relationships between variables and to identify when they might be confounded. This chapter provides a walkthrough of analyzing some microbiome data (in the R statistical language) and demonstrates a few days to identify overall differences and similarities in microbiome data. The chapter is concluded by discussing remarks about considering taxonomic results in the context of the study and interrogating sequence alignments using the command line.
Collapse
Affiliation(s)
- Abraham Gihawi
- Bob Champion Research & Education Building, Norwich Medical School, University of East Anglia, Norwich, UK
| | - Ryan Cardenas
- Bob Champion Research & Education Building, Norwich Medical School, University of East Anglia, Norwich, UK
| | - Rachel Hurst
- Bob Champion Research & Education Building, Norwich Medical School, University of East Anglia, Norwich, UK
| | - Daniel S Brewer
- Bob Champion Research & Education Building, Norwich Medical School, University of East Anglia, Norwich, UK.
- Earlham Institute, Norwich Research Park, Norwich, UK.
| |
Collapse
|
15
|
Durant E, Rouard M, Ganko EW, Muller C, Cleary AM, Farmer AD, Conte M, Sabot F. Ten simple rules for developing visualization tools in genomics. PLoS Comput Biol 2022; 18:e1010622. [PMID: 36355753 PMCID: PMC9648702 DOI: 10.1371/journal.pcbi.1010622] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Affiliation(s)
- Eloi Durant
- DIADE, University of Montpellier, CIRAD, IRD, Montpellier, France
- Syngenta Seeds SAS, Saint-Sauveur, France
- Bioversity International, Parc Scientifique Agropolis II, Montpellier, France
- French Institute of Bioinformatics (IFB)—South Green Bioinformatics Platform, Bioversity, CIRAD, INRAE, IRD, Montpellier, France
| | - Mathieu Rouard
- Bioversity International, Parc Scientifique Agropolis II, Montpellier, France
- French Institute of Bioinformatics (IFB)—South Green Bioinformatics Platform, Bioversity, CIRAD, INRAE, IRD, Montpellier, France
| | - Eric W. Ganko
- Seeds Research, Syngenta Crop Protection, LLC, Research Triangle Park, Durham, North Carolina, United States of America
| | | | - Alan M. Cleary
- National Center for Genome Resources, Santa Fe, New Mexico, United States of America
| | - Andrew D. Farmer
- National Center for Genome Resources, Santa Fe, New Mexico, United States of America
| | | | - Francois Sabot
- DIADE, University of Montpellier, CIRAD, IRD, Montpellier, France
- French Institute of Bioinformatics (IFB)—South Green Bioinformatics Platform, Bioversity, CIRAD, INRAE, IRD, Montpellier, France
| |
Collapse
|
16
|
Peng K, Moore J, Vahed M, Brito J, Kao G, Burkhardt AM, Alachkar H, Mangul S. pyTCR: A comprehensive and scalable solution for TCR-Seq data analysis to facilitate reproducibility and rigor of immunogenomics research. Front Immunol 2022; 13:954078. [PMID: 36451811 PMCID: PMC9704496 DOI: 10.3389/fimmu.2022.954078] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/26/2022] [Accepted: 10/05/2022] [Indexed: 01/29/2023] Open
Abstract
T cell receptor (TCR) studies have grown substantially with the advancement in the sequencing techniques of T cell receptor repertoire sequencing (TCR-Seq). The analysis of the TCR-Seq data requires computational skills to run the computational analysis of TCR repertoire tools. However biomedical researchers with limited computational backgrounds face numerous obstacles to properly and efficiently utilizing bioinformatics tools for analyzing TCR-Seq data. Here we report pyTCR, a computational notebook-based solution for comprehensive and scalable TCR-Seq data analysis. Computational notebooks, which combine code, calculations, and visualization, are able to provide users with a high level of flexibility and transparency for the analysis. Additionally, computational notebooks are demonstrated to be user-friendly and suitable for researchers with limited computational skills. Our tool has a rich set of functionalities including various TCR metrics, statistical analysis, and customizable visualizations. The application of pyTCR on large and diverse TCR-Seq datasets will enable the effective analysis of large-scale TCR-Seq data with flexibility, and eventually facilitate new discoveries.
Collapse
Affiliation(s)
- Kerui Peng
- Department of Clinical Pharmacy, School of Pharmacy, University of Southern California, Los Angeles, CA, United States
| | - Jaden Moore
- Department of Clinical Pharmacy, School of Pharmacy, University of Southern California, Los Angeles, CA, United States,Computer Science Department, Orange Coast College, Costa Mesa, CA, United States
| | - Mohammad Vahed
- Department of Clinical Pharmacy, School of Pharmacy, University of Southern California, Los Angeles, CA, United States
| | - Jaqueline Brito
- Department of Clinical Pharmacy, School of Pharmacy, University of Southern California, Los Angeles, CA, United States
| | - Guoyun Kao
- Department of Pharmacology and Pharmaceutical Sciences, School of Pharmacy, University of Southern California, Los Angeles, CA, United States
| | - Amanda M. Burkhardt
- Department of Clinical Pharmacy, School of Pharmacy, University of Southern California, Los Angeles, CA, United States
| | - Houda Alachkar
- Department of Clinical Pharmacy, School of Pharmacy, University of Southern California, Los Angeles, CA, United States
| | - Serghei Mangul
- Department of Clinical Pharmacy, School of Pharmacy, University of Southern California, Los Angeles, CA, United States,*Correspondence: Serghei Mangul,
| |
Collapse
|
17
|
Sarwal V, Niehus S, Ayyala R, Kim M, Sarkar A, Chang S, Lu A, Rajkumar N, Darci-Maher N, Littman R, Chhugani K, Soylev A, Comarova Z, Wesel E, Castellanos J, Chikka R, Distler MG, Eskin E, Flint J, Mangul S. A comprehensive benchmarking of WGS-based deletion structural variant callers. Brief Bioinform 2022; 23:bbac221. [PMID: 35753701 PMCID: PMC9294411 DOI: 10.1093/bib/bbac221] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/08/2021] [Revised: 04/30/2022] [Accepted: 05/11/2022] [Indexed: 01/10/2023] Open
Abstract
Advances in whole-genome sequencing (WGS) promise to enable the accurate and comprehensive structural variant (SV) discovery. Dissecting SVs from WGS data presents a substantial number of challenges and a plethora of SV detection methods have been developed. Currently, evidence that investigators can use to select appropriate SV detection tools is lacking. In this article, we have evaluated the performance of SV detection tools on mouse and human WGS data using a comprehensive polymerase chain reaction-confirmed gold standard set of SVs and the genome-in-a-bottle variant set, respectively. In contrast to the previous benchmarking studies, our gold standard dataset included a complete set of SVs allowing us to report both precision and sensitivity rates of the SV detection methods. Our study investigates the ability of the methods to detect deletions, thus providing an optimistic estimate of SV detection performance as the SV detection methods that fail to detect deletions are likely to miss more complex SVs. We found that SV detection tools varied widely in their performance, with several methods providing a good balance between sensitivity and precision. Additionally, we have determined the SV callers best suited for low- and ultralow-pass sequencing data as well as for different deletion length categories.
Collapse
Affiliation(s)
- Varuni Sarwal
- Department of Computer Science, University of California Los Angeles, 580 Portola Plaza, Los Angeles, CA 90095, USA
- Indian Institute of Technology Delhi, Hauz Khas, New Delhi, Delhi 110016, India
| | - Sebastian Niehus
- Berlin Institute of Health (BIH), Anna-Louisa-Karsch-Str. 2, 10178 Berlin, Germany
- Charité-Universitätsmedizin Berlin, corporate member of Freie Universität Berlin, Humboldt-Universität zu Berlin, and Berlin Institute of Health, Charitéplatz 1, 10117 Berlin, Germany
| | - Ram Ayyala
- Department of Computer Science, University of California Los Angeles, 580 Portola Plaza, Los Angeles, CA 90095, USA
| | - Minyoung Kim
- Department of Quantitative and Computational Biology, University of Southern California, 1050 Childs Way, Los Angeles, CA 90089
| | - Aditya Sarkar
- School of Computing and Electrical Engineering, Indian Institute of Technology Mandi, Kamand, Mandi, Himachal Pradesh 175001, India
| | - Sei Chang
- Department of Computer Science, University of California Los Angeles, 580 Portola Plaza, Los Angeles, CA 90095, USA
| | - Angela Lu
- Department of Computer Science, University of California Los Angeles, 580 Portola Plaza, Los Angeles, CA 90095, USA
| | - Neha Rajkumar
- Department of Bioengineering, Department of Bioengineering, University of California Los Angeles, Los Angeles, CA, 90095
| | - Nicholas Darci-Maher
- Department of Computer Science, University of California Los Angeles, 580 Portola Plaza, Los Angeles, CA 90095, USA
| | - Russell Littman
- Department of Computer Science, University of California Los Angeles, 580 Portola Plaza, Los Angeles, CA 90095, USA
| | - Karishma Chhugani
- Department of Clinical Pharmacy, School of Pharmacy, University of Southern California 1985 Zonal Avenue Los Angeles, CA 90089-9121
| | - Arda Soylev
- Department of Computer Engineering, Konya Food and Agriculture University, Konya, Turkey
| | - Zoia Comarova
- Department Civil and Environmental Engineering, University of Southern California, Los Angeles, CA, United States
| | - Emily Wesel
- Department of Computer Science, University of California Los Angeles, 580 Portola Plaza, Los Angeles, CA 90095, USA
| | - Jacqueline Castellanos
- Department of Computer Science, University of California Los Angeles, 580 Portola Plaza, Los Angeles, CA 90095, USA
| | - Rahul Chikka
- Department of Computer Science, University of California Los Angeles, 580 Portola Plaza, Los Angeles, CA 90095, USA
| | - Margaret G Distler
- Department of Computer Science, University of California Los Angeles, 580 Portola Plaza, Los Angeles, CA 90095, USA
| | - Eleazar Eskin
- Department of Computer Science, University of California Los Angeles, 580 Portola Plaza, Los Angeles, CA 90095, USA
- Department of Human Genetics, David Geffen School of Medicine at UCLA, 695 Charles E. Young Drive South, Box 708822, Los Angeles, CA, 90095, USA
- Department of Computational Medicine, David Geffen School of Medicine at UCLA, 73-235 CHS, Los Angeles, CA, 90095, USA
| | - Jonathan Flint
- Center for Neurobehavioral Genetics, Semel Institute for Neuroscience and Human Behavior, University of California Los Angeles, 760 Westwood Plaza, Los Angeles, CA 90095, USA
| | - Serghei Mangul
- Department of Clinical Pharmacy, School of Pharmacy, University of Southern California 1985 Zonal Avenue Los Angeles, CA 90089-9121
| |
Collapse
|
18
|
Niu YN, Roberts EG, Denisko D, Hoffman MM. Assessing and assuring interoperability of a genomics file format. Bioinformatics 2022; 38:3327-3336. [PMID: 35575355 PMCID: PMC9237710 DOI: 10.1093/bioinformatics/btac327] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/07/2022] [Revised: 03/30/2022] [Accepted: 05/11/2022] [Indexed: 12/01/2022] Open
Abstract
Motivation Bioinformatics software tools operate largely through the use of specialized genomics file formats. Often these formats lack formal specification, making it difficult or impossible for the creators of these tools to robustly test them for correct handling of input and output. This causes problems in interoperability between different tools that, at best, wastes time and frustrates users. At worst, interoperability issues could lead to undetected errors in scientific results. Results We developed a new verification system, Acidbio, which tests for correct behavior in bioinformatics software packages. We crafted tests to unify correct behavior when tools encounter various edge cases—potentially unexpected inputs that exemplify the limits of the format. To analyze the performance of existing software, we tested the input validation of 80 Bioconda packages that parsed the Browser Extensible Data (BED) format. We also used a fuzzing approach to automatically perform additional testing. Of 80 software packages examined, 75 achieved less than 70% correctness on our test suite. We categorized multiple root causes for the poor performance of different types of software. Fuzzing detected other errors that the manually designed test suite could not. We also created a badge system that developers can use to indicate more precisely which BED variants their software accepts and to advertise the software’s performance on the test suite. Availability and implementation Acidbio is available at https://github.com/hoffmangroup/acidbio. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Yi Nian Niu
- Princess Margaret Cancer Centre University Health Network, Toronto, ON, M5G 2C1, Canada
| | - Eric G Roberts
- Princess Margaret Cancer Centre University Health Network, Toronto, ON, M5G 2C1, Canada
| | - Danielle Denisko
- Princess Margaret Cancer Centre University Health Network, Toronto, ON, M5G 2C1, Canada.,Department of Medical Biophysics, University of Toronto, Toronto, ON, M5G 1L7, Canada
| | - Michael M Hoffman
- Princess Margaret Cancer Centre University Health Network, Toronto, ON, M5G 2C1, Canada.,Department of Medical Biophysics, University of Toronto, Toronto, ON, M5G 1L7, Canada.,Department of Computer Science, University of Toronto, Toronto, ON, M5S 2E4, Canada.,Vector Institute, Toronto, ON, M5G 1M1, Canada
| |
Collapse
|
19
|
Steenwyk JL, Buida Iii TJ, Gonçalves C, Goltz DC, Morales G, Mead ME, LaBella AL, Chavez CM, Schmitz JE, Hadjifrangiskou M, Li Y, Rokas A. BioKIT: a versatile toolkit for processing and analyzing diverse types of sequence data. Genetics 2022; 221:6583183. [PMID: 35536198 PMCID: PMC9252278 DOI: 10.1093/genetics/iyac079] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/21/2022] [Accepted: 05/03/2022] [Indexed: 11/14/2022] Open
Abstract
Bioinformatic analysis-such as genome assembly quality assessment, alignment summary statistics, relative synonymous codon usage, file format conversion, and processing and analysis-is integrated into diverse disciplines in the biological sciences. Several command-line pieces of software have been developed to conduct some of these individual analyses, but unified toolkits that conduct all these analyses are lacking. To address this gap, we introduce BioKIT, a versatile command line toolkit that has, upon publication, 42 functions, several of which were community-sourced, that conduct routine and novel processing and analysis of genome assemblies, multiple sequence alignments, coding sequences, sequencing data, and more. To demonstrate the utility of BioKIT, we conducted a comprehensive examination of relative synonymous codon usage across 171 fungal genomes that use alternative genetic codes, showed that the novel metric of gene-wise relative synonymous codon usage can accurately estimate gene-wise codon optimization, evaluated the quality and characteristics of 901 eukaryotic genome assemblies, and calculated alignment summary statistics for 10 phylogenomic data matrices. BioKIT will be helpful in facilitating and streamlining sequence analysis workflows. BioKIT is freely available under the MIT license from GitHub (https://github.com/JLSteenwyk/BioKIT), PyPi (https://pypi.org/project/jlsteenwyk-biokit/), and the Anaconda Cloud (https://anaconda.org/jlsteenwyk/jlsteenwyk-biokit). Documentation, user tutorials, and instructions for requesting new features are available online (https://jlsteenwyk.com/BioKIT).
Collapse
Affiliation(s)
- Jacob L Steenwyk
- Department of Biological Sciences, Vanderbilt University, VU Station B #35-1634, Nashville, TN 37235, USA.,Evolutionary Studies Initiative, Vanderbilt University, Nashville, TN 37235, USA
| | | | - Carla Gonçalves
- Department of Biological Sciences, Vanderbilt University, VU Station B #35-1634, Nashville, TN 37235, USA.,Evolutionary Studies Initiative, Vanderbilt University, Nashville, TN 37235, USA.,Associate Laboratory i4HB-Institute for Health and Bioeconomy, NOVA School of Science and Technology, NOVA University Lisbon, 2819-516 Caparica, Portugal.,UCIBIO-Applied Molecular Biosciences Unit, Department of Life Sciences, NOVA School of Science and Technology, NOVA University Lisbon, 2819-516 Caparica, Portugal
| | | | - Grace Morales
- Department of Pathology, Microbiology & Immunology, Center for Personalized Microbiology, Vanderbilt University Medical Center, Nashville, TN 37232, USA
| | - Matthew E Mead
- Department of Biological Sciences, Vanderbilt University, VU Station B #35-1634, Nashville, TN 37235, USA.,Evolutionary Studies Initiative, Vanderbilt University, Nashville, TN 37235, USA
| | - Abigail L LaBella
- Department of Biological Sciences, Vanderbilt University, VU Station B #35-1634, Nashville, TN 37235, USA.,Evolutionary Studies Initiative, Vanderbilt University, Nashville, TN 37235, USA
| | - Christina M Chavez
- Department of Biological Sciences, Vanderbilt University, VU Station B #35-1634, Nashville, TN 37235, USA.,Evolutionary Studies Initiative, Vanderbilt University, Nashville, TN 37235, USA
| | - Jonathan E Schmitz
- Department of Pathology, Microbiology & Immunology, Center for Personalized Microbiology, Vanderbilt University Medical Center, Nashville, TN 37232, USA
| | - Maria Hadjifrangiskou
- Evolutionary Studies Initiative, Vanderbilt University, Nashville, TN 37235, USA.,Department of Pathology, Microbiology & Immunology, Center for Personalized Microbiology, Vanderbilt University Medical Center, Nashville, TN 37232, USA
| | - Yuanning Li
- Department of Biological Sciences, Vanderbilt University, VU Station B #35-1634, Nashville, TN 37235, USA
| | - Antonis Rokas
- Department of Biological Sciences, Vanderbilt University, VU Station B #35-1634, Nashville, TN 37235, USA.,Evolutionary Studies Initiative, Vanderbilt University, Nashville, TN 37235, USA
| |
Collapse
|
20
|
Gardner PP, Paterson JM, McGimpsey S, Ashari-Ghomi F, Umu SU, Pawlik A, Gavryushkin A, Black MA. Sustained software development, not number of citations or journal choice, is indicative of accurate bioinformatic software. Genome Biol 2022; 23:56. [PMID: 35172880 PMCID: PMC8851831 DOI: 10.1186/s13059-022-02625-x] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/05/2021] [Accepted: 02/06/2022] [Indexed: 11/29/2022] Open
Abstract
Background Computational biology provides software tools for testing and making inferences about biological data. In the face of increasing volumes of data, heuristic methods that trade software speed for accuracy may be employed. We have studied these trade-offs using the results of a large number of independent software benchmarks, and evaluated whether external factors, including speed, author reputation, journal impact, recency and developer efforts, are indicative of accurate software. Results We find that software speed, author reputation, journal impact, number of citations and age are unreliable predictors of software accuracy. This is unfortunate because these are frequently cited reasons for selecting software tools. However, GitHub-derived statistics and high version numbers show that accurate bioinformatic software tools are generally the product of many improvements over time. We also find an excess of slow and inaccurate bioinformatic software tools, and this is consistent across many sub-disciplines. There are few tools that are middle-of-road in terms of accuracy and speed trade-offs. Conclusions Our findings indicate that accurate bioinformatic software is primarily the product of long-term commitments to software development. In addition, we hypothesise that bioinformatics software suffers from publication bias. Software that is intermediate in terms of both speed and accuracy may be difficult to publish—possibly due to author, editor and reviewer practises. This leaves an unfortunate hole in the literature, as ideal tools may fall into this gap. High accuracy tools are not always useful if they are slow, while high speed is not useful if the results are also inaccurate. Supplementary Information The online version contains supplementary material available at (10.1186/s13059-022-02625-x).
Collapse
Affiliation(s)
- Paul P Gardner
- Department of Biochemistry,, University of Otago, Dunedin, New Zealand. .,Biomolecular Interaction Centre, University of Canterbury, Christchurch, New Zealand.
| | - James M Paterson
- Department of Civil and Natural Resources Engineering, University of Canterbury, Christchurch, New Zealand
| | | | - Fatemeh Ashari-Ghomi
- Research Group for Genomic Epidemiology, Technical University of Denmark, Kongens Lyngby, Denmark
| | - Sinan U Umu
- Department of Research, Cancer Registry of Norway, Oslo, Norway
| | | | - Alex Gavryushkin
- Department of Computer Science, University of Otago, Dunedin, New Zealand.,School of Mathematics and Statistics, University of Canterbury, Christchurch, New Zealand
| | - Michael A Black
- Department of Biochemistry,, University of Otago, Dunedin, New Zealand
| |
Collapse
|
21
|
Reisle C, Williamson LM, Pleasance E, Davies A, Pellegrini B, Bleile DW, Mungall KL, Chuah E, Jones MR, Ma Y, Lewis E, Beckie I, Pham D, Matiello Pletz R, Muhammadzadeh A, Pierce BM, Li J, Stevenson R, Wong H, Bailey L, Reisle A, Douglas M, Bonakdar M, Nelson JMT, Grisdale CJ, Krzywinski M, Fisic A, Mitchell T, Renouf DJ, Yip S, Laskin J, Marra MA, Jones SJM. A platform for oncogenomic reporting and interpretation. Nat Commun 2022; 13:756. [PMID: 35140225 PMCID: PMC8828759 DOI: 10.1038/s41467-022-28348-y] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/08/2021] [Accepted: 01/14/2022] [Indexed: 01/01/2023] Open
Abstract
Manual interpretation of variants remains rate limiting in precision oncology. The increasing scale and complexity of molecular data generated from comprehensive sequencing of cancer samples requires advanced interpretative platforms as precision oncology expands beyond individual patients to entire populations. To address this unmet need, we introduce a Platform for Oncogenomic Reporting and Interpretation (PORI), comprising an analytic framework that facilitates the interpretation and reporting of somatic variants in cancer. PORI integrates reporting and graph knowledge base tools combined with support for manual curation at the reporting stage. PORI represents an open-source platform alternative to commercial reporting solutions suitable for comprehensive genomic data sets in precision oncology. We demonstrate the utility of PORI by matching 9,961 pan-cancer genome atlas tumours to the graph knowledge base, calculating therapeutically informative alterations, and making available reports describing select individual samples. The interpretation of somatic variants in cancer is challenging due to the scale and complexity of sequencing data. Here, the authors present PORI, an open-source framework for interpreting somatic variants in cancer using graph knowledge base tools, automated reporting, and manual curation.
Collapse
Affiliation(s)
- Caralyn Reisle
- Canada's Michael Smith Genome Sciences Centre, Vancouver, BC, Canada.,Bioinformatics Graduate Program, Faculty of Science, University of British Columbia, Vancouver, BC, Canada
| | | | - Erin Pleasance
- Canada's Michael Smith Genome Sciences Centre, Vancouver, BC, Canada
| | - Anna Davies
- Canada's Michael Smith Genome Sciences Centre, Vancouver, BC, Canada
| | | | - Dustin W Bleile
- Canada's Michael Smith Genome Sciences Centre, Vancouver, BC, Canada
| | - Karen L Mungall
- Canada's Michael Smith Genome Sciences Centre, Vancouver, BC, Canada
| | - Eric Chuah
- Canada's Michael Smith Genome Sciences Centre, Vancouver, BC, Canada
| | - Martin R Jones
- Canada's Michael Smith Genome Sciences Centre, Vancouver, BC, Canada
| | - Yussanne Ma
- Canada's Michael Smith Genome Sciences Centre, Vancouver, BC, Canada
| | - Eleanor Lewis
- Canada's Michael Smith Genome Sciences Centre, Vancouver, BC, Canada
| | - Isaac Beckie
- Canada's Michael Smith Genome Sciences Centre, Vancouver, BC, Canada
| | - David Pham
- Canada's Michael Smith Genome Sciences Centre, Vancouver, BC, Canada
| | | | | | - Brandon M Pierce
- Canada's Michael Smith Genome Sciences Centre, Vancouver, BC, Canada
| | - Jacky Li
- Canada's Michael Smith Genome Sciences Centre, Vancouver, BC, Canada
| | - Ross Stevenson
- Canada's Michael Smith Genome Sciences Centre, Vancouver, BC, Canada
| | - Hansen Wong
- Canada's Michael Smith Genome Sciences Centre, Vancouver, BC, Canada
| | - Lance Bailey
- Canada's Michael Smith Genome Sciences Centre, Vancouver, BC, Canada
| | - Abbey Reisle
- Canada's Michael Smith Genome Sciences Centre, Vancouver, BC, Canada
| | - Matthew Douglas
- Canada's Michael Smith Genome Sciences Centre, Vancouver, BC, Canada
| | - Melika Bonakdar
- Canada's Michael Smith Genome Sciences Centre, Vancouver, BC, Canada
| | | | | | - Martin Krzywinski
- Canada's Michael Smith Genome Sciences Centre, Vancouver, BC, Canada
| | - Ana Fisic
- Department of Medical Oncology, BC Cancer, Vancouver, BC, Canada
| | - Teresa Mitchell
- Department of Medical Oncology, BC Cancer, Vancouver, BC, Canada
| | - Daniel J Renouf
- Pancreas Centre BC, Vancouver, BC, Canada.,Department of Pathology and Laboratory Medicine, Faculty of Medicine, University of British Columbia, Vancouver, BC, Canada
| | - Stephen Yip
- Department of Pathology and Laboratory Medicine, Faculty of Medicine, University of British Columbia, Vancouver, BC, Canada
| | - Janessa Laskin
- Department of Medical Oncology, BC Cancer, Vancouver, BC, Canada
| | - Marco A Marra
- Canada's Michael Smith Genome Sciences Centre, Vancouver, BC, Canada.,Department of Medical Genetics, University of British Columbia, Vancouver, BC, Canada
| | - Steven J M Jones
- Canada's Michael Smith Genome Sciences Centre, Vancouver, BC, Canada. .,Department of Medical Genetics, University of British Columbia, Vancouver, BC, Canada. .,Department of Molecular Biology and Biochemistry, Simon Fraser University, Burnaby, BC, Canada.
| |
Collapse
|
22
|
Helmy M, Agrawal R, Ali J, Soudy M, Bui TT, Selvarajoo K. GeneCloudOmics: A Data Analytic Cloud Platform for High-Throughput Gene Expression Analysis. FRONTIERS IN BIOINFORMATICS 2021; 1:693836. [PMID: 36303746 PMCID: PMC9581002 DOI: 10.3389/fbinf.2021.693836] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/07/2021] [Accepted: 10/14/2021] [Indexed: 11/18/2022] Open
Abstract
Gene expression profiling techniques, such as DNA microarray and RNA-Sequencing, have provided significant impact on our understanding of biological systems. They contribute to almost all aspects of biomedical research, including studying developmental biology, host-parasite relationships, disease progression and drug effects. However, the high-throughput data generations present challenges for many wet experimentalists to analyze and take full advantage of such rich and complex data. Here we present GeneCloudOmics, an easy-to-use web server for high-throughput gene expression analysis that extends the functionality of our previous ABioTrans with several new tools, including protein datasets analysis, and a web interface. GeneCloudOmics allows both microarray and RNA-Seq data analysis with a comprehensive range of data analytics tools in one package that no other current standalone software or web-based tool can do. In total, GeneCloudOmics provides the user access to 23 different data analytical and bioinformatics tasks including reads normalization, scatter plots, linear/non-linear correlations, PCA, clustering (hierarchical, k-means, t-SNE, SOM), differential expression analyses, pathway enrichments, evolutionary analyses, pathological analyses, and protein-protein interaction (PPI) identifications. Furthermore, GeneCloudOmics allows the direct import of gene expression data from the NCBI Gene Expression Omnibus database. The user can perform all tasks rapidly through an intuitive graphical user interface that overcomes the hassle of coding, installing tools/packages/libraries and dealing with operating systems compatibility and version issues, complications that make data analysis tasks challenging for biologists. Thus, GeneCloudOmics is a one-stop open-source tool for gene expression data analysis and visualization. It is freely available at http://combio-sifbi.org/GeneCloudOmics.
Collapse
Affiliation(s)
- Mohamed Helmy
- Bioinformatics Institute (BII), Agency for Science, Technology and Research (A*STAR), Singapore, Singapore
- Department of Computer Science, Lakehead University, Thunder Bay, ON, Canada
| | - Rahul Agrawal
- Department of Geology and Geophysics, Indian Institute of Technology (IIT) Kharagpur, Kharagpur, India
| | - Javed Ali
- Department of Geology and Geophysics, Indian Institute of Technology (IIT) Kharagpur, Kharagpur, India
| | - Mohamed Soudy
- Proteomics and Metabolomics Unit, Children Cancer Hospital (CCHE-57357), Cairo, Egypt
| | - Thuy Tien Bui
- Bioinformatics Institute (BII), Agency for Science, Technology and Research (A*STAR), Singapore, Singapore
| | - Kumar Selvarajoo
- Bioinformatics Institute (BII), Agency for Science, Technology and Research (A*STAR), Singapore, Singapore
- Singapore Institute of Food and Biotechnology Innovation (SIFBI), Agency for Science, Technology and Research (A*STAR), Singapore, Singapore
- Synthetic Biology for Clinical and Technological Innovation (SynCTI), National University of Singapore (NUS), Singapore, Singapore
- *Correspondence: Kumar Selvarajoo,
| |
Collapse
|
23
|
Wratten L, Wilm A, Göke J. Reproducible, scalable, and shareable analysis pipelines with bioinformatics workflow managers. Nat Methods 2021; 18:1161-1168. [PMID: 34556866 DOI: 10.1038/s41592-021-01254-9] [Citation(s) in RCA: 53] [Impact Index Per Article: 17.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/20/2020] [Accepted: 07/29/2021] [Indexed: 02/08/2023]
Abstract
The rapid growth of high-throughput technologies has transformed biomedical research. With the increasing amount and complexity of data, scalability and reproducibility have become essential not just for experiments, but also for computational analysis. However, transforming data into information involves running a large number of tools, optimizing parameters, and integrating dynamically changing reference data. Workflow managers were developed in response to such challenges. They simplify pipeline development, optimize resource usage, handle software installation and versions, and run on different compute platforms, enabling workflow portability and sharing. In this Perspective, we highlight key features of workflow managers, compare commonly used approaches for bioinformatics workflows, and provide a guide for computational and noncomputational users. We outline community-curated pipeline initiatives that enable novice and experienced users to perform complex, best-practice analyses without having to manually assemble workflows. In sum, we illustrate how workflow managers contribute to making computational analysis in biomedical research shareable, scalable, and reproducible.
Collapse
Affiliation(s)
| | | | - Jonathan Göke
- Genome Institute of Singapore, Singapore, Singapore.
| |
Collapse
|
24
|
Steenwyk JL, Rokas A. orthofisher: a broadly applicable tool for automated gene identification and retrieval. G3-GENES GENOMES GENETICS 2021; 11:6321954. [PMID: 34544141 PMCID: PMC8496211 DOI: 10.1093/g3journal/jkab250] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 06/07/2021] [Accepted: 07/06/2021] [Indexed: 11/15/2022]
Abstract
Identification and retrieval of genes of interest from genomic data are an essential step for many bioinformatic applications. We present orthofisher, a command-line tool for automated identification and retrieval of genes with high sequence similarity to a query profile Hidden Markov Model sequence alignment across a set of proteomes. Performance assessment of orthofisher revealed high accuracy and precision during single-copy orthologous gene identification. orthofisher may be useful for assessing gene annotation quality, identifying single-copy orthologous genes for phylogenomic analyses, estimating gene copy number, and other evolutionary analyses that rely on identification and retrieval of homologous genes from genomic data. orthofisher comes complete with comprehensive documentation (https://jlsteenwyk.com/orthofisher/), is freely available under the MIT license, and is available for download from GitHub (https://github.com/JLSteenwyk/orthofisher), PyPi (https://pypi.org/project/orthofisher/), and the Anaconda Cloud (https://anaconda.org/jlsteenwyk/orthofisher).
Collapse
Affiliation(s)
- Jacob L Steenwyk
- Department of Biological Sciences, Vanderbilt University , Nashville, TN 37235, USA
| | - Antonis Rokas
- Department of Biological Sciences, Vanderbilt University , Nashville, TN 37235, USA
| |
Collapse
|
25
|
Bathke J, Lühken G. OVarFlow: a resource optimized GATK 4 based Open source Variant calling workFlow. BMC Bioinformatics 2021; 22:402. [PMID: 34388963 PMCID: PMC8361789 DOI: 10.1186/s12859-021-04317-y] [Citation(s) in RCA: 21] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/10/2021] [Accepted: 08/04/2021] [Indexed: 12/30/2022] Open
Abstract
Background The advent of next generation sequencing has opened new avenues for basic and applied research. One application is the discovery of sequence variants causative of a phenotypic trait or a disease pathology. The computational task of detecting and annotating sequence differences of a target dataset between a reference genome is known as "variant calling". Typically, this task is computationally involved, often combining a complex chain of linked software tools. A major player in this field is the Genome Analysis Toolkit (GATK). The "GATK Best Practices" is a commonly referred recipe for variant calling. However, current computational recommendations on variant calling predominantly focus on human sequencing data and ignore ever-changing demands of high-throughput sequencing developments. Furthermore, frequent updates to such recommendations are counterintuitive to the goal of offering a standard workflow and hamper reproducibility over time. Results A workflow for automated detection of single nucleotide polymorphisms and insertion-deletions offers a wide range of applications in sequence annotation of model and non-model organisms. The introduced workflow builds on the GATK Best Practices, while enabling reproducibility over time and offering an open, generalized computational architecture. The workflow achieves parallelized data evaluation and maximizes performance of individual computational tasks. Optimized Java garbage collection and heap size settings for the GATK applications SortSam, MarkDuplicates, HaplotypeCaller, and GatherVcfs effectively cut the overall analysis time in half. Conclusions The demand for variant calling, efficient computational processing, and standardized workflows is growing. The Open source Variant calling workFlow (OVarFlow) offers automation and reproducibility for a computationally optimized variant calling task. By reducing usage of computational resources, the workflow removes prior existing entry barriers to the variant calling field and enables standardized variant calling.
Collapse
Affiliation(s)
- Jochen Bathke
- Institute of Animal Breeding and Genetics, Justus Liebig University Gießen, Ludwigstraße 21, 35390, Gießen, Germany.
| | - Gesine Lühken
- Institute of Animal Breeding and Genetics, Justus Liebig University Gießen, Ludwigstraße 21, 35390, Gießen, Germany
| |
Collapse
|
26
|
Hauschild AC, Eick L, Wienbeck J, Heider D. Fostering reproducibility, reusability, and technology transfer in health informatics. iScience 2021; 24:102803. [PMID: 34296072 PMCID: PMC8282945 DOI: 10.1016/j.isci.2021.102803] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/31/2023] Open
Abstract
Computational methods can transform healthcare. In particular, health informatics with artificial intelligence has shown tremendous potential when applied in various fields of medical research and has opened a new era for precision medicine. The development of reusable biomedical software for research or clinical practice is time-consuming and requires rigorous compliance with quality requirements as defined by international standards. However, research projects rarely implement such measures, hindering smooth technology transfer into the research community or manufacturers as well as reproducibility and reusability. Here, we present a guideline for quality management systems (QMS) for academic organizations incorporating the essential components while confining the requirements to an easily manageable effort. It provides a starting point to implement a QMS tailored to specific needs effortlessly and greatly facilitates technology transfer in a controlled manner, thereby supporting reproducibility and reusability. Ultimately, the emerging standardized workflows can pave the way for an accelerated deployment in clinical practice.
Collapse
Affiliation(s)
- Anne-Christin Hauschild
- Department of Data Science in Biomedicine, Faculty of Mathematics & Computer Science, Philipps University of Marburg, Hans-Meerwein-Strasse 6, Marburg, 35032, Germany
| | - Lisa Eick
- Department of Data Science in Biomedicine, Faculty of Mathematics & Computer Science, Philipps University of Marburg, Hans-Meerwein-Strasse 6, Marburg, 35032, Germany
| | - Joachim Wienbeck
- Department of Data Science in Biomedicine, Faculty of Mathematics & Computer Science, Philipps University of Marburg, Hans-Meerwein-Strasse 6, Marburg, 35032, Germany
| | - Dominik Heider
- Department of Data Science in Biomedicine, Faculty of Mathematics & Computer Science, Philipps University of Marburg, Hans-Meerwein-Strasse 6, Marburg, 35032, Germany
| |
Collapse
|
27
|
Auer S, Haeltermann NA, Weissberger TL, Erlich JC, Susilaradeya D, Julkowska M, Gazda MA, Schwessinger B, Jadavji NM. A community-led initiative for training in reproducible research. eLife 2021; 10:64719. [PMID: 34151774 PMCID: PMC8282331 DOI: 10.7554/elife.64719] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/09/2020] [Accepted: 06/18/2021] [Indexed: 12/15/2022] Open
Abstract
Open and reproducible research practices increase the reusability and impact of scientific research. The reproducibility of research results is influenced by many factors, most of which can be addressed by improved education and training. Here we describe how workshops developed by the Reproducibility for Everyone (R4E) initiative can be customized to provide researchers at all career stages and across most disciplines with education and training in reproducible research practices. The R4E initiative, which is led by volunteers, has reached more than 3000 researchers worldwide to date, and all workshop materials, including accompanying resources, are available under a CC-BY 4.0 license at https://www.repro4everyone.org/.
Collapse
Affiliation(s)
- Susann Auer
- Department of Plant Physiology, Institute of Botany, Faculty of Biology, Technische Universität Dresden, Dresden, Germany
| | - Nele A Haeltermann
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, United States
| | - Tracey L Weissberger
- QUEST Center, Berlin Institute of Health, Charité Universitätsmedizin Berlin, Berlin, Germany
| | - Jeffrey C Erlich
- Shanghai Key Laboratory of Brain Functional Genomics, East China Normal University, Shanghai, China
| | - Damar Susilaradeya
- Medical Technology Cluster, Indonesian Medical Education and Research Institute, Faculty of Medicine, Universitas Indonesia, Jakarta, Indonesia
| | | | - Małgorzata Anna Gazda
- CIBO/InBIOO, Centro de Investigação em Biodiversidade e Recursos Genéticos, Campus Agrário de Vairão, Porto, Portugal.,Departamento de Biologia, Faculdade de Ciências, Universidade do Porto, Porto, Portugal
| | | | - Nafisa M Jadavji
- Department of Biomedical Science, Midwestern University, Glendale, United States.,Department of Neuroscience, Carleton University, Ottawa, Canada
| | -
- Reproducibility for Everyone, New York, United States
| |
Collapse
|
28
|
Kurgan G, Turk R, Li H, Roberts N, Rettig GR, Jacobi AM, Tso L, Sturgeon M, Mertens M, Noten R, Florus K, Behlke MA, Wang Y, McNeill MS. CRISPAltRations: a validated cloud-based approach for interrogation of double-strand break repair mediated by CRISPR genome editing. Mol Ther Methods Clin Dev 2021; 21:478-491. [PMID: 33981780 PMCID: PMC8082044 DOI: 10.1016/j.omtm.2021.03.024] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/20/2020] [Accepted: 03/29/2021] [Indexed: 12/26/2022]
Abstract
CRISPR systems enable targeted genome editing in a wide variety of organisms by introducing single- or double-strand DNA breaks, which are repaired using endogenous molecular pathways. Characterization of on- and off-target editing events from CRISPR proteins can be evaluated using targeted genome resequencing. We characterized DNA repair fingerprints that result from non-homologous end joining (NHEJ) after double-stranded breaks (DSBs) were introduced by Cas9 or Cas12a for >500 paired treatment/control experiments. We found that building biological understanding of the repair into a novel analysis tool (CRISPAltRations) improved the quality of the results. We validated our software using simulated, targeted amplicon sequencing data (11 guide RNAs [gRNAs] and 603 on- and off-target locations) and demonstrated that CRISPAltRations outperforms other publicly available software tools in accurately annotating CRISPR-associated indels and homology-directed repair (HDR) events. We enable non-bioinformaticians to use CRISPAltRations by developing a web-accessible, cloud-hosted deployment, which allows rapid batch processing of samples in a graphical user interface (GUI) and complies with HIPAA security standards. By ensuring that our software is thoroughly tested, version controlled, and supported with a user interface (UI), we enable resequencing analysis of CRISPR genome editing experiments to researchers no matter their skill in bioinformatics.
Collapse
Affiliation(s)
- Gavin Kurgan
- Integrated DNA Technologies, Coralville, IA 52241, USA
| | - Rolf Turk
- Integrated DNA Technologies, Coralville, IA 52241, USA
| | - Heng Li
- Department of Data Sciences, Dana-Farber Cancer Institute, Boston, MA 02215
| | | | | | | | - Lauren Tso
- Integrated DNA Technologies, Coralville, IA 52241, USA
| | | | | | | | | | | | - Yu Wang
- Integrated DNA Technologies, Coralville, IA 52241, USA
| | | |
Collapse
|
29
|
Queirós P, Delogu F, Hickl O, May P, Wilmes P. Mantis: flexible and consensus-driven genome annotation. Gigascience 2021; 10:6291114. [PMID: 34076241 PMCID: PMC8170692 DOI: 10.1093/gigascience/giab042] [Citation(s) in RCA: 13] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/04/2020] [Revised: 03/22/2021] [Accepted: 05/14/2021] [Indexed: 12/22/2022] Open
Abstract
Background The rapid development of the (meta-)omics fields has produced an unprecedented amount of high-resolution and high-fidelity data. Through the use of these datasets we can infer the role of previously functionally unannotated proteins from single organisms and consortia. In this context, protein function annotation can be described as the identification of regions of interest (i.e., domains) in protein sequences and the assignment of biological functions. Despite the existence of numerous tools, challenges remain in terms of speed, flexibility, and reproducibility. In the big data era, it is also increasingly important to cease limiting our findings to a single reference, coalescing knowledge from different data sources, and thus overcoming some limitations in overly relying on computationally generated data from single sources. Results We implemented a protein annotation tool, Mantis, which uses database identifiers intersection and text mining to integrate knowledge from multiple reference data sources into a single consensus-driven output. Mantis is flexible, allowing for the customization of reference data and execution parameters, and is reproducible across different research goals and user environments. We implemented a depth-first search algorithm for domain-specific annotation, which significantly improved annotation performance compared to sequence-wide annotation. The parallelized implementation of Mantis results in short runtimes while also outputting high coverage and high-quality protein function annotations. Conclusions Mantis is a protein function annotation tool that produces high-quality consensus-driven protein annotations. It is easy to set up, customize, and use, scaling from single genomes to large metagenomes. Mantis is available under the MIT license at https://github.com/PedroMTQ/mantis.
Collapse
Affiliation(s)
- Pedro Queirós
- Systems Ecology, Luxembourg Centre for Systems Biomedicine, University of Luxembourg, 6 Avenue du Swing, 4367 Esch-sur-Alzette, Luxembourg
| | - Francesco Delogu
- Systems Ecology, Luxembourg Centre for Systems Biomedicine, University of Luxembourg, 6 Avenue du Swing, 4367 Esch-sur-Alzette, Luxembourg
| | - Oskar Hickl
- Bioinformatics Core, Luxembourg Centre for Systems Biomedicine, University of Luxembourg, 6 Avenue du Swing, 4367 Esch-sur-Alzette, Luxembourg
| | - Patrick May
- Bioinformatics Core, Luxembourg Centre for Systems Biomedicine, University of Luxembourg, 6 Avenue du Swing, 4367 Esch-sur-Alzette, Luxembourg
| | - Paul Wilmes
- Systems Ecology, Luxembourg Centre for Systems Biomedicine, University of Luxembourg, 6 Avenue du Swing, 4367 Esch-sur-Alzette, Luxembourg
| |
Collapse
|
30
|
Misra BB. Advances in high resolution GC-MS technology: a focus on the application of GC-Orbitrap-MS in metabolomics and exposomics for FAIR practices. ANALYTICAL METHODS : ADVANCING METHODS AND APPLICATIONS 2021; 13:2265-2282. [PMID: 33987631 DOI: 10.1039/d1ay00173f] [Citation(s) in RCA: 14] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
Gas chromatography-mass spectrometry (GC-MS) provides a complementary analytical platform for capturing volatiles, non-polar and (derivatized) polar metabolites and exposures from a diverse array of matrixes. High resolution (HR) GC-MS as a data generation platform can capture data on analytes that are usually not detectable/quantifiable in liquid chromatography mass-spectrometry-based solutions. With the rise of high-resolution accurate mass (HRAM) GC-MS systems such as GC-Orbitrap-MS in the last decade after the time-of-flight (ToF) renaissance, numerous applications have been found in the fields of metabolomics and exposomics. In a short span of time, a multitude of studies have used GC-Orbitrap-MS to generate exciting new high throughput data spanning from diverse basic to applied research areas. The GC-Orbitrap-MS has found application in both targeted and untargeted efforts for capturing metabolomes and exposomes across diverse studies. In this review, I capture and summarize all the reported studies to date, and provide a snapshot of the milieu of commercial and open-source software solutions, spectral libraries, and informatics solutions available to a GC-Orbitrap-MS system instrument user or a data analyst dealing with these datasets. Lastly, but importantly, I provide an account on data sharing and meta-data capturing solutions that are available to make HRAM GC-MS based metabolomics and exposomics studies findable, accessible, interoperable, and reproducible (FAIR). These FAIR practices would allow data generators and users of GC-HRMS instruments to help the community of GC-MS researchers to collaborate and co-develop exciting tools and algorithms in the future.
Collapse
Affiliation(s)
- Biswapriya B Misra
- Independent Researcher, Pine-211, Raintree Park Dwaraka Krishna, Namburu, AP-522508, India.
| |
Collapse
|
31
|
Chang HY, Colby SM, Du X, Gomez JD, Helf MJ, Kechris K, Kirkpatrick CR, Li S, Patti GJ, Renslow RS, Subramaniam S, Verma M, Xia J, Young JD. A Practical Guide to Metabolomics Software Development. Anal Chem 2021; 93:1912-1923. [PMID: 33467846 PMCID: PMC7859930 DOI: 10.1021/acs.analchem.0c03581] [Citation(s) in RCA: 30] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/22/2022]
Abstract
![]()
A growing number
of software tools have been developed for metabolomics
data processing and analysis. Many new tools are contributed by metabolomics
practitioners who have limited prior experience with software development,
and the tools are subsequently implemented by users with expertise
that ranges from basic point-and-click data analysis to advanced coding.
This Perspective is intended to introduce metabolomics software users
and developers to important considerations that determine the overall
impact of a publicly available tool within the scientific community.
The recommendations reflect the collective experience of an NIH-sponsored
Metabolomics Consortium working group that was formed with the goal
of researching guidelines and best practices for metabolomics tool
development. The recommendations are aimed at metabolomics researchers
with little formal background in programming and are organized into
three stages: (i) preparation, (ii) tool development, and (iii) distribution
and maintenance.
Collapse
Affiliation(s)
- Hui-Yin Chang
- Department of Pathology, University of Michigan, 1301 Catherine Street, Ann Arbor, Michigan 48109, United States.,Department of Biomedical Sciences and Engineering, National Central University, No. 300, Zhongda Road, Zhongli District, Taoyuan City 320, Taiwan
| | - Sean M Colby
- Biological Sciences Division, Pacific Northwest National Laboratory, P.O. Box 999, MSIN: K8-98, Richland, Washington 99352, United States
| | - Xiuxia Du
- Department of Bioinformatics & Genomics, University of North Carolina at Charlotte, 9201 University City Boulevard, Charlotte, North Carolina 28223, United States
| | - Javier D Gomez
- Department of Chemical and Biomolecular Engineering, Vanderbilt University, PMB 351604, 2301 Vanderbilt Place, Nashville, Tennessee 37235, United States
| | - Maximilian J Helf
- Boyce Thompson Institute and Department of Chemistry and Chemical Biology, Cornell University, 533 Tower Road, Ithaca, New York 14853, United States
| | - Katerina Kechris
- Department of Biostatistics and Informatics, University of Colorado Anschutz Medical Campus, 13001 East 17th Place B119, Aurora, Colorado 80045, United States
| | - Christine R Kirkpatrick
- San Diego Supercomputer Center, University of California San Diego, MC 0505, 9500 Gilman Drive, La Jolla, California 92093, United States
| | - Shuzhao Li
- The Jackson Laboratory for Genomic Medicine, 10 Discovery Drive, Farmington, Connecticut 06032, United States
| | - Gary J Patti
- Department of Chemistry, Department of Medicine, and Siteman Cancer Center, Washington University in St. Louis, CB 1134, One Brookings Drive, St. Louis, Missouri 63130, United States
| | - Ryan S Renslow
- Biological Sciences Division, Pacific Northwest National Laboratory, P.O. Box 999, MSIN: K8-98, Richland, Washington 99352, United States.,Gene and Linda Voiland School of Chemical Engineering and Bioengineering, Washington State University, P.O. Box 646515, Pullman, Washington 99164, United States
| | - Shankar Subramaniam
- San Diego Supercomputer Center, University of California San Diego, MC 0505, 9500 Gilman Drive, La Jolla, California 92093, United States.,Department of Bioengineering, Department of Computer Science and Engineering, Department of Cellular and Molecular Medicine, and Department of Chemistry and Biochemistry, University of California San Diego, 9500 Gilman Drive #0412, La Jolla, California 92093, United States
| | - Mukesh Verma
- Epidemiology and Genomics Research Program, National Cancer Institute, National Institutes of Health, Suite 4E102, 9609 Medical Center Drive, MSC 9763, Rockville, Maryland 20850, United States
| | - Jianguo Xia
- Faculty of Agricultural and Environmental Sciences, McGill University, 21111 Lakeshore Road, Ste. Anne de Bellevue, Quebec H9X 3 V9, Canada
| | - Jamey D Young
- Department of Chemical and Biomolecular Engineering, Vanderbilt University, PMB 351604, 2301 Vanderbilt Place, Nashville, Tennessee 37235, United States.,Department of Molecular Physiology and Biophysics, Vanderbilt University, PMB 351604, 2301 Vanderbilt Place, Nashville, Tennessee 37235, United States
| |
Collapse
|
32
|
König M. Executable Simulation Model of the Liver. SYSTEMS MEDICINE 2021. [DOI: 10.1016/b978-0-12-801238-3.11682-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/24/2022] Open
|
33
|
Steenwyk JL, Buida TJ, Li Y, Shen XX, Rokas A. ClipKIT: A multiple sequence alignment trimming software for accurate phylogenomic inference. PLoS Biol 2020; 18:e3001007. [PMID: 33264284 PMCID: PMC7735675 DOI: 10.1371/journal.pbio.3001007] [Citation(s) in RCA: 190] [Impact Index Per Article: 47.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/24/2020] [Revised: 12/14/2020] [Accepted: 11/10/2020] [Indexed: 12/22/2022] Open
Abstract
Highly divergent sites in multiple sequence alignments (MSAs), which can stem from erroneous inference of homology and saturation of substitutions, are thought to negatively impact phylogenetic inference. Thus, several different trimming strategies have been developed for identifying and removing these sites prior to phylogenetic inference. However, a recent study reported that doing so can worsen inference, underscoring the need for alternative alignment trimming strategies. Here, we introduce ClipKIT, an alignment trimming software that, rather than identifying and removing putatively phylogenetically uninformative sites, instead aims to identify and retain parsimony-informative sites, which are known to be phylogenetically informative. To test the efficacy of ClipKIT, we examined the accuracy and support of phylogenies inferred from 14 different alignment trimming strategies, including those implemented in ClipKIT, across nearly 140,000 alignments from a broad sampling of evolutionary histories. Phylogenies inferred from ClipKIT-trimmed alignments are accurate, robust, and time saving. Furthermore, ClipKIT consistently outperformed other trimming methods across diverse datasets, suggesting that strategies based on identifying and retaining parsimony-informative sites provide a robust framework for alignment trimming.
Collapse
Affiliation(s)
- Jacob L. Steenwyk
- Vanderbilt University, Department of Biological Sciences, Nashville, Tennessee, United States of America
- * E-mail: (JLS); (AR)
| | | | - Yuanning Li
- Vanderbilt University, Department of Biological Sciences, Nashville, Tennessee, United States of America
| | - Xing-Xing Shen
- Ministry of Agriculture Key Lab of Molecular Biology of Crop Pathogens and Insects, Institute of Insect Sciences, Zhejiang University, Hangzhou, China
| | - Antonis Rokas
- Vanderbilt University, Department of Biological Sciences, Nashville, Tennessee, United States of America
- * E-mail: (JLS); (AR)
| |
Collapse
|
34
|
Mitchell K, Ronas J, Dao C, Freise AC, Mangul S, Shapiro C, Moberg Parker J. PUMAA: A Platform for Accessible Microbiome Analysis in the Undergraduate Classroom. Front Microbiol 2020; 11:584699. [PMID: 33123113 PMCID: PMC7573227 DOI: 10.3389/fmicb.2020.584699] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/17/2020] [Accepted: 09/14/2020] [Indexed: 12/22/2022] Open
Abstract
Improvements in high-throughput sequencing makes targeted amplicon analysis an ideal method for the study of human and environmental microbiomes by undergraduates. Multiple bioinformatics programs are available to process and interpret raw microbial diversity datasets, and the choice of programs to use in curricula is largely determined by student learning goals. Many of the most commonly used microbiome bioinformatics platforms offer end-to-end data processing and data analysis using a command line interface (CLI), but the downside for novice microbiome researchers is the steep learning curve often required. Alternatively, some sequencing providers include processing of raw data and taxonomy assignments as part of their pipelines. This, when coupled with available web-based or graphical user interface (GUI) analysis and visualization tools, eliminates the need for students or instructors to have extensive CLI experience. However, lack of universal data formats can make integration of these tools challenging. For example, tools for upstream and downstream analyses frequently use multiple different data formats which then require writing custom scripts or hours of manual work to make the files compatible. Here, we describe a microbial ecology bioinformatics curriculum that focuses on data analysis, visualization, and statistical reasoning by taking advantage of existing web-based and GUI tools. We created the Program for Unifying Microbiome Analysis Applications (PUMAA), which solves the problem of inconsistent files by formatting the output files from several raw data processing programs to seamlessly transition to a suite of GUI programs for analysis and visualization of microbiome taxonomic and inferred functional profiles. Additionally, we created a series of tutorials to accompany each of the microbiome analysis curricular modules. From pre- and post-course surveys, students in this curriculum self-reported conceptual and confidence gains in bioinformatics and data analysis skills. Students also demonstrated gains in biologically relevant statistical reasoning based on rubric-guided evaluations of open-ended survey questions and the Statistical Reasoning in Biology Concept Inventory. The PUMAA program and associated analysis tutorials enable students and researchers with no computational experience to effectively analyze real microbiome datasets to investigate real-world research questions.
Collapse
Affiliation(s)
- Keith Mitchell
- Department of Microbiology, Immunology and Molecular Genetics, University of California, Los Angeles, Los Angeles, CA, United States
| | - Jiem Ronas
- Department of Microbiology, Immunology and Molecular Genetics, University of California, Los Angeles, Los Angeles, CA, United States
| | - Christopher Dao
- Department of Microbiology, Immunology and Molecular Genetics, University of California, Los Angeles, Los Angeles, CA, United States
| | - Amanda C Freise
- Department of Microbiology, Immunology and Molecular Genetics, University of California, Los Angeles, Los Angeles, CA, United States
| | - Serghei Mangul
- Department of Clinical Pharmacy, School of Pharmacy, University of Southern California, Los Angeles, CA, United States
| | - Casey Shapiro
- Center for Educational Assessment, Center for the Advancement of Teaching, University of California, Los Angeles, Los Angeles, CA, United States
| | - Jordan Moberg Parker
- Department of Microbiology, Immunology and Molecular Genetics, University of California, Los Angeles, Los Angeles, CA, United States
| |
Collapse
|
35
|
Grabia S, Smyczynska U, Pagacz K, Fendler W. NormiRazor: tool applying GPU-accelerated computing for determination of internal references in microRNA transcription studies. BMC Bioinformatics 2020; 21:425. [PMID: 32993488 PMCID: PMC7523363 DOI: 10.1186/s12859-020-03743-8] [Citation(s) in RCA: 15] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/07/2020] [Accepted: 09/07/2020] [Indexed: 02/06/2023] Open
Abstract
BACKGROUND Multi-gene expression assays are an attractive tool in revealing complex regulatory mechanisms in living organisms. Normalization is an indispensable step of data analysis in all those studies, since it removes unwanted, non-biological variability from data. In targeted qPCR assays it is typically performed with respect to prespecified reference genes, but the lack of robust strategy of their selection is reported in literature, especially in studies concerning circulating microRNAs (miRNA). Unfortunately, this problem impedes translation of scientific discoveries on miRNA biomarkers into widely available laboratory assays. Previous studies concluded that averaged expressions of multi-miRNA combinations are more stable references than single genes. However, due to the number of such combinations the computational load is considerable and may be hindering for objective reference selection in large datasets. Existing implementations of normalization algorithms (geNorm, NormFinder and BestKeeper) have poor performance and may require days to compute stability values for all potential reference as the evaluation is performed sequentially. RESULTS We designed NormiRazor - an integrative tool which implements those methods in a parallel manner on a graphics processing unit (GPU) using CUDA platform. We tested our approach on publicly available miRNA expression datasets. As a result, the times of executions on 8 datasets containing from 50 to 400 miRNAs (subsets of GSE68314) decreased 18.7 ±0.6 (mean ±SD), 104.7 ±4.2 and 76.5 ±2.2 times for geNorm, BestKeeper and NormFinder with respect to previous Python implementation. To allow for easy access to normalization pipeline for biomedical researchers we implemented NormiRazor as an online platform where a user could normalize their datasets based on the automatically selected references. It is available at norm.btm.umed.pl, together with instruction manual and exemplary datasets. CONCLUSIONS NormiRazor allows for an easy, informed choice of reference genes for qPCR transcriptomic studies. As such it can improve comparability and repeatability of experiments and in longer perspective help translate newly discovered biomarkers into readily available assays.
Collapse
Affiliation(s)
- Szymon Grabia
- Department of Biostatistics and Translational Medicine, Medical University of Lodz, 15 Mazowiecka St., Lodz, 92-215 Poland
- Institute of Applied Computer Science, Lodz University of Technology, 18/22 Stefanowskiego St., Lodz, 90-537 Poland
| | - Urszula Smyczynska
- Department of Biostatistics and Translational Medicine, Medical University of Lodz, 15 Mazowiecka St., Lodz, 92-215 Poland
| | - Konrad Pagacz
- Department of Biostatistics and Translational Medicine, Medical University of Lodz, 15 Mazowiecka St., Lodz, 92-215 Poland
- Postgraduate School of Molecular Medicine, Medical University of Warsaw, 61 Zwirki i Wigury St., Warsaw, 02-091 Poland
| | - Wojciech Fendler
- Department of Biostatistics and Translational Medicine, Medical University of Lodz, 15 Mazowiecka St., Lodz, 92-215 Poland
- Dana-Farber Cancer Institute, Harvard Medical School, Boston, 450 Brookline Av., Boston, MA 02215 USA
| |
Collapse
|
36
|
Eliseev A, Gibson KM, Avdeyev P, Novik D, Bendall ML, Pérez-Losada M, Alexeev N, Crandall KA. Evaluation of haplotype callers for next-generation sequencing of viruses. INFECTION, GENETICS AND EVOLUTION : JOURNAL OF MOLECULAR EPIDEMIOLOGY AND EVOLUTIONARY GENETICS IN INFECTIOUS DISEASES 2020; 82:104277. [PMID: 32151775 PMCID: PMC7293574 DOI: 10.1016/j.meegid.2020.104277] [Citation(s) in RCA: 16] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/04/2019] [Revised: 03/04/2020] [Accepted: 03/06/2020] [Indexed: 01/30/2023]
Abstract
Currently, the standard practice for assembling next-generation sequencing (NGS) reads of viral genomes is to summarize thousands of individual short reads into a single consensus sequence, thus confounding useful intra-host diversity information for molecular phylodynamic inference. It is hypothesized that a few viral strains may dominate the intra-host genetic diversity with a variety of lower frequency strains comprising the rest of the population. Several software tools currently exist to convert NGS sequence variants into haplotypes. Previous benchmarks of viral haplotype reconstruction programs used simulation scenarios that are useful from a mathematical perspective but do not reflect viral evolution and epidemiology. Here, we tested twelve NGS haplotype reconstruction methods using viral populations simulated under realistic evolutionary dynamics. We simulated coalescent-based populations that spanned known levels of viral genetic diversity, including mutation rates, sample size and effective population size, to test the limits of the haplotype reconstruction methods and to ensure coverage of predicted intra-host viral diversity levels (especially HIV-1). All twelve investigated haplotype callers showed variable performance and produced drastically different results that were mainly driven by differences in mutation rate and, to a lesser extent, in effective population size. Most methods were able to accurately reconstruct haplotypes when genetic diversity was low. However, under higher levels of diversity (e.g., those seen intra-host HIV-1 infections), haplotype reconstruction quality was highly variable and, on average, poor. All haplotype reconstruction tools, except QuasiRecomb and ShoRAH, greatly underestimated intra-host diversity and the true number of haplotypes. PredictHaplo outperformed, in regard to highest precision, recall, and lowest UniFrac distance values, the other haplotype reconstruction tools followed by CliqueSNV, which, given more computational time, may have outperformed PredictHaplo. Here, we present an extensive comparison of available viral haplotype reconstruction tools and provide insights for future improvements in haplotype reconstruction tools using both short-read and long-read technologies.
Collapse
Affiliation(s)
- Anton Eliseev
- Computer Technologies Laboratory, ITMO University, Saint-Petersburg, Russia
| | - Keylie M Gibson
- Computational Biology Institute, Milken Institute School of Public Health, George Washington University, Washington, DC, USA.
| | - Pavel Avdeyev
- Computational Biology Institute, Milken Institute School of Public Health, George Washington University, Washington, DC, USA; Department of Mathematics, George Washington University, Washington, DC, USA
| | - Dmitry Novik
- Computer Technologies Laboratory, ITMO University, Saint-Petersburg, Russia
| | - Matthew L Bendall
- Computational Biology Institute, Milken Institute School of Public Health, George Washington University, Washington, DC, USA
| | - Marcos Pérez-Losada
- Computational Biology Institute, Milken Institute School of Public Health, George Washington University, Washington, DC, USA; Department of Biostatistics and Bioinformatics, Milken Institute School of Public Health, George Washington University, Washington, DC, USA; CIBIO-InBIO, Centro de Investigação em Biodiversidade e Recursos Genéticos, Universidade do Porto, Campus Agrário de Vairão, Vairão, Portugal
| | - Nikita Alexeev
- Computer Technologies Laboratory, ITMO University, Saint-Petersburg, Russia
| | - Keith A Crandall
- Computational Biology Institute, Milken Institute School of Public Health, George Washington University, Washington, DC, USA; Department of Biostatistics and Bioinformatics, Milken Institute School of Public Health, George Washington University, Washington, DC, USA
| |
Collapse
|
37
|
Hanna RE, Doench JG. Design and analysis of CRISPR-Cas experiments. Nat Biotechnol 2020; 38:813-823. [PMID: 32284587 DOI: 10.1038/s41587-020-0490-7] [Citation(s) in RCA: 96] [Impact Index Per Article: 24.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/02/2018] [Accepted: 03/06/2020] [Indexed: 02/08/2023]
Abstract
A large and ever-expanding set of CRISPR-Cas systems now enables the rapid and flexible manipulation of genomes in both targeted and large-scale experiments. Numerous software tools and analytical methods have been developed for the design and analysis of CRISPR-Cas experiments, including resources to design optimal guide RNAs for various modes of manipulation and to analyze the results of such experiments. A major recent focus has been the development of comprehensive tools for use on data from large-scale CRISPR-based genetic screens. As this field continues to progress, a clear ongoing challenge is not only to innovate, but to actively maintain and improve existing tools so that researchers across disciplines can rely on a stable set of excellent computational resources for CRISPR-Cas experiments.
Collapse
Affiliation(s)
- Ruth E Hanna
- Broad Institute of Harvard and MIT, Cambridge, MA, USA
| | - John G Doench
- Broad Institute of Harvard and MIT, Cambridge, MA, USA.
| |
Collapse
|
38
|
Brito JJ, Li J, Moore JH, Greene CS, Nogoy NA, Garmire LX, Mangul S. Recommendations to enhance rigor and reproducibility in biomedical research. Gigascience 2020; 9:giaa056. [PMID: 32479592 PMCID: PMC7263079 DOI: 10.1093/gigascience/giaa056] [Citation(s) in RCA: 29] [Impact Index Per Article: 7.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/24/2020] [Revised: 04/08/2020] [Accepted: 05/06/2020] [Indexed: 12/25/2022] Open
Abstract
Biomedical research depends increasingly on computational tools, but mechanisms ensuring open data, open software, and reproducibility are variably enforced by academic institutions, funders, and publishers. Publications may present software for which source code or documentation are or become unavailable; this compromises the role of peer review in evaluating technical strength and scientific contribution. Incomplete ancillary information for an academic software package may bias or limit subsequent work. We provide 8 recommendations to improve reproducibility, transparency, and rigor in computational biology-precisely the values that should be emphasized in life science curricula. Our recommendations for improving software availability, usability, and archival stability aim to foster a sustainable data science ecosystem in life science research.
Collapse
Affiliation(s)
- Jaqueline J Brito
- Department of Clinical Pharmacy, School of Pharmacy, University of Southern California, 1985 Zonal Avenue, Los Angeles, CA 90089, USA
| | - Jun Li
- Department of Computational Medicine & Bioinformatics, Medical School, University of Michigan, 1301 Catherine Street, Ann Arbor, MI 48109, USA
| | - Jason H Moore
- Department of Biostatistics, Epidemiology, and Informatics, Institute for Biomedical Informatics, University of Pennsylvania, 3700 Hamilton Walk, Philadelphia, PA 19104, USA
| | - Casey S Greene
- Department of Systems Pharmacology and Translational Therapeutics, Perelman School of Medicine, University of Pennsylvania, 3400 Civic Center Boulevard, Philadelphia, PA 19104, USA
- Childhood Cancer Data Lab, Alex's Lemonade Stand, 1429 Walnut St, Floor 10, Philadelphia, PA 19102, USA
| | - Nicole A Nogoy
- GigaScience, 26/F, Kings Wing Plaza 2, 1 On Kwan Street, Shek Mun, N.T., Hong Kong
| | - Lana X Garmire
- Department of Computational Medicine & Bioinformatics, Medical School, University of Michigan, 1301 Catherine Street, Ann Arbor, MI 48109, USA
| | - Serghei Mangul
- Department of Clinical Pharmacy, School of Pharmacy, University of Southern California, 1985 Zonal Avenue, Los Angeles, CA 90089, USA
| |
Collapse
|
39
|
Brito JJ, Mosqueiro T, Rotman J, Xue V, Chapski DJ, la Hoz JD, Matias P, Martin LS, Zelikovsky A, Pellegrini M, Mangul S. Telescope: an interactive tool for managing large-scale analysis from mobile devices. Gigascience 2020; 9:giz163. [PMID: 31972019 PMCID: PMC6977584 DOI: 10.1093/gigascience/giz163] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/02/2019] [Revised: 11/26/2019] [Accepted: 12/19/2019] [Indexed: 11/27/2022] Open
Abstract
BACKGROUND In today's world of big data, computational analysis has become a key driver of biomedical research. High-performance computational facilities are capable of processing considerable volumes of data, yet often lack an easy-to-use interface to guide the user in supervising and adjusting bioinformatics analysis via a tablet or smartphone. RESULTS To address this gap we proposed Telescope, a novel tool that interfaces with high-performance computational clusters to deliver an intuitive user interface for controlling and monitoring bioinformatics analyses in real-time. By leveraging last generation technology now ubiquitous to most researchers (such as smartphones), Telescope delivers a friendly user experience and manages conectivity and encryption under the hood. CONCLUSIONS Telescope helps to mitigate the digital divide between wet and computational laboratories in contemporary biology. By delivering convenience and ease of use through a user experience not relying on expertise with computational clusters, Telescope can help researchers close the feedback loop between bioinformatics and experimental work with minimal impact on the performance of computational tools. Telescope is freely available at https://github.com/Mangul-Lab-USC/telescope.
Collapse
Affiliation(s)
- Jaqueline J Brito
- Department of Clinical Pharmacy, School of Pharmacy, University of Southern California, 1985 Zonal Avenue, Los Angeles, CA 90089-9121, USA
| | - Thiago Mosqueiro
- Institute for Quantitative and Computational Biosciences, University of California Los Angeles, 611 Charles E. Young Drive East, Los Angeles, CA 90095, USA
| | - Jeremy Rotman
- Department of Computer Science, University of California, Los Angeles, 404 Westwood Plaza, Los Angeles, CA 90095, USA
| | - Victor Xue
- Department of Computer Science, University of California, Los Angeles, 404 Westwood Plaza, Los Angeles, CA 90095, USA
| | - Douglas J Chapski
- Department of Anesthesiology, David Geffen School of Medicine at UCLA, 650 Charles E. Young Drive, Los Angeles, CA 90095, USA
| | - Juan De la Hoz
- Center for Neurobehavioral Genetics, University of California Los Angeles, 695 Charles E Young Dr S, Los Angeles, CA 90095, USA
| | - Paulo Matias
- Department of Computer Science, Federal University of São Carlos, km 325 Rod. Washington Luis, São Carlos, SP 13565–905, Brazil
| | - Lana S Martin
- Department of Clinical Pharmacy, School of Pharmacy, University of Southern California, 1985 Zonal Avenue, Los Angeles, CA 90089-9121, USA
| | - Alex Zelikovsky
- Department of Computer Science, Georgia State University, 1 Park Place, Atlanta, GA 30303, USA
- The Laboratory of Bioinformatics, I.M. Sechenov First Moscow State Medical University, Moscow 119991, Russia
| | - Matteo Pellegrini
- Institute for Quantitative and Computational Biosciences, University of California Los Angeles, 611 Charles E. Young Drive East, Los Angeles, CA 90095, USA
| | - Serghei Mangul
- Department of Clinical Pharmacy, School of Pharmacy, University of Southern California, 1985 Zonal Avenue, Los Angeles, CA 90089-9121, USA
| |
Collapse
|