1
|
Caliskan A, Dangwal S, Dandekar T. Metadata integrity in bioinformatics: Bridging the gap between data and knowledge. Comput Struct Biotechnol J 2023; 21:4895-4913. [PMID: 37860229 PMCID: PMC10582761 DOI: 10.1016/j.csbj.2023.10.006] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/02/2023] [Revised: 10/04/2023] [Accepted: 10/04/2023] [Indexed: 10/21/2023] Open
Abstract
In the fast-evolving landscape of biomedical research, the emergence of big data has presented researchers with extraordinary opportunities to explore biological complexities. In biomedical research, big data imply also a big responsibility. This is not only due to genomics data being sensitive information but also due to genomics data being shared and re-analysed among the scientific community. This saves valuable resources and can even help to find new insights in silico. To fully use these opportunities, detailed and correct metadata are imperative. This includes not only the availability of metadata but also their correctness. Metadata integrity serves as a fundamental determinant of research credibility, supporting the reliability and reproducibility of data-driven findings. Ensuring metadata availability, curation, and accuracy are therefore essential for bioinformatic research. Not only must metadata be readily available, but they must also be meticulously curated and ideally error-free. Motivated by an accidental discovery of a critical metadata error in patient data published in two high-impact journals, we aim to raise awareness for the need of correct, complete, and curated metadata. We describe how the metadata error was found, addressed, and present examples for metadata-related challenges in omics research, along with supporting measures, including tools for checking metadata and software to facilitate various steps from data analysis to published research.
Collapse
Affiliation(s)
- Aylin Caliskan
- Department of Bioinformatics, Biocenter, University of Würzburg, 97074 Würzburg, Germany
| | - Seema Dangwal
- Stanford Cardiovascular Institute, Department of Medicine, Stanford University School of Medicine, Stanford, CA 94305-5101, United States
| | - Thomas Dandekar
- Department of Bioinformatics, Biocenter, University of Würzburg, 97074 Würzburg, Germany
| |
Collapse
|
2
|
Qian DC, Busam JA, Xiao X, O'Mara TA, Eeles RA, Schumacher FR, Phelan CM, Amos CI. seXY: a tool for sex inference from genotype arrays. Bioinformatics 2017; 33:561-563. [PMID: 28035028 DOI: 10.1093/bioinformatics/btw696] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/20/2016] [Accepted: 11/03/2016] [Indexed: 11/13/2022] Open
Abstract
Motivation Checking concordance between reported sex and genotype-inferred sex is a crucial quality control measure in genome-wide association studies (GWAS). However, limited insights exist regarding the true accuracy of software that infer sex from genotype array data. Results We present seXY, a logistic regression model trained on both X chromosome heterozygosity and Y chromosome missingness, that consistently demonstrated >99.5% sex inference accuracy in cross-validation for 889 males and 5,361 females enrolled in prostate cancer and ovarian cancer GWAS. Compared to PLINK, one of the most popular tools for sex inference in GWAS that assesses only X chromosome heterozygosity, seXY achieved marginally better male classification and 3% more accurate female classification. Availability and Implementation https://github.com/Christopher-Amos-Lab/seXY. Contact Christopher.I.Amos@dartmouth.edu. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- David C Qian
- Department of Biomedical Data Science, Dartmouth Geisel School of Medicine, Lebanon, NH 03756, USA
| | - Jonathan A Busam
- Department of Biological Sciences, Dartmouth College, Hanover, NH 03755, USA
| | - Xiangjun Xiao
- Department of Biomedical Data Science, Dartmouth Geisel School of Medicine, Lebanon, NH 03756, USA
| | - Tracy A O'Mara
- Department of Genetics and Computational Biology, QIMR Berghofer Medical Research Institute, Brisbane, QLD 4006, Australia
| | - Rosalind A Eeles
- Division of Genetics and Epidemiology, Institute of Cancer Research, London SW7 3RP, UK
| | - Frederick R Schumacher
- Department of Epidemiology and Biostatistics, Case Western Reserve University, Cleveland, OH 44106, USA
| | - Catherine M Phelan
- Department of Cancer Epidemiology, Moffitt Cancer Center, Tampa, FL 33612, USA
| | - Christopher I Amos
- Department of Biomedical Data Science, Dartmouth Geisel School of Medicine, Lebanon, NH 03756, USA
| |
Collapse
|
3
|
Jeong S, Kim J, Park W, Jeon H, Kim N. SEXCMD: Development and validation of sex marker sequences for whole-exome/genome and RNA sequencing. PLoS One 2017; 12:e0184087. [PMID: 28886064 PMCID: PMC5590872 DOI: 10.1371/journal.pone.0184087] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/15/2017] [Accepted: 08/17/2017] [Indexed: 12/01/2022] Open
Abstract
Over the last decade, a large number of nucleotide sequences have been generated by next-generation sequencing technologies and deposited to public databases. However, most of these datasets do not specify the sex of individuals sampled because researchers typically ignore or hide this information. Male and female genomes in many species have distinctive sex chromosomes, XX/XY and ZW/ZZ, and expression levels of many sex-related genes differ between the sexes. Herein, we describe how to develop sex marker sequences from syntenic regions of sex chromosomes and use them to quickly identify the sex of individuals being analyzed. Array-based technologies routinely use either known sex markers or the B-allele frequency of X or Z chromosomes to deduce the sex of an individual. The same strategy has been used with whole-exome/genome sequence data; however, all reads must be aligned onto a reference genome to determine the B-allele frequency of the X or Z chromosomes. SEXCMD is a pipeline that can extract sex marker sequences from reference sex chromosomes and rapidly identify the sex of individuals from whole-exome/genome and RNA sequencing after training with a known dataset through a simple machine learning approach. The pipeline counts total numbers of hits from sex-specific marker sequences and identifies the sex of the individuals sampled based on the fact that XX/ZZ samples do not have Y or W chromosome hits. We have successfully validated our pipeline with mammalian (Homo sapiens; XY) and avian (Gallus gallus; ZW) genomes. Typical calculation time when applying SEXCMD to human whole-exome or RNA sequencing datasets is a few minutes, and analyzing human whole-genome datasets takes about 10 minutes. Another important application of SEXCMD is as a quality control measure to avoid mixing samples before bioinformatics analysis. SEXCMD comprises simple Python and R scripts and is freely available at https://github.com/lovemun/SEXCMD.
Collapse
Affiliation(s)
- Seongmun Jeong
- Personalized Genomic Medicine Research Center, Korea Research Institute of Bioscience and Biotechnology, Daejeon, Korea
| | - Jiwoong Kim
- Quantitative Biomedical Research Center, Department of Clinical Sciences, University of Texas Southwestern Medical Center, Dallas, TX, United States of America
| | - Won Park
- Personalized Genomic Medicine Research Center, Korea Research Institute of Bioscience and Biotechnology, Daejeon, Korea
| | - Hongmin Jeon
- Personalized Genomic Medicine Research Center, Korea Research Institute of Bioscience and Biotechnology, Daejeon, Korea
| | - Namshin Kim
- Personalized Genomic Medicine Research Center, Korea Research Institute of Bioscience and Biotechnology, Daejeon, Korea
- * E-mail:
| |
Collapse
|
4
|
Toker L, Feng M, Pavlidis P. Whose sample is it anyway? Widespread misannotation of samples in transcriptomics studies. F1000Res 2016; 5:2103. [PMID: 27746907 PMCID: PMC5034794 DOI: 10.12688/f1000research.9471.2] [Citation(s) in RCA: 16] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Accepted: 09/29/2016] [Indexed: 01/21/2023] Open
Abstract
Concern about the reproducibility and reliability of biomedical research has been rising. An understudied issue is the prevalence of sample mislabeling, one impact of which would be invalid comparisons. We studied this issue in a corpus of human transcriptomics studies by comparing the provided annotations of sex to the expression levels of sex-specific genes. We identified apparent mislabeled samples in 46% of the datasets studied, yielding a 99% confidence lower-bound estimate for all studies of 33%. In a separate analysis of a set of datasets concerning a single cohort of subjects, 2/4 had mislabeled samples, indicating laboratory mix-ups rather than data recording errors. While the number of mixed-up samples per study was generally small, because our method can only identify a subset of potential mix-ups, our estimate is conservative for the breadth of the problem. Our findings emphasize the need for more stringent sample tracking, and that re-users of published data must be alert to the possibility of annotation and labelling errors.
Collapse
Affiliation(s)
- Lilah Toker
- Department of Psychiatry, University of British Columbia, Vancouver, V6T 2A1, Canada; Michael Smith Laboratories, University of British Columbia, Vancouver, V6T 1Z4, Canada
| | - Min Feng
- Department of Psychiatry, University of British Columbia, Vancouver, V6T 2A1, Canada; Michael Smith Laboratories, University of British Columbia, Vancouver, V6T 1Z4, Canada; Graduate Program in Genome Sciences and Technology, University of British Columbia, Vancouver, V5Z 4S6, Canada
| | - Paul Pavlidis
- Department of Psychiatry, University of British Columbia, Vancouver, V6T 2A1, Canada; Michael Smith Laboratories, University of British Columbia, Vancouver, V6T 1Z4, Canada
| |
Collapse
|
5
|
Toker L, Feng M, Pavlidis P. Whose sample is it anyway? Widespread misannotation of samples in transcriptomics studies. F1000Res 2016; 5:2103. [PMID: 27746907 DOI: 10.12688/f1000research.9471.1] [Citation(s) in RCA: 29] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Accepted: 08/22/2016] [Indexed: 11/20/2022] Open
Abstract
Concern about the reproducibility and reliability of biomedical research has been rising. An understudied issue is the prevalence of sample mislabeling, one impact of which would be invalid comparisons. We studied this issue in a corpus of human transcriptomics studies by comparing the provided annotations of sex to the expression levels of sex-specific genes. We identified apparent mislabeled samples in 46% of the datasets studied, yielding a 99% confidence lower-bound estimate for all studies of 33%. In a separate analysis of a set of datasets concerning a single cohort of subjects, 2/4 had mislabeled samples, indicating laboratory mix-ups rather than data recording errors. While the number of mixed-up samples per study was generally small, because our method can only identify a subset of potential mix-ups, our estimate is conservative for the breadth of the problem. Our findings emphasize the need for more stringent sample tracking, and that re-users of published data must be alert to the possibility of annotation and labelling errors.
Collapse
Affiliation(s)
- Lilah Toker
- Department of Psychiatry, University of British Columbia, Vancouver, V6T 2A1, Canada; Michael Smith Laboratories, University of British Columbia, Vancouver, V6T 1Z4, Canada
| | - Min Feng
- Department of Psychiatry, University of British Columbia, Vancouver, V6T 2A1, Canada; Michael Smith Laboratories, University of British Columbia, Vancouver, V6T 1Z4, Canada; Graduate Program in Genome Sciences and Technology, University of British Columbia, Vancouver, V5Z 4S6, Canada
| | - Paul Pavlidis
- Department of Psychiatry, University of British Columbia, Vancouver, V6T 2A1, Canada; Michael Smith Laboratories, University of British Columbia, Vancouver, V6T 1Z4, Canada
| |
Collapse
|
6
|
Yoo S, Huang T, Campbell JD, Lee E, Tu Z, Geraci MW, Powell CA, Schadt EE, Spira A, Zhu J. MODMatcher: multi-omics data matcher for integrative genomic analysis. PLoS Comput Biol 2014; 10:e1003790. [PMID: 25122495 PMCID: PMC4133046 DOI: 10.1371/journal.pcbi.1003790] [Citation(s) in RCA: 29] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/21/2014] [Accepted: 06/26/2014] [Indexed: 12/30/2022] Open
Abstract
Errors in sample annotation or labeling often occur in large-scale genetic or genomic studies and are difficult to avoid completely during data generation and management. For integrative genomic studies, it is critical to identify and correct these errors. Different types of genetic and genomic data are inter-connected by cis-regulations. On that basis, we developed a computational approach, Multi-Omics Data Matcher (MODMatcher), to identify and correct sample labeling errors in multiple types of molecular data, which can be used in further integrative analysis. Our results indicate that inspection of sample annotation and labeling error is an indispensable data quality assurance step. Applied to a large lung genomic study, MODMatcher increased statistically significant genetic associations and genomic correlations by more than two-fold. In a simulation study, MODMatcher provided more robust results by using three types of omics data than two types of omics data. We further demonstrate that MODMatcher can be broadly applied to large genomic data sets containing multiple types of omics data, such as The Cancer Genome Atlas (TCGA) data sets. Many human diseases are complex with multiple genetic and environmental causal factors interacting together to give rise to disease phenotypes. Such factors affect biological systems through many layers of regulations, including transcriptional and epigenetic regulation, and protein changes. To fully understand their molecular mechanisms, complex diseases are often studied in diverse dimensions including genetics (genotype variations by single nucleotide polymorphism (SNP) arrays or whole exome sequencing), transcriptomics, epigenetics, and proteomics. However, errors in sample annotation or labeling often occur in large-scale genetic and genomic studies and are difficult to avoid completely during data generation and management. Identifying and correcting these errors are critical for integrative genomic studies. In this study, we developed a computational approach, Multi-Omics Data Matcher (MODMatcher), to identify and correct sample labeling errors based on multiple types of molecular data before further integrative analysis. Our results indicate that signals increased more than 100% after correction of sample labeling errors in a large lung genomic study. Our method can be broadly applied to large genomic data sets with multiple types of omics data, such as TCGA (The Cancer Genome Atlas) data sets.
Collapse
Affiliation(s)
- Seungyeul Yoo
- Icahn Institute of Genomics and Multiscale Biology, Icahn School of Medicine at Mount Sinai, New York, New York, United States of America
- Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, New York, United States of America
| | - Tao Huang
- Icahn Institute of Genomics and Multiscale Biology, Icahn School of Medicine at Mount Sinai, New York, New York, United States of America
- Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, New York, United States of America
| | - Joshua D. Campbell
- Division of Computational Biomedicine, Department of Medicine, Boston University School of Medicine, Boston, Massachusetts, United States of America
| | - Eunjee Lee
- Icahn Institute of Genomics and Multiscale Biology, Icahn School of Medicine at Mount Sinai, New York, New York, United States of America
- Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, New York, United States of America
| | - Zhidong Tu
- Icahn Institute of Genomics and Multiscale Biology, Icahn School of Medicine at Mount Sinai, New York, New York, United States of America
- Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, New York, United States of America
| | - Mark W. Geraci
- Division of Pulmonary Sciences and Critical Care Medicine, University of Colorado Denver, Aurora, Colorado, United States of America
| | - Charles A. Powell
- Division of Pulmonary, Critical Care and Sleep Medicine, Icahn School of Medicine at Mount Sinai, New York, New York, United States of America
| | - Eric E. Schadt
- Icahn Institute of Genomics and Multiscale Biology, Icahn School of Medicine at Mount Sinai, New York, New York, United States of America
- Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, New York, United States of America
| | - Avrum Spira
- Division of Computational Biomedicine, Department of Medicine, Boston University School of Medicine, Boston, Massachusetts, United States of America
| | - Jun Zhu
- Icahn Institute of Genomics and Multiscale Biology, Icahn School of Medicine at Mount Sinai, New York, New York, United States of America
- Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, New York, United States of America
- * E-mail:
| |
Collapse
|