1
|
Fu Y, Yuan ZF, Wu L, Peng J, Wang X, High AA. Addressing Sample Mix-Ups: Tools and Approaches for Large-Scale Multi-Omics Studies. Proteomics 2025; 25:e202400271. [PMID: 39659081 DOI: 10.1002/pmic.202400271] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/09/2024] [Revised: 11/25/2024] [Accepted: 11/26/2024] [Indexed: 12/12/2024]
Abstract
Advances in high-throughput omics technologies have enabled system-wide characterization of biological samples across multiple molecular levels, such as the genome, transcriptome, and proteome. However, as sample sizes rapidly increase in large-scale multi-omics studies, sample mix-ups have become a prevalent issue, compromising data integrity and leading to erroneous conclusions. The interconnected nature of multi-omics data presents an opportunity to identify and correct these errors. This review examines the potential sources of sample mix-ups and evaluates the methodologies and tools developed for detecting and correcting these errors, with an emphasis on approaches applicable to proteomics data. We categorize existing tools into three main groups: expression/protein quantitative trait loci-based, genotype concordance-based, and gene/protein expression correlation-based approaches. Notably, only a handful of tools currently utilize the proteogenomics approach for correcting sample mix-ups at the proteomics level. Integrating the strengths of current tools across diverse data types could enable the development of more versatile and comprehensive solutions. In conclusion, verifying sample identity is a critical first step to reduce bias and increase precision in subsequent analyses for large-scale multi-omics studies. By leveraging these tools for identifying and correcting sample mix-ups, researchers can significantly improve the reliability and reproducibility of biomedical research.
Collapse
Affiliation(s)
- Yingxue Fu
- Center for Proteomics and Metabolomics, St. Jude Children's Research Hospital, Memphis, Tennessee, USA
| | - Zuo-Fei Yuan
- Center for Proteomics and Metabolomics, St. Jude Children's Research Hospital, Memphis, Tennessee, USA
| | - Long Wu
- Center for Proteomics and Metabolomics, St. Jude Children's Research Hospital, Memphis, Tennessee, USA
| | - Junmin Peng
- Department of Structural Biology, St. Jude Children's Research Hospital, Memphis, Tennessee, USA
- Department of Developmental Neurobiology, St. Jude Children's Research Hospital, Memphis, Tennessee, USA
| | - Xusheng Wang
- Department of Neurology, University of Tennessee Health Science Center, Memphis, Tennessee, USA
- Department of Genetics, Genomics and Informatics, University of Tennessee Health Science Center, Memphis, Tennessee, USA
| | - Anthony A High
- Center for Proteomics and Metabolomics, St. Jude Children's Research Hospital, Memphis, Tennessee, USA
| |
Collapse
|
2
|
Yang J, Liu Y, Shang J, Chen Q, Chen Q, Ren L, Zhang N, Yu Y, Li Z, Song Y, Yang S, Scherer A, Tong W, Hong H, Xiao W, Shi L, Zheng Y. The Quartet Data Portal: integration of community-wide resources for multiomics quality control. Genome Biol 2023; 24:245. [PMID: 37884999 PMCID: PMC10601216 DOI: 10.1186/s13059-023-03091-9] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/08/2022] [Accepted: 10/17/2023] [Indexed: 10/28/2023] Open
Abstract
The Quartet Data Portal facilitates community access to well-characterized reference materials, reference datasets, and related resources established based on a family of four individuals with identical twins from the Quartet Project. Users can request DNA, RNA, protein, and metabolite reference materials, as well as datasets generated across omics, platforms, labs, protocols, and batches. Reproducible analysis tools allow for objective performance assessment of user-submitted data, while interactive visualization tools support rapid exploration of reference datasets. A closed-loop "distribution-collection-evaluation-integration" workflow enables updates and integration of community-contributed multiomics data. Ultimately, this portal helps promote the advancement of reference datasets and multiomics quality control.
Collapse
Affiliation(s)
- Jingcheng Yang
- State Key Laboratory of Genetic Engineering, School of Life Sciences, Human Phenome Institute and Shanghai Cancer Center, Fudan University, Shanghai, China
- Greater Bay Area Institute of Precision Medicine, Guangzhou, Guangdong, China
| | - Yaqing Liu
- State Key Laboratory of Genetic Engineering, School of Life Sciences, Human Phenome Institute and Shanghai Cancer Center, Fudan University, Shanghai, China
| | - Jun Shang
- State Key Laboratory of Genetic Engineering, School of Life Sciences, Human Phenome Institute and Shanghai Cancer Center, Fudan University, Shanghai, China
| | - Qiaochu Chen
- State Key Laboratory of Genetic Engineering, School of Life Sciences, Human Phenome Institute and Shanghai Cancer Center, Fudan University, Shanghai, China
| | - Qingwang Chen
- State Key Laboratory of Genetic Engineering, School of Life Sciences, Human Phenome Institute and Shanghai Cancer Center, Fudan University, Shanghai, China
| | - Luyao Ren
- State Key Laboratory of Genetic Engineering, School of Life Sciences, Human Phenome Institute and Shanghai Cancer Center, Fudan University, Shanghai, China
| | - Naixin Zhang
- State Key Laboratory of Genetic Engineering, School of Life Sciences, Human Phenome Institute and Shanghai Cancer Center, Fudan University, Shanghai, China
| | - Ying Yu
- State Key Laboratory of Genetic Engineering, School of Life Sciences, Human Phenome Institute and Shanghai Cancer Center, Fudan University, Shanghai, China
| | - Zhihui Li
- State Key Laboratory of Genetic Engineering, School of Life Sciences, Human Phenome Institute and Shanghai Cancer Center, Fudan University, Shanghai, China
| | - Yueqiang Song
- State Key Laboratory of Genetic Engineering, School of Life Sciences, Human Phenome Institute and Shanghai Cancer Center, Fudan University, Shanghai, China
| | - Shengpeng Yang
- Intelligent Storage, Alibaba Cloud, Alibaba Group, Hangzhou, Zhejiang, China
| | - Andreas Scherer
- Institute for Molecular Medicine Finland (FIMM), University of Helsinki, Helsinki, Finland
- EATRIS ERIC-European Infrastructure for Translational Medicine, Amsterdam, the Netherlands
| | - Weida Tong
- Division of Bioinformatics and Biostatistics, National Center for Toxicological Research, US Food and Drug Administration, Jefferson, AR, USA
| | - Huixiao Hong
- Division of Bioinformatics and Biostatistics, National Center for Toxicological Research, US Food and Drug Administration, Jefferson, AR, USA
| | - Wenming Xiao
- Office of Oncological Diseases, Office of New Drugs, Center for Drug Evaluation and Research, US Food and Drug Administration, Silver Spring, MD, USA
| | - Leming Shi
- State Key Laboratory of Genetic Engineering, School of Life Sciences, Human Phenome Institute and Shanghai Cancer Center, Fudan University, Shanghai, China.
- International Human Phenome Institutes (Shanghai), Shanghai, China.
| | - Yuanting Zheng
- State Key Laboratory of Genetic Engineering, School of Life Sciences, Human Phenome Institute and Shanghai Cancer Center, Fudan University, Shanghai, China.
| |
Collapse
|
3
|
Li L, Niu M, Erickson A, Luo J, Rowbotham K, Guo K, Huang H, Li Y, Jiang Y, Hur J, Liu C, Peng J, Wang X. SMAP is a pipeline for sample matching in proteogenomics. Nat Commun 2022; 13:744. [PMID: 35136070 PMCID: PMC8825821 DOI: 10.1038/s41467-022-28411-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/30/2021] [Accepted: 01/17/2022] [Indexed: 11/12/2022] Open
Abstract
The integration of genomics and proteomics data (proteogenomics) holds the promise of furthering the in-depth understanding of human disease. However, sample mix-up is a pervasive problem in proteogenomics because of the complexity of sample processing. Here, we present a pipeline for Sample Matching in Proteogenomics (SMAP) to verify sample identity and ensure data integrity. SMAP infers sample-dependent protein-coding variants from quantitative mass spectrometry (MS), and aligns the MS-based proteomic samples with genomic samples by two discriminant scores. Theoretical analysis with simulated data indicates that SMAP is capable of uniquely matching proteomic and genomic samples when ≥20% genotypes of individual samples are available. When SMAP was applied to a large-scale dataset generated by the PsychENCODE BrainGVEX project, 54 samples (19%) were corrected. The correction was further confirmed by ribosome profiling and chromatin sequencing (ATAC-seq) data from the same set of samples. Our results demonstrate that SMAP is an effective tool for sample verification in a large-scale MS-based proteogenomics study. SMAP is publicly available at https://github.com/UND-Wanglab/SMAP, and a web-based version can be accessed at https://smap.shinyapps.io/smap/. Sample mix-up is a potential problem in large-scale omic studies due to the complexity of sample processing. Here, the authors present a pipeline for sample matching in proteogenomics to verify sample identity and ensure data integrity.
Collapse
Affiliation(s)
- Ling Li
- Department of Biology, University of North Dakota, Grand Forks, ND, 58202, USA
| | - Mingming Niu
- Departments of Structural Biology and Developmental Neurobiology, Center for Proteomics and Metabolomics, St. Jude Children's Research Hospital, Memphis, TN, 38105, USA
| | - Alyssa Erickson
- Department of Biology, University of North Dakota, Grand Forks, ND, 58202, USA
| | - Jie Luo
- State Key Laboratory for Managing Biotic and Chemical Threats to the Quality and Safety of Agro-products, Zhejiang Academy of Agricultural Sciences, Hangzhou, 310021, China
| | - Kincaid Rowbotham
- Department of Biology, University of North Dakota, Grand Forks, ND, 58202, USA
| | - Kai Guo
- Department of Neurology, University of Michigan, Ann Arbor, MI, 48109, USA
| | - He Huang
- Department of Biology, University of North Dakota, Grand Forks, ND, 58202, USA
| | - Yuxin Li
- Departments of Structural Biology and Developmental Neurobiology, Center for Proteomics and Metabolomics, St. Jude Children's Research Hospital, Memphis, TN, 38105, USA
| | - Yi Jiang
- Department of Epidemiology and Biostatistics, School of Public Health, Tongji Medical College, Huazhong University of Science and Technology, Wuhan, 430030, China
| | - Junguk Hur
- Department of Biomedical Sciences, School of medicine and health sciences, University of North Dakota, Grand Forks, ND, 58202, USA
| | - Chunyu Liu
- Department of Psychiatry, SUNY Upstate Medical University, Syracuse, NY, 13210, USA
| | - Junmin Peng
- Departments of Structural Biology and Developmental Neurobiology, Center for Proteomics and Metabolomics, St. Jude Children's Research Hospital, Memphis, TN, 38105, USA.
| | - Xusheng Wang
- Department of Biology, University of North Dakota, Grand Forks, ND, 58202, USA.
| |
Collapse
|
4
|
Sturm G, List M, Zhang JD. Tissue heterogeneity is prevalent in gene expression studies. NAR Genom Bioinform 2021; 3:lqab077. [PMID: 34514392 PMCID: PMC8415427 DOI: 10.1093/nargab/lqab077] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/16/2021] [Revised: 08/01/2021] [Accepted: 08/29/2021] [Indexed: 12/17/2022] Open
Abstract
Lack of reproducibility in gene expression studies is a serious issue being actively addressed by the biomedical research community. Besides established factors such as batch effects and incorrect sample annotations, we recently reported tissue heterogeneity, a consequence of unintended profiling of cells of other origins than the tissue of interest, as a source of variance. Although tissue heterogeneity exacerbates irreproducibility, its prevalence in gene expression data remains unknown. Here, we systematically analyse 2 667 publicly available gene expression datasets covering 76 576 samples. Using two independent data compendia and a reproducible, open-source software pipeline, we find a prevalence of tissue heterogeneity in gene expression data that affects between 1 and 40% of the samples, depending on the tissue type. We discover both cases of severe heterogeneity, which may be caused by mistakes in annotation or sample handling, and cases of moderate heterogeneity, which are likely caused by tissue infiltration or sample contamination. Our analysis establishes tissue heterogeneity as a widespread phenomenon in publicly available gene expression datasets, which constitutes an important source of variance that should not be ignored. Consequently, we advocate the application of quality-control methods such as BioQC to detect tissue heterogeneity prior to mining or analysing gene expression data.
Collapse
Affiliation(s)
- Gregor Sturm
- Biocenter, Institute of Bioinformatics, Medical University of Innsbruck, 6020 Innsbruck, Austria
| | - Markus List
- Chair of Experimental Bioinformatics, TUM School of Life Sciences Weihenstephan, Technical University of Munich, 85354 Freising, Germany
| | - Jitao David Zhang
- Pharma Research and Early Development, Pharmaceutical Sciences, Roche Innovation Center Basel, F. Hoffmann-La Roche Ltd, Grenzacherstrasse 124, 4070 Basel, Switzerland
| |
Collapse
|