1
|
Su H, Zhou X, Lin G, Luo C, Meng W, Lv C, Chen Y, Wen Z, Li X, Wu Y, Xiao C, Yang J, Lu J, Luo X, Chen Y, Tam PKH, Li C, Sun H, Pan X. Deciphering the Oncogenic Landscape of Hepatocytes Through Integrated Single-Nucleus and Bulk RNA-Seq of Hepatocellular Carcinoma. ADVANCED SCIENCE (WEINHEIM, BADEN-WURTTEMBERG, GERMANY) 2025; 12:e2412944. [PMID: 39960344 PMCID: PMC11984907 DOI: 10.1002/advs.202412944] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/14/2024] [Revised: 01/01/2025] [Indexed: 04/12/2025]
Abstract
Hepatocellular carcinoma (HCC) is a major cause of cancer-related mortality, while the hepatocyte mechanisms driving oncogenesis remains poorly understood. In this study, single-nucleus RNA sequencing of samples from 22 HCC patients revealed 10 distinct hepatocyte subtypes, including beneficial Hep0, predominantly malignant Hep2, and immunosuppressive Hep9. These subtypes were strongly associated with patient prognosis, confirmed in TCGA-LIHC and Fudan HCC cohorts through hepatocyte composition deconvolution. A quantile-based scoring method is developed to integrate data from 29 public HCC datasets, creating a Quantile Distribution Model (QDM) with excellent diagnostic accuracy (Area Under the Curve, AUC = 0.968-0.982). QDM was employed to screen potential biomarkers, revealing that PDE7B functions as a key gene whose suppression promotes HCC progression. Guided by the genes specific to Hep0/2/9 subtypes, HCC is categorized into metabolic, inflammatory, and matrix classes, which are distinguishable in gene mutation frequencies, survival times, enriched pathways, and immune infiltration. Meanwhile, the sensitive drugs of the three HCC classes are identified, namely ouabain, teniposide, and TG-101348. This study presents the largest single-cell hepatocyte dataset to date, offering transformative insights into hepatocarcinogenesis and a comprehensive framework for advancing HCC diagnostics, prognostics, and personalized treatment strategies.
Collapse
Affiliation(s)
- Huanhou Su
- Department of Biochemistry and Molecular BiologySchool of Basic Medical SciencesSouthern Medical University and Guangdong Provincial Key Laboratory of Single Cell Technology and ApplicationGuangzhou510515China
- Precision Regenerative Medicine Research CentreMedical Science Divisionand State Key Laboratory of Quality Research in Chinese MedicineMacau University of Science and TechnologyMacao999078China
| | - Xuewen Zhou
- Department of Biochemistry and Molecular BiologySchool of Basic Medical SciencesSouthern Medical University and Guangdong Provincial Key Laboratory of Single Cell Technology and ApplicationGuangzhou510515China
- Precision Regenerative Medicine Research CentreMedical Science Divisionand State Key Laboratory of Quality Research in Chinese MedicineMacau University of Science and TechnologyMacao999078China
| | - Guanchuan Lin
- Department of Biochemistry and Molecular BiologySchool of Basic Medical SciencesSouthern Medical University and Guangdong Provincial Key Laboratory of Single Cell Technology and ApplicationGuangzhou510515China
| | - Chaochao Luo
- Department of Biochemistry and Molecular BiologySchool of Basic Medical SciencesSouthern Medical University and Guangdong Provincial Key Laboratory of Single Cell Technology and ApplicationGuangzhou510515China
- College of Life SciencesShihezi UniversityShiheziXinjiang832003China
| | - Wei Meng
- Department of Biochemistry and Molecular BiologySchool of Basic Medical SciencesSouthern Medical University and Guangdong Provincial Key Laboratory of Single Cell Technology and ApplicationGuangzhou510515China
| | - Cui Lv
- Clinical Biobank CenterMicrobiome Medicine CenterDepartment of Laboratory MedicineGuangdong Provincial Clinical Research Center for Laboratory MedicineZhujiang HospitalSouthern Medical UniversityGuangzhou510280China
| | - Yuting Chen
- Department of Biochemistry and Molecular BiologySchool of Basic Medical SciencesSouthern Medical University and Guangdong Provincial Key Laboratory of Single Cell Technology and ApplicationGuangzhou510515China
| | - Zebin Wen
- Department of Biochemistry and Molecular BiologySchool of Basic Medical SciencesSouthern Medical University and Guangdong Provincial Key Laboratory of Single Cell Technology and ApplicationGuangzhou510515China
| | - Xu Li
- Department of Biochemistry and Molecular BiologySchool of Basic Medical SciencesSouthern Medical University and Guangdong Provincial Key Laboratory of Single Cell Technology and ApplicationGuangzhou510515China
| | - Yongzhang Wu
- Department of Biochemistry and Molecular BiologySchool of Basic Medical SciencesSouthern Medical University and Guangdong Provincial Key Laboratory of Single Cell Technology and ApplicationGuangzhou510515China
| | - Changtai Xiao
- Department of Biochemistry and Molecular BiologySchool of Basic Medical SciencesSouthern Medical University and Guangdong Provincial Key Laboratory of Single Cell Technology and ApplicationGuangzhou510515China
| | - Jian Yang
- Department of Hepatobiliary Surgery IGeneral Surgery Center and Guangdong Provincial Clinical and Engineering Center of Digital MedicineZhujiang HospitalSouthern Medical UniversityGuangzhou510280China
| | - Jiameng Lu
- Precision Regenerative Medicine Research CentreMedical Science Divisionand State Key Laboratory of Quality Research in Chinese MedicineMacau University of Science and TechnologyMacao999078China
| | - Xingguang Luo
- Department of PsychiatryYale University School of MedicineNew HavenCT06510USA
| | - Yan Chen
- Precision Regenerative Medicine Research CentreMedical Science Divisionand State Key Laboratory of Quality Research in Chinese MedicineMacau University of Science and TechnologyMacao999078China
| | - Paul KH Tam
- Precision Regenerative Medicine Research CentreMedical Science Divisionand State Key Laboratory of Quality Research in Chinese MedicineMacau University of Science and TechnologyMacao999078China
| | - Chuanjiang Li
- Division of Hepatobiliopancreatic SurgeryDepartment of General SurgeryNanfang HospitalSouthern Medical UniversityGuangzhouGuangdong510515China
| | - Haitao Sun
- Clinical Biobank CenterMicrobiome Medicine CenterDepartment of Laboratory MedicineGuangdong Provincial Clinical Research Center for Laboratory MedicineZhujiang HospitalSouthern Medical UniversityGuangzhou510280China
| | - Xinghua Pan
- Department of Biochemistry and Molecular BiologySchool of Basic Medical SciencesSouthern Medical University and Guangdong Provincial Key Laboratory of Single Cell Technology and ApplicationGuangzhou510515China
- Precision Regenerative Medicine Research CentreMedical Science Divisionand State Key Laboratory of Quality Research in Chinese MedicineMacau University of Science and TechnologyMacao999078China
- Key Laboratory of Infectious Diseases Research in South China (China Ministry Education)Southern Medical UniversityGuangzhouGuangdong510515China
| |
Collapse
|
2
|
Yu Y, Mai Y, Zheng Y, Shi L. Assessing and mitigating batch effects in large-scale omics studies. Genome Biol 2024; 25:254. [PMID: 39363244 PMCID: PMC11447944 DOI: 10.1186/s13059-024-03401-9] [Citation(s) in RCA: 10] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/11/2023] [Accepted: 09/23/2024] [Indexed: 10/05/2024] Open
Abstract
Batch effects in omics data are notoriously common technical variations unrelated to study objectives, and may result in misleading outcomes if uncorrected, or hinder biomedical discovery if over-corrected. Assessing and mitigating batch effects is crucial for ensuring the reliability and reproducibility of omics data and minimizing the impact of technical variations on biological interpretation. In this review, we highlight the profound negative impact of batch effects and the urgent need to address this challenging problem in large-scale omics studies. We summarize potential sources of batch effects, current progress in evaluating and correcting them, and consortium efforts aiming to tackle them.
Collapse
Affiliation(s)
- Ying Yu
- State Key Laboratory of Genetic Engineering, School of Life Sciences and Human Phenome Institute, Fudan University, Shanghai, China.
| | - Yuanbang Mai
- State Key Laboratory of Genetic Engineering, School of Life Sciences and Human Phenome Institute, Fudan University, Shanghai, China
| | - Yuanting Zheng
- State Key Laboratory of Genetic Engineering, School of Life Sciences and Human Phenome Institute, Fudan University, Shanghai, China.
| | - Leming Shi
- State Key Laboratory of Genetic Engineering, School of Life Sciences and Human Phenome Institute, Fudan University, Shanghai, China.
- Cancer Institute, Shanghai Cancer Center, Fudan University, Shanghai, China.
- International Human Phenome Institutes (Shanghai), Shanghai, China.
| |
Collapse
|
3
|
Feldman S, Ner-Gaon H, Treister E, Shay T. Comparison and development of cross-study normalization methods for inter-species transcriptional analysis. PLoS One 2024; 19:e0307997. [PMID: 39255285 PMCID: PMC11386461 DOI: 10.1371/journal.pone.0307997] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/08/2023] [Accepted: 07/16/2024] [Indexed: 09/12/2024] Open
Abstract
Performing joint analysis of gene expression datasets from different experiments can present challenges brought on by multiple factors-differences in equipment, protocols, climate etc. "Cross-study normalization" is a general term for transformations aimed at eliminating such effects, thus making datasets more comparable. However, joint analysis of datasets from different species is rarely done, and there are no dedicated normalization methods for such inter-species analysis. In order to test the usefulness of cross-studies normalization methods for inter-species analysis, we first applied three cross-study normalization methods, EB, DWD and XPN, to RNA sequencing datasets from different species. We then developed a new approach to evaluate the performance of cross-study normalization in eliminating experimental effects, while also maintaining the biologically significant differences between species and conditions. Our results indicate that all normalization methods performed relatively well in the cross-species setting. We found XPN to be better at reducing experimental differences, and found EB to be better at preserving biological differences. Still, according to our in-silico experiments, in all methods it is not possible to enforce the preservation of the biological differences in the normalization process. In addition to the study above, in this work we propose a new dedicated cross-studies and cross-species normalization method. Our aim is to address the shortcoming mentioned above: in the normalization process, we wish to reduce the experimental differences while preserving the biological differences. We term our method as CSN, and base it on the performance evaluation criteria mentioned above. Repeating the same experiments, the CSN method obtained a better and more balanced conservation of biological differences within the datasets compared to existing methods. To summarize, we demonstrate the usefulness of cross-study normalization methods in the inter-species settings, and suggest a dedicated cross-study cross-species normalization method that will hopefully open the way to the development of improved normalization methods for the inter-species settings.
Collapse
Affiliation(s)
- Sofya Feldman
- Dept of Computer Science, Ben-Gurion University of the Negev, Beer-Sheva, Israel
| | - Hadas Ner-Gaon
- Dept of Life Sciences, Ben-Gurion University of the Negev, Beer-Sheva, Israel
| | - Eran Treister
- Dept of Computer Science, Ben-Gurion University of the Negev, Beer-Sheva, Israel
| | - Tal Shay
- Dept of Life Sciences, Ben-Gurion University of the Negev, Beer-Sheva, Israel
| |
Collapse
|
4
|
Zheng Y, Liu Y, Yang J, Dong L, Zhang R, Tian S, Yu Y, Ren L, Hou W, Zhu F, Mai Y, Han J, Zhang L, Jiang H, Lin L, Lou J, Li R, Lin J, Liu H, Kong Z, Wang D, Dai F, Bao D, Cao Z, Chen Q, Chen Q, Chen X, Gao Y, Jiang H, Li B, Li B, Li J, Liu R, Qing T, Shang E, Shang J, Sun S, Wang H, Wang X, Zhang N, Zhang P, Zhang R, Zhu S, Scherer A, Wang J, Wang J, Huo Y, Liu G, Cao C, Shao L, Xu J, Hong H, Xiao W, Liang X, Lu D, Jin L, Tong W, Ding C, Li J, Fang X, Shi L. Multi-omics data integration using ratio-based quantitative profiling with Quartet reference materials. Nat Biotechnol 2024; 42:1133-1149. [PMID: 37679543 PMCID: PMC11252085 DOI: 10.1038/s41587-023-01934-1] [Citation(s) in RCA: 24] [Impact Index Per Article: 24.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/25/2022] [Accepted: 07/31/2023] [Indexed: 09/09/2023]
Abstract
Characterization and integration of the genome, epigenome, transcriptome, proteome and metabolome of different datasets is difficult owing to a lack of ground truth. Here we develop and characterize suites of publicly available multi-omics reference materials of matched DNA, RNA, protein and metabolites derived from immortalized cell lines from a family quartet of parents and monozygotic twin daughters. These references provide built-in truth defined by relationships among the family members and the information flow from DNA to RNA to protein. We demonstrate how using a ratio-based profiling approach that scales the absolute feature values of a study sample relative to those of a concurrently measured common reference sample produces reproducible and comparable data suitable for integration across batches, labs, platforms and omics types. Our study identifies reference-free 'absolute' feature quantification as the root cause of irreproducibility in multi-omics measurement and data integration and establishes the advantages of ratio-based multi-omics profiling with common reference materials.
Collapse
Affiliation(s)
- Yuanting Zheng
- State Key Laboratory of Genetic Engineering, School of Life Sciences, Human Phenome Institute and Shanghai Cancer Center, Fudan University, Shanghai, China.
| | - Yaqing Liu
- State Key Laboratory of Genetic Engineering, School of Life Sciences, Human Phenome Institute and Shanghai Cancer Center, Fudan University, Shanghai, China
| | - Jingcheng Yang
- State Key Laboratory of Genetic Engineering, School of Life Sciences, Human Phenome Institute and Shanghai Cancer Center, Fudan University, Shanghai, China
- Greater Bay Area Institute of Precision Medicine, Guangzhou, China
| | | | - Rui Zhang
- National Center for Clinical Laboratories, Institute of Geriatric Medicine, Chinese Academy of Medical Sciences, Beijing Hospital, Beijing, China
| | - Sha Tian
- State Key Laboratory of Genetic Engineering, School of Life Sciences, Human Phenome Institute and Shanghai Cancer Center, Fudan University, Shanghai, China
| | - Ying Yu
- State Key Laboratory of Genetic Engineering, School of Life Sciences, Human Phenome Institute and Shanghai Cancer Center, Fudan University, Shanghai, China
| | - Luyao Ren
- State Key Laboratory of Genetic Engineering, School of Life Sciences, Human Phenome Institute and Shanghai Cancer Center, Fudan University, Shanghai, China
| | - Wanwan Hou
- State Key Laboratory of Genetic Engineering, School of Life Sciences, Human Phenome Institute and Shanghai Cancer Center, Fudan University, Shanghai, China
| | - Feng Zhu
- State Key Laboratory of Genetic Engineering, School of Life Sciences, Human Phenome Institute and Shanghai Cancer Center, Fudan University, Shanghai, China
| | - Yuanbang Mai
- State Key Laboratory of Genetic Engineering, School of Life Sciences, Human Phenome Institute and Shanghai Cancer Center, Fudan University, Shanghai, China
| | | | | | | | - Ling Lin
- Zhangjiang Center for Translational Medicine, Shanghai Biotecan Medical Diagnostics Co. Ltd., Shanghai, China
| | - Jingwei Lou
- Zhangjiang Center for Translational Medicine, Shanghai Biotecan Medical Diagnostics Co. Ltd., Shanghai, China
| | - Ruiqiang Li
- Novogene Bioinformatics Institute, Beijing, China
| | - Jingchao Lin
- Metabo-Profile Biotechnology (Shanghai) Co. Ltd., Shanghai, China
| | | | | | - Depeng Wang
- Nextomics Biosciences Institute, Wuhan, China
| | | | - Ding Bao
- State Key Laboratory of Genetic Engineering, School of Life Sciences, Human Phenome Institute and Shanghai Cancer Center, Fudan University, Shanghai, China
| | - Zehui Cao
- State Key Laboratory of Genetic Engineering, School of Life Sciences, Human Phenome Institute and Shanghai Cancer Center, Fudan University, Shanghai, China
| | - Qiaochu Chen
- State Key Laboratory of Genetic Engineering, School of Life Sciences, Human Phenome Institute and Shanghai Cancer Center, Fudan University, Shanghai, China
| | - Qingwang Chen
- State Key Laboratory of Genetic Engineering, School of Life Sciences, Human Phenome Institute and Shanghai Cancer Center, Fudan University, Shanghai, China
| | - Xingdong Chen
- State Key Laboratory of Genetic Engineering, School of Life Sciences, Human Phenome Institute and Shanghai Cancer Center, Fudan University, Shanghai, China
| | - Yuechen Gao
- State Key Laboratory of Genetic Engineering, School of Life Sciences, Human Phenome Institute and Shanghai Cancer Center, Fudan University, Shanghai, China
| | - He Jiang
- State Key Laboratory of Genetic Engineering, School of Life Sciences, Human Phenome Institute and Shanghai Cancer Center, Fudan University, Shanghai, China
| | - Bin Li
- State Key Laboratory of Genetic Engineering, School of Life Sciences, Human Phenome Institute and Shanghai Cancer Center, Fudan University, Shanghai, China
| | - Bingying Li
- State Key Laboratory of Genetic Engineering, School of Life Sciences, Human Phenome Institute and Shanghai Cancer Center, Fudan University, Shanghai, China
| | - Jingjing Li
- State Key Laboratory of Genetic Engineering, School of Life Sciences, Human Phenome Institute and Shanghai Cancer Center, Fudan University, Shanghai, China
- Nextomics Biosciences Institute, Wuhan, China
| | - Ruimei Liu
- State Key Laboratory of Genetic Engineering, School of Life Sciences, Human Phenome Institute and Shanghai Cancer Center, Fudan University, Shanghai, China
| | - Tao Qing
- State Key Laboratory of Genetic Engineering, School of Life Sciences, Human Phenome Institute and Shanghai Cancer Center, Fudan University, Shanghai, China
| | - Erfei Shang
- State Key Laboratory of Genetic Engineering, School of Life Sciences, Human Phenome Institute and Shanghai Cancer Center, Fudan University, Shanghai, China
| | - Jun Shang
- State Key Laboratory of Genetic Engineering, School of Life Sciences, Human Phenome Institute and Shanghai Cancer Center, Fudan University, Shanghai, China
| | - Shanyue Sun
- State Key Laboratory of Genetic Engineering, School of Life Sciences, Human Phenome Institute and Shanghai Cancer Center, Fudan University, Shanghai, China
| | - Haiyan Wang
- State Key Laboratory of Genetic Engineering, School of Life Sciences, Human Phenome Institute and Shanghai Cancer Center, Fudan University, Shanghai, China
| | - Xiaolin Wang
- State Key Laboratory of Genetic Engineering, School of Life Sciences, Human Phenome Institute and Shanghai Cancer Center, Fudan University, Shanghai, China
| | - Naixin Zhang
- State Key Laboratory of Genetic Engineering, School of Life Sciences, Human Phenome Institute and Shanghai Cancer Center, Fudan University, Shanghai, China
| | - Peipei Zhang
- State Key Laboratory of Genetic Engineering, School of Life Sciences, Human Phenome Institute and Shanghai Cancer Center, Fudan University, Shanghai, China
| | - Ruolan Zhang
- State Key Laboratory of Genetic Engineering, School of Life Sciences, Human Phenome Institute and Shanghai Cancer Center, Fudan University, Shanghai, China
| | - Sibo Zhu
- State Key Laboratory of Genetic Engineering, School of Life Sciences, Human Phenome Institute and Shanghai Cancer Center, Fudan University, Shanghai, China
| | - Andreas Scherer
- Institute for Molecular Medicine Finland (FIMM), University of Helsinki, Helsinki, Finland
- EATRIS ERIC-European Infrastructure for Translational Medicine, Amsterdam, the Netherlands
| | - Jiucun Wang
- State Key Laboratory of Genetic Engineering, School of Life Sciences, Human Phenome Institute and Shanghai Cancer Center, Fudan University, Shanghai, China
| | - Jing Wang
- National Institute of Metrology, Beijing, China
| | - Yinbo Huo
- Key Laboratory of Bioanalysis and Metrology for State Market Regulation, Shanghai Institute of Measurement and Testing Technology, Shanghai, China
| | - Gang Liu
- Key Laboratory of Bioanalysis and Metrology for State Market Regulation, Shanghai Institute of Measurement and Testing Technology, Shanghai, China
| | - Chengming Cao
- Key Laboratory of Bioanalysis and Metrology for State Market Regulation, Shanghai Institute of Measurement and Testing Technology, Shanghai, China
| | - Li Shao
- Key Laboratory of Bioanalysis and Metrology for State Market Regulation, Shanghai Institute of Measurement and Testing Technology, Shanghai, China
| | - Joshua Xu
- Division of Bioinformatics and Biostatistics, National Center for Toxicological Research, US Food and Drug Administration, Jefferson, AR, USA
| | - Huixiao Hong
- Division of Bioinformatics and Biostatistics, National Center for Toxicological Research, US Food and Drug Administration, Jefferson, AR, USA
| | - Wenming Xiao
- Office of Oncologic Diseases, Office of New Drugs, Center for Drug Evaluation and Research, US Food and Drug Administration, Silver Spring, MD, USA
| | - Xiaozhen Liang
- Shanghai Institute of Immunity and Infection, Chinese Academy of Sciences, Shanghai, China
| | - Daru Lu
- State Key Laboratory of Genetic Engineering, School of Life Sciences, Human Phenome Institute and Shanghai Cancer Center, Fudan University, Shanghai, China
| | - Li Jin
- State Key Laboratory of Genetic Engineering, School of Life Sciences, Human Phenome Institute and Shanghai Cancer Center, Fudan University, Shanghai, China
| | - Weida Tong
- Key Laboratory of Bioanalysis and Metrology for State Market Regulation, Shanghai Institute of Measurement and Testing Technology, Shanghai, China
| | - Chen Ding
- State Key Laboratory of Genetic Engineering, School of Life Sciences, Human Phenome Institute and Shanghai Cancer Center, Fudan University, Shanghai, China.
| | - Jinming Li
- National Center for Clinical Laboratories, Institute of Geriatric Medicine, Chinese Academy of Medical Sciences, Beijing Hospital, Beijing, China.
| | - Xiang Fang
- National Institute of Metrology, Beijing, China.
| | - Leming Shi
- State Key Laboratory of Genetic Engineering, School of Life Sciences, Human Phenome Institute and Shanghai Cancer Center, Fudan University, Shanghai, China.
- International Human Phenome Institutes (Shanghai), Shanghai, China.
| |
Collapse
|
5
|
Goldstein Y, Cohen OT, Wald O, Bavli D, Kaplan T, Benny O. Particle uptake in cancer cells can predict malignancy and drug resistance using machine learning. SCIENCE ADVANCES 2024; 10:eadj4370. [PMID: 38809990 PMCID: PMC11314625 DOI: 10.1126/sciadv.adj4370] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/27/2023] [Accepted: 04/23/2024] [Indexed: 05/31/2024]
Abstract
Tumor heterogeneity is a primary factor that contributes to treatment failure. Predictive tools, capable of classifying cancer cells based on their functions, may substantially enhance therapy and extend patient life span. The connection between cell biomechanics and cancer cell functions is used here to classify cells through mechanical measurements, via particle uptake. Machine learning (ML) was used to classify cells based on single-cell patterns of uptake of particles with diverse sizes. Three pairs of human cancer cell subpopulations, varied in their level of drug resistance or malignancy, were studied. Cells were allowed to interact with fluorescently labeled polystyrene particles ranging in size from 0.04 to 3.36 μm and analyzed for their uptake patterns using flow cytometry. ML algorithms accurately classified cancer cell subtypes with accuracy rates exceeding 95%. The uptake data were especially advantageous for morphologically similar cell subpopulations. Moreover, the uptake data were found to serve as a form of "normalization" that could reduce variation in repeated experiments.
Collapse
Affiliation(s)
- Yoel Goldstein
- Institute for Drug Research, The School of Pharmacy, Faculty of Medicine, The Hebrew University of Jerusalem, Jerusalem 9112001, Israel
| | - Ora T. Cohen
- Institute for Drug Research, The School of Pharmacy, Faculty of Medicine, The Hebrew University of Jerusalem, Jerusalem 9112001, Israel
| | - Ori Wald
- Department of Cardiothoracic Surgery, Hadassah Medical Center, Faculty of Medicine, The Hebrew University of Jerusalem, Jerusalem, Israel
| | - Danny Bavli
- Department of Stem Cell and Regenerative Biology, Harvard Stem Cell Institute, Harvard University, Cambridge, MA, USA
| | - Tommy Kaplan
- School of Computer Science and Engineering, The Hebrew University of Jerusalem, Jerusalem 9190401, Israel
- Department of Developmental Biology and Cancer Research, Faculty of Medicine, The Hebrew University of Jerusalem, Jerusalem 9112001, Israel
| | - Ofra Benny
- Institute for Drug Research, The School of Pharmacy, Faculty of Medicine, The Hebrew University of Jerusalem, Jerusalem 9112001, Israel
| |
Collapse
|
6
|
Cheng SL, Hedges M, Keski-Rahkonen P, Chatziioannou AC, Scalbert A, Chung KF, Sinharay R, Green DC, de Kok TMCM, Vlaanderen J, Kyrtopoulos SA, Kelly F, Portengen L, Vineis P, Vermeulen RCH, Chadeau-Hyam M, Dagnino S. Multiomic Signatures of Traffic-Related Air Pollution in London Reveal Potential Short-Term Perturbations in Gut Microbiome-Related Pathways. ENVIRONMENTAL SCIENCE & TECHNOLOGY 2024; 58:8771-8782. [PMID: 38728551 PMCID: PMC11112755 DOI: 10.1021/acs.est.3c09148] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/02/2023] [Revised: 04/25/2024] [Accepted: 04/26/2024] [Indexed: 05/12/2024]
Abstract
This randomized crossover study investigated the metabolic and mRNA alterations associated with exposure to high and low traffic-related air pollution (TRAP) in 50 participants who were either healthy or were diagnosed with chronic pulmonary obstructive disease (COPD) or ischemic heart disease (IHD). For the first time, this study combined transcriptomics and serum metabolomics measured in the same participants over multiple time points (2 h before, and 2 and 24 h after exposure) and over two contrasted exposure regimes to identify potential multiomic modifications linked to TRAP exposure. With a multivariate normal model, we identified 78 metabolic features and 53 mRNA features associated with at least one TRAP exposure. Nitrogen dioxide (NO2) emerged as the dominant pollutant, with 67 unique associated metabolomic features. Pathway analysis and annotation of metabolic features consistently indicated perturbations in the tryptophan metabolism associated with NO2 exposure, particularly in the gut-microbiome-associated indole pathway. Conditional multiomics networks revealed complex and intricate mechanisms associated with TRAP exposure, with some effects persisting 24 h after exposure. Our findings indicate that exposure to TRAP can alter important physiological mechanisms even after a short-term exposure of a 2 h walk. We describe for the first time a potential link between NO2 exposure and perturbation of the microbiome-related pathways.
Collapse
Affiliation(s)
- Sibo Lucas Cheng
- NIHR
HPRU in Environmental Exposures and Health, Imperial College London, London W12 0BZ, U.K.
- MRC
Centre for Environment and Health, Department of Epidemiology and
Biostatistics, School of Public Health, Imperial College London, London W12 7TA, U.K.
| | - Michael Hedges
- MRC
Centre for Environment and Health, Environmental Research Group, Imperial College London, London W12 0BZ, U.K.
| | | | | | - Augustin Scalbert
- International
Agency for Research on Cancer (IARC), Lyon 69366 Cedex, France
| | - Kian Fan Chung
- National
Heart & Lung Institute, Imperial College
London, London SW7 2AZ, U.K.
- Royal Brompton
& Harefield NHS Trust, London SW3 6NP, U.K.
| | - Rudy Sinharay
- National
Heart & Lung Institute, Imperial College
London, London SW7 2AZ, U.K.
- Imperial
College Healthcare NHS Trust, London W2 1NY, U.K.
| | - David C. Green
- NIHR
HPRU in Environmental Exposures and Health, Imperial College London, London W12 0BZ, U.K.
- MRC
Centre for Environment and Health, Environmental Research Group, Imperial College London, London W12 0BZ, U.K.
| | - Theo M. C. M. de Kok
- Department
of Toxicogenomics, GROW School for Oncology and Reproduction, Maastricht University, Maastricht 6229 ER, The Netherlands
| | - Jelle Vlaanderen
- Division
of Environmental Epidemiology, Institute for Risk Assessment Sciences, Utrecht University, Utrecht 3584 CS, The Netherlands
| | | | - Frank Kelly
- NIHR
HPRU in Environmental Exposures and Health, Imperial College London, London W12 0BZ, U.K.
- MRC
Centre for Environment and Health, Environmental Research Group, Imperial College London, London W12 0BZ, U.K.
| | - Lützen Portengen
- Division
of Environmental Epidemiology, Institute for Risk Assessment Sciences, Utrecht University, Utrecht 3584 CS, The Netherlands
| | - Paolo Vineis
- MRC
Centre for Environment and Health, Department of Epidemiology and
Biostatistics, School of Public Health, Imperial College London, London W12 7TA, U.K.
| | - Roel C. H. Vermeulen
- Division
of Environmental Epidemiology, Institute for Risk Assessment Sciences, Utrecht University, Utrecht 3584 CS, The Netherlands
- Julius Centre for Health Sciences and Primary Care, University
Medical
Centre, Utrecht University, Utrecht 3584 CG, The Netherlands
| | - Marc Chadeau-Hyam
- NIHR
HPRU in Environmental Exposures and Health, Imperial College London, London W12 0BZ, U.K.
- MRC
Centre for Environment and Health, Department of Epidemiology and
Biostatistics, School of Public Health, Imperial College London, London W12 7TA, U.K.
| | - Sonia Dagnino
- MRC
Centre for Environment and Health, Department of Epidemiology and
Biostatistics, School of Public Health, Imperial College London, London W12 7TA, U.K.
- Transporters
in Imaging and Radiotherapy in Oncology (TIRO), School
of Medicine, Direction de la Recherche Fondamentale (DRF), Institut
des Sciences du Vivant Fréderic Joliot, Commissariat à
l’Energie Atomique et aux Énergies Alternatives (CEA), Université Côte d’Azur (UniCA), Nice 06107, France
| |
Collapse
|
7
|
Vejrup K, Brantsæter AL, Caspersen IH, Haug LS, Villanger GD, Aase H, Knutsen HK. Mercury exposure in the Norwegian Mother, Father, and Child Cohort Study - measured and predicted blood concentrations and associations with birth weight. Heliyon 2024; 10:e30246. [PMID: 38726118 PMCID: PMC11078626 DOI: 10.1016/j.heliyon.2024.e30246] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/01/2023] [Revised: 02/20/2024] [Accepted: 04/22/2024] [Indexed: 05/12/2024] Open
Abstract
Background Blood total mercury concentration (BTHg) predominantly contains methyl Hg from seafood, and less inorganic Hg. Measured BTHg is often available only in a small proportion of large cohort study samples. Associations between estimated dietary intake of total Hg (THg) and lower birth weight within strata of maternal seafood intake was previously reported in the Norwegian Mother, Father, and Child Cohort Study (MoBa). However, maternal seafood consumption was associated with increased birth weight, indicating negative confounding by seafood in the association between THg intake and birth weight. Using predicted BTHg as a proxy for measured BTHg, we hypothesized that predicted BTHg would be associated with decreased birth weight. Objectives To develop and validate a prediction model for BTHg in MoBa and to examine the association between predicted BTHg and birth weight in the MoBa population. Methods Using linear regression, measured maternal BTHg (n = 1437) was used to build the best fitting model (highest R-squared value). Model validation (n = 1436) was based on correlation and weighted Kappa (Кw). Associations between predicted BTHg in the MoBa population (n = 86,775) or measured BTHg (n = 3590) and birth weight were assessed by multivariate linear regression models. Results The best fitting model had R-squared = 0.3 and showed strong correlation (r = 0.53, p < 0.001) between predicted and measured BTHg. Cross-classification (quintiles) showed 73 % correctly classified and 3.3 % grossly misclassified, with Кw of 0.37. Measured BTHg was not associated with birth weight. Predicted BTHg was significantly associated with higher birth weight. There were no trends in birth weight with increasing quintiles of measured or predicted BTHg after stratification into high or low seafood consumption. Conclusions The results indicate that prediction of BTHg did not overcome negative confounding of the association between Hg exposure and birth weight by seafood intake. Furthermore, effect on birth weight of toxicological concern is unexpected in our observed BTHg range.
Collapse
Affiliation(s)
- Kristine Vejrup
- Institute of Military Epidemiology, Norwegian Armed Forces Joint Medical Serviced, Norway
| | - Anne Lise Brantsæter
- Department of Food Safety and Centre for Sustainable Diets, Norwegian Institute of Public Health, Norway
| | - Ida H. Caspersen
- Centre for Fertility and Health, Norwegian Institute of Public Health, Norway
| | - Line S. Haug
- Department of Food Safety and Centre for Sustainable Diets, Norwegian Institute of Public Health, Norway
| | - Gro D. Villanger
- Department of Child Health and Development, Norwegian Institute of Public Health, Norway
| | - Heidi Aase
- Department of Child Health and Development, Norwegian Institute of Public Health, Norway
| | - Helle K. Knutsen
- Department of Food Safety and Centre for Sustainable Diets, Norwegian Institute of Public Health, Norway
| |
Collapse
|
8
|
Van R, Alvarez D, Mize T, Gannavarapu S, Chintham Reddy L, Nasoz F, Han MV. A comparison of RNA-Seq data preprocessing pipelines for transcriptomic predictions across independent studies. BMC Bioinformatics 2024; 25:181. [PMID: 38720247 PMCID: PMC11080237 DOI: 10.1186/s12859-024-05801-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/31/2024] [Accepted: 05/02/2024] [Indexed: 05/12/2024] Open
Abstract
BACKGROUND RNA sequencing combined with machine learning techniques has provided a modern approach to the molecular classification of cancer. Class predictors, reflecting the disease class, can be constructed for known tissue types using the gene expression measurements extracted from cancer patients. One challenge of current cancer predictors is that they often have suboptimal performance estimates when integrating molecular datasets generated from different labs. Often, the quality of the data is variable, procured differently, and contains unwanted noise hampering the ability of a predictive model to extract useful information. Data preprocessing methods can be applied in attempts to reduce these systematic variations and harmonize the datasets before they are used to build a machine learning model for resolving tissue of origins. RESULTS We aimed to investigate the impact of data preprocessing steps-focusing on normalization, batch effect correction, and data scaling-through trial and comparison. Our goal was to improve the cross-study predictions of tissue of origin for common cancers on large-scale RNA-Seq datasets derived from thousands of patients and over a dozen tumor types. The results showed that the choice of data preprocessing operations affected the performance of the associated classifier models constructed for tissue of origin predictions in cancer. CONCLUSION By using TCGA as a training set and applying data preprocessing methods, we demonstrated that batch effect correction improved performance measured by weighted F1-score in resolving tissue of origin against an independent GTEx test dataset. On the other hand, the use of data preprocessing operations worsened classification performance when the independent test dataset was aggregated from separate studies in ICGC and GEO. Therefore, based on our findings with these publicly available large-scale RNA-Seq datasets, the application of data preprocessing techniques to a machine learning pipeline is not always appropriate.
Collapse
Affiliation(s)
- Richard Van
- School of Life Sciences, University of Nevada Las Vegas, Las Vegas, NV, USA
- Nevada Institute of Personalized Medicine, Las Vegas, NV, USA
| | - Daniel Alvarez
- Department of Computer Science, University of Nevada Las Vegas, Las Vegas, NV, USA
- Nevada Institute of Personalized Medicine, Las Vegas, NV, USA
| | - Travis Mize
- Icahn School of Medicine at Mount Sinai, Institute for Genomic Health, New York City, NY, USA
| | - Sravani Gannavarapu
- Department of Computer Science, University of Nevada Las Vegas, Las Vegas, NV, USA
- Nevada Institute of Personalized Medicine, Las Vegas, NV, USA
| | - Lohitha Chintham Reddy
- Department of Computer Science, University of Nevada Las Vegas, Las Vegas, NV, USA
- Nevada Institute of Personalized Medicine, Las Vegas, NV, USA
| | - Fatma Nasoz
- Department of Computer Science, University of Nevada Las Vegas, Las Vegas, NV, USA
- Nevada Institute of Personalized Medicine, Las Vegas, NV, USA
| | - Mira V Han
- School of Life Sciences, University of Nevada Las Vegas, Las Vegas, NV, USA.
- Nevada Institute of Personalized Medicine, Las Vegas, NV, USA.
| |
Collapse
|
9
|
Chin JL, Tan ZC, Chan LC, Ruffin F, Parmar R, Ahn R, Taylor SD, Bayer AS, Hoffmann A, Fowler VG, Reed EF, Yeaman MR, Meyer AS. Tensor modeling of MRSA bacteremia cytokine and transcriptional patterns reveals coordinated, outcome-associated immunological programs. PNAS NEXUS 2024; 3:pgae185. [PMID: 38779114 PMCID: PMC11109816 DOI: 10.1093/pnasnexus/pgae185] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 11/23/2023] [Accepted: 04/17/2024] [Indexed: 05/25/2024]
Abstract
Methicillin-resistant Staphylococcus aureus (MRSA) bacteremia is a common and life-threatening infection that imposes up to 30% mortality even when appropriate therapy is used. Despite in vitro efficacy determined by minimum inhibitory concentration breakpoints, antibiotics often fail to resolve these infections in vivo, resulting in persistent MRSA bacteremia. Recently, several genetic, epigenetic, and proteomic correlates of persistent outcomes have been identified. However, the extent to which single variables or their composite patterns operate as independent predictors of outcome or reflect shared underlying mechanisms of persistence is unknown. To explore this question, we employed a tensor-based integration of host transcriptional and cytokine datasets across a well-characterized cohort of patients with persistent or resolving MRSA bacteremia outcomes. This method yielded high correlative accuracy with outcomes and immunologic signatures united by transcriptomic and cytokine datasets. Results reveal that patients with persistent MRSA bacteremia (PB) exhibit signals of granulocyte dysfunction, suppressed antigen presentation, and deviated lymphocyte polarization. In contrast, patients with resolving bacteremia (RB) heterogeneously exhibit correlates of robust antigen-presenting cell trafficking and enhanced neutrophil maturation corresponding to appropriate T lymphocyte polarization and B lymphocyte response. These results suggest that transcriptional and cytokine correlates of PB vs. RB outcomes are complex and may not be disclosed by conventional modeling. In this respect, a tensor-based integration approach may help to reveal consensus molecular and cellular mechanisms and their biological interpretation.
Collapse
Affiliation(s)
- Jackson L Chin
- Department of Bioengineering, University of California, Los Angeles, Los Angeles, CA 90024, USA
| | - Zhixin Cyrillus Tan
- Bioinformatics Interdepartmental Program, University of California, Los Angeles, Los Angeles, CA 90024, USA
| | - Liana C Chan
- The Lundquist Institute for Biomedical Innovation, Harbor-UCLA Medical Center, Torrance, CA 90502, USA
- Department of Medicine, David Geffen School of Medicine, University of California, Los Angeles, CA 90095, USA
- Division of Infectious Diseases, Department of Medicine, Harbor-UCLA Medical Center, Torrance, CA 90502, USA
- Division of Molecular Medicine, Department of Medicine, Harbor-UCLA Medical Center, Torrance, CA 90502, USA
| | - Felicia Ruffin
- Division of Infectious Diseases, Duke University School of Medicine, Durham, NC 27710, USA
| | - Rajesh Parmar
- Department of Pathology and Laboratory Medicine, University of California, Los Angeles, Los Angeles, CA 90095, USA
| | - Richard Ahn
- Institute for Quantitative and Computational Biosciences, David Geffen School of Medicine at UCLA, Los Angeles, CA 90095, USA
| | - Scott D Taylor
- Department of Bioengineering, University of California, Los Angeles, Los Angeles, CA 90024, USA
| | - Arnold S Bayer
- The Lundquist Institute for Biomedical Innovation, Harbor-UCLA Medical Center, Torrance, CA 90502, USA
- Department of Medicine, David Geffen School of Medicine, University of California, Los Angeles, CA 90095, USA
| | - Alexander Hoffmann
- Institute for Quantitative and Computational Biosciences, David Geffen School of Medicine at UCLA, Los Angeles, CA 90095, USA
| | - Vance G Fowler
- Division of Infectious Diseases, Duke University School of Medicine, Durham, NC 27710, USA
| | - Elaine F Reed
- Department of Pathology and Laboratory Medicine, University of California, Los Angeles, Los Angeles, CA 90095, USA
| | - Michael R Yeaman
- The Lundquist Institute for Biomedical Innovation, Harbor-UCLA Medical Center, Torrance, CA 90502, USA
- Department of Medicine, David Geffen School of Medicine, University of California, Los Angeles, CA 90095, USA
- Division of Infectious Diseases, Department of Medicine, Harbor-UCLA Medical Center, Torrance, CA 90502, USA
- Division of Molecular Medicine, Department of Medicine, Harbor-UCLA Medical Center, Torrance, CA 90502, USA
- Division of Infectious Diseases, Duke University School of Medicine, Durham, NC 27710, USA
| | - Aaron S Meyer
- Department of Bioengineering, University of California, Los Angeles, Los Angeles, CA 90024, USA
- Bioinformatics Interdepartmental Program, University of California, Los Angeles, Los Angeles, CA 90024, USA
- Jonsson Comprehensive Cancer Center, University of California, Los Angeles, Los Angeles, CA 90024, USA
- Eli and Edythe Broad Center of Regenerative Medicine and Stem Cell Research, University of California, Los Angeles, Los Angeles, CA 90024, USA
| |
Collapse
|
10
|
Kotlov N, Shaposhnikov K, Tazearslan C, Chasse M, Baisangurov A, Podsvirova S, Fernandez D, Abdou M, Kaneunyenye L, Morgan K, Cheremushkin I, Zemskiy P, Chelushkin M, Sorokina M, Belova E, Khorkova S, Lozinsky Y, Nuzhdina K, Vasileva E, Kravchenko D, Suryamohan K, Nomie K, Curran J, Fowler N, Bagaev A. Procrustes is a machine-learning approach that removes cross-platform batch effects from clinical RNA sequencing data. Commun Biol 2024; 7:392. [PMID: 38555407 PMCID: PMC10981711 DOI: 10.1038/s42003-024-06020-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/24/2023] [Accepted: 03/06/2024] [Indexed: 04/02/2024] Open
Abstract
With the increased use of gene expression profiling for personalized oncology, optimized RNA sequencing (RNA-seq) protocols and algorithms are necessary to provide comparable expression measurements between exome capture (EC)-based and poly-A RNA-seq. Here, we developed and optimized an EC-based protocol for processing formalin-fixed, paraffin-embedded samples and a machine-learning algorithm, Procrustes, to overcome batch effects across RNA-seq data obtained using different sample preparation protocols like EC-based or poly-A RNA-seq protocols. Applying Procrustes to samples processed using EC and poly-A RNA-seq protocols showed the expression of 61% of genes (N = 20,062) to correlate across both protocols (concordance correlation coefficient > 0.8, versus 26% before transformation by Procrustes), including 84% of cancer-specific and cancer microenvironment-related genes (versus 36% before applying Procrustes; N = 1,438). Benchmarking analyses also showed Procrustes to outperform other batch correction methods. Finally, we showed that Procrustes can project RNA-seq data for a single sample to a larger cohort of RNA-seq data. Future application of Procrustes will enable direct gene expression analysis for single tumor samples to support gene expression-based treatment decisions.
Collapse
Affiliation(s)
| | | | | | | | | | | | | | - Mary Abdou
- BostonGene, Corp., Waltham, MA, 02453, USA
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
11
|
Kumar B, Lorusso E, Fosso B, Pesole G. A comprehensive overview of microbiome data in the light of machine learning applications: categorization, accessibility, and future directions. Front Microbiol 2024; 15:1343572. [PMID: 38419630 PMCID: PMC10900530 DOI: 10.3389/fmicb.2024.1343572] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/23/2023] [Accepted: 01/29/2024] [Indexed: 03/02/2024] Open
Abstract
Metagenomics, Metabolomics, and Metaproteomics have significantly advanced our knowledge of microbial communities by providing culture-independent insights into their composition and functional potential. However, a critical challenge in this field is the lack of standard and comprehensive metadata associated with raw data, hindering the ability to perform robust data stratifications and consider confounding factors. In this comprehensive review, we categorize publicly available microbiome data into five types: shotgun sequencing, amplicon sequencing, metatranscriptomic, metabolomic, and metaproteomic data. We explore the importance of metadata for data reuse and address the challenges in collecting standardized metadata. We also, assess the limitations in metadata collection of existing public repositories collecting metagenomic data. This review emphasizes the vital role of metadata in interpreting and comparing datasets and highlights the need for standardized metadata protocols to fully leverage metagenomic data's potential. Furthermore, we explore future directions of implementation of Machine Learning (ML) in metadata retrieval, offering promising avenues for a deeper understanding of microbial communities and their ecological roles. Leveraging these tools will enhance our insights into microbial functional capabilities and ecological dynamics in diverse ecosystems. Finally, we emphasize the crucial metadata role in ML models development.
Collapse
Affiliation(s)
- Bablu Kumar
- Università degli Studi di Milano, Milan, Italy
- Department of Biosciences, Biotechnology and Environment, University of Bari A. Moro, Bari, Italy
| | - Erika Lorusso
- Department of Biosciences, Biotechnology and Environment, University of Bari A. Moro, Bari, Italy
- National Research Council, Institute of Biomembranes, Bioenergetics and Molecular Biotechnologies, Bari, Italy
| | - Bruno Fosso
- Department of Biosciences, Biotechnology and Environment, University of Bari A. Moro, Bari, Italy
| | - Graziano Pesole
- Department of Biosciences, Biotechnology and Environment, University of Bari A. Moro, Bari, Italy
- National Research Council, Institute of Biomembranes, Bioenergetics and Molecular Biotechnologies, Bari, Italy
| |
Collapse
|
12
|
Xiong Y, Ma Y, Liu K, Lei J, Zhao J, Zhu J, Wang W, Wen M, Wang X, Sun Y, Zhao Y, Han Y, Jiang T, Liu Y. A gene-based score for the risk stratification of stage IA lung adenocarcinoma. Respir Res 2024; 25:18. [PMID: 38178073 PMCID: PMC10765678 DOI: 10.1186/s12931-023-02647-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/25/2022] [Accepted: 12/20/2023] [Indexed: 01/06/2024] Open
Abstract
OBJECTIVE We aim to molecularly stratify stage IA lung adenocarcinoma (LUAD) for precision medicine. METHODS Twelve multi-institution datasets (837 cases of IA) were used to classify the high- and low-risk types (based on survival status within 5 years), and the biological differences were compared. Then, a gene-based classifying score (IA score) was trained, tested and validated by several machine learning methods. Furthermore, we estimated the significance of the IA score in the prognostic assessment, chemotherapy prediction and risk stratification of stage IA LUAD. We also developed an R package for the clinical application. The SEER database (15708 IA samples) and TCGA Pan-Cancer (1881 stage I samples) database were used to verify clinical significance. RESULTS Compared with the low-risk group, the high-risk group of stage IA LUAD has obvious enrichment of the malignant pathway and more driver mutations and copy number variations. The effect of the IA score on the classification of high- and low-risk stage IA LUAD was much better than that of classical clinicopathological factors (training set: AUC = 0.9, validation set: AUC = 0.7). The IA score can significantly predict the prognosis of stage IA LUAD and has a prognostic effect for stage I pancancer. The IA score can effectively predict chemotherapy sensitivity and occult metastasis or invasion in stage IA LUAD. The R package IAExpSuv has a good risk probability prediction effect for both groups and single stages of IA LUAD. CONCLUSIONS The IA score can effectively stratify the risk of stage IA LUAD, offering good assistance in precision medicine.
Collapse
Affiliation(s)
- Yanlu Xiong
- Department of Thoracic Surgery, First Medical Center, Chinese PLA General Hospital and PLA Medical School, Beijing, China
- Department of Thoracic Surgery, Tangdu Hospital, Fourth Military Medical University, Xi'an, China
- Innovation Center for Advanced Medicine, Tangdu Hospital, Fourth Military Medical University, Xi'an, China
| | - Yongfu Ma
- Department of Thoracic Surgery, First Medical Center, Chinese PLA General Hospital and PLA Medical School, Beijing, China
| | - Kun Liu
- Department of Epidemiology, Ministry of Education Key Laboratory of Hazard Assessment and Control in Special Operational Environment, School of Public Health, Air Force Medical University, Xi'an, China
| | - Jie Lei
- Department of Thoracic Surgery, Tangdu Hospital, Fourth Military Medical University, Xi'an, China
| | - Jinbo Zhao
- Department of Thoracic Surgery, Tangdu Hospital, Fourth Military Medical University, Xi'an, China
| | - Jianfei Zhu
- Department of Thoracic Surgery, Tangdu Hospital, Fourth Military Medical University, Xi'an, China
- Department of Thoracic Surgery, Shaanxi Provincial People's Hospital, Xi'an, China
| | - Wenchen Wang
- Department of Thoracic Surgery, Tangdu Hospital, Fourth Military Medical University, Xi'an, China
| | - Miaomiao Wen
- Department of Thoracic Surgery, Tangdu Hospital, Fourth Military Medical University, Xi'an, China
| | - Xuejiao Wang
- Department of Thoracic Surgery, Tangdu Hospital, Fourth Military Medical University, Xi'an, China
| | - Ying Sun
- Department of Thoracic Surgery, Tangdu Hospital, Fourth Military Medical University, Xi'an, China
| | - Yabo Zhao
- Department of Thoracic Surgery, Tangdu Hospital, Fourth Military Medical University, Xi'an, China
| | - Yong Han
- Department of Thoracic Surgery, Tangdu Hospital, Fourth Military Medical University, Xi'an, China.
- Department of Thoracic Surgery, Air Force Medical Center, Fourth Military Medical University, Beijing, China.
| | - Tao Jiang
- Department of Thoracic Surgery, Tangdu Hospital, Fourth Military Medical University, Xi'an, China.
| | - Yang Liu
- Department of Thoracic Surgery, First Medical Center, Chinese PLA General Hospital and PLA Medical School, Beijing, China.
| |
Collapse
|
13
|
Mizuno T, Kusuhara H. Investigation of normalization procedures for transcriptome profiles of compounds oriented toward practical study design. J Toxicol Sci 2024; 49:249-259. [PMID: 38825484 DOI: 10.2131/jts.49.249] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/04/2024]
Abstract
The transcriptome profile is a representative phenotype-based descriptor of compounds, widely acknowledged for its ability to effectively capture compound effects. However, the presence of batch differences is inevitable. Despite the existence of sophisticated statistical methods, many of them presume a substantial sample size. How should we design a transcriptome analysis to obtain robust compound profiles, particularly in the context of small datasets frequently encountered in practical scenarios? This study addresses this question by investigating the normalization procedures for transcriptome profiles, focusing on the baseline distribution employed in deriving biological responses as profiles. Firstly, we investigated two large GeneChip datasets, comparing the impact of different normalization procedures. Through an evaluation of the similarity between response profiles of biological replicates within each dataset and the similarity between response profiles of the same compound across datasets, we revealed that the baseline distribution defined by all samples within each batch under batch-corrected condition is a good choice for large datasets. Subsequently, we conducted a simulation to explore the influence of the number of control samples on the robustness of response profiles across datasets. The results offer insights into determining the suitable quantity of control samples for diminutive datasets. It is crucial to acknowledge that these conclusions stem from constrained datasets. Nevertheless, we believe that this study enhances our understanding of how to effectively leverage transcriptome profiles of compounds and promotes the accumulation of essential knowledge for the practical application of such profiles.
Collapse
Affiliation(s)
- Tadahaya Mizuno
- Laboratory of Molecular Pharmacokinetics, Graduate School of Pharmaceutical Sciences, The University of Tokyo
| | - Hiroyuki Kusuhara
- Laboratory of Molecular Pharmacokinetics, Graduate School of Pharmaceutical Sciences, The University of Tokyo
| |
Collapse
|
14
|
Hosseini M, Hammami B, Kazemi M. Identification of potential diagnostic biomarkers and therapeutic targets for endometriosis based on bioinformatics and machine learning analysis. J Assist Reprod Genet 2023; 40:2439-2451. [PMID: 37555920 PMCID: PMC10504186 DOI: 10.1007/s10815-023-02903-y] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/31/2023] [Accepted: 07/28/2023] [Indexed: 08/10/2023] Open
Abstract
PURPOSE Endometriosis (EMs) is a major gynecological condition in women. Due to the absence of definitive symptoms, its early detection is very challenging; thus, it is crucial to find biomarkers to ease its diagnosis and therapy. Here, we aimed to identify potential diagnostic and therapeutic targets for EMs by constructing a regulatory network and using machine learning approaches. METHODS Three Gene Expression Omnibus (GEO) datasets were merged, and differentially expressed genes (DEGS) were identified after preprocessing steps. Using the DEGs, a transcription factor (TF)-mRNA-miRNA regulatory network was constructed, and hub genes were detected based on four different algorithms in CytoHubba. The hub genes were used to build a GaussianNB diagnostic model and also in docking analysis that were performed using Discovery Studio and AutoDock Vina software. RESULTS A total of 119 DEGs were identified between EMs and non-EMs samples. A regulatory network consisting of 52 mRNAs, 249 miRNAs, and 37 TFs was then constructed. The diagnostic model was introduced using the hub genes selected from the network (GATA6, HMOX1, HS3ST1, NFASC, and PTGIS) that its area under the curve (AUC) was 0.98 and 0.92 in the training and validation cohorts, respectively. Based on docking analysis, two chemical compounds, rofecoxib and retinoic acid, had potential therapeutic effects on EMs. CONCLUSION In conclusion, this study identified potential diagnostic and therapeutic targets for EMs which demand more experimental confirmations.
Collapse
Affiliation(s)
- Maryam Hosseini
- Department of Genetics and Molecular Biology, School of Medicine, Isfahan University of Medical Sciences, Isfahan, Iran
| | - Behnaz Hammami
- Department of Genetics and Molecular Biology, School of Medicine, Isfahan University of Medical Sciences, Isfahan, Iran
| | - Mohammad Kazemi
- Department of Genetics and Molecular Biology, School of Medicine, Isfahan University of Medical Sciences, Isfahan, Iran.
- Reproductive Sciences and Sexual Health Research Center, Isfahan University of Medical Sciences, Isfahan, Iran.
| |
Collapse
|
15
|
Zhao J, Chao K, Wang A. Integrative analysis of metabolome, proteome, and transcriptome for identifying genes influencing total lignin content in Populus trichocarpa. FRONTIERS IN PLANT SCIENCE 2023; 14:1244020. [PMID: 37771490 PMCID: PMC10525687 DOI: 10.3389/fpls.2023.1244020] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 06/21/2023] [Accepted: 08/22/2023] [Indexed: 09/30/2023]
Abstract
Lignin, a component of plant cell walls, possesses significant research potential as a renewable energy source to replace carbon-based products and as a notable pollutant in papermaking processes. The monolignol biosynthetic pathway has been elucidated and it is known that not all monolignol genes influence the total lignin content. However, it remains unclear which monolignol genes are more closely related to the total lignin content and which potential genes influence the total lignin content. In this study, we present a combination of t-test, differential gene expression analysis, correlation analysis, and weighted gene co-expression network analysis to identify genes that regulate the total lignin content by utilizing multi-omics data from transgenic knockdowns of the monolignol genes that includes data related to the transcriptome, proteome, and total lignin content. Firstly, it was discovered that enzymes from the PtrPAL, Ptr4CL, PtrC3H, and PtrC4H gene families are more strongly correlated with the total lignin content. Additionally, the co-downregulation of three genes, PtrC3H3, PtrC4H1, and PtrC4H2, had the greatest impact on the total lignin content. Secondly, GO and KEGG analysis of lignin-related modules revealed that the total lignin content is not only influenced by monolignol genes, but also closely related to genes involved in the "glutathione metabolic process", "cellular modified amino acid metabolic process" and "carbohydrate catabolic process" pathways. Finally, the cinnamyl alcohol dehydrogenase genes CAD1, CADL3, and CADL8 emerged as potential contributors to total lignin content. The genes HYR1 (UDP-glycosyltransferase superfamily protein) and UGT71B1 (UDP-glucosyltransferase), exhibiting a close relationship with coumarin, have the potential to influence total lignin content by regulating coumarin metabolism. Additionally, the monolignol genes PtrC3H3, PtrC4H1, and PtrC4H2, which belong to the cytochrome P450 genes, may have a significant impact on the total lignin content. Overall, this study establishes connections between gene expression levels and total lignin content, effectively identifying genes that have a significant impact on total lignin content and offering novel perspectives for future lignin research endeavours.
Collapse
Affiliation(s)
- Jia Zhao
- College of Computer and Control Engineering, Northeast Forestry University, Harbin, China
| | - Kairui Chao
- College of Forestry, Inner Mongolia Agricultural University, Hohhot, China
| | - Achuan Wang
- College of Computer and Control Engineering, Northeast Forestry University, Harbin, China
| |
Collapse
|
16
|
Yu Y, Zhang N, Mai Y, Ren L, Chen Q, Cao Z, Chen Q, Liu Y, Hou W, Yang J, Hong H, Xu J, Tong W, Dong L, Shi L, Fang X, Zheng Y. Correcting batch effects in large-scale multiomics studies using a reference-material-based ratio method. Genome Biol 2023; 24:201. [PMID: 37674217 PMCID: PMC10483871 DOI: 10.1186/s13059-023-03047-z] [Citation(s) in RCA: 22] [Impact Index Per Article: 11.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/08/2022] [Accepted: 05/18/2023] [Indexed: 09/08/2023] Open
Abstract
BACKGROUND Batch effects are notoriously common technical variations in multiomics data and may result in misleading outcomes if uncorrected or over-corrected. A plethora of batch-effect correction algorithms are proposed to facilitate data integration. However, their respective advantages and limitations are not adequately assessed in terms of omics types, the performance metrics, and the application scenarios. RESULTS As part of the Quartet Project for quality control and data integration of multiomics profiling, we comprehensively assess the performance of seven batch effect correction algorithms based on different performance metrics of clinical relevance, i.e., the accuracy of identifying differentially expressed features, the robustness of predictive models, and the ability of accurately clustering cross-batch samples into their own donors. The ratio-based method, i.e., by scaling absolute feature values of study samples relative to those of concurrently profiled reference material(s), is found to be much more effective and broadly applicable than others, especially when batch effects are completely confounded with biological factors of study interests. We further provide practical guidelines for implementing the ratio based approach in increasingly large-scale multiomics studies. CONCLUSIONS Multiomics measurements are prone to batch effects, which can be effectively corrected using ratio-based scaling of the multiomics data. Our study lays the foundation for eliminating batch effects at a ratio scale.
Collapse
Affiliation(s)
- Ying Yu
- State Key Laboratory of Genetic Engineering, School of Life Sciences and Human Phenome Institute, Shanghai Cancer Center, Fudan University, Shanghai, China
| | - Naixin Zhang
- State Key Laboratory of Genetic Engineering, School of Life Sciences and Human Phenome Institute, Shanghai Cancer Center, Fudan University, Shanghai, China
| | - Yuanbang Mai
- State Key Laboratory of Genetic Engineering, School of Life Sciences and Human Phenome Institute, Shanghai Cancer Center, Fudan University, Shanghai, China
| | - Luyao Ren
- State Key Laboratory of Genetic Engineering, School of Life Sciences and Human Phenome Institute, Shanghai Cancer Center, Fudan University, Shanghai, China
| | - Qiaochu Chen
- State Key Laboratory of Genetic Engineering, School of Life Sciences and Human Phenome Institute, Shanghai Cancer Center, Fudan University, Shanghai, China
| | - Zehui Cao
- State Key Laboratory of Genetic Engineering, School of Life Sciences and Human Phenome Institute, Shanghai Cancer Center, Fudan University, Shanghai, China
| | - Qingwang Chen
- State Key Laboratory of Genetic Engineering, School of Life Sciences and Human Phenome Institute, Shanghai Cancer Center, Fudan University, Shanghai, China
| | - Yaqing Liu
- State Key Laboratory of Genetic Engineering, School of Life Sciences and Human Phenome Institute, Shanghai Cancer Center, Fudan University, Shanghai, China
| | - Wanwan Hou
- State Key Laboratory of Genetic Engineering, School of Life Sciences and Human Phenome Institute, Shanghai Cancer Center, Fudan University, Shanghai, China
| | - Jingcheng Yang
- State Key Laboratory of Genetic Engineering, School of Life Sciences and Human Phenome Institute, Shanghai Cancer Center, Fudan University, Shanghai, China
- Greater Bay Area Institute of Precision Medicine, Guangzhou, Guangdong, China
| | - Huixiao Hong
- Division of Bioinformatics and Biostatistics, National Center for Toxicological Research, US Food and Drug Administration, Jefferson, AR, USA
| | - Joshua Xu
- Division of Bioinformatics and Biostatistics, National Center for Toxicological Research, US Food and Drug Administration, Jefferson, AR, USA
| | - Weida Tong
- Division of Bioinformatics and Biostatistics, National Center for Toxicological Research, US Food and Drug Administration, Jefferson, AR, USA
| | | | - Leming Shi
- State Key Laboratory of Genetic Engineering, School of Life Sciences and Human Phenome Institute, Shanghai Cancer Center, Fudan University, Shanghai, China.
- International Human Phenome Institutes, Shanghai, China.
| | - Xiang Fang
- National Institute of Metrology, Beijing, China.
| | - Yuanting Zheng
- State Key Laboratory of Genetic Engineering, School of Life Sciences and Human Phenome Institute, Shanghai Cancer Center, Fudan University, Shanghai, China.
| |
Collapse
|
17
|
Weyde KVF, Winterton A, Surén P, Andersen GL, Vik T, Biele G, Knutsen HK, Thomsen C, Meltzer HM, Skogheim TS, Engel SM, Aase H, Villanger GD. Association between gestational levels of toxic metals and essential elements and cerebral palsy in children. Front Neurol 2023; 14:1124943. [PMID: 37662050 PMCID: PMC10470125 DOI: 10.3389/fneur.2023.1124943] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/16/2022] [Accepted: 07/11/2023] [Indexed: 09/05/2023] Open
Abstract
Introduction Cerebral palsy (CP) is the most common motor disability in childhood, but its causes are only partly known. Early-life exposure to toxic metals and inadequate or excess amounts of essential elements can adversely affect brain and nervous system development. However, little is still known about these as perinatal risk factors for CP. This study aims to investigate the associations between second trimester maternal blood levels of toxic metals, essential elements, and mixtures thereof, with CP diagnoses in children. Methods In a large, population-based prospective birth cohort (The Norwegian Mother, Father, and Child Cohort Study), children with CP diagnoses were identified through The Norwegian Patient Registry and Cerebral Palsy Registry of Norway. One hundred forty-four children with CP and 1,082 controls were included. The relationship between maternal blood concentrations of five toxic metals and six essential elements and CP diagnoses were investigated using mixture approaches: elastic net with stability selection to identify important metals/elements in the mixture in relation to CP; then logistic regressions of the selected metals/elements to estimate odds ratio (OR) of CP and two-way interactions among metals/elements and with child sex and maternal education. Finally, the joint effects of the mixtures on CP diagnoses were estimated using quantile-based g-computation analyses. Results The essential elements manganese and copper, as well as the toxic metal Hg, were the most important in relation to CP. Elevated maternal levels of copper (OR = 1.40) and manganese (OR = 1.20) were associated with increased risk of CP, while Hg levels were, counterintuitively, inversely related to CP. Metal/element interactions that were associated with CP were observed, and that sex and maternal education influenced the relationships between metals/elements and CP. In the joint mixture approach no significant association between the mixture of metals/elements and CP (OR = 1.00, 95% CI = [0.67, 1.50]) was identified. Conclusion Using mixture approaches, elevated levels of copper and manganese measured in maternal blood during the second trimester could be related to increased risk of CP in children. The inverse associations between maternal Hg and CP could reflect Hg as a marker of maternal fish intake and thus nutrients beneficial for foetal brain development.
Collapse
Affiliation(s)
- Kjell Vegard F. Weyde
- Division of Mental and Physical Health, Norwegian Institute of Public Health, Oslo, Norway
| | - Adriano Winterton
- Division of Mental and Physical Health, Norwegian Institute of Public Health, Oslo, Norway
| | - Pål Surén
- Division of Mental and Physical Health, Norwegian Institute of Public Health, Oslo, Norway
| | - Guro L. Andersen
- Department of Clinical and Molecular Medicine, Norwegian University of Science and Technology, Trondheim, Norway
| | - Torstein Vik
- Department of Clinical and Molecular Medicine, Norwegian University of Science and Technology, Trondheim, Norway
| | - Guido Biele
- Division of Mental and Physical Health, Norwegian Institute of Public Health, Oslo, Norway
| | - Helle K. Knutsen
- Division of Infection Control and Environmental Health, Norwegian Institute of Public Health, Oslo, Norway
| | - Cathrine Thomsen
- Division of Infection Control and Environmental Health, Norwegian Institute of Public Health, Oslo, Norway
| | - Helle M. Meltzer
- Division of Infection Control and Environmental Health, Norwegian Institute of Public Health, Oslo, Norway
| | - Thea S. Skogheim
- Division of Mental and Physical Health, Norwegian Institute of Public Health, Oslo, Norway
| | - Stephanie M. Engel
- Gillings School of Global Public Health, University of North Carolina at Chapel Hill, Chapel Hill, NC, United States
| | - Heidi Aase
- Division of Mental and Physical Health, Norwegian Institute of Public Health, Oslo, Norway
| | - Gro D. Villanger
- Division of Mental and Physical Health, Norwegian Institute of Public Health, Oslo, Norway
| |
Collapse
|
18
|
Rahman MA, Tutul AA, Sharmin M, Bayzid MS. BEENE: deep learning-based nonlinear embedding improves batch effect estimation. Bioinformatics 2023; 39:btad479. [PMID: 37561107 PMCID: PMC10448987 DOI: 10.1093/bioinformatics/btad479] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/03/2022] [Revised: 06/11/2023] [Accepted: 08/09/2023] [Indexed: 08/11/2023] Open
Abstract
MOTIVATION Analyzing large-scale single-cell transcriptomic datasets generated using different technologies is challenging due to the presence of batch-specific systematic variations known as batch effects. Since biological and technological differences are often interspersed, detecting and accounting for batch effects in RNA-seq datasets are critical for effective data integration and interpretation. Low-dimensional embeddings, such as principal component analysis (PCA) are widely used in visual inspection and estimation of batch effects. Linear dimensionality reduction methods like PCA are effective in assessing the presence of batch effects, especially when batch effects exhibit linear patterns. However, batch effects are inherently complex and existing linear dimensionality reduction methods could be inadequate and imprecise in the presence of sophisticated nonlinear batch effects. RESULTS We present Batch Effect Estimation using Nonlinear Embedding (BEENE), a deep nonlinear auto-encoder network which is specially tailored to generate an alternative lower dimensional embedding suitable for both linear and nonlinear batch effects. BEENE simultaneously learns the batch and biological variables from RNA-seq data, resulting in an embedding that is more robust and sensitive than PCA embedding in terms of detecting and quantifying batch effects. BEENE was assessed on a collection of carefully controlled simulated datasets as well as biological datasets, including two technical replicates of mouse embryogenesis cells, peripheral blood mononuclear cells from three largely different experiments and five studies of pancreatic islet cells. AVAILABILITY AND IMPLEMENTATION BEENE is freely available as an open source project at https://github.com/ashiq24/BEENE.
Collapse
Affiliation(s)
- Md Ashiqur Rahman
- Department of Computer Science and Engineering, Bangladesh University of Engineering and Technology, Dhaka 1205, Bangladesh
- Department of Computer Science, Purdue University, West Lafayette, IN 47907, United States
| | - Abdullah Aman Tutul
- Department of Computer Science and Engineering, Bangladesh University of Engineering and Technology, Dhaka 1205, Bangladesh
- Department of Computer Science and Engineering, Texas A&M University, College Station, TX 77843, United States
| | - Mahfuza Sharmin
- Department of Genetics, Stanford University, Stanford, CA 94305, United States
| | - Md Shamsuzzoha Bayzid
- Department of Computer Science and Engineering, Bangladesh University of Engineering and Technology, Dhaka 1205, Bangladesh
| |
Collapse
|
19
|
Gebre RK, Senjem ML, Raghavan S, Schwarz CG, Gunter JL, Hofrenning EI, Reid RI, Kantarci K, Graff-Radford J, Knopman DS, Petersen RC, Jack CR, Vemuri P. Cross-scanner harmonization methods for structural MRI may need further work: A comparison study. Neuroimage 2023; 269:119912. [PMID: 36731814 PMCID: PMC10170652 DOI: 10.1016/j.neuroimage.2023.119912] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/10/2022] [Revised: 01/26/2023] [Accepted: 01/28/2023] [Indexed: 02/01/2023] Open
Abstract
The clinical usefulness MRI biomarkers for aging and dementia studies relies on precise brain morphological measurements; however, scanner and/or protocol variations may introduce noise or bias. One approach to address this is post-acquisition scan harmonization. In this work, we evaluate deep learning (neural style transfer, CycleGAN and CGAN), histogram matching, and statistical (ComBat and LongComBat) methods. Participants who had been scanned on both GE and Siemens scanners (cross-sectional participants, known as Crossover (n = 113), and longitudinally scanned participants on both scanners (n = 454)) were used. The goal was to match GE MPRAGE (T1-weighted) scans to Siemens improved resolution MPRAGE scans. Harmonization was performed on raw native and preprocessed (resampled, affine transformed to template space) scans. Cortical thicknesses were measured using FreeSurfer (v.7.1.1). Distributions were checked using Kolmogorov-Smirnov tests. Intra-class correlation (ICC) was used to assess the degree of agreement in the Crossover datasets and annualized percent change in cortical thickness was calculated to evaluate the Longitudinal datasets. Prior to harmonization, the least agreement was found at the frontal pole (ICC = 0.72) for the raw native scans, and at caudal anterior cingulate (0.76) and frontal pole (0.54) for the preprocessed scans. Harmonization with NST, CycleGAN, and HM improved the ICCs of the preprocessed scans at the caudal anterior cingulate (>0.81) and frontal poles (>0.67). In the Longitudinal raw native scans, over- and under-estimations of cortical thickness were observed due to the changing of the scanners. ComBat matched the cortical thickness distributions throughout but was not able to increase the ICCs or remove the effects of scanner changeover in the Longitudinal datasets. CycleGAN and NST performed slightly better to address the cortical thickness variations between scanner change. However, none of the methods succeeded in harmonizing the Longitudinal dataset. CGAN was the worst performer for both datasets. In conclusion, the performance of the methods was overall similar and region dependent. Future research is needed to improve the existing approaches since none of them outperformed each other in terms of harmonizing the datasets at all ROIs. The findings of this study establish framework for future research into the scan harmonization problem.
Collapse
Affiliation(s)
- Robel K Gebre
- Department of Radiology, Mayo Clinic, Rochester, MN 55905, USA.
| | - Matthew L Senjem
- Department of Radiology, Mayo Clinic, Rochester, MN 55905, USA; Department of Information Technology, Mayo Clinic, Rochester, MN 55905, USA
| | | | | | | | | | - Robert I Reid
- Department of Information Technology, Mayo Clinic, Rochester, MN 55905, USA
| | - Kejal Kantarci
- Department of Radiology, Mayo Clinic, Rochester, MN 55905, USA
| | | | - David S Knopman
- Department of Neurology, Mayo Clinic, Rochester, MN 55905, USA
| | | | - Clifford R Jack
- Department of Radiology, Mayo Clinic, Rochester, MN 55905, USA
| | | |
Collapse
|
20
|
Yosef A, Shnaider E, Schneider M, Gurevich M. Heuristic normalization procedure for batch effect correction. Soft comput 2023. [DOI: 10.1007/s00500-023-08049-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/31/2023]
|
21
|
Carry PM, Vigers T, Vanderlinden LA, Keeter C, Dong F, Buckner T, Litkowski E, Yang I, Norris JM, Kechris K. Propensity scores as a novel method to guide sample allocation and minimize batch effects during the design of high throughput experiments. BMC Bioinformatics 2023; 24:86. [PMID: 36882691 PMCID: PMC9990331 DOI: 10.1186/s12859-023-05202-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/18/2022] [Accepted: 02/22/2023] [Indexed: 03/09/2023] Open
Abstract
BACKGROUND We developed a novel approach to minimize batch effects when assigning samples to batches. Our algorithm selects a batch allocation, among all possible ways of assigning samples to batches, that minimizes differences in average propensity score between batches. This strategy was compared to randomization and stratified randomization in a case-control study (30 per group) with a covariate (case vs control, represented as β1, set to be null) and two biologically relevant confounding variables (age, represented as β2, and hemoglobin A1c (HbA1c), represented as β3). Gene expression values were obtained from a publicly available dataset of expression data obtained from pancreas islet cells. Batch effects were simulated as twice the median biological variation across the gene expression dataset and were added to the publicly available dataset to simulate a batch effect condition. Bias was calculated as the absolute difference between observed betas under the batch allocation strategies and the true beta (no batch effects). Bias was also evaluated after adjustment for batch effects using ComBat as well as a linear regression model. In order to understand performance of our optimal allocation strategy under the alternative hypothesis, we also evaluated bias at a single gene associated with both age and HbA1c levels in the 'true' dataset (CAPN13 gene). RESULTS Pre-batch correction, under the null hypothesis (β1), maximum absolute bias and root mean square (RMS) of maximum absolute bias, were minimized using the optimal allocation strategy. Under the alternative hypothesis (β2 and β3 for the CAPN13 gene), maximum absolute bias and RMS of maximum absolute bias were also consistently lower using the optimal allocation strategy. ComBat and the regression batch adjustment methods performed well as the bias estimates moved towards the true values in all conditions under both the null and alternative hypotheses. Although the differences between methods were less pronounced following batch correction, estimates of bias (average and RMS) were consistently lower using the optimal allocation strategy under both the null and alternative hypotheses. CONCLUSIONS Our algorithm provides an extremely flexible and effective method for assigning samples to batches by exploiting knowledge of covariates prior to sample allocation.
Collapse
Affiliation(s)
- Patrick M Carry
- Colorado Program for Musculoskeletal Research, Department of Orthopedics, University of Colorado Anschutz Medical Campus, 12631 E. 17Th Ave, Room 4602, Mail Stop B202, Aurora, CO, 80045, USA. .,Department of Epidemiology, Colorado School of Public Health, Aurora, CO, USA.
| | - Tim Vigers
- Department of Biostatistics and Informatics, Colorado School of Public Health, Aurora, CO, USA.,Barbara Davis Center for Diabetes, University of Colorado Anschutz Medical Campus, Aurora, CO, USA
| | - Lauren A Vanderlinden
- Department of Epidemiology, Colorado School of Public Health, Aurora, CO, USA.,Department of Biostatistics and Informatics, Colorado School of Public Health, Aurora, CO, USA
| | - Carson Keeter
- Colorado Program for Musculoskeletal Research, Department of Orthopedics, University of Colorado Anschutz Medical Campus, 12631 E. 17Th Ave, Room 4602, Mail Stop B202, Aurora, CO, 80045, USA
| | - Fran Dong
- Barbara Davis Center for Diabetes, University of Colorado Anschutz Medical Campus, Aurora, CO, USA
| | - Teresa Buckner
- Department of Epidemiology, Colorado School of Public Health, Aurora, CO, USA
| | - Elizabeth Litkowski
- Department of Epidemiology, Colorado School of Public Health, Aurora, CO, USA
| | - Ivana Yang
- Department of Medicine, University of Colorado Anschutz Medical Campus, Aurora, CO, USA
| | - Jill M Norris
- Department of Epidemiology, Colorado School of Public Health, Aurora, CO, USA
| | - Katerina Kechris
- Department of Biostatistics and Informatics, Colorado School of Public Health, Aurora, CO, USA
| |
Collapse
|
22
|
The importance of batch sensitization in missing value imputation. Sci Rep 2023; 13:3003. [PMID: 36810890 PMCID: PMC9944322 DOI: 10.1038/s41598-023-30084-2] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/05/2022] [Accepted: 02/15/2023] [Indexed: 02/23/2023] Open
Abstract
Data analysis is complex due to a myriad of technical problems. Amongst these, missing values and batch effects are endemic. Although many methods have been developed for missing value imputation (MVI) and batch correction respectively, no study has directly considered the confounding impact of MVI on downstream batch correction. This is surprising as missing values are imputed during early pre-processing while batch effects are mitigated during late pre-processing, prior to functional analysis. Unless actively managed, MVI approaches generally ignore the batch covariate, with unknown consequences. We examine this problem by modelling three simple imputation strategies: global (M1), self-batch (M2) and cross-batch (M3) first via simulations, and then corroborated on real proteomics and genomics data. We report that explicit consideration of batch covariates (M2) is important for good outcomes, resulting in enhanced batch correction and lower statistical errors. However, M1 and M3 are error-generating: global and cross-batch averaging may result in batch-effect dilution, with concomitant and irreversible increase in intra-sample noise. This noise is unremovable via batch correction algorithms and produces false positives and negatives. Hence, careless imputation in the presence of non-negligible covariates such as batch effects should be avoided.
Collapse
|
23
|
Huang EP, Pennello G, deSouza NM, Wang X, Buckler AJ, Kinahan PE, Barnhart HX, Delfino JG, Hall TJ, Raunig DL, Guimaraes AR, Obuchowski NA. Multiparametric Quantitative Imaging in Risk Prediction: Recommendations for Data Acquisition, Technical Performance Assessment, and Model Development and Validation. Acad Radiol 2023; 30:196-214. [PMID: 36273996 PMCID: PMC9825642 DOI: 10.1016/j.acra.2022.09.018] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/02/2022] [Revised: 09/12/2022] [Accepted: 09/17/2022] [Indexed: 01/11/2023]
Abstract
Combinations of multiple quantitative imaging biomarkers (QIBs) are often able to predict the likelihood of an event of interest such as death or disease recurrence more effectively than single imaging measurements can alone. The development of such multiparametric quantitative imaging and evaluation of its fitness of use differs from the analogous processes for individual QIBs in several key aspects. A computational procedure to combine the QIB values into a model output must be specified. The output must also be reproducible and be shown to have reasonably strong ability to predict the risk of an event of interest. Attention must be paid to statistical issues not often encountered in the single QIB scenario, including overfitting and bias in the estimates of model performance. This is the fourth in a five-part series on statistical methodology for assessing the technical performance of multiparametric quantitative imaging. Considerations for data acquisition are discussed and recommendations from the literature on methodology to construct and evaluate QIB-based models for risk prediction are summarized. The findings in the literature upon which these recommendations are based are demonstrated through simulation studies. The concepts in this manuscript are applied to a real-life example involving prediction of major adverse cardiac events using automated plaque analysis.
Collapse
Affiliation(s)
- Erich P Huang
- Division of Cancer Treatment and Diagnosis, National Cancer Institute, National Institutes of Health, 9609 Medical Center Drive, MSC 9735, Bethesda, MD 20892-9735.
| | - Gene Pennello
- Center for Devices and Radiological Health, US Food and Drug Administration
| | - Nandita M deSouza
- Division of Radiotherapy and Imaging, The Institute of Cancer Research (London, UK), European Imaging Biomarkers Alliance
| | - Xiaofeng Wang
- Department of Quantitative Health Sciences, Lerner Research Institute, Cleveland Clinic Foundation
| | | | | | | | - Jana G Delfino
- Center for Devices and Radiological Health, US Food and Drug Administration
| | - Timothy J Hall
- Department of Medical Physics, University of Wisconsin, Madison
| | - David L Raunig
- Data Science Institute, Statistical and Quantitative Sciences, Takeda
| | | | - Nancy A Obuchowski
- Department of Quantitative Health Sciences, Lerner Research Institute, Cleveland Clinic Foundation
| |
Collapse
|
24
|
Huang EP, O'Connor JPB, McShane LM, Giger ML, Lambin P, Kinahan PE, Siegel EL, Shankar LK. Criteria for the translation of radiomics into clinically useful tests. Nat Rev Clin Oncol 2023; 20:69-82. [PMID: 36443594 PMCID: PMC9707172 DOI: 10.1038/s41571-022-00707-0] [Citation(s) in RCA: 114] [Impact Index Per Article: 57.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 11/02/2022] [Indexed: 11/29/2022]
Abstract
Computer-extracted tumour characteristics have been incorporated into medical imaging computer-aided diagnosis (CAD) algorithms for decades. With the advent of radiomics, an extension of CAD involving high-throughput computer-extracted quantitative characterization of healthy or pathological structures and processes as captured by medical imaging, interest in such computer-extracted measurements has increased substantially. However, despite the thousands of radiomic studies, the number of settings in which radiomics has been successfully translated into a clinically useful tool or has obtained FDA clearance is comparatively small. This relative dearth might be attributable to factors such as the varying imaging and radiomic feature extraction protocols used from study to study, the numerous potential pitfalls in the analysis of radiomic data, and the lack of studies showing that acting upon a radiomic-based tool leads to a favourable benefit-risk balance for the patient. Several guidelines on specific aspects of radiomic data acquisition and analysis are already available, although a similar roadmap for the overall process of translating radiomics into tools that can be used in clinical care is needed. Herein, we provide 16 criteria for the effective execution of this process in the hopes that they will guide the development of more clinically useful radiomic tests in the future.
Collapse
Affiliation(s)
- Erich P Huang
- Division of Cancer Treatment and Diagnosis, National Cancer Institute, National Institutes of Health, Rockville, MD, USA.
| | - James P B O'Connor
- Division of Radiotherapy and Imaging, Institute of Cancer Research, London, UK
| | - Lisa M McShane
- Division of Cancer Treatment and Diagnosis, National Cancer Institute, National Institutes of Health, Rockville, MD, USA
| | | | - Philippe Lambin
- Department of Precision Medicine, Maastricht University, Maastricht, Netherlands
| | - Paul E Kinahan
- Department of Radiology, University of Washington, Seattle, WA, USA
| | - Eliot L Siegel
- Department of Diagnostic Radiology, University of Maryland, Baltimore, MD, USA
| | - Lalitha K Shankar
- Division of Cancer Treatment and Diagnosis, National Cancer Institute, National Institutes of Health, Rockville, MD, USA
| |
Collapse
|
25
|
Fallet M, Blanc M, Di Criscio M, Antczak P, Engwall M, Guerrero Bosagna C, Rüegg J, Keiter SH. Present and future challenges for the investigation of transgenerational epigenetic inheritance. ENVIRONMENT INTERNATIONAL 2023; 172:107776. [PMID: 36731188 DOI: 10.1016/j.envint.2023.107776] [Citation(s) in RCA: 25] [Impact Index Per Article: 12.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/24/2022] [Revised: 01/18/2023] [Accepted: 01/23/2023] [Indexed: 06/18/2023]
Abstract
Epigenetic pathways are essential in different biological processes and in phenotype-environment interactions in response to different stressors and they can induce phenotypic plasticity. They encompass several processes that are mitotically and, in some cases, meiotically heritable, so they can be transferred to subsequent generations via the germline. Transgenerational Epigenetic Inheritance (TEI) describes the phenomenon that phenotypic traits, such as changes in fertility, metabolic function, or behavior, induced by environmental factors (e.g., parental care, pathogens, pollutants, climate change), can be transferred to offspring generations via epigenetic mechanisms. Investigations on TEI contribute to deciphering the role of epigenetic mechanisms in adaptation, adversity, and evolution. However, molecular mechanisms underlying the transmission of epigenetic changes between generations, and the downstream chain of events leading to persistent phenotypic changes, remain unclear. Therefore, inter-, (transmission of information between parental and offspring generation via direct exposure) and transgenerational (transmission of information through several generations with disappearance of the triggering factor) consequences of epigenetic modifications remain major issues in the field of modern biology. In this article, we review and describe the major gaps and issues still encountered in the TEI field: the general challenges faced in epigenetic research; deciphering the key epigenetic mechanisms in inheritance processes; identifying the relevant drivers for TEI and implement a collaborative and multi-disciplinary approach to study TEI. Finally, we provide suggestions on how to overcome these challenges and ultimately be able to identify the specific contribution of epigenetics in transgenerational inheritance and use the correct tools for environmental science investigation and biomarkers identification.
Collapse
Affiliation(s)
- Manon Fallet
- Man-Technology-Environment Research Centre (MTM), School of Science and Technology, Örebro University, Fakultetsgatan 1, 70182 Örebro, Sweden; Department of Biochemistry, Dorothy Crowfoot Hodgkin Building, University of Oxford, South Parks Rd, Oxford OX1 3QU, United Kingdom.
| | - Mélanie Blanc
- MARBEC, Univ Montpellier, CNRS, Ifremer, IRD, INRAE, Palavas, France
| | - Michela Di Criscio
- Department of Organismal Biology, Uppsala University, Norbyv. 18A, 75236 Uppsala, Sweden
| | - Philipp Antczak
- University of Cologne, Faculty of Medicine and Cologne University Hospital, Center for Molecular Medicine Cologne, Germany; Excellence Cluster on Cellular Stress Responses in Aging Associated Diseases, University of Cologne, Cologne, Germany
| | - Magnus Engwall
- Man-Technology-Environment Research Centre (MTM), School of Science and Technology, Örebro University, Fakultetsgatan 1, 70182 Örebro, Sweden
| | | | - Joëlle Rüegg
- Department of Organismal Biology, Uppsala University, Norbyv. 18A, 75236 Uppsala, Sweden
| | - Steffen H Keiter
- Man-Technology-Environment Research Centre (MTM), School of Science and Technology, Örebro University, Fakultetsgatan 1, 70182 Örebro, Sweden
| |
Collapse
|
26
|
Yosef A, Shnaider E, Schneider M, Gurevich M. Normalization of Large-Scale Transcriptome Data Using Heuristic Methods. Bioinform Biol Insights 2023; 17:11779322231160397. [PMID: 37020503 PMCID: PMC10068970 DOI: 10.1177/11779322231160397] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/22/2022] [Accepted: 02/09/2023] [Indexed: 04/03/2023] Open
Abstract
In this study, we introduce an artificial intelligent method for addressing the batch effect of a transcriptome data. The method has several clear advantages in comparison with the alternative methods presently in use. Batch effect refers to the discrepancy in gene expression data series, measured under different conditions. While the data from the same batch (measurements performed under the same conditions) are compatible, combining various batches into 1 data set is problematic because of incompatible measurements. Therefore, it is necessary to perform correction of the combined data (normalization), before performing biological analysis. There are numerous methods attempting to correct data set for batch effect. These methods rely on various assumptions regarding the distribution of the measurements. Forcing the data elements into pre-supposed distribution can severely distort biological signals, thus leading to incorrect results and conclusions. As the discrepancy between the assumptions regarding the data distribution and the actual distribution is wider, the biases introduced by such “correction methods” are greater. We introduce a heuristic method to reduce batch effect. The method does not rely on any assumptions regarding the distribution and the behavior of data elements. Hence, it does not introduce any new biases in the process of correcting the batch effect. It strictly maintains the integrity of measurements within the original batches.
Collapse
|
27
|
Gregori J, Sánchez À, Villanueva J. msmsEDA & msmsTests: Label-Free Differential Expression by Spectral Counts. Methods Mol Biol 2023; 2426:197-242. [PMID: 36308691 DOI: 10.1007/978-1-0716-1967-4_10] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/16/2023]
Abstract
msmsTests is an R/Bioconductor package providing functions for statistical tests in label-free LC-MS/MS data by spectral counts. These functions aim at discovering differentially expressed proteins between two biological conditions. Three tests are available: Poisson GLM regression, quasi-likelihood GLM regression, and the negative binomial of the edgeR package. The three models admit blocking factors to control for nuisance variables. To assure a good level of reproducibility a post-test filter is available, where (1) a minimum effect size considered biologically relevant, and (2) a minimum expression of the most abundant condition, may be set. A companion package, msmsEDA, proposes functions to explore datasets based on msms spectral counts. The provided graphics help in identifying outliers, the presence of eventual batch factors, and check the effects of different normalizing strategies. This protocol illustrates the use of both packages on two examples: A purely spike-in experiment of 48 human proteins in a standard yeast cell lysate; and a cancer cell-line secretome dataset requiring a biological normalization.
Collapse
Affiliation(s)
- Josep Gregori
- Vall Hebron Research Institute (VHIR), Barcelona, Spain.
| | - Àlex Sánchez
- VHIR, Barcelona, Spain
- Department of Genetics Statistics and Microbiology, UB, Barcelona, Spain
| | - Josep Villanueva
- Tumor Biomarkers Lab, Vall Hebron Institute of Oncology, Barcelona, Spain
| |
Collapse
|
28
|
Finney CA, Delerue F, Gold WA, Brown DA, Shvetcov A. Artificial intelligence-driven meta-analysis of brain gene expression identifies novel gene candidates and a role for mitochondria in Alzheimer's disease. Comput Struct Biotechnol J 2022; 21:388-400. [PMID: 36618979 PMCID: PMC9798142 DOI: 10.1016/j.csbj.2022.12.018] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/10/2022] [Revised: 12/11/2022] [Accepted: 12/12/2022] [Indexed: 12/23/2022] Open
Abstract
Alzheimer's disease (AD) is the most common form of dementia. There is no treatment and AD models have focused on a small subset of genes identified in familial AD. Microarray studies have identified thousands of dysregulated genes in the brains of patients with AD yet identifying the best gene candidates to both model and treat AD remains a challenge. We performed a meta-analysis of microarray data from the frontal cortex (n = 697) and cerebellum (n = 230) of AD patients and healthy controls. A two-stage artificial intelligence approach, with both unsupervised and supervised machine learning, combined with a functional network analysis was used to identify functionally connected and biologically relevant novel gene candidates in AD. We found that in the frontal cortex, genes involved in mitochondrial energy, ATP, and oxidative phosphorylation, were the most significant dysregulated genes. In the cerebellum, dysregulated genes were involved in mitochondrial cellular biosynthesis (mitochondrial ribosomes). Although there was little overlap between dysregulated genes between the frontal cortex and cerebellum, machine learning models comprised of this overlap. A further functional network analysis of these genes identified that two downregulated genes, ATP5L and ATP5H, which both encode subunits of ATP synthase (mitochondrial complex V) may play a role in AD. Combined, our results suggest that mitochondrial dysfunction, particularly a deficit in energy homeostasis, may play an important role in AD.
Collapse
Affiliation(s)
- Caitlin A. Finney
- Neuroinflammation Research Group, Centre for Immunology and Allergy Research, Westmead Institute for Medical Research, Sydney, Australia,School of Medical Sciences, Faculty of Medicine Health, The University of Sydney, Sydney, Australia,Correspondence to: 176 Hawkesbury Rd, Westmead, NSW, Australia.
| | - Fabien Delerue
- Dementia Research Centre, Macquarie Medical School, Faculty of Medicine, Health and Human Sciences, Macquarie University, Sydney, Australia
| | - Wendy A. Gold
- School of Medical Sciences, Faculty of Medicine Health, The University of Sydney, Sydney, Australia,Molecular Neurobiology Research Laboratory, Kids Research, Children’s Hospital at Westmead and the Children’s Medical Research Institute, Westmead, Australia,Kids Neuroscience Centre, Kids Research, Children’s Hospital at Westmead, Westmead, Australia
| | - David A. Brown
- Neuroinflammation Research Group, Centre for Immunology and Allergy Research, Westmead Institute for Medical Research, Sydney, Australia,Department of Immunopathology, Institute for Clinical Pathology and Medical Research-New South Wales Health Pathology, Westmead Hospital, Sydney, Australia,Westmead Clinical School, Faculty of Medicine and Health, The University of Sydney, Sydney, Australia
| | - Artur Shvetcov
- Black Dog Institute, Sydney, Australia,School of Psychiatry, Faculty of Medicine, University of New South Wales, Sydney, Australia,Correspondence to: Hospital Rd., Randwick, NSW, Australia.
| |
Collapse
|
29
|
Abstract
Applying computational statistics or machine learning methods to data is a key component of many scientific studies, in any field, but alone might not be sufficient to generate robust and reliable outcomes and results. Before applying any discovery method, preprocessing steps are necessary to prepare the data to the computational analysis. In this framework, data cleaning and feature engineering are key pillars of any scientific study involving data analysis and that should be adequately designed and performed since the first phases of the project. We call "feature" a variable describing a particular trait of a person or an observation, recorded usually as a column in a dataset. Even if pivotal, these data cleaning and feature engineering steps sometimes are done poorly or inefficiently, especially by beginners and unexperienced researchers. For this reason, we propose here our quick tips for data cleaning and feature engineering on how to carry out these important preprocessing steps correctly avoiding common mistakes and pitfalls. Although we designed these guidelines with bioinformatics and health informatics scenarios in mind, we believe they can more in general be applied to any scientific area. We therefore target these guidelines to any researcher or practitioners wanting to perform data cleaning or feature engineering. We believe our simple recommendations can help researchers and scholars perform better computational analyses that can lead, in turn, to more solid outcomes and more reliable discoveries.
Collapse
Affiliation(s)
- Davide Chicco
- Institute of Health Policy Management and Evaluation, University of Toronto, Toronto, Ontario, Canada
| | - Luca Oneto
- Dipartimento di Informatica Bioingegneria Robotica e Ingegneria dei Sistemi, Università di Genova, Genoa, Italy
- ZenaByte S.r.l., Genoa, Italy
| | - Erica Tavazzi
- Dipartimento di Ingegneria dell’Informazione, Università di Padova, Padua, Italy
| |
Collapse
|
30
|
Lyu Q, Yang Q, Hao J, Yue Y, Wang X, Tian J, An L. A small proportion of X-linked genes contribute to X chromosome upregulation in early embryos via BRD4-mediated transcriptional activation. Curr Biol 2022; 32:4397-4410.e5. [PMID: 36108637 DOI: 10.1016/j.cub.2022.08.059] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/06/2022] [Revised: 06/30/2022] [Accepted: 08/19/2022] [Indexed: 02/08/2023]
Abstract
Females have two X chromosomes and males have only one in most mammals. X chromosome inactivation (XCI) occurs in females to equalize X-dosage between sexes. Besides, mammals also balance the dosage between X chromosomes and autosomes via X chromosome upregulation (XCU) to fine-tune X-linked expression and thus maintain genomic homeostasis. Despite some studies highlighting the importance of XCU in somatic cells, little is known about how XCU is achieved and its developmental role during early embryogenesis. Herein, using mouse preimplantation embryos as the model, we reported that XCU initially occurs upon major zygotic genome activation and co-regulates X-linked expression in cooperation with imprinted XCI during preimplantation development. An in-depth analysis further indicated, unexpectedly, only a small proportion of, but not X chromosome-wide, X-linked genes contribute greatly to XCU. Furthermore, we identified that bromodomain containing 4 (BRD4) plays a key role in the transcription activation of XCU during preimplantation development. BRD4 deficiency or inhibition caused an impaired XCU, thus leading to reduced developmental potential and mitochondrial dysfunctions of blastocysts. Our finding was also supported by the tight association of BRD4 dysregulation and XCU disruption in the pathology of cholangiocarcinoma. Thus, our results not only advanced the current knowledge of X-dosage compensation and provided a mechanism for understanding XCU initiation but also presented an important clue for understanding the developmental and pathological role of XCU.
Collapse
Affiliation(s)
- Qingji Lyu
- Key Laboratory of Animal Genetics, Breeding and Reproduction of the Ministry of Agriculture and Rural Affairs, National Engineering Laboratory for Animal Breeding, College of Animal Science and Technology, China Agricultural University, Beijing 100193, P.R. China
| | - Qianying Yang
- Key Laboratory of Animal Genetics, Breeding and Reproduction of the Ministry of Agriculture and Rural Affairs, National Engineering Laboratory for Animal Breeding, College of Animal Science and Technology, China Agricultural University, Beijing 100193, P.R. China
| | - Jia Hao
- Key Laboratory of Animal Genetics, Breeding and Reproduction of the Ministry of Agriculture and Rural Affairs, National Engineering Laboratory for Animal Breeding, College of Animal Science and Technology, China Agricultural University, Beijing 100193, P.R. China
| | - Yuan Yue
- Key Laboratory of Animal Genetics, Breeding and Reproduction of the Ministry of Agriculture and Rural Affairs, National Engineering Laboratory for Animal Breeding, College of Animal Science and Technology, China Agricultural University, Beijing 100193, P.R. China
| | - Xiaodong Wang
- Key Laboratory of Animal Genetics, Breeding and Reproduction of the Ministry of Agriculture and Rural Affairs, National Engineering Laboratory for Animal Breeding, College of Animal Science and Technology, China Agricultural University, Beijing 100193, P.R. China
| | - Jianhui Tian
- Key Laboratory of Animal Genetics, Breeding and Reproduction of the Ministry of Agriculture and Rural Affairs, National Engineering Laboratory for Animal Breeding, College of Animal Science and Technology, China Agricultural University, Beijing 100193, P.R. China
| | - Lei An
- Key Laboratory of Animal Genetics, Breeding and Reproduction of the Ministry of Agriculture and Rural Affairs, National Engineering Laboratory for Animal Breeding, College of Animal Science and Technology, China Agricultural University, Beijing 100193, P.R. China.
| |
Collapse
|
31
|
Phua SX, Lim KP, Goh WWB. Perspectives for better batch effect correction in mass-spectrometry-based proteomics. Comput Struct Biotechnol J 2022; 20:4369-4375. [PMID: 36051874 PMCID: PMC9411064 DOI: 10.1016/j.csbj.2022.08.022] [Citation(s) in RCA: 12] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/14/2022] [Revised: 08/09/2022] [Accepted: 08/09/2022] [Indexed: 11/08/2022] Open
Abstract
Mass-spectrometry-based proteomics presents some unique challenges for batch effect correction. Batch effects are technical sources of variation, can confound analysis and usually non-biological in nature. As proteomic analysis involves several stages of data transformation from spectra to protein, the decision on when and what to apply batch correction on is often unclear. Here, we explore several relevant issues pertinent to batch effect correct considerations. The first involves applications of batch effect correction requiring prior knowledge on batch factors and exploring data to uncover new/unknown batch factors. The second considers recent literature that suggests there is no single best batch effect correction algorithm---i.e., instead of a best approach, one may instead ask, what is a suitable approach. The third section considers issues of batch effect detection. And finally, we look at potential developments for proteomic-specific batch effect correction methods and how to do better functional evaluations on batch corrected data.
Collapse
Affiliation(s)
- Ser-Xian Phua
- Lee Kong Chian School of Medicine, Nanyang Technological University, Singapore
- School of Biological Sciences, Nanyang Technological University, Singapore
| | - Kai-Peng Lim
- Lee Kong Chian School of Medicine, Nanyang Technological University, Singapore
- School of Biological Sciences, Nanyang Technological University, Singapore
| | - Wilson Wen-Bin Goh
- Lee Kong Chian School of Medicine, Nanyang Technological University, Singapore
- School of Biological Sciences, Nanyang Technological University, Singapore
- Center for Biomedical Informatics, Nanyang Technological University, Singapore
| |
Collapse
|
32
|
Gargiuli C, De Cecco L, Mariancini A, Iannò MF, Micali A, Mancinelli E, Boeri M, Sozzi G, Dugo M, Sensi M. A Cross-Comparison of High-Throughput Platforms for Circulating MicroRNA Quantification, Agreement in Risk Classification, and Biomarker Discovery in Non-Small Cell Lung Cancer. Front Oncol 2022; 12:911613. [PMID: 35928879 PMCID: PMC9343840 DOI: 10.3389/fonc.2022.911613] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/02/2022] [Accepted: 06/16/2022] [Indexed: 11/13/2022] Open
Abstract
BackgroundCirculating microRNAs (ct-miRs) are promising cancer biomarkers. This study focuses on platform comparison to assess performance variability, agreement in the assignment of a miR signature classifier (MSC), and concordance for the identification of cancer-associated miRs in plasma samples from non‐small cell lung cancer (NSCLC) patients.MethodsA plasma cohort of 10 NSCLC patients and 10 healthy donors matched for clinical features and MSC risk level was profiled for miR expression using two sequencing-based and three quantitative reverse transcription PCR (qPCR)-based platforms. Intra- and inter-platform variations were examined by correlation and concordance analysis. The MSC risk levels were compared with those estimated using a reference method. Differentially expressed ct-miRs were identified among NSCLC patients and donors, and the diagnostic value of those dysregulated in patients was assessed by receiver operating characteristic curve analysis. The downregulation of miR-150-5p was verified by qPCR. The Cancer Genome Atlas (TCGA) lung carcinoma dataset was used for validation at the tissue level.ResultsThe intra-platform reproducibility was consistent, whereas the highest values of inter-platform correlations were among qPCR-based platforms. MSC classification concordance was >80% for four platforms. The dysregulation and discriminatory power of miR-150-5p and miR-210-3p were documented. Both were significantly dysregulated also on TCGA tissue-originated profiles from lung cell carcinoma in comparison with normal samples.ConclusionOverall, our studies provide a large performance analysis between five different platforms for miR quantification, indicate the solidity of MSC classifier, and identify two noninvasive biomarkers for NSCLC.
Collapse
Affiliation(s)
- Chiara Gargiuli
- Platform of Integrated Biology Unit, Department of Applied Research and Technology Development, Fondazione IRCCS Istituto Nazionale dei Tumori, Milan, Italy
| | - Loris De Cecco
- Platform of Integrated Biology Unit, Department of Applied Research and Technology Development, Fondazione IRCCS Istituto Nazionale dei Tumori, Milan, Italy
- *Correspondence: Marialuisa Sensi, ; Loris De Cecco,
| | - Andrea Mariancini
- Platform of Integrated Biology Unit, Department of Applied Research and Technology Development, Fondazione IRCCS Istituto Nazionale dei Tumori, Milan, Italy
| | - Maria Federica Iannò
- Platform of Integrated Biology Unit, Department of Applied Research and Technology Development, Fondazione IRCCS Istituto Nazionale dei Tumori, Milan, Italy
| | - Arianna Micali
- Platform of Integrated Biology Unit, Department of Applied Research and Technology Development, Fondazione IRCCS Istituto Nazionale dei Tumori, Milan, Italy
| | - Elisa Mancinelli
- Platform of Integrated Biology Unit, Department of Applied Research and Technology Development, Fondazione IRCCS Istituto Nazionale dei Tumori, Milan, Italy
| | - Mattia Boeri
- Tumor Genomics Unit, Department of Research, Fondazione IRCCS Istituto Nazionale dei Tumori, Milan, Italy
| | - Gabriella Sozzi
- Tumor Genomics Unit, Department of Research, Fondazione IRCCS Istituto Nazionale dei Tumori, Milan, Italy
| | - Matteo Dugo
- Platform of Integrated Biology Unit, Department of Applied Research and Technology Development, Fondazione IRCCS Istituto Nazionale dei Tumori, Milan, Italy
| | - Marialuisa Sensi
- Platform of Integrated Biology Unit, Department of Applied Research and Technology Development, Fondazione IRCCS Istituto Nazionale dei Tumori, Milan, Italy
- *Correspondence: Marialuisa Sensi, ; Loris De Cecco,
| |
Collapse
|
33
|
Niu J, Yang J, Guo Y, Qian K, Wang Q. Joint deep learning for batch effect removal and classification toward MALDI MS based metabolomics. BMC Bioinformatics 2022; 23:270. [PMID: 35818047 PMCID: PMC9275160 DOI: 10.1186/s12859-022-04758-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/15/2021] [Accepted: 05/30/2022] [Indexed: 12/02/2022] Open
Abstract
Background Metabolomics is a primary omics topic, which occupies an important position in both clinical applications and basic researches for metabolic signatures and biomarkers. Unfortunately, the relevant studies are challenged by the batch effect caused by many external factors. In last decade, the technique of deep learning has become a dominant tool in data science, such that one may train a diagnosis network from a known batch and then generalize it to a new batch. However, the batch effect inevitably hinders such efforts, as the two batches under consideration can be highly mismatched. Results We propose an end-to-end deep learning framework, for joint batch effect removal and then classification upon metabolomics data. We firstly validate the proposed deep learning framework on a public CyTOF dataset as a simulated experiment. We also visually compare the t-SNE distribution and demonstrate that our method effectively removes the batch effects in latent space. Then, for a private MALDI MS dataset, we have achieved the highest diagnostic accuracy, with about 5.1 ~ 7.9% increase on average over state-of-the-art methods. Conclusions Both experiments conclude that our method performs significantly better in classification than conventional methods benefitting from the effective removal of batch effect. Supplementary Information The online version contains supplementary material available at 10.1186/s12859-022-04758-z.
Collapse
Affiliation(s)
- Jingyang Niu
- School of Biomedical Engineering, Shanghai Jiao Tong University, Shanghai, 200030, China
| | - Jing Yang
- School of Biomedical Engineering, Shanghai Jiao Tong University, Shanghai, 200030, China
| | - Yuyu Guo
- School of Biomedical Engineering, Shanghai Jiao Tong University, Shanghai, 200030, China
| | - Kun Qian
- School of Biomedical Engineering, Shanghai Jiao Tong University, Shanghai, 200030, China
| | - Qian Wang
- School of Biomedical Engineering, ShanghaiTech University, Shanghai, 201210, China.
| |
Collapse
|
34
|
Niu J, Xu W, Wei D, Qian K, Wang Q. Deep Learning Framework for Integrating Multibatch Calibration, Classification, and Pathway Activities. Anal Chem 2022; 94:8937-8946. [PMID: 35709357 DOI: 10.1021/acs.analchem.2c00601] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
Abstract
The amount of available biological data has exploded since the emergence of high-throughput technologies, which is not only revolting the way we recognize molecules and diseases but also bringing novel analytical challenges to bioinformatics analysis. In recent years, deep learning has become a dominant technique in data science. However, classification accuracy is plagued with domain discrepancy. Notably, in the presence of multiple batches, domain discrepancy typically happens between individual batches. Most pairwise adaptation approaches may be suboptimal as they fail to eliminate external factors across multiple batches and take the classification task into account simultaneously. We propose a joint deep learning framework for integrating batch effect removal, classification, and downstream pathway activities upon biological data. To this end, we validate it on two MALDI MS-based metabolomics datasets. We have achieved the highest diagnostic accuracy (ACC), with a notable ∼10% improvement over other methods. Overall, these results indicate that our approach removes batch effect more effectively than state-of-the-art methods and yields more accurate classification as well as biomarkers for smart diagnosis.
Collapse
Affiliation(s)
- JingYang Niu
- School of Biomedical Engineering, Shanghai Jiao Tong University, Shanghai 200030, China
| | - Wei Xu
- School of Biomedical Engineering, Shanghai Jiao Tong University, Shanghai 200030, China
| | - DongMing Wei
- School of Biomedical Engineering, Shanghai Jiao Tong University, Shanghai 200030, China
| | - Kun Qian
- School of Biomedical Engineering, Shanghai Jiao Tong University, Shanghai 200030, China
| | - Qian Wang
- School of Biomedical Engineering, ShanghaiTech University, Shanghai 201210, China
| |
Collapse
|
35
|
Tamposis IA, Manios GA, Charitou T, Vennou KE, Kontou PI, Bagos PG. MAGE: An Open-Source Tool for Meta-Analysis of Gene Expression Studies. BIOLOGY 2022; 11:biology11060895. [PMID: 35741417 PMCID: PMC9220151 DOI: 10.3390/biology11060895] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 05/12/2022] [Revised: 06/05/2022] [Accepted: 06/08/2022] [Indexed: 11/16/2022]
Abstract
MAGE (Meta-Analysis of Gene Expression) is a Python open-source software package designed to perform meta-analysis and functional enrichment analysis of gene expression data. We incorporate standard methods for the meta-analysis of gene expression studies, bootstrap standard errors, corrections for multiple testing, and meta-analysis of multiple outcomes. Importantly, the MAGE toolkit includes additional features for the conversion of probes to gene identifiers, and for conducting functional enrichment analysis, with annotated results, of statistically significant enriched terms in several formats. Along with the tool itself, a web-based infrastructure was also developed to support the features of this package.
Collapse
Affiliation(s)
- Ioannis A. Tamposis
- Department of Computer Science and Biomedical Informatics, University of Thessaly, 35131 Lamia, Greece; (I.A.T.); (G.A.M.); (T.C.); (K.E.V.)
| | - Georgios A. Manios
- Department of Computer Science and Biomedical Informatics, University of Thessaly, 35131 Lamia, Greece; (I.A.T.); (G.A.M.); (T.C.); (K.E.V.)
| | - Theodosia Charitou
- Department of Computer Science and Biomedical Informatics, University of Thessaly, 35131 Lamia, Greece; (I.A.T.); (G.A.M.); (T.C.); (K.E.V.)
| | - Konstantina E. Vennou
- Department of Computer Science and Biomedical Informatics, University of Thessaly, 35131 Lamia, Greece; (I.A.T.); (G.A.M.); (T.C.); (K.E.V.)
| | | | - Pantelis G. Bagos
- Department of Computer Science and Biomedical Informatics, University of Thessaly, 35131 Lamia, Greece; (I.A.T.); (G.A.M.); (T.C.); (K.E.V.)
- Correspondence:
| |
Collapse
|
36
|
Ross JP, van Dijk S, Phang M, Skilton MR, Molloy PL, Oytam Y. Batch-effect detection, correction and characterisation in Illumina HumanMethylation450 and MethylationEPIC BeadChip array data. Clin Epigenetics 2022; 14:58. [PMID: 35488315 PMCID: PMC9055778 DOI: 10.1186/s13148-022-01277-9] [Citation(s) in RCA: 15] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/28/2022] [Accepted: 04/10/2022] [Indexed: 11/20/2022] Open
Abstract
Background Genomic technologies can be subject to significant batch-effects which are known to reduce experimental power and to potentially create false positive results. The Illumina Infinium Methylation BeadChip is a popular technology choice for epigenome-wide association studies (EWAS), but presently, little is known about the nature of batch-effects on these designs. Given the subtlety of biological phenotypes in many EWAS, control for batch-effects should be a consideration.
Results Using the batch-effect removal approaches in the ComBat and Harman software, we examined two in-house datasets and compared results with three large publicly available datasets, (1214 HumanMethylation450 and 1094 MethylationEPIC BeadChips in total), and find that despite various forms of preprocessing, some batch-effects persist. This residual batch-effect is associated with the day of processing, the individual glass slide and the position of the array on the slide. Consistently across all datasets, 4649 probes required high amounts of correction. To understand the impact of this set to EWAS studies, we explored the literature and found three instances where persistently batch-effect prone probes have been reported in abstracts as key sites of differential methylation. As well as batch-effect susceptible probes, we also discover a set of probes which are erroneously corrected. We provide batch-effect workflows for Infinium Methylation data and provide reference matrices of batch-effect prone and erroneously corrected features across the five datasets spanning regionally diverse populations and three commonly collected biosamples (blood, buccal and saliva). Conclusions Batch-effects are ever present, even in high-quality data, and a strategy to deal with them should be part of experimental design, particularly for EWAS. Batch-effect removal tools are useful to reduce technical variance in Infinium Methylation data, but they need to be applied with care and make use of post hoc diagnostic measures. Supplementary Information The online version contains supplementary material available at 10.1186/s13148-022-01277-9.
Collapse
Affiliation(s)
- Jason P Ross
- Human Health Program, Health and Biosecurity, CSIRO, Sydney, Australia.
| | - Susan van Dijk
- Human Health Program, Health and Biosecurity, CSIRO, Sydney, Australia
| | - Melinda Phang
- Charles Perkins Centre, The University of Sydney, Sydney, Australia
| | - Michael R Skilton
- Charles Perkins Centre, The University of Sydney, Sydney, Australia.,Sydney Medical School, The University of Sydney, Sydney, Australia.,Sydney Institute for Women, Children and Their Families, Sydney Local Health District, Sydney, Australia
| | - Peter L Molloy
- Human Health Program, Health and Biosecurity, CSIRO, Sydney, Australia
| | - Yalchin Oytam
- Clinical Insights and Analytics Unit, South Eastern Sydney Local Health District, Sydney, Australia
| |
Collapse
|
37
|
Decision Theory versus Conventional Statistics for Personalized Therapy of Breast Cancer. J Pers Med 2022; 12:jpm12040570. [PMID: 35455687 PMCID: PMC9028435 DOI: 10.3390/jpm12040570] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/17/2022] [Revised: 03/24/2022] [Accepted: 03/28/2022] [Indexed: 11/17/2022] Open
Abstract
Estrogen and progesterone receptors being present or not represents one of the most important biomarkers for therapy selection in breast cancer patients. Conventional measurement by immunohistochemistry (IHC) involves errors, and numerous attempts have been made to increase precision by additional information from gene expression. This raises the question of how to fuse information, in particular, if there is disagreement. It is the primary domain of Dempster–Shafer decision theory (DST) to deal with contradicting evidence on the same item (here: receptor status), obtained through different techniques. DST is widely used in technical settings, such as self-driving cars and aviation, and is also promising to deliver significant advantages in medicine. Using data from breast cancer patients already presented in previous work, we focus on comparing DST with classical statistics in this work, to pave the way for its application in medicine. First, we explain how DST not only considers probabilities (a single number per sample), but also incorporates uncertainty in a concept of ‘evidence’ (two numbers per sample). This allows for very powerful displays of patient data in so-called ternary plots, a novel and crucial advantage for medical interpretation. Results are obtained according to conventional statistics (ODDS) and, in parallel, according to DST. Agreement and differences are evaluated, and the particular merits of DST discussed. The presented application demonstrates how decision theory introduces new levels of confidence in diagnoses derived from medical data.
Collapse
|
38
|
Su L, Xu C, Zeng S, Su L, Joshi T, Stacey G, Xu D. Large-Scale Integrative Analysis of Soybean Transcriptome Using an Unsupervised Autoencoder Model. FRONTIERS IN PLANT SCIENCE 2022; 13:831204. [PMID: 35310659 PMCID: PMC8927983 DOI: 10.3389/fpls.2022.831204] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 12/08/2021] [Accepted: 02/09/2022] [Indexed: 06/14/2023]
Abstract
Plant tissues are distinguished by their gene expression patterns, which can help identify tissue-specific highly expressed genes and their differential functional modules. For this purpose, large-scale soybean transcriptome samples were collected and processed starting from raw sequencing reads in a uniform analysis pipeline. To address the gene expression heterogeneity in different tissues, we utilized an adversarial deconfounding autoencoder (AD-AE) model to map gene expressions into a latent space and adapted a standard unsupervised autoencoder (AE) model to help effectively extract meaningful biological signals from the noisy data. As a result, four groups of 1,743, 914, 2,107, and 1,451 genes were found highly expressed specifically in leaf, root, seed and nodule tissues, respectively. To obtain key transcription factors (TFs), hub genes and their functional modules in each tissue, we constructed tissue-specific gene regulatory networks (GRNs), and differential correlation networks by using corrected and compressed gene expression data. We validated our results from the literature and gene enrichment analysis, which confirmed many identified tissue-specific genes. Our study represents the largest gene expression analysis in soybean tissues to date. It provides valuable targets for tissue-specific research and helps uncover broader biological patterns. Code is publicly available with open source at https://github.com/LingtaoSu/SoyMeta.
Collapse
Affiliation(s)
- Lingtao Su
- Department of Electrical Engineering and Computer Science and Christopher S. Bond Life Sciences Center, University of Missouri, Columbia, MO, United States
| | - Chunhui Xu
- Institute for Data Science and Informatics, Christopher S. Bond Life Sciences Center, University of Missouri, Columbia, MO, United States
| | - Shuai Zeng
- Department of Electrical Engineering and Computer Science and Christopher S. Bond Life Sciences Center, University of Missouri, Columbia, MO, United States
| | - Li Su
- Institute for Data Science and Informatics, Christopher S. Bond Life Sciences Center, University of Missouri, Columbia, MO, United States
| | - Trupti Joshi
- Department of Electrical Engineering and Computer Science and Christopher S. Bond Life Sciences Center, University of Missouri, Columbia, MO, United States
- Institute for Data Science and Informatics, Christopher S. Bond Life Sciences Center, University of Missouri, Columbia, MO, United States
- Department of Health Management and Informatics and Christopher S. Bond Life Sciences Center, University of Missouri, Columbia, MO, United States
| | - Gary Stacey
- Division of Plant Sciences and Technology and Biochemistry Christopher S. Bond Life Sciences Center, University of Missouri, Columbia, MO, United States
| | - Dong Xu
- Department of Electrical Engineering and Computer Science and Christopher S. Bond Life Sciences Center, University of Missouri, Columbia, MO, United States
- Institute for Data Science and Informatics, Christopher S. Bond Life Sciences Center, University of Missouri, Columbia, MO, United States
| |
Collapse
|
39
|
Schurink NW, van Kranen SR, Roberti S, van Griethuysen JJM, Bogveradze N, Castagnoli F, El Khababi N, Bakers FCH, de Bie SH, Bosma GPT, Cappendijk VC, Geenen RWF, Neijenhuis PA, Peterson GM, Veeken CJ, Vliegen RFA, Beets-Tan RGH, Lambregts DMJ. Sources of variation in multicenter rectal MRI data and their effect on radiomics feature reproducibility. Eur Radiol 2022; 32:1506-1516. [PMID: 34655313 PMCID: PMC8831294 DOI: 10.1007/s00330-021-08251-8] [Citation(s) in RCA: 25] [Impact Index Per Article: 8.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/16/2021] [Revised: 07/23/2021] [Accepted: 08/06/2021] [Indexed: 12/22/2022]
Abstract
OBJECTIVES To investigate sources of variation in a multicenter rectal cancer MRI dataset focusing on hardware and image acquisition, segmentation methodology, and radiomics feature extraction software. METHODS T2W and DWI/ADC MRIs from 649 rectal cancer patients were retrospectively acquired in 9 centers. Fifty-two imaging features (14 first-order/6 shape/32 higher-order) were extracted from each scan using whole-volume (expert/non-expert) and single-slice segmentations using two different software packages (PyRadiomics/CapTk). Influence of hardware, acquisition, and patient-intrinsic factors (age/gender/cTN-stage) on ADC was assessed using linear regression. Feature reproducibility was assessed between segmentation methods and software packages using the intraclass correlation coefficient. RESULTS Image features differed significantly (p < 0.001) between centers with more substantial variations in ADC compared to T2W-MRI. In total, 64.3% of the variation in mean ADC was explained by differences in hardware and acquisition, compared to 0.4% by patient-intrinsic factors. Feature reproducibility between expert and non-expert segmentations was good to excellent (median ICC 0.89-0.90). Reproducibility for single-slice versus whole-volume segmentations was substantially poorer (median ICC 0.40-0.58). Between software packages, reproducibility was good to excellent (median ICC 0.99) for most features (first-order/shape/GLCM/GLRLM) but poor for higher-order (GLSZM/NGTDM) features (median ICC 0.00-0.41). CONCLUSIONS Significant variations are present in multicenter MRI data, particularly related to differences in hardware and acquisition, which will likely negatively influence subsequent analysis if not corrected for. Segmentation variations had a minor impact when using whole volume segmentations. Between software packages, higher-order features were less reproducible and caution is warranted when implementing these in prediction models. KEY POINTS • Features derived from T2W-MRI and in particular ADC differ significantly between centers when performing multicenter data analysis. • Variations in ADC are mainly (> 60%) caused by hardware and image acquisition differences and less so (< 1%) by patient- or tumor-intrinsic variations. • Features derived using different image segmentations (expert/non-expert) were reproducible, provided that whole-volume segmentations were used. When using different feature extraction software packages with similar settings, higher-order features were less reproducible.
Collapse
Affiliation(s)
- Niels W Schurink
- Department of Radiology, The Netherlands Cancer Institute, POB 90203, 1006 BE, Amsterdam, The Netherlands
- GROW School for Oncology & Developmental Biology, University of Maastricht, Maastricht, The Netherlands
| | - Simon R van Kranen
- Department of Radiation Oncology, The Netherlands Cancer Institute, Amsterdam, The Netherlands
| | - Sander Roberti
- Department of Epidemiology and Biostatistics, The Netherlands Cancer Institute, Amsterdam, The Netherlands
| | - Joost J M van Griethuysen
- Department of Radiology, The Netherlands Cancer Institute, POB 90203, 1006 BE, Amsterdam, The Netherlands
- GROW School for Oncology & Developmental Biology, University of Maastricht, Maastricht, The Netherlands
| | - Nino Bogveradze
- Department of Radiology, The Netherlands Cancer Institute, POB 90203, 1006 BE, Amsterdam, The Netherlands
- GROW School for Oncology & Developmental Biology, University of Maastricht, Maastricht, The Netherlands
- Department of Radiology, Acad. F. Todua Medical Center, Research Institute of Clinical Medicine, Tbilisi, Georgia
| | - Francesca Castagnoli
- Department of Radiology, The Netherlands Cancer Institute, POB 90203, 1006 BE, Amsterdam, The Netherlands
| | - Najim El Khababi
- Department of Radiology, The Netherlands Cancer Institute, POB 90203, 1006 BE, Amsterdam, The Netherlands
- GROW School for Oncology & Developmental Biology, University of Maastricht, Maastricht, The Netherlands
| | - Frans C H Bakers
- Department of Radiology, Maastricht University Medical Centre, Maastricht, The Netherlands
| | - Shira H de Bie
- Department of Radiology, Deventer Ziekenhuis, Deventer, The Netherlands
| | - Gerlof P T Bosma
- Department of Interventional Radiology, Elisabeth Tweesteden Hospital, Tilburg, The Netherlands
| | - Vincent C Cappendijk
- Department of Radiology, Jeroen Bosch Hospital, 's-Hertogenbosch, The Netherlands
| | - Remy W F Geenen
- Department of Radiology, Northwest Clinics, Alkmaar, The Netherlands
| | | | | | - Cornelis J Veeken
- Department of Radiology, IJsselland Hospital, Capelle Aan Den IJssel, The Netherlands
| | - Roy F A Vliegen
- Department of Radiology, Zuyderland Medical Center, Heerlen, The Netherlands
| | - Regina G H Beets-Tan
- Department of Radiology, The Netherlands Cancer Institute, POB 90203, 1006 BE, Amsterdam, The Netherlands.
- GROW School for Oncology & Developmental Biology, University of Maastricht, Maastricht, The Netherlands.
| | - Doenja M J Lambregts
- Department of Radiology, The Netherlands Cancer Institute, POB 90203, 1006 BE, Amsterdam, The Netherlands.
| |
Collapse
|
40
|
Moretto M, Sonego P, Pilati S, Matus JT, Costantini L, Malacarne G, Engelen K. A COMPASS for VESPUCCI: A FAIR Way to Explore the Grapevine Transcriptomic Landscape. FRONTIERS IN PLANT SCIENCE 2022; 13:815443. [PMID: 35283898 PMCID: PMC8908374 DOI: 10.3389/fpls.2022.815443] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 11/15/2021] [Accepted: 01/24/2022] [Indexed: 06/14/2023]
Abstract
Successfully integrating transcriptomic experiments is a challenging task with the ultimate goal of analyzing gene expression data in the broader context of all available measurements, all from a single point of access. In its second major release VESPUCCI, the integrated database of gene expression data for grapevine, has been updated to be FAIR-compliant, employing standards and created with open-source technologies. It includes all public grapevine gene expression experiments from both microarray and RNA-seq platforms. Transcriptomic data can be accessed in multiple ways through the newly developed COMPASS GraphQL interface, while the expression values are normalized using different methodologies to flexibly satisfy different analysis requirements. Sample annotations are manually curated and use standard formats and ontologies. The updated version of VESPUCCI provides easy querying and analyzing of integrated grapevine gene expression (meta)data and can be seamlessly embedded in any analysis workflow or tools. VESPUCCI is freely accessible and offers several ways of interaction, depending on the specific goals and purposes and/or user expertise; an overview can be found at https://vespucci.readthedocs.io/.
Collapse
Affiliation(s)
- Marco Moretto
- Unit of Computational Biology, Research and Innovation Centre, Fondazione Edmund Mach, San Michele all’Adige, Italy
| | - Paolo Sonego
- Unit of Computational Biology, Research and Innovation Centre, Fondazione Edmund Mach, San Michele all’Adige, Italy
| | - Stefania Pilati
- Unit of Plant Biology and Physiology, Research and Innovation Centre, Fondazione Edmund Mach, San Michele all’Adige, Italy
| | - José Tomás Matus
- Institute for Integrative Systems Biology (I2SysBio), Universitat de València-CSIC, Paterna, Spain
| | - Laura Costantini
- Unit of Grapevine Genetics and Breeding, Research and Innovation Centre, Fondazione Edmund Mach, San Michele all’Adige, Italy
| | - Giulia Malacarne
- Unit of Plant Biology and Physiology, Research and Innovation Centre, Fondazione Edmund Mach, San Michele all’Adige, Italy
| | - Kristof Engelen
- Unit of Computational Biology, Research and Innovation Centre, Fondazione Edmund Mach, San Michele all’Adige, Italy
| |
Collapse
|
41
|
Kubinski R, Djamen-Kepaou JY, Zhanabaev T, Hernandez-Garcia A, Bauer S, Hildebrand F, Korcsmaros T, Karam S, Jantchou P, Kafi K, Martin RD. Benchmark of Data Processing Methods and Machine Learning Models for Gut Microbiome-Based Diagnosis of Inflammatory Bowel Disease. Front Genet 2022; 13:784397. [PMID: 35251123 PMCID: PMC8895431 DOI: 10.3389/fgene.2022.784397] [Citation(s) in RCA: 13] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/27/2021] [Accepted: 01/13/2022] [Indexed: 12/14/2022] Open
Abstract
Patients with inflammatory bowel disease (IBD) wait months and undergo numerous invasive procedures between the initial appearance of symptoms and receiving a diagnosis. In order to reduce time until diagnosis and improve patient wellbeing, machine learning algorithms capable of diagnosing IBD from the gut microbiome's composition are currently being explored. To date, these models have had limited clinical application due to decreased performance when applied to a new cohort of patient samples. Various methods have been developed to analyze microbiome data which may improve the generalizability of machine learning IBD diagnostic tests. With an abundance of methods, there is a need to benchmark the performance and generalizability of various machine learning pipelines (from data processing to training a machine learning model) for microbiome-based IBD diagnostic tools. We collected fifteen 16S rRNA microbiome datasets (7,707 samples) from North America to benchmark combinations of gut microbiome features, data normalization and transformation methods, batch effect correction methods, and machine learning models. Pipeline generalizability to new cohorts of patients was evaluated with two binary classification metrics following leave-one-dataset-out cross (LODO) validation, where all samples from one study were left out of the training set and tested upon. We demonstrate that taxonomic features processed with a compositional transformation method and batch effect correction with the naive zero-centering method attain the best classification performance. In addition, machine learning models that identify non-linear decision boundaries between labels are more generalizable than those that are linearly constrained. Lastly, we illustrate the importance of generating a curated training dataset to ensure similar performance across patient demographics. These findings will help improve the generalizability of machine learning models as we move towards non-invasive diagnostic and disease management tools for patients with IBD.
Collapse
Affiliation(s)
| | | | | | - Alex Hernandez-Garcia
- Mila, Quebec Artificial Intelligence Institute, University of Montreal, Montréal, QC, Canada
| | - Stefan Bauer
- Max Planck Institute for Intelligent Systems, Tübingen, Germany
| | - Falk Hildebrand
- Gut Microbes and Health, Quadram Institute Bioscience, Norwich, United Kingdom
- Earlham Institute, Norwich, United Kingdom
| | - Tamas Korcsmaros
- Gut Microbes and Health, Quadram Institute Bioscience, Norwich, United Kingdom
- Earlham Institute, Norwich, United Kingdom
| | - Sani Karam
- Phyla Technologies Inc, Montréal, QC, Canada
| | - Prévost Jantchou
- Centre Hospitalier Universitaire Sainte-Justine, Montréal, QC, Canada
| | - Kamran Kafi
- Phyla Technologies Inc, Montréal, QC, Canada
| | | |
Collapse
|
42
|
Ba R, Geffard E, Douillard V, Simon F, Mesnard L, Vince N, Gourraud PA, Limou S. Surfing the Big Data Wave: Omics Data Challenges in Transplantation. Transplantation 2022; 106:e114-e125. [PMID: 34889882 DOI: 10.1097/tp.0000000000003992] [Citation(s) in RCA: 10] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022]
Abstract
In both research and care, patients, caregivers, and researchers are facing a leap forward in the quantity of data that are available for analysis and interpretation, marking the daunting "big data era." In the biomedical field, this quantitative shift refers mostly to the -omics that permit measuring and analyzing biological features of the same type as a whole. Omics studies have greatly impacted transplantation research and highlighted their potential to better understand transplant outcomes. Some studies have emphasized the contribution of omics in developing personalized therapies to avoid graft loss. However, integrating omics data remains challenging in terms of analytical processes. These data come from multiple sources. Consequently, they may contain biases and systematic errors that can be mistaken for relevant biological information. Normalization methods and batch effects have been developed to tackle issues related to data quality and homogeneity. In addition, imputation methods handle data missingness. Importantly, the transplantation field represents a unique analytical context as the biological statistical unit is the donor-recipient pair, which brings additional complexity to the omics analyses. Strategies such as combined risk scores between 2 genomes taking into account genetic ancestry are emerging to better understand graft mechanisms and refine biological interpretations. The future omics will be based on integrative biology, considering the analysis of the system as a whole and no longer the study of a single characteristic. In this review, we summarize omics studies advances in transplantation and address the most challenging analytical issues regarding these approaches.
Collapse
Affiliation(s)
- Rokhaya Ba
- Université de Nantes, Centre Hospitalier Universitaire Nantes, Institute of Health and Medical Research, Centre de Recherche en Transplantation et Immunologie, UMR 1064, Institut de Transplantation Urologie-Néphrologie, Nantes, France
- Département Informatique et Mathématiques, Ecole Centrale de Nantes, Nantes, France
| | - Estelle Geffard
- Université de Nantes, Centre Hospitalier Universitaire Nantes, Institute of Health and Medical Research, Centre de Recherche en Transplantation et Immunologie, UMR 1064, Institut de Transplantation Urologie-Néphrologie, Nantes, France
| | - Venceslas Douillard
- Université de Nantes, Centre Hospitalier Universitaire Nantes, Institute of Health and Medical Research, Centre de Recherche en Transplantation et Immunologie, UMR 1064, Institut de Transplantation Urologie-Néphrologie, Nantes, France
| | - Françoise Simon
- Université de Nantes, Centre Hospitalier Universitaire Nantes, Institute of Health and Medical Research, Centre de Recherche en Transplantation et Immunologie, UMR 1064, Institut de Transplantation Urologie-Néphrologie, Nantes, France
- Mount Sinai School of Medicine, New York, NY
| | - Laurent Mesnard
- Urgences Néphrologiques et Transplantation Rénale, Hôpital Tenon, Assistance Publique-Hôpitaux de Paris, Paris, France
- Sorbonne Université, Paris, France
| | - Nicolas Vince
- Université de Nantes, Centre Hospitalier Universitaire Nantes, Institute of Health and Medical Research, Centre de Recherche en Transplantation et Immunologie, UMR 1064, Institut de Transplantation Urologie-Néphrologie, Nantes, France
| | - Pierre-Antoine Gourraud
- Université de Nantes, Centre Hospitalier Universitaire Nantes, Institute of Health and Medical Research, Centre de Recherche en Transplantation et Immunologie, UMR 1064, Institut de Transplantation Urologie-Néphrologie, Nantes, France
| | - Sophie Limou
- Université de Nantes, Centre Hospitalier Universitaire Nantes, Institute of Health and Medical Research, Centre de Recherche en Transplantation et Immunologie, UMR 1064, Institut de Transplantation Urologie-Néphrologie, Nantes, France
- Département Informatique et Mathématiques, Ecole Centrale de Nantes, Nantes, France
| |
Collapse
|
43
|
Van Asselt AJ, Ehli EA. Whole-Genome Genotyping Using DNA Microarrays for Population Genetics. Methods Mol Biol 2022; 2418:269-287. [PMID: 35119671 DOI: 10.1007/978-1-0716-1920-9_16] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/14/2023]
Abstract
The field of population genetics has exploded in the last two decades following the sequencing of the human genome in 2001 (Green et al. Nature 526:29-31, 2015). Tools to measure genetic variation have matured significantly throughout this advancement in knowledge (Lenoir and Giannella. J Biomed Discov Collab 1:11, 2006; Marzancola et al. Methods Mol Biol 1368:161-178, 2016). In this chapter, the focus is on the laboratory methods developed to perform genome-wide genotyping utilizing DNA microarrays, which is one of the most commonly used molecular techniques to assess global genetic variation (Heller MJ, Annu Rev Biomed Eng 4:129-153, 2002). DNA microarrays allow for the interrogation of hundreds of thousands of SNPs (single nucleotide polymorphisms) at once utilizing array-based technology in conjunction with fluorescent molecular labels in a process referred to as genotyping (Marzancola et al. Methods Mol Biol 1368:161-178, 2016). Genotype data can be utilized to associate certain phenotypes in relation with specific genetic variants within a population in a process known as genome-wide association studies or GWAS (Charlesworth and Charlesworth. Heredity (Edinb) 118(1):2-9, 2017; Casillas and Barbadilla. Genetics 205(3):1003-1035, 2017). This experimental technique is a multiple-day process involving the combination of DNA extraction, amplification, fragmentation, binding, and staining (Illumina Infinium HTS Assay Protocol Guide, 2013). Many vendors supply platforms and products to assess global genetic variation using DNA microarrays (Illumina Infinium HTS Assay Protocol Guide, 2013). In this chapter, the focus is on the methods utilized to generate high-quality genotype data with the Illumina® Infinium Global Screening Array. Although data analysis and quality control are not the focus for this chapter, they are also briefly addressed.
Collapse
Affiliation(s)
- Austin J Van Asselt
- Avera Institute for Human Genetics, Avera McKennan Hospital and University Health Center, Sioux Falls, SD, USA
- Division of Basic Biomedical Sciences, Sanford School of Medicine, University of South Dakota, Vermillion, SD, USA
| | - Erik A Ehli
- Avera Institute for Human Genetics, Avera McKennan Hospital and University Health Center, Sioux Falls, SD, USA.
- Department of Psychiatry, Sanford School of Medicine, University of South Dakota, Sioux Falls, SD, USA.
| |
Collapse
|
44
|
Qin Y, Yi D, Chen X, Guan Y. Deep learning identifies erroneous microarray-based, gene-level conclusions in literature. NAR Genom Bioinform 2021; 3:lqab089. [PMID: 34617014 PMCID: PMC8489595 DOI: 10.1093/nargab/lqab089] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/10/2021] [Revised: 08/25/2021] [Accepted: 09/17/2021] [Indexed: 11/14/2022] Open
Abstract
More than 110 000 publications have used microarrays to decipher phenotype-associated genes, clinical biomarkers and gene functions. Microarrays rely on digital assaying the fluorescence signals of arrays. In this study, we retrospectively constructed raw images for 37 724 published microarray data, and developed deep learning algorithms to automatically detect systematic defects. We report that an alarming amount of 26.73% of the microarray-based studies are affected by serious imaging defects. By literature mining, we found that publications associated with these affected microarrays have reported disproportionately more biological discoveries on the genes in the contaminated areas compared to other genes. 28.82% of the gene-level conclusions reported in these publications were based on measurements falling into the contaminated area, indicating severe, systematic problems caused by such contaminations. We provided the identified published, problematic datasets, affected genes and the imputed arrays as well as software tools for scanning such contamination that will become essential to future studies to scrutinize and critically analyze microarray data.
Collapse
Affiliation(s)
- Yanan Qin
- Department of Computational Medicine and Bioinformatics, University of Michigan Medical School, Ann Arbor, MI 48109, USA
| | - Daiyao Yi
- Department of Computational Medicine and Bioinformatics, University of Michigan Medical School, Ann Arbor, MI 48109, USA
| | - Xianghao Chen
- Department of Computational Medicine and Bioinformatics, University of Michigan Medical School, Ann Arbor, MI 48109, USA
| | - Yuanfang Guan
- Department of Computational Medicine and Bioinformatics, University of Michigan Medical School, Ann Arbor, MI 48109, USA
- Department of Internal Medicine, University of Michigan Medical School, Ann Arbor, MI 48109, USA
| |
Collapse
|
45
|
Wu Y, Guo Y, Ma J, Sa Y, Li Q, Zhang N. Research Progress of Gliomas in Machine Learning. Cells 2021; 10:cells10113169. [PMID: 34831392 PMCID: PMC8622230 DOI: 10.3390/cells10113169] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/04/2021] [Revised: 11/04/2021] [Accepted: 11/05/2021] [Indexed: 12/29/2022] Open
Abstract
In the field of gliomas research, the broad availability of genetic and image information originated by computer technologies and the booming of biomedical publications has led to the advent of the big-data era. Machine learning methods were applied as possible approaches to speed up the data mining processes. In this article, we reviewed the present situation and future orientations of machine learning application in gliomas within the context of workflows to integrate analysis for precision cancer care. Publicly available tools or algorithms for key machine learning technologies in the literature mining for glioma clinical research were reviewed and compared. Further, the existing solutions of machine learning methods and their limitations in glioma prediction and diagnostics, such as overfitting and class imbalanced, were critically analyzed.
Collapse
|
46
|
Lindbergh CA, Asken BM, Casaletto KB, Elahi FM, Goldberger LA, Fonseca C, You M, Apple AC, Staffaroni AM, Fitch R, Rivera Contreras W, Wang P, Karydas A, Kramer JH. Interbatch Reliability of Blood-Based Cytokine and Chemokine Measurements in Community-Dwelling Older Adults: A Cross-Sectional Study. J Gerontol A Biol Sci Med Sci 2021; 76:1954-1961. [PMID: 34110415 DOI: 10.1093/gerona/glab162] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/12/2021] [Indexed: 11/13/2022] Open
Abstract
Blood-based inflammatory markers hold considerable promise for diagnosis and prognostication of age-related neurodegenerative disease, though a paucity of research has empirically tested how reliably they can be measured across different experimental runs ("batches"). We quantified the interbatch reliability of 13 cytokines and chemokines in a cross-sectional study of 92 community-dwelling older adults (mean age = 74; 48% female). Plasma aliquots from the same blood draw were parallelly processed in 2 separate batches using the same analytic platform and procedures (high-performance electrochemiluminescence by Meso Scale Discovery). Interbatch correlations (Pearson's r) ranged from small and nonsignificant (r = .13 for macrophage inflammatory protein-1 alpha [MIP-1α]) to very large (r > .90 for interferon gamma [IFNγ], interleukin-10 [IL-10], interferon gamma-induced protein 10 [IP-10], MIP-1β, thymus and activation-regulated chemokine [TARC]) with most markers falling somewhere in between (.67 ≤ r ≤ .90 for IL-6, tumor necrosis factor alpha [TNF-α], Eotaxin, Eotaxin-3, monocyte chemoattractant protein-1 [MCP-1], MCP-4, macrophage-derived chemokine [MDC]). All markers, except for IL-6 and MCP-4, showed significant differences in absolute values between batches, with discrepancies ranging in effect size (Cohen's d) from small to moderate (0.2 ≤ |d| ≤ 0.5 for IL-10, IP-10, MDC) to large or very large (0.68 ≤ |d| ≤ 1.5 for IFNγ, TNF-α, Eotaxin, Eotaxin-3, MCP-1, MIP-1α, MIP-1β, TARC). Relatively consistent associations with external variables of interest (age, sex, systolic blood pressure, body mass index, cognition) were observed across batches. Taken together, our results suggest heterogeneity in measurement reliability of blood-based cytokines and chemokines, with some analytes outperforming others. Future work is needed to evaluate the generalizability of these findings while identifying potential sources of batch effect measurement error.
Collapse
Affiliation(s)
- Cutter A Lindbergh
- Memory and Aging Center, Department of Neurology, University of California San Francisco, USA
| | - Breton M Asken
- Memory and Aging Center, Department of Neurology, University of California San Francisco, USA
| | - Kaitlin B Casaletto
- Memory and Aging Center, Department of Neurology, University of California San Francisco, USA
| | - Fanny M Elahi
- Memory and Aging Center, Department of Neurology, University of California San Francisco, USA
| | - Lauren A Goldberger
- Memory and Aging Center, Department of Neurology, University of California San Francisco, USA
| | - Corrina Fonseca
- Memory and Aging Center, Department of Neurology, University of California San Francisco, USA
| | - Michelle You
- Memory and Aging Center, Department of Neurology, University of California San Francisco, USA
| | - Alexandra C Apple
- Memory and Aging Center, Department of Neurology, University of California San Francisco, USA
| | - Adam M Staffaroni
- Memory and Aging Center, Department of Neurology, University of California San Francisco, USA
| | - Ryan Fitch
- Memory and Aging Center, Department of Neurology, University of California San Francisco, USA
| | - Will Rivera Contreras
- Memory and Aging Center, Department of Neurology, University of California San Francisco, USA
| | - Paul Wang
- Memory and Aging Center, Department of Neurology, University of California San Francisco, USA
| | - Anna Karydas
- Memory and Aging Center, Department of Neurology, University of California San Francisco, USA
| | - Joel H Kramer
- Memory and Aging Center, Department of Neurology, University of California San Francisco, USA
| | | |
Collapse
|
47
|
Whitney HM, Li H, Ji Y, Liu P, Giger ML. Multi-Stage Harmonization for Robust AI across Breast MR Databases. Cancers (Basel) 2021; 13:cancers13194809. [PMID: 34638294 PMCID: PMC8508003 DOI: 10.3390/cancers13194809] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/02/2021] [Revised: 09/16/2021] [Accepted: 09/18/2021] [Indexed: 12/22/2022] Open
Abstract
Simple Summary Batch harmonization of radiomic features extracted from magnetic resonance images of breast lesions from two databases was applied to an artificial intelligence/machine learning classification workflow. Training and independent test sets from the two databases, as well as the combination of them, were used in pre-harmonization and post-harmonization forms to investigate the generalizability of performance in the task of distinguishing between malignant and benign lesions. Most training and independent test scenarios were statistically equivalent, demonstrating that batch harmonization with feature selection harmonization can potentially develop generalizable classification models. Abstract Radiomic features extracted from medical images may demonstrate a batch effect when cases come from different sources. We investigated classification performance using training and independent test sets drawn from two sources using both pre-harmonization and post-harmonization features. In this retrospective study, a database of thirty-two radiomic features, extracted from DCE-MR images of breast lesions after fuzzy c-means segmentation, was collected. There were 944 unique lesions in Database A (208 benign lesions, 736 cancers) and 1986 unique lesions in Database B (481 benign lesions, 1505 cancers). The lesions from each database were divided by year of image acquisition into training and independent test sets, separately by database and in combination. ComBat batch harmonization was conducted on the combined training set to minimize the batch effect on eligible features by database. The empirical Bayes estimates from the feature harmonization were applied to the eligible features of the combined independent test set. The training sets (A, B, and combined) were then used in training linear discriminant analysis classifiers after stepwise feature selection. The classifiers were then run on the A, B, and combined independent test sets. Classification performance was compared using pre-harmonization features to post-harmonization features, including their corresponding feature selection, evaluated using the area under the receiver operating characteristic curve (AUC) as the figure of merit. Four out of five training and independent test scenarios demonstrated statistically equivalent classification performance when compared pre- and post-harmonization. These results demonstrate that translation of machine learning techniques with batch data harmonization can potentially yield generalizable models that maintain classification performance.
Collapse
Affiliation(s)
- Heather M. Whitney
- Department of Radiology, The University of Chicago, Chicago, IL 60637, USA; (H.L.); (Y.J.)
- Department of Physics, Wheaton College, Wheaton, IL 60187, USA
- Correspondence: (H.M.W.); (M.L.G.)
| | - Hui Li
- Department of Radiology, The University of Chicago, Chicago, IL 60637, USA; (H.L.); (Y.J.)
| | - Yu Ji
- Department of Radiology, The University of Chicago, Chicago, IL 60637, USA; (H.L.); (Y.J.)
- Tianjin Medical University Cancer Institute and Hospital, Tianjin 300060, China;
| | - Peifang Liu
- Tianjin Medical University Cancer Institute and Hospital, Tianjin 300060, China;
| | - Maryellen L. Giger
- Department of Radiology, The University of Chicago, Chicago, IL 60637, USA; (H.L.); (Y.J.)
- Correspondence: (H.M.W.); (M.L.G.)
| |
Collapse
|
48
|
Modos D, Thomas JP, Korcsmaros T. A handy meta-analysis tool for IBD research. NATURE COMPUTATIONAL SCIENCE 2021; 1:571-572. [PMID: 38217126 DOI: 10.1038/s43588-021-00124-w] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/15/2024]
Affiliation(s)
- Dezso Modos
- Earlham Institute, Norwich, UK
- Quadram Institute Bioscience, Norwich, UK
| | - John P Thomas
- Earlham Institute, Norwich, UK
- Quadram Institute Bioscience, Norwich, UK
- Department of Gastroenterology, Norfolk and Norwich University Hospital, Norwich, UK
| | - Tamas Korcsmaros
- Earlham Institute, Norwich, UK.
- Quadram Institute Bioscience, Norwich, UK.
| |
Collapse
|
49
|
Čuklina J, Lee CH, Williams EG, Sajic T, Collins BC, Rodríguez Martínez M, Sharma VS, Wendt F, Goetze S, Keele GR, Wollscheid B, Aebersold R, Pedrioli PGA. Diagnostics and correction of batch effects in large-scale proteomic studies: a tutorial. Mol Syst Biol 2021; 17:e10240. [PMID: 34432947 PMCID: PMC8447595 DOI: 10.15252/msb.202110240] [Citation(s) in RCA: 75] [Impact Index Per Article: 18.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/22/2021] [Revised: 07/16/2021] [Accepted: 07/26/2021] [Indexed: 12/11/2022] Open
Abstract
Advancements in mass spectrometry-based proteomics have enabled experiments encompassing hundreds of samples. While these large sample sets deliver much-needed statistical power, handling them introduces technical variability known as batch effects. Here, we present a step-by-step protocol for the assessment, normalization, and batch correction of proteomic data. We review established methodologies from related fields and describe solutions specific to proteomic challenges, such as ion intensity drift and missing values in quantitative feature matrices. Finally, we compile a set of techniques that enable control of batch effect adjustment quality. We provide an R package, "proBatch", containing functions required for each step of the protocol. We demonstrate the utility of this methodology on five proteomic datasets each encompassing hundreds of samples and consisting of multiple experimental designs. In conclusion, we provide guidelines and tools to make the extraction of true biological signal from large proteomic studies more robust and transparent, ultimately facilitating reliable and reproducible research in clinical proteomics and systems biology.
Collapse
Affiliation(s)
- Jelena Čuklina
- Department of BiologyInstitute of Molecular Systems BiologyETH ZurichZurichSwitzerland
- PhD Program in Systems BiologyUniversity of Zurich and ETH ZurichZurichSwitzerland
- IBM Research EuropeRüschlikonSwitzerland
| | - Chloe H Lee
- Department of BiologyInstitute of Molecular Systems BiologyETH ZurichZurichSwitzerland
| | - Evan G Williams
- Department of BiologyInstitute of Molecular Systems BiologyETH ZurichZurichSwitzerland
- Luxembourg Centre for Systems BiomedicineUniversity of LuxembourgLuxembourgLuxembourg
| | - Tatjana Sajic
- Department of BiologyInstitute of Molecular Systems BiologyETH ZurichZurichSwitzerland
| | - Ben C Collins
- Department of BiologyInstitute of Molecular Systems BiologyETH ZurichZurichSwitzerland
- Queen’s University BelfastBelfastUK
| | | | - Varun S Sharma
- Department of BiologyInstitute of Molecular Systems BiologyETH ZurichZurichSwitzerland
| | - Fabian Wendt
- Department of Health Sciences and TechnologyInstitute of Translational MedicineETH ZurichZurichSwitzerland
| | - Sandra Goetze
- Department of Health Sciences and TechnologyInstitute of Translational MedicineETH ZurichZurichSwitzerland
- ETH ZürichPHRT‐CPACZürichSwitzerland
- SIB Swiss Institute of BioinformaticsLausanneSwitzerland
| | | | - Bernd Wollscheid
- Department of Health Sciences and TechnologyInstitute of Translational MedicineETH ZurichZurichSwitzerland
- ETH ZürichPHRT‐CPACZürichSwitzerland
- SIB Swiss Institute of BioinformaticsLausanneSwitzerland
| | - Ruedi Aebersold
- Department of BiologyInstitute of Molecular Systems BiologyETH ZurichZurichSwitzerland
- Faculty of ScienceUniversity of ZurichZurichSwitzerland
| | - Patrick G A Pedrioli
- Department of BiologyInstitute of Molecular Systems BiologyETH ZurichZurichSwitzerland
- Department of Health Sciences and TechnologyInstitute of Translational MedicineETH ZurichZurichSwitzerland
- ETH ZürichPHRT‐CPACZürichSwitzerland
- SIB Swiss Institute of BioinformaticsLausanneSwitzerland
| |
Collapse
|
50
|
Zhang Y, Patil P, Johnson WE, Parmigiani G. Robustifying genomic classifiers to batch effects via ensemble learning. Bioinformatics 2021; 37:1521-1527. [PMID: 33245114 DOI: 10.1093/bioinformatics/btaa986] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/15/2020] [Revised: 10/20/2020] [Accepted: 11/13/2020] [Indexed: 01/08/2023] Open
Abstract
MOTIVATION Genomic data are often produced in batches due to practical restrictions, which may lead to unwanted variation in data caused by discrepancies across batches. Such 'batch effects' often have negative impact on downstream biological analysis and need careful consideration. In practice, batch effects are usually addressed by specifically designed software, which merge the data from different batches, then estimate batch effects and remove them from the data. Here, we focus on classification and prediction problems, and propose a different strategy based on ensemble learning. We first develop prediction models within each batch, then integrate them through ensemble weighting methods. RESULTS We provide a systematic comparison between these two strategies using studies targeting diverse populations infected with tuberculosis. In one study, we simulated increasing levels of heterogeneity across random subsets of the study, which we treat as simulated batches. We then use the two methods to develop a genomic classifier for the binary indicator of disease status. We evaluate the accuracy of prediction in another independent study targeting a different population cohort. We observed that in independent validation, while merging followed by batch adjustment provides better discrimination at low level of heterogeneity, our ensemble learning strategy achieves more robust performance, especially at high severity of batch effects. These observations provide practical guidelines for handling batch effects in the development and evaluation of genomic classifiers. AVAILABILITY AND IMPLEMENTATION The data underlying this article are available in the article and in its online supplementary material. Processed data is available in the Github repository with implementation code, at https://github.com/zhangyuqing/bea_ensemble. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Yuqing Zhang
- Clinical Bioinformatics, Gilead Sciences, Inc., Foster City, CA 94404, USA
| | - Prasad Patil
- Department of Biostatistics, Boston University School of Public Health, Boston, MA 02118, USA
| | - W Evan Johnson
- Department of Biostatistics, Boston University School of Public Health, Boston, MA 02118, USA.,Division of Computational Biomedicine, Boston University School of Medicine, Boston, MA 02118, USA
| | - Giovanni Parmigiani
- Department of Data Sciences, Dana-Farber Cancer Institute, Boston, MA 02215, USA.,Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA 02115, USA
| |
Collapse
|