1
|
Chappel JR, Kirkwood-Donelson KI, Reif DM, Baker ES. From big data to big insights: statistical and bioinformatic approaches for exploring the lipidome. Anal Bioanal Chem 2024; 416:2189-2202. [PMID: 37875675 PMCID: PMC10954412 DOI: 10.1007/s00216-023-04991-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/30/2023] [Revised: 10/01/2023] [Accepted: 10/05/2023] [Indexed: 10/26/2023]
Abstract
The goal of lipidomic studies is to provide a broad characterization of cellular lipids present and changing in a sample of interest. Recent lipidomic research has significantly contributed to revealing the multifaceted roles that lipids play in fundamental cellular processes, including signaling, energy storage, and structural support. Furthermore, these findings have shed light on how lipids dynamically respond to various perturbations. Continued advancement in analytical techniques has also led to improved abilities to detect and identify novel lipid species, resulting in increasingly large datasets. Statistical analysis of these datasets can be challenging not only because of their vast size, but also because of the highly correlated data structure that exists due to many lipids belonging to the same metabolic or regulatory pathways. Interpretation of these lipidomic datasets is also hindered by a lack of current biological knowledge for the individual lipids. These limitations can therefore make lipidomic data analysis a daunting task. To address these difficulties and shed light on opportunities and also weaknesses in current tools, we have assembled this review. Here, we illustrate common statistical approaches for finding patterns in lipidomic datasets, including univariate hypothesis testing, unsupervised clustering, supervised classification modeling, and deep learning approaches. We then describe various bioinformatic tools often used to biologically contextualize results of interest. Overall, this review provides a framework for guiding lipidomic data analysis to promote a greater assessment of lipidomic results, while understanding potential advantages and weaknesses along the way.
Collapse
Affiliation(s)
- Jessie R Chappel
- Bioinformatics Research Center, Department of Biological Sciences, North Carolina State University, Raleigh, NC, 27606, USA
| | - Kaylie I Kirkwood-Donelson
- Immunity, Inflammation, and Disease Laboratory, Division of Intramural Research, National Institute of Environmental Health Sciences, Durham, NC, 27709, USA
| | - David M Reif
- Predictive Toxicology Branch, Division of Translational Toxicology, National Institute of Environmental Health Sciences, Durham, NC, 27709, USA.
| | - Erin S Baker
- Department of Chemistry, University of North Carolina at Chapel Hill, Chapel Hill, NC, 27514, USA.
| |
Collapse
|
2
|
Tan ALM, Getzen EJ, Hutch MR, Strasser ZH, Gutiérrez-Sacristán A, Le TT, Dagliati A, Morris M, Hanauer DA, Moal B, Bonzel CL, Yuan W, Chiudinelli L, Das P, Zhang HG, Aronow BJ, Avillach P, Brat GA, Cai T, Hong C, La Cava WG, Hooi Will Loh H, Luo Y, Murphy SN, Yuan Hgiam K, Omenn GS, Patel LP, Jebathilagam Samayamuthu M, Shriver ER, Shakeri Hossein Abad Z, Tan BWL, Visweswaran S, Wang X, Weber GM, Xia Z, Verdy B, Long Q, Mowery DL, Holmes JH. Informative missingness: What can we learn from patterns in missing laboratory data in the electronic health record? J Biomed Inform 2023; 139:104306. [PMID: 36738870 PMCID: PMC10849195 DOI: 10.1016/j.jbi.2023.104306] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/17/2022] [Revised: 01/21/2023] [Accepted: 01/29/2023] [Indexed: 02/05/2023]
Abstract
BACKGROUND In electronic health records, patterns of missing laboratory test results could capture patients' course of disease as well as reflect clinician's concerns or worries for possible conditions. These patterns are often understudied and overlooked. This study aims to identify informative patterns of missingness among laboratory data collected across 15 healthcare system sites in three countries for COVID-19 inpatients. METHODS We collected and analyzed demographic, diagnosis, and laboratory data for 69,939 patients with positive COVID-19 PCR tests across three countries from 1 January 2020 through 30 September 2021. We analyzed missing laboratory measurements across sites, missingness stratification by demographic variables, temporal trends of missingness, correlations between labs based on missingness indicators over time, and clustering of groups of labs based on their missingness/ordering pattern. RESULTS With these analyses, we identified mapping issues faced in seven out of 15 sites. We also identified nuances in data collection and variable definition for the various sites. Temporal trend analyses may support the use of laboratory test result missingness patterns in identifying severe COVID-19 patients. Lastly, using missingness patterns, we determined relationships between various labs that reflect clinical behaviors. CONCLUSION In this work, we use computational approaches to relate missingness patterns to hospital treatment capacity and highlight the heterogeneity of looking at COVID-19 over time and at multiple sites, where there might be different phases, policies, etc. Changes in missingness could suggest a change in a patient's condition, and patterns of missingness among laboratory measurements could potentially identify clinical outcomes. This allows sites to consider missing data as informative to analyses and help researchers identify which sites are better poised to study particular questions.
Collapse
Affiliation(s)
| | - Emily J Getzen
- University of Pennsylvania Perelman School of Medicine, Philadelphia, PA, USA
| | | | | | | | - Trang T Le
- University of Pennsylvania Perelman School of Medicine, Philadelphia, PA, USA
| | | | | | | | | | | | | | | | - Priam Das
- Harvard Medical School, Cambridge, MA, USA
| | | | - Bruce J Aronow
- Cincinnati Children's Hospital Medical Center, University of Cincinnati, Cincinnati, OH, USA
| | | | | | - Tianxi Cai
- Harvard Medical School, Cambridge, MA, USA
| | - Chuan Hong
- Harvard Medical School, Cambridge, MA, USA; Duke University, Durham, NC, USA
| | - William G La Cava
- Harvard Medical School, Cambridge, MA, USA; Boston Children's Hospital, Boston, MA, USA
| | | | - Yuan Luo
- Northwestern University, Chicago, IL, USA
| | | | | | | | - Lav P Patel
- University of Kansas Medical Center, United States
| | | | - Emily R Shriver
- University of Pennsylvania Health System, Philadelphia, PA, USA
| | | | | | | | - Xuan Wang
- Harvard Medical School, Cambridge, MA, USA
| | | | - Zongqi Xia
- University of Pittsburgh, Pittsburgh, PA, USA
| | | | - Qi Long
- University of Pennsylvania Perelman School of Medicine, Philadelphia, PA, USA
| | - Danielle L Mowery
- University of Pennsylvania Perelman School of Medicine, Philadelphia, PA, USA
| | - John H Holmes
- University of Pennsylvania Perelman School of Medicine, Philadelphia, PA, USA
| |
Collapse
|
3
|
Olshansky G, Giles C, Salim A, Meikle PJ. Challenges and opportunities for prevention and removal of unwanted variation in lipidomic studies. Prog Lipid Res 2022; 87:101177. [PMID: 35780914 DOI: 10.1016/j.plipres.2022.101177] [Citation(s) in RCA: 9] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/23/2022] [Revised: 05/19/2022] [Accepted: 06/26/2022] [Indexed: 10/17/2022]
Abstract
Large 'omics studies are of particular interest to population and clinical research as they allow elucidation of biological pathways that are often out of reach of other methodologies. Typically, these information rich datasets are produced from multiple coordinated profiling studies that may include lipidomics, metabolomics, proteomics or other strategies to generate high dimensional data. In lipidomics, the generation of such data presents a series of unique technological and logistical challenges; to maximize the power (number of samples) and coverage (number of analytes) of the dataset while minimizing the sources of unwanted variation. Technological advances in analytical platforms, as well as computational approaches, have led to improvement of data quality - especially with regard to instrumental variation. In the small scale, it is possible to control systematic bias from beginning to end. However, as the size and complexity of datasets grow, it is inevitable that unwanted variation arises from multiple sources, some potentially unknown and out of the investigators control. Increases in cohort sizes and complexity has led to new challenges in sample collection, handling, storage, and preparation stages. If not considered and dealt with appropriately, this unwanted variation may undermine the quality of the data and reliability of any subsequent analysis. Here we review the various experimental phases where unwanted variation may be introduced and review general strategies and approaches to handle this variation, specifically addressing issues relevant to lipidomics studies.
Collapse
Affiliation(s)
- Gavriel Olshansky
- Metabolomics Laboratory, Baker Heart and Diabetes Institute, Melbourne, Victoria, Australia; Baker Department of Cardiometabolic Health, University of Melbourne, Parkville, Victoria, Australia
| | - Corey Giles
- Metabolomics Laboratory, Baker Heart and Diabetes Institute, Melbourne, Victoria, Australia; Baker Department of Cardiometabolic Health, University of Melbourne, Parkville, Victoria, Australia
| | - Agus Salim
- Melbourne School of Population and Global Health, University of Melbourne, Parkville, VIC 3010, Australia; School of Mathematics and Statistics, University of Melbourne, Parkville, VIC 3010, Australia
| | - Peter J Meikle
- Metabolomics Laboratory, Baker Heart and Diabetes Institute, Melbourne, Victoria, Australia; Baker Department of Cardiometabolic Health, University of Melbourne, Parkville, Victoria, Australia; Faculty of Medicine, Nursing and Health Sciences, Central Clinical School, Monash University, Melbourne, Victoria, Australia.
| |
Collapse
|
4
|
Penalized Variable Selection for Lipid-Environment Interactions in a Longitudinal Lipidomics Study. Genes (Basel) 2019; 10:genes10121002. [PMID: 31816972 PMCID: PMC6947406 DOI: 10.3390/genes10121002] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/07/2019] [Accepted: 11/26/2019] [Indexed: 12/20/2022] Open
Abstract
Lipid species are critical components of eukaryotic membranes. They play key roles in many biological processes such as signal transduction, cell homeostasis, and energy storage. Investigations of lipid-environment interactions, in addition to the lipid and environment main effects, have important implications in understanding the lipid metabolism and related changes in phenotype. In this study, we developed a novel penalized variable selection method to identify important lipid-environment interactions in a longitudinal lipidomics study. An efficient Newton-Raphson based algorithm was proposed within the generalized estimating equation (GEE) framework. We conducted extensive simulation studies to demonstrate the superior performance of our method over alternatives, in terms of both identification accuracy and prediction performance. As weight control via dietary calorie restriction and exercise has been demonstrated to prevent cancer in a variety of studies, analysis of the high-dimensional lipid datasets collected using 60 mice from the skin cancer prevention study identified meaningful markers that provide fresh insight into the underlying mechanism of cancer preventive effects.
Collapse
|
5
|
Lozano M, Manyes L, Peiró J, Iftimi A, Ramada JM. Strategic procedure in three stages for the selection of variables to obtain balanced results in public health research. CAD SAUDE PUBLICA 2018; 34:e00174017. [PMID: 30043852 DOI: 10.1590/0102-311x00174017] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/05/2017] [Accepted: 05/03/2018] [Indexed: 11/22/2022] Open
Abstract
Multidisciplinary research in public health is approached using methods from many scientific disciplines. One of the main characteristics of this type of research is dealing with large data sets. Classic statistical variable selection methods, known as "screen and clean", and used in a single-step, select the variables with greater explanatory weight in the model. These methods, commonly used in public health research, may induce masking and multicollinearity, excluding relevant variables for the experts in each discipline and skewing the result. Some specific techniques are used to solve this problem, such as penalized regressions and Bayesian statistics, they offer more balanced results among subsets of variables, but with less restrictive selection thresholds. Using a combination of classical methods, a three-step procedure is proposed in this manuscript, capturing the relevant variables of each scientific discipline, minimizing the selection of variables in each of them and obtaining a balanced distribution that explains most of the variability. This procedure was applied on a dataset from a public health research. Comparing the results with the single-step methods, the proposed method shows a greater reduction in the number of variables, as well as a balanced distribution among the scientific disciplines associated with the response variable. We propose an innovative procedure for variable selection and apply it to our dataset. Furthermore, we compare the new method with the classic single-step procedures.
Collapse
Affiliation(s)
- Manuel Lozano
- Departament de Medicina Preventiva i Salut Pública, Ciències de l'Alimentació, Toxicologia i Medicina Legal, Universitat de València, Valencia, España
| | - Lara Manyes
- Departament de Medicina Preventiva i Salut Pública, Ciències de l'Alimentació, Toxicologia i Medicina Legal, Universitat de València, Valencia, España
| | - Juanjo Peiró
- Departament d'Estadística i Investigació Operativa, Universitat de València, Valencia, España
| | - Adina Iftimi
- Departament d'Estadística i Investigació Operativa, Universitat de València, Valencia, España.,Department of Biosciences and Nutrition. Karolinska Institutet, Huddinge, Sweden
| | - José María Ramada
- Institut Hospital del Mar d'Investigacions Mèdiques, Barcelona, España.,CIBER de Epidemiología y Salud Pública, Madrid, España
| |
Collapse
|