1
|
Yin A, Yuan A, Tan MT. Highly robust causal semiparametric U-statistic with applications in biomedical studies. Int J Biostat 2024; 20:69-91. [PMID: 36433631 PMCID: PMC10225018 DOI: 10.1515/ijb-2022-0047] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/19/2022] [Accepted: 10/31/2022] [Indexed: 11/28/2022]
Abstract
With our increased ability to capture large data, causal inference has received renewed attention and is playing an ever-important role in biomedicine and economics. However, one major methodological hurdle is that existing methods rely on many unverifiable model assumptions. Thus robust modeling is a critically important approach complementary to sensitivity analysis, where it compares results under various model assumptions. The more robust a method is with respect to model assumptions, the more worthy it is. The doubly robust estimator (DRE) is a significant advance in this direction. However, in practice, many outcome measures are functionals of multiple distributions, and so are the associated estimands, which can only be estimated via U-statistics. Thus most existing DREs do not apply. This article proposes a broad class of highly robust U-statistic estimators (HREs), which use semiparametric specifications for both the propensity score and outcome models in constructing the U-statistic. Thus, the HRE is more robust than the existing DREs. We derive comprehensive asymptotic properties of the proposed estimators and perform extensive simulation studies to evaluate their finite sample performance and compare them with the corresponding parametric U-statistics and the naive estimators, which show significant advantages. Then we apply the method to analyze a clinical trial from the AIDS Clinical Trials Group.
Collapse
Affiliation(s)
- Anqi Yin
- Department of Biostatistics, Bioinformatics and Biomathematics Georgetown University, Washington, DC 20057, USA
| | - Ao Yuan
- Department of Biostatistics, Bioinformatics and Biomathematics Georgetown University, Washington, DC 20057, USA
| | - Ming T. Tan
- Department of Biostatistics, Bioinformatics and Biomathematics Georgetown University, Washington, DC 20057, USA
| |
Collapse
|
2
|
Pashazadeh A, Navimipour NJ. Big data handling mechanisms in the healthcare applications: A comprehensive and systematic literature review. J Biomed Inform 2018; 82:47-62. [PMID: 29655946 DOI: 10.1016/j.jbi.2018.03.014] [Citation(s) in RCA: 24] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/22/2017] [Revised: 11/19/2017] [Accepted: 03/23/2018] [Indexed: 01/08/2023]
Abstract
Healthcare provides many services such as diagnosing, treatment, prevention of diseases, illnesses, injuries, and other physical and mental disorders. Large-scale distributed data processing applications in healthcare as a basic concept operates on large amounts of data. Therefore, big data application functions are the main part of healthcare operations, but there was not any comprehensive and systematic survey about studying and evaluating the important techniques in this field. Therefore, this paper aims at providing the comprehensive, detailed, and systematic study of the state-of-the-art mechanisms in the big data related to healthcare applications in five categories, including machine learning, cloud-based, heuristic-based, agent-based, and hybrid mechanisms. Also, this paper displayed a systematic literature review (SLR) of the big data applications in the healthcare literature up to the end of 2016. Initially, 205 papers were identified, but a paper selection process reduced the number of papers to 29 important studies.
Collapse
Affiliation(s)
- Asma Pashazadeh
- Department of Computer Engineering, Tabriz Branch, Islamic Azad University, Tabriz, Iran
| | - Nima Jafari Navimipour
- Department of Computer Engineering, Tabriz Branch, Islamic Azad University, Tabriz, Iran.
| |
Collapse
|
3
|
Pan W. Relationship between genomic distance-based regression and kernel machine regression for multi-marker association testing. Genet Epidemiol 2015; 35:211-6. [PMID: 21308765 DOI: 10.1002/gepi.20567] [Citation(s) in RCA: 40] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/18/2010] [Revised: 11/21/2010] [Accepted: 01/04/2011] [Indexed: 11/10/2022]
Abstract
To detect genetic association with common and complex diseases, two powerful yet quite different multimarker association tests have been proposed, genomic distance-based regression (GDBR) (Wessel and Schork [2006] Am J Hum Genet 79:821–833) and kernel machine regression (KMR) (Kwee et al. [2008] Am J Hum Genet 82:386–397; Wu et al. [2010] Am J Hum Genet 86:929–942). GDBR is based on relating a multimarker similarity metric for a group of subjects to variation in their trait values, while KMR is based on nonparametric estimates of the effects of the multiple markers on the trait through a kernel function or kernel matrix. Since the two approaches are both powerful and general, but appear quite different, it is important to know their specific relationships. In this report, we show that, under the condition that there is no other covariate, there is a striking correspondence between the two approaches for a quantitative or a binary trait: if the same positive semi-definite matrix is used as the centered similarity matrix in GDBR and as the kernel matrix in KMR, the F-test statistic in GDBR and the score test statistic in KMR are equal (up to some ignorable constants). The result is based on the connections of both methods to linear or logistic (random-effects) regression models.
Collapse
Affiliation(s)
- Wei Pan
- Division of Biostatistics, School of Public Health, University of Minnesota, Minneapolis, MN 55455–0392, USA.
| |
Collapse
|
4
|
Lin WY, Tiwari HK, Gao G, Zhang K, Arcaroli JJ, Abraham E, Liu N. Similarity-based multimarker association tests for continuous traits. Ann Hum Genet 2012; 76:246-60. [PMID: 22497480 DOI: 10.1111/j.1469-1809.2012.00706.x] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
Abstract
Testing multiple markers simultaneously not only can capture the linkage disequilibrium patterns but also can decrease the number of tests and thus alleviate the multiple-testing penalty. If a gene is associated with a phenotype, subjects with similar genotypes in this gene should also have similar phenotypes. Based on this concept, we have developed a general framework that is applicable to continuous traits. Two similarity-based tests (namely, SIMc and SIMp tests) were derived as special cases of the general framework. In our simulation study, we compared the power of the two tests with that of the single-marker analysis, a standard haplotype regression, and a popular and powerful kernel machine regression. Our SIMc test outperforms other tests when the average R(2) (a measure of linkage disequilibrium) between the causal variant and the surrounding markers is larger than 0.3 or when the causal allele is common (say, frequency = 0.3). Our SIMp test outperforms other tests when the causal variant was introduced at common haplotypes (the maximum frequency of risk haplotypes >0.4). We also applied our two tests to an adiposity data set to show their utility.
Collapse
Affiliation(s)
- Wan-Yu Lin
- Department of Biostatistics, University of Alabama at Birmingham, USA
| | | | | | | | | | | | | |
Collapse
|
5
|
Han F, Pan W. Powerful multi-marker association tests: unifying genomic distance-based regression and logistic regression. Genet Epidemiol 2011; 34:680-8. [PMID: 20976795 DOI: 10.1002/gepi.20529] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/11/2022]
Abstract
To detect genetic association with common and complex diseases, many statistical tests have been proposed for candidate gene or genome-wide association studies with the case-control design. Due to linkage disequilibrium (LD), multi-marker association tests can gain power over single-marker tests with a Bonferroni multiple testing adjustment. Among many existing multi-marker association tests, most target to detect only one of many possible aspects in distributional differences between the genotypes of cases and controls, such as allele frequency differences, while a few new ones aim to target two or three aspects, all of which can be implemented in logistic regression. In contrast to logistic regression, a genomic distance-based regression (GDBR) approach aims to detect some high-order genotypic differences between cases and controls. A recent study has confirmed the high power of GDBR tests. At this moment, the popular logistic regression and the emerging GDBR approaches are completely unrelated; for example, one has to choose between the two. In this article, we reformulate GDBR as logistic regression, opening a venue to constructing other powerful tests while overcoming some limitations of GDBR. For example, asymptotic distributions can replace time-consuming permutations for deriving P-values and covariates, including gene-gene interactions, can be easily incorporated. Importantly, this reformulation facilitates combining GDBR with other existing methods in a unified framework of logistic regression. In particular, we show that Fisher's P-value combining method can boost statistical power by incorporating information from allele frequencies, Hardy-Weinberg disequilibrium, LD patterns, and other higher-order interactions among multi-markers as captured by GDBR.
Collapse
Affiliation(s)
- Fang Han
- Division of Biostatistics, School of Public Health, University of Minnesota, Minneapolis, Minnesota 55455–0392, USA
| | | |
Collapse
|
6
|
Schaid DJ. Genomic similarity and kernel methods I: advancements by building on mathematical and statistical foundations. Hum Hered 2010; 70:109-31. [PMID: 20610906 DOI: 10.1159/000312641] [Citation(s) in RCA: 75] [Impact Index Per Article: 5.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/02/2009] [Accepted: 03/09/2010] [Indexed: 01/05/2023] Open
Abstract
Measures of genomic similarity are the basis of many statistical analytic methods. We review the mathematical and statistical basis of similarity methods, particularly based on kernel methods. A kernel function converts information for a pair of subjects to a quantitative value representing either similarity (larger values meaning more similar) or distance (smaller values meaning more similar), with the requirement that it must create a positive semidefinite matrix when applied to all pairs of subjects. This review emphasizes the wide range of statistical methods and software that can be used when similarity is based on kernel methods, such as nonparametric regression, linear mixed models and generalized linear mixed models, hierarchical models, score statistics, and support vector machines. The mathematical rigor for these methods is summarized, as is the mathematical framework for making kernels. This review provides a framework to move from intuitive and heuristic approaches to define genomic similarities to more rigorous methods that can take advantage of powerful statistical modeling and existing software. A companion paper reviews novel approaches to creating kernels that might be useful for genomic analyses, providing insights with examples [1].
Collapse
Affiliation(s)
- Daniel J Schaid
- Division of Biomedical Statistics and Informatics, Mayo Clinic, Rochester, Minn., USA
| |
Collapse
|
7
|
Lin WY, Schaid DJ. Power comparisons between similarity-based multilocus association methods, logistic regression, and score tests for haplotypes. Genet Epidemiol 2009; 33:183-97. [PMID: 18814307 DOI: 10.1002/gepi.20364] [Citation(s) in RCA: 27] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/07/2022]
Abstract
Recently, a genomic distance-based regression for multilocus associations was proposed (Wessel and Schork [2006] Am. J. Hum. Genet. 79:792-806) in which either locus or haplotype scoring can be used to measure genetic distance. Although it allows various measures of genomic similarity and simultaneous analyses of multiple phenotypes, its power relative to other methods for case-control analyses is not well known. We compare the power of traditional methods with this new distance-based approach, for both locus-scoring and haplotype-scoring strategies. We discuss the relative power of these association methods with respect to five properties: (1) the marker informativity; (2) the number of markers; (3) the causal allele frequency; (4) the preponderance of the most common high-risk haplotype; (5) the correlation between the causal single-nucleotide polymorphism (SNP) and its flanking markers. We found that locus-based logistic regression and the global score test for haplotypes suffered from power loss when many markers were included in the analyses, due to many degrees of freedom. In contrast, the distance-based approach was not as vulnerable to more markers or more haplotypes. A genotype counting measure was more sensitive to the marker informativity and the correlation between the causal SNP and its flanking markers. After examining the impact of the five properties on power, we found that on average, the genomic distance-based regression that uses a matching measure for diplotypes was the most powerful and robust method among the seven methods we compared.
Collapse
Affiliation(s)
- Wan-Yu Lin
- Institute of Epidemiology, National Taiwan University, Taipei, Taiwan.
| | | |
Collapse
|
8
|
Yuan A, Yue Q, Apprey V, Bonney G. An extension of the weighted dissimilarity test to association study in families. Hum Genet 2007; 122:83-94. [PMID: 17530290 DOI: 10.1007/s00439-007-0376-5] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/02/2007] [Accepted: 05/07/2007] [Indexed: 10/23/2022]
Abstract
Association studies for complex diseases based on pedigree haplotype or genotype data have received increasing attention in the last few years. The similarity tests are appealing for these studies because they take into account of the DNA structure, but they have blind areas on which significant association can not be detected. Recently, we developed a dissimilarity method for this problem based on independent haplotype data, which eliminates the blind areas of the existing methods. As DNA collected on families are common in practice, and the data are either of the form of genotype or haplotype. Here we extend our method for association study to data on families. It can be used to evaluate different designs in terms of power. Simulation studies confirmed that the extended method improves the type I error rate and power. Applying this method to the Genetic Analysis Workshop 14 alcoholism data, we find that markers rs716581, rs1017418, rs1332184 and rs1943418 on chromosomes 1, 2, 9 and 18 yield strong signal (with P value 0.001 or lower) for association with alcoholism. Our work can serve as a guide in the design of association studies in families.
Collapse
Affiliation(s)
- Ao Yuan
- National Human Genome Center, Department of Community Health and Family Medicine, Howard University, Washington, DC 20059, USA.
| | | | | | | |
Collapse
|