26
|
Bergemann TL, Laws RJ, Quiaoit F, Zhao LP. A Statistically Driven Approach for Image Segmentation and Signal Extraction in cDNA Microarrays. J Comput Biol 2004; 11:695-713. [PMID: 15579239 DOI: 10.1089/cmb.2004.11.695] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
The increasing use of cDNA microarrays necessitates the development of methods for extracting quality data. Here, we set forth hurdles to overcome in image analysis of microarrays. We emphasize the importance of objective data extraction methods resulting in reliable signal estimates. Based on statistical principles, we describe a method for automated grid alignment, spot detection, background estimation, flagging, and signal extraction. A software application that we call SignalViewer has been implemented for this method. We identify areas where we improved upon current methods used for array image analysis at each step in the process. Finally, we give examples to illustrate the performance of our algorithms on raw data.
Collapse
|
27
|
Hsu L, Prentice RL, Zhao LP, Li S. Incorporating age at onset information into the transmission/disequilibrium test. Genet Epidemiol 2002; 21 Suppl 1:S347-52. [PMID: 11793696 DOI: 10.1002/gepi.2001.21.s1.s347] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]
Abstract
In this paper, we proposed two approaches for incorporating the age at onset into the transmission/disequilibrium test (TDT). Trios (affected offspring and their parents) were extracted from the first four replicate data sets of the general population type. Focusing on chromosome 6 where MG6 and MG7 reside, we compared the usual TDT with the newly proposed tests in terms of gene localization.
Collapse
|
28
|
Thomas JG, Olson JM, Tapscott SJ, Zhao LP. An efficient and robust statistical modeling approach to discover differentially expressed genes using genomic expression profiles. Genome Res 2001; 11:1227-36. [PMID: 11435405 PMCID: PMC311075 DOI: 10.1101/gr.165101] [Citation(s) in RCA: 219] [Impact Index Per Article: 9.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
Abstract
We have developed a statistical regression modeling approach to discover genes that are differentially expressed between two predefined sample groups in DNA microarray experiments. Our model is based on well-defined assumptions, uses rigorous and well-characterized statistical measures, and accounts for the heterogeneity and genomic complexity of the data. In contrast to cluster analysis, which attempts to define groups of genes and/or samples that share common overall expression profiles, our modeling approach uses known sample group membership to focus on expression profiles of individual genes in a sensitive and robust manner. Further, this approach can be used to test statistical hypotheses about gene expression. To demonstrate this methodology, we compared the expression profiles of 11 acute myeloid leukemia (AML) and 27 acute lymphoblastic leukemia (ALL) samples from a previous study (Golub et al. 1999) and found 141 genes differentially expressed between AML and ALL with a 1% significance at the genomic level. Using this modeling approach to compare different sample groups within the AML samples, we identified a group of genes whose expression profiles correlated with that of thrombopoietin and found that genes whose expression associated with AML treatment outcome lie in recurrent chromosomal locations. Our results are compared with those obtained using t-tests or Wilcoxon rank sum statistics.
Collapse
|
29
|
Zhao LP, Prentice R, Breeden L. Statistical modeling of large microarray data sets to identify stimulus-response profiles. Proc Natl Acad Sci U S A 2001; 98:5631-6. [PMID: 11344303 PMCID: PMC33264 DOI: 10.1073/pnas.101013198] [Citation(s) in RCA: 98] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Abstract
A statistical modeling approach is proposed for use in searching large microarray data sets for genes that have a transcriptional response to a stimulus. The approach is unrestricted with respect to the timing, magnitude or duration of the response, or the overall abundance of the transcript. The statistical model makes an accommodation for systematic heterogeneity in expression levels. Corresponding data analyses provide gene-specific information, and the approach provides a means for evaluating the statistical significance of such information. To illustrate this strategy we have derived a model to depict the profile expected for a periodically transcribed gene and used it to look for budding yeast transcripts that adhere to this profile. Using objective criteria, this method identifies 81% of the known periodic transcripts and 1,088 genes, which show significant periodicity in at least one of the three data sets analyzed. However, only one-quarter of these genes show significant oscillations in at least two data sets and can be classified as periodic with high confidence. The method provides estimates of the mean activation and deactivation times, induced and basal expression levels, and statistical measures of the precision of these estimates for each periodic transcript.
Collapse
|
30
|
Zhao LP, Aragaki C, Hsu L, Potter J, Elston R, Malone KE, Daling JR, Prentice R. Integrated designs for gene discovery and characterization. J Natl Cancer Inst Monogr 2000:71-80. [PMID: 10854489 DOI: 10.1093/oxfordjournals.jncimonographs.a024229] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
Recent advances, including near completion of the human genome map, ever improving high-throughput technologies, and successes in discovering chronic disease-related genes, have stimulated the further development of genetic epidemiology. The primary mission of genetic epidemiology is to discover and characterize genes, whether independent of or interactive with environmental factors, that cause human diseases. To accomplish such a mission, genetic epidemiology needs to integrate both genetic and epidemiologic approaches. One of the challenges facing such an integrated approach is the identification of study designs that are efficient for both gene discovery and characterization. Because designs for gene discovery alone and designs for gene characterization alone have been elaborated in the other two panels, the focus of this paper is to describe those designs that may be useful for discovery and characterization jointly, including case-family and case-control-family designs. Examples of integrated designs are described, and studies of breast cancer conducted at the Fred Hutchinson Cancer Research Center are used for illustration. Finally, related analytic issues are also discussed.
Collapse
|
31
|
Zhao LP, Hsu L, Davidov O, Potter J, Elston RC, Prentice RL. Population-based family study designs: an interdisciplinary research framework for genetic epidemiology. Genet Epidemiol 2000; 14:365-88. [PMID: 9271710 DOI: 10.1002/(sici)1098-2272(1997)14:4<365::aid-gepi3>3.0.co;2-2] [Citation(s) in RCA: 28] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/05/2023]
Abstract
Most complex traits such as cancer and coronary heart diseases are attributed either to heritable factors or to environmental factors or to both. Dissecting the genetic and environmental etiology of complex traits thus requires an interdisciplinary research strategy. Genetic studies generally involve families and investigate familial aggregations of traits, segregation of major disease genes, and locations of disease genes on the human genome, the latter of which can be identified via linkage analysis. Epidemiologic studies often use population-based case-control studies to establish the role of specific environmental factors. Integrating both objectives, genetic epidemiology is to assess the associations of environmental factors with disease status, to quantify the aggregation of cases within families, to characterize putative disease genes via segregation analysis, and to localize disease genes via linkage analysis with genetic markers. To accomplish these objectives through designed studies, we propose a class of population-based family study designs, which are formed by choosing among sampling designs at three stages. The objectives of sampling at these three stages are 1) combined aggregation and association analysis, 2) combined segregation, aggregation, and association analysis, and 3) combined linkage, segregation, aggregation, and association analysis. These designs form an interdisciplinary research framework for genetic epidemiology. Our preliminary exploration of this framework and related analytic methods indicates that population-based family study designs retain the efficiency of linkage analysis for localizing disease genes without losing the property of being population-based, and they will therefore allow an assessment of a joint contribution of genetic and environmental factors to complex traits.
Collapse
|
32
|
Holte S, Quiaoit F, Hsu L, Davidov O, Zhao LP. A population based family study of a common oligogenic disease--Part I: Association/aggregation analysis. Genet Epidemiol 2000; 14:803-7. [PMID: 9433581 DOI: 10.1002/(sici)1098-2272(1997)14:6<803::aid-gepi40>3.0.co;2-r] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/05/2023]
|
33
|
Abstract
We used segregation analysis to investigate the genetic etiology of the disease in Problem 2A. Under the assumption of a dominant major gene, our analysis suggests a major gene with relative risk of 58 and an allele frequency of 0.013. Under an additive gene assumption, it appears that there may be two genes with relative risks of 39 and 17 and allele frequencies of 0.015 and 0.075, respectively.
Collapse
|
34
|
Hsu L, Zhao LP, Aragaki C. A note on a conditional-likelihood approach for family-based association studies of candidate genes. Hum Hered 2000; 50:194-200. [PMID: 10686500 DOI: 10.1159/000022914] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022] Open
Abstract
The family-based association study design is a variation of the case-control study design, where unaffected family members instead of unrelated subjects are sampled as controls. This variation is useful in assessing the effects of candidate genes on disease, because it avoids false associations caused by admixture of populations. A complication of this design is that because of an inherited genotypic correlation among family members, the genotypic distributions between cases and relative controls may be distorted by the ascertainment criteria of families, which could involve not only cases and relative controls, but also other relatives. Analyzing such data naively may lead to biased estimates of relative risk. In this note, we will discuss the consistency of a conditional-likelihood approach. We show analytically that maximum conditional-likelihood estimators are consistent for the true relative risks, if genotypes for family members are exchangeable under the sampling process, for example, sibling clusters. Besides being straightforward conceptually and computationally, this approach is robust to ascertainment bias and naturally accommodates genetic heterogeneity across families.
Collapse
|
35
|
Abstract
Decreased age at onset in successive generations has been observed for a number of diseases. Two nonparametric matched and unmatched test statistics are proposed, taking into account not only current age or age at death for unaffected individuals and age at disease onset for affected individuals, but also possible correlations among family members. Both are asymptotically normal with readily estimated variances from the data. A simulation study is conducted to compare the proposed tests with the commonly used paired t-test and log-rank test. It has been shown that the proposed test statistics yield valid conclusions in assessing genetic anticipation under all situations considered. However, the paired t-test is valid only when the censoring distributions are comparable between two generations, whereas the log-rank test is valid when the correlation among family members is weak. As expected, the matched test is most powerful when the data are heterogeneous, and the unmatched and the log-rank tests are most powerful when the data are homogeneous and the correlation is weak. Lastly, a population-based family study of breast cancer conducted at the Fred Hutchinson Cancer Research Center is used for illustration of the proposed and the log-rank tests. The preliminary analysis suggests that there appears a decreased age at onset over the successive generations in breast cancer.
Collapse
|
36
|
Aragaki C, Quiaoit F, Hsu L, Zhao LP. Mapping alcoholism genes using linkage/linkage disequilibrium analysis. Genet Epidemiol 1999; 17 Suppl 1:S43-8. [PMID: 10597410 DOI: 10.1002/gepi.1370170708] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
Abstract
Using a recently developed semiparametric method for combined linkage/linkage-disequilibrium analysis, we analyzed the Collaborative Study on the Genetics of Alcoholism data subset developed for Genetic Analysis Workshop 11 (GAW11). This semiparametric approach estimates recombination fractions for linkage, marker log odds ratios for linkage-disequilibrium, their product for combined linkage/linkage-disequilibrium, and corresponding z-scores. We used two outcomes: alcohol dependence and "alcoholism-free" and a genome-wide significance level of 4.1 (which corresponds to a genome-wide lod score of 3.6). For the alcohol dependence outcome, we observed significant linkage signals at D1S1588-D1S1631, D1S547, D2S399, D2S425, D4S2361, D7S1796, and D7S1824. We also found significant linkage-disequilibrium signals at D1S547 and D7S1795. For the "alcoholism-free" outcome, we found significant linkage signals at D4S2457, D41651 (both flank ADH3), D11S2359, and D16S47 and significant linkage-disequilibrium signals at D4S2361, FABP2, D11S2359, D19S431 and D19S47-D19S198-D19S601.
Collapse
|
37
|
Hsu L, Aragaki C, Quiaoit F, Wang X, Xu X, Zhao LP. A genome-wide scan for a simulated data set using two newly developed methods. Genet Epidemiol 1999; 17 Suppl 1:S621-6. [PMID: 10597503 DOI: 10.1002/gepi.13701707101] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022]
Abstract
A genome-wide scan of a simulated data set for fictitious disease genes was conducted using both semiparametric and nonparametric methods. The semiparametric model-based method, which tests for linkage/linkage disequilibrium separately and together, correctly identified all three underlying disease loci along with two false positives through the linkage analysis. However, the nonparametric model-free method which tests combined linkage/linkage disequilibrium, failed to yield any results due to the lack of linkage disequilibrium information in the data.
Collapse
|
38
|
Zarbl H, Aragaki C, Zhao LP. An efficient protocol for rare mutation genotyping in a large population. GENETIC TESTING 1999; 2:315-21. [PMID: 10464610 DOI: 10.1089/gte.1998.2.315] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/13/2022]
Abstract
We introduce a method to efficiently detect rare mutations for individual subjects in a large population by pooling samples and retesting subgroups of positive pooled samples. We conducted computer simulations of this method and discovered that it seems efficient for mutation prevalences less than 0.1, regardless of the number of samples. The simulations also indicate that splitting the pooled samples into three to five subgroups at each level is optimal. The expected number of necessary tests and relative efficiency of this method are given, by mutation prevalence and sample size.
Collapse
|
39
|
Zhao LP, Quiaoit F, Aragaki C, Hsu L. An efficient, robust and unified method for mapping complex traits (III): combined linkage/linkage-disequilibrium analysis. AMERICAN JOURNAL OF MEDICAL GENETICS 1999; 84:433-53. [PMID: 10360398] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Subscribe] [Scholar Register] [Indexed: 04/13/2023]
Abstract
Extending the method for linkage analysis [Zhao et al., 1998a: Am. J. Med. Genet. 77:366-383; 1998b: Am. J. Med. Genet. 79:49-61], this article describes a method for the linkage-disequilibrium analysis, and for combining linkage and linkage-disequilibrium analyses. As highly dense markers are increasingly used in genome scans, one or more markers are not only linked with the disease genes if they exist, but also likely in linkage-disequilibrium with those putative genes. Hence, linkage-disequilibrium analysis potentially offers additional information about positions of putative disease genes. Combining both linkage and linkage-disequilibrium signals, this approach is able to improve positional signals. As before, the proposed method is a model-based approach, but semiparametric via the estimating equation technique. Under the assumptions of penetrance and allele frequency, this method efficiently estimates recombination fractions for linkage analysis and odds ratios for linkage-disequilibrium analysis. As described in two previous papers, this method is relatively more robust than the lod score methods, since it requires weaker assumption than conditional independence. While the estimated recombination fractions are used for inference as part of linkage analysis, the estimated odds ratios are used for linkage-disequilibrium inference and combined linkage, and linkage-disequilibrium parameters can be used to test combined linkage/linkage-disequilibrium analysis. This approach has been implemented, named gSCAN, and its compiled version is available for trial on request via the web site (http:/lynx.fhcrc.org/qge). We applied this new approach to affected sib-pair data collected for the genome scan to localize type 1 diabetes genes. Under an assumed autosomal dominant gene model, the linkage analysis confirms an earlier suggestion of one major gene around D6S281. Interestingly, the linkage-disequilibrium analysis suggests several additional signals around D6S250, GATA30, D6S311, D6S441, D6S442, D6S415, D6S411, D6S305, and a290xh9. The linkage analysis, on the other hand, suggests a signal around D6S281, while providing supporting evidence for several other marker loci. However, the combined analysis did not provide strong support for any of the findings, implying that linkage and linkage-disequilibrium findings are not consistent.
Collapse
|
40
|
Zhao LP, Prentice R, Shen F, Hsu L. On the assessment of statistical significance in disease-gene discovery. Am J Hum Genet 1999; 64:1739-53. [PMID: 10330362 PMCID: PMC1377918 DOI: 10.1086/512072] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/03/2022] Open
Abstract
One of the major challenges facing genome-scan studies to discover disease genes is the assessment of the genomewide significance. The assessment becomes particularly challenging if the scan involves a large number of markers collected from a relatively small number of meioses. Typically, this assessment has two objectives: to assess genomewide significance under the null hypothesis of no linkage and to evaluate true-positive and false-positive prediction error rates under alternative hypotheses. The distinction between these goals allows one to formulate the problem in the well-established paradigm of statistical hypothesis testing. Within this paradigm, we evaluate the traditional criterion of LOD score 3.0 and a recent suggestion of LOD score 3.6, using the Monte Carlo simulation method. The Monte Carlo experiments show that the type I error varies with the chromosome length, with the number of markers, and also with sample sizes. For a typical setup with 50 informative meioses on 50 markers uniformly distributed on a chromosome of average length (i.e., 150 cM), the use of LOD score 3.0 entails an estimated chromosomewide type I error rate of.00574, leading to a genomewide significance level >.05. In contrast, the corresponding type I error for LOD score 3.6 is.00191, giving a genomewide significance level of slightly <.05. However, with a larger sample size and a shorter chromosome, a LOD score between 3.0 and 3.6 may be preferred, on the basis of proximity to the targeted type I error. In terms of reliability, these two LOD-score criteria appear not to have appreciable differences. These simulation experiments also identified factors that influence power and reliability, shedding light on the design of genome-scan studies.
Collapse
|
41
|
Zhao LP, Quiaoit F, Aragaki C, Hsu L. An efficient, robust, and unified method for mapping complex traits (II): multipoint linkage analysis. AMERICAN JOURNAL OF MEDICAL GENETICS 1998; 79:48-61. [PMID: 9738869 DOI: 10.1002/(sici)1096-8628(19980827)79:1<48::aid-ajmg12>3.0.co;2-m] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/24/2023]
Abstract
Extending the method for two-point linkage analysis [Zhao et al., 1998: Am J Med Genet 77:366-383], this paper introduces a semiparametric method for multipoint linkage analysis, expected to gain efficiency by using multiple markers simultaneously. Overcoming the longstanding statistical and computational challenge to the parametric approaches (or lod score methods) for multipoint linkage analysis, this semiparametric approach, based on the estimating equation technique, yields statistically efficient and yet robust estimates and enjoys the computational efficiency in processing multiple markers from large pedigrees. Its computational burden increases linearly with the sizes of pedigrees and with the number of marker loci. To illustrate this semiparametric method, we apply it to marker data gathered for the Breast Cancer Consortium. The result supports the earlier finding of the positive linkage with BRCA1 and has also shown that the multipoint linkage analysis has an improved power. In addition, we have applied this method to analyze genome scanning data that have been used to localize genes responsible for type 1 diabetes. In support of the earlier findings, the genome scanning detects the linkage signals on chromosome 6 but does not support the earlier suggestions of two major genes in that genome segment. Through sensitivity analysis, it appears that the results are robust to misspecification of penetrance and allele frequency.
Collapse
|
42
|
Offermanns S, Zhao LP, Gohla A, Sarosi I, Simon MI, Wilkie TM. Embryonic cardiomyocyte hypoplasia and craniofacial defects in G alpha q/G alpha 11-mutant mice. EMBO J 1998; 17:4304-12. [PMID: 9687499 PMCID: PMC1170764 DOI: 10.1093/emboj/17.15.4304] [Citation(s) in RCA: 187] [Impact Index Per Article: 7.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/08/2023] Open
Abstract
Heterotrimeric G proteins of the Gq class have been implicated in signaling pathways regulating cardiac growth under physiological and pathological conditions. Knockout mice carrying inactivating mutations in both of the widely expressed G alpha q class genes, G alpha q and G alpha 11, demonstrate that at least two active alleles of these genes are required for extrauterine life. Mice carrying only one intact allele [G alpha q(-/+);G alpha 11(-/-) or G alpha q(-/-);G alpha 11(-/+)] died shortly after birth. These mutants showed a high incidence of cardiac malformation. In addition, G alpha q(-/-);G alpha 11(-/+) newborns suffered from craniofacial defects. Mice lacking both G alpha q and G alpha 11 [G alpha q(-/-);G alpha 11(-/-)] died at embryonic day 11 due to cardiomyocyte hypoplasia. These data demonstrate overlap in G alpha q and G alpha 11 gene functions and indicate that the Gq class of G proteins plays a crucial role in cardiac growth and development.
Collapse
|
43
|
Zhao LP, Aragaki C, Hsu L, Quiaoit F. Mapping of complex traits by single-nucleotide polymorphisms. Am J Hum Genet 1998; 63:225-40. [PMID: 9634510 PMCID: PMC1377233 DOI: 10.1086/301909] [Citation(s) in RCA: 47] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/03/2022] Open
Abstract
Molecular geneticists are developing the third-generation human genome map with single-nucleotide polymorphisms (SNPs), which can be assayed via chip-based microarrays. One use of these SNP markers is the ability to locate loci that may be responsible for complex traits, via linkage/linkage-disequilibrium analysis. In this communication, we describe a semiparametric method for combined linkage/linkage-disequilibrium analysis using SNP markers. Asymptotic results are obtained for the estimated parameters, and the finite-sample properties are evaluated via a simulation study. We also applied this technique to a simulated genome-scan experiment for mapping a complex trait with two major genes. This experiment shows that separate linkage and linkage-disequilibrium analyses correctly detected the signals of both major genes; but the rates of false-positive signals seem high. When linkage and linkage-disequilibrium signals were combined, the analysis yielded much stronger and clearer signals for the presence of two major genes than did two separate analyses.
Collapse
|
44
|
Zhao LP, Quiaoit F, Hsu L, Aragaki C. Efficient, robust, and unified method for mapping complex traits (I): two-point linkage analysis. AMERICAN JOURNAL OF MEDICAL GENETICS 1998; 77:366-83. [PMID: 9632166] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Subscribe] [Scholar Register] [Indexed: 02/07/2023]
Abstract
The completion of a preliminary human genome map and development of molecular methods have enabled researchers to assay a large number of polymorphic markers that are evenly spaced along the entire human genome. Among many applications, marker data are valuable for mapping complex traits through linkage or linkage-disequilibrium analysis, the former of which is the focus of this paper, the first in a series on this subject. Formalizing the concept and computation for linkage analysis, Elston and Stewart [1971; Human Heredity 21:523-542] introduced a likelihood function to capture relevant genetic information and a recursive algorithm for computing the likelihood function. However, the computing burden is prohibitive in processing complex pedigrees. Since that fundamental development, improving the computational algorithm and extending the method has been a dynamic area of research. The primary objective of this communication is to introduce a semiparametric method for linkage analysis. It is a particularly suitable approach with desirable properties for mapping complex traits that may be binary, continuous, and partially observed (i.e., censored). It incorporates candidate genes, environmental factors, and their interactions with the putative gene and is expected to be robust and efficient in comparison with likelihood-based methods. The properties of the estimates have been studied in finite samples with a limited simulation study. This method is illustrated with an application to family data contributed to the Breast Cancer Consortium.
Collapse
|
45
|
Le Marchand L, Zhao LP, Quiaoit F, Wilkens LR, Kolonel LN. Family history and risk of colorectal cancer in the multiethnic population of Hawaii. Am J Epidemiol 1996; 144:1122-8. [PMID: 8956624 DOI: 10.1093/oxfordjournals.aje.a008890] [Citation(s) in RCA: 30] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/03/2023] Open
Abstract
Increased risk of colorectal cancer in individuals with family history of the disease has been observed consistently in past studies. However, limited attention has been given to the influence of ethnicity, the characteristics of the proband's tumor, and kinship. A population-based case-control study was conducted between 1987 and 1991 in Hawaii among 1,192 incident colorectal cancer cases and 1,192 sex-, age-, and ethnicity-matched population controls. The study identified 7,673 relatives for the cases and 7,823 relatives for the controls. With an estimating equation-based regression method, relatives of cases were found to have a 2.5-fold increased risk of colorectal cancer compared with relatives of controls (95% confidence interval (CI) 1.8-3.4) after adjustment for covariates. This increase in risk was greater for Japanese (odds ratio (OR) = 3.0, 95% CI 1.7-5.4) than Caucasians (OR = 1.8, 95% CI 1.2-2.9), for siblings (OR = 3.1, 95% CI 2.1-4.6) than parents (OR = 2.0, 95% CI 1.1-3.1), and when the index patient was diagnosed before the age of 55 years (OR = 4.1, 95% CI 2.1-8.0) with multiple tumors (OR = 9.5, 95% CI 4.4-20.6), with a distant stage (OR = 4.6, 95% CI 2.7-7.8), or with cancer of the right colon (OR = 3.0, 95% CI 2.0-4.4) or the rectum (OR = 3.0, 95% CI 1.8-4.8). The increase in risk was not affected by the relative's sex. Relatives of cases were not at increased risk for other common cancers. It is estimated that approximately 11.1% and 6.5% of colorectal cancers are attributable to a first degree family history of the disease for Japanese and Caucasians, respectively. These data and those of previous studies strongly suggest that individuals with a family history of colorectal cancer in a first degree relative are at increased risk for the disease and should receive regular diagnostic screening. Characteristics of the index case, such as age and stage at diagnosis, subsite and number of tumors, and race, as well as kinship, may be important in assessing the colorectal cancer risk of a relative.
Collapse
|
46
|
Zhao LP, Lipsitz S, Lew D. Regression analysis with missing covariate data using estimating equations. Biometrics 1996; 52:1165-82. [PMID: 8962448] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/03/2023]
Abstract
In regression analysis, missing covariate data has been among the most common problems. Frequently, practitioners adopt the so-called complete-case analysis, i.e., performing the analysis on only a complete dataset after excluding records with missing covariates. Performing a complete-case analysis is convenient with existing statistical packages, but it may be inefficient since the observed outcomes and covariates on those records with missing covariates are not used. It can even give misleading statistical inference if missing is not completely at random. This paper introduces a joint estimating equation (JEE) for regression analysis in the presence of missing observations on one covariate, which may be thought of as a method in a general framework for the missing covariate data problem proposed by Robins, Rotnitzky, and Zhao (1994, Journal of the American Statistical Association 89, 846-866). A generalization of JEE to more than one such covariate is discussed. The JEE is generally applicable to estimating regression coefficients from a regression model, including linear and logistic regression. Provided that the missing covariate data is either missing completely at random or missing at random (in addition to mild regularity conditions), estimates of regression coefficients from the JEE are consistent and have an asymptotic normal distribution. Simulation results show that the asymptotic distribution of estimated coefficients performs well in finite samples. Also shown through the simulation study is that the validity of JEE estimates depends on the correct specification of the probability function that characterizes the missing mechanism, suggesting a need for further research on how to robustify the estimation from making this nuisance assumption. Finally, the JEE is illustrated with an application from a case-control study of diet and thyroid cancer.
Collapse
|
47
|
Zhao LP, Koslovsky JS, Reinhard J, Bähler M, Witt AE, Provance DW, Mercer JA. Cloning and characterization of myr 6, an unconventional myosin of the dilute/myosin-V family. Proc Natl Acad Sci U S A 1996; 93:10826-31. [PMID: 8855265 PMCID: PMC38240 DOI: 10.1073/pnas.93.20.10826] [Citation(s) in RCA: 71] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/02/2023] Open
Abstract
We have isolated cDNAs encoding a second member of the dilute (myosin-V) unconventional myosin family in vertebrates, myr 6 (myosin from rat 6). Expression of myr 6 transcripts in the brain is much more limited than is the expression of dilute, with highest levels observed in choroid plexus and components of the limbic system. We have mapped the myr 6 locus to mouse chromosome 18 using an interspecific backcross. The 3' portion of the myr 6 cDNA sequence from rat is nearly identical to that of a previously published putative glutamic acid decarboxylase from mouse [Huang, W.M., Reed-Fourquet, L., Wu, E. & Wu, J.Y. (1990) Proc. Natl. Acad. Sci. USA 87, 8491-8495].
Collapse
|
48
|
Zhao LP, Kristal AR, White E. Estimating relative risk functions in case-control studies using a nonparametric logistic regression. Am J Epidemiol 1996; 144:598-609. [PMID: 8797520 DOI: 10.1093/oxfordjournals.aje.a008970] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/02/2023] Open
Abstract
The authors describe an approach to the analysis of case-control studies in which the exposure variables are continuous, i.e., quantitative variables, and one wishes neither to categorize levels of the exposure variable nor to assume a log-linear relation between level of exposure and disease risk. A dose-response association of an exposure variable with a disease outcome can be depicted by estimated relative risks at various exposure levels, and the functional relation between exposure dose and disease risk is here termed a relative risk function (RRF). A RRF takes values that are greater than zero: Values less than one imply lower risk; the value one implies no risk, and values greater than one imply increased risk, when compared with a reference value. The authors describe how a nonparametric logistic regression can be used to estimate and display these RRFs. Using data from a previously published case-control study of diet and colon cancer, RRFs for total energy, dietary fiber, and alcohol intakes are compared with the original results obtained from using categorized levels of exposure variables. For total energy and alcohol intakes, there were meaningful differences in study results based on the two analytic approaches. For energy, the nonparametric logistic regression detected a significant protective effect of low intakes, which was not found in the original analysis. For alcohol, the nonparametric logistic regression suggested that there were two underlying populations, non- or very light drinkers and moderate to heavy drinkers, with different relation of dose to disease risk. In contrast, the original analysis found a nonlinear increase in risk across intake categories and did not detect the complex, bimodal nature of the exposure distribution. These results demonstrate that nonparametric logistic regression can be a useful approach to displaying and interpreting results of case-control studies.
Collapse
|
49
|
Liu SL, Rodrigo AG, Shankarappa R, Learn GH, Hsu L, Davidov O, Zhao LP, Mullins JI. HIV quasispecies and resampling. Science 1996; 273:415-6. [PMID: 8677432 DOI: 10.1126/science.273.5274.415] [Citation(s) in RCA: 146] [Impact Index Per Article: 5.2] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/01/2023]
|
50
|
Hsu L, Zhao LP. Assessing familial aggregation of age at onset, by using estimating equations, with application to breast cancer. Am J Hum Genet 1996; 58:1057-71. [PMID: 8651267 PMCID: PMC1914617] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/01/2023] Open
Abstract
In genetic research of chronic diseases, age-at-onset outcomes within families are often correlated. The nature of correlation of age-at-onset outcomes is indicative of common genetic and/or shared environmental risk factors among family members. Understanding patterns of such correlation may shed light on the disease etiology and, hence, is an important step to take prior to further searching for the responsible genes via segregation and linkage studies. Age-at-onset outcomes are different from those familiar quantitative or qualitative traits for which many statistical methods have been developed. In comparison with the quantitative traits, age-at-onset outcomes are often censored, i.e., instead of actual age-at-onset outcomes, only the current ages or ages at death are observed. They are also different from qualitative traits because of their continuity. Because of the complexity of correlated censored outcomes, few methods have yet been developed. A traditional approach is to impose a parametric joint distribution for the correlated age-at-onset outcomes, which has been criticized for requiring a stringent assumption about the entire distribution of age at onset. The purpose of this paper is to describe a method for assessing familial aggregation of correlated age-at-onset outcomes semiparametrically, by use of estimating equations. This method does not require any parametric assumption for modeling the age at onset. The estimates of parameters, including those quantifying the correlation within families, are consistent and have an asymptotic normal distribution that can be used to make inferences. To illustrate this new method, we analyzed two age-at-onset data sets that were obtained from studies conducted in the States of Washington and Hawaii, with the objective of quantifying the familial aggregations of age at onset of breast cancer.
Collapse
|