Goodness-of-fit tests for logistic regression models when data are collected using a complex sampling design

https://doi.org/10.1016/j.csda.2006.07.006Get rights and content

Abstract

Logistic regression models are frequently used in epidemiological studies for estimating associations that demographic, behavioral, and risk factor variables have on a dichotomous outcome, such as disease being present versus absent. After the coefficients in a logistic regression model have been estimated, goodness-of-fit of the resulting model should be examined, particularly if the purpose of the model is to estimate probabilities of event occurrences. While various goodness-of-fit tests have been proposed, the properties of these tests have been studied under the assumption that observations selected were independent and identically distributed. Increasingly, epidemiologists are using large-scale sample survey data when fitting logistic regression models, such as the National Health Interview Survey or the National Health and Nutrition Examination Survey. Unfortunately, for such situations no goodness-of-fit testing procedures have been developed or implemented in available software. To address this problem, goodness-of-fit tests for logistic regression models when data are collected using complex sampling designs are proposed. Properties of the proposed tests were examined using extensive simulation studies and results were compared to traditional goodness-of-fit tests. A Stata ado function svylogitgof for estimating the F-adjusted mean residual test after svylogit fit is available at the author's website http://www.people.vcu.edu/~kjarcher/Research/Data.htm.

Introduction

Logistic regression is frequently used in epidemiological studies to model the relationship between a categorical outcome variable and a set of predictor variables. Traditionally, logistic regression assumes that the observations represent a random sample from a population (i.e., independent and identically distributed (iid)), where the model is expressed asyi=πxi+εi.In this equation, yi represents the dichotomous dependent or outcome variable; πxi represents the conditional probability of experiencing the event given independent predictor variables xi, or PrYi=1|xi; and εi represents the binomial random error term. More formally, the conditional probability πxi as a function of the independent covariates xi is expressed asπxi=PrYi=1|xi=exiβ1+exiβ,where β=β0,β1,β2,,βp are the model parameters to be estimated and p is the number of independent terms in the model.

Under iid-based sampling, elements are selected independently; therefore, the covariance between elements is zero. Under complex sampling, there may be a number of primary sampling units (PSUs), that is, there are j=1,,M PSUs (or “clusters”) from which m PSUs are sampled. Furthermore, within each sampled PSU there are i=1,,Nj units from which nm are sampled. A disadvantage generally associated with cluster sampling is that elements from the same cluster are often more homogeneous than elements from different clusters. This results in a positive covariance between elements within a cluster. Therefore, the intra-class correlation, which measures the homogeneity within clusters, is generally positive for cluster sample designs, and as a result, traditional maximum likelihood methods for estimation cannot be used. Rather, under complex sampling, which involves both stratification and possibly several stages of cluster sampling, pseudo-maximum likelihood is used (Skinner et al., 1989). The sampling weight, wji, calculated as the inverse of the product of the conditional inclusion probabilities at each stage of sampling, represents the number of units that the given sampled observation represents in the total population. Expanding each observation by its sampling weight will produce a dataset for the N units in the total population. Conceptually, pseudo-maximum likelihood estimation is like obtaining the maximum likelihood estimates for the expanded dataset. In other words, the logistic regression model is being fit to the ‘census’ data. The model parameters β for logistic regression models built from complex survey data are found by using pseudo-maximum likelihood. The contribution of a single observation using pseudo-maximum likelihood isπxjiwji×yji1-πxjiwji×1-yji.The pseudo-maximum likelihood function is still constructed as the product of the individual contributions to the likelihood, but now it is the product over the m clusters sampled and nm observations within the given cluster, expressed aslp(β)=j=1mi=1njπxjiwji×yji1-πxjiwji×1-yji.Given the pseudo-likelihood equation we find the PMLE (pseudo-maximum likelihood estimator) is that value that maximizes the pseudo log-likelihood functionlnLp(β)=j=1mi=1njwji×yji×lnπxji+wji×1-yji×ln1-πxji.

The survey sampling design may induce correlation among observations, particularly when cluster samples are drawn. To appropriately estimate standard errors associated with model parameters and estimated odds ratios, it is important to account for the sampling design.

The need to account for the sampling design in the statistical analysis of survey data has been widely reported in the literature. A brief tutorial regarding the importance of accounting for clustering and sampling weights, accompanied by an illustration using the National Health and Nutrition Examination Survey I data has previously been reported (Korn and Graubard, 1991). A more comprehensive review was subsequently provided by Korn and Graubard (1995). In another example, the difference between “model-based” (assuming the observations are from a random sample) and “design-based” analyses (an analysis which accounts for the survey design) was illustrated using the Personnes Ages Quid study, a stratified cluster sample (Lemeshow et al., 1998). It is of particular importance to model the survey design when estimating standard errors associated with model parameters or odds ratios.

Once a logistic regression model has been fit to a given set of data, the adequacy of the model is examined by overall goodness-of-fit tests and examination of influential observations. One concludes a model fits if the differences between the observed and fitted values are small and if there is no systematic contribution of the differences to the error structure of the model. A goodness-of-fit test that is commonly used to assess the fit of logistic regression models is the Hosmer–Lemeshow test (Hosmer and Lemeshow, 1980). Other goodness-of-fit tests for logistic regression models have been proposed (Cox, 1958; Tsiatis, 1980; Brown, 1982; Azzalini et al., 1989; le Cessie and van Houwelingen, 1991, le Cessie and van Houwelingen, 1995; Su and Wei, 1991; Osius and Rojek, 1992; Pigeon and Heyse, 1999a, Pigeon and Heyse, 1999b). These goodness-of-fit tests have been studied under independent and identically distributed random variable assumptions, which we refer to as the ‘iid-based’ setting.

Although appropriate estimation methods which take into account the sampling design in estimating logistic regression model parameters are available in various statistical packages, there is a corresponding absence of design-based goodness-of-fit testing procedures. Due to this noted absence, it has been suggested that goodness-of-fit be examined by first fitting the design-based model, then estimating the probabilities, and subsequently using iid-based tests for goodness-of-fit and applying any findings to the design-based model (Hosmer and Lemeshow, 2000). Unfortunately, the statistical properties of this method have not been examined. In this article we studied this proposed method and additionally proposed alternative design-based goodness-of-fit tests for logistic regression models. Unlike ordinary goodness-of-fit tests, the proposed tests take the sampling design and weights into account.

Section snippets

Goodness-of-fit

Three modifications to existing goodness-of-fit tests for design-based logistic regression models were previously studied. First, Graubard et al. (1997) proposed an alternative grouping strategy for establishing deciles of risk for the Hosmer–Lemeshow goodness-of-fit test. As usual, after fitting the logistic regression model, the probabilities are estimated as π^xji=PrYji=1|xji=exjiβ^1+exjiβ^. When the estimated probabilities are sorted in ascending order and subsequently grouped into 10

Proposed goodness-of-fit tests for complex sampling

Due to the inflated Type I error rates for the previously proposed tests, we proposed several alternative design-based goodness-of-fit tests and examined their properties empirically. The proposed goodness-of-fit tests for logistic regression applied to complex survey data are calculated in the following manner: after the logistic regression model is fit, the residuals r^ji=yji-π^xji are obtained. These goodness-of-fit tests are based on the residuals since large departures between observed and

Simulation study: type I error

The results from the simulations for the correctly specified models when 25, 75, and 125 clusters were sampled appear in Table 3, Table 4, Table 5, respectively. The tabled value is the percent of times the p-value from the goodness-of-fit test was less than 0.05. With 2000 replications, this percent should range from 4% to 6%. Therefore, when the percent ranged from 4% to 6%, the goodness-of-fit test was interpreted to have Type I error rate close to the nominal level; when the percent was

Discussion

The simulations for the simulated population were run under a variety of conditions. Conditions that were varied in the simulation study included (i) different number of sampled clusters (m=25, 75, 125), and (ii) examination of the Type I error using the probabilities from both the design-based model as well as the probabilities from the population parameter values for several proposed goodness-of-fit tests. The Type I error rates estimated when the Hosmer–Lemeshow goodness-of-fit test was

References (25)

  • K.J. Archer et al.

    Goodness-of-fit test for a logistic regression model estimated using survey sample data

    The Stata J.

    (2006)
  • A. Azzalini et al.

    On the use of nonparametric regression for model checking

    Biometrika

    (1989)
  • C.C. Brown

    On a goodness-of-fit test for the logistic model based on score statistics

    Comm. Statist. Theory Meth.

    (1982)
  • D.R. Cox

    Two further applications of a model for binary responses

    Biometrika

    (1958)
  • B.I. Graubard et al.

    Testing goodness-of-fit for logistic regression with survey data

  • N.J. Horton

    Goodness-of-fit for GEE: an example with mental health service utilization

    Statist. Medicine

    (1999)
  • D.W. Hosmer et al.

    Goodness-of-fit tests for the multiple logistic regression model

    Comm. Statist. Theory Meth. A

    (1980)
  • D.W. Hosmer et al.

    Applied Logistic Regression

    (2000)
  • D.W. Hosmer

    A comparison of goodness-of-fit tests for the logistic regression model

    Statist. Medicine

    (1997)
  • E.L. Korn et al.

    Surveys: accounting for the sampling design

    Amer. J. Public Health

    (1991)
  • E.L. Korn et al.

    Analysis of large health surveys: accounting for the sampling design

    J. Roy. Statist. Soc. A

    (1995)
  • E.L. Korn et al.

    Analysis of Health Surveys

    (1999)
  • Cited by (146)

    • Cleaning Tasks and Products and Asthma among Health Care Professionals

      2024, Journal of Occupational and Environmental Medicine
    View all citing articles on Scopus
    View full text