Quantifying the magnitude of environmental exposure misclassification when using imprecise address proxies in public health research

https://doi.org/10.1016/j.sste.2012.02.006Get rights and content

Abstract

In spatial epidemiologic and public health research it is common to use spatially aggregated units such as centroids of postal/zip codes, census tracts, dissemination areas, blocks or block groups as proxies for sample unit locations. Few studies, however, address the potential problems associated with using these units as address proxies. The purpose of this study is to quantify the magnitude of distance errors and accessibility misclassification that result from using several commonly-used address proxies in public health research. The impact of these positional discrepancies for spatial epidemiology is illustrated by examining misclassification of accessibility to several health-related facilities, including hospitals, public recreation spaces, schools, grocery stores, and junk food retailers throughout the City of London and Middlesex County, Ontario, Canada. Positional errors are quantified by multiple neighborhood types, revealing that address proxies are most problematic when used to represent residential locations in small towns and rural areas compared to suburban and urban areas. Findings indicate that the shorter the threshold distance used to measure accessibility between subject population and health-related facility, the greater the proportion of misclassified addresses. Using address proxies based on large aggregated units such as centroids of census tracts or dissemination areas can result in very large positional discrepancies (median errors up to 343 and 2088 m in urban and rural areas, respectively), and therefore should be avoided in spatial epidemiologic research. Even smaller, commonly-used, proxies for residential address such as postal code centroids can have large positional discrepancies (median errors up to 109 and 1363 m in urban and rural areas, respectively), and are prone to misrepresenting accessibility in small towns and rural Canada; therefore, postal codes should only be used with caution in spatial epidemiologic research.

Introduction

Recent advances in the analytical capacity of desktop geographic information system (GIS) software, combined with the increasing availability of spatially-referenced health and environmental data in digital format, have created new opportunities for making breakthroughs in spatial epidemiology (Zandbergen, 2008). As digital mapping is an abstraction of reality, the spatial data used for visualizing and analyzing geographic phenomena will always be inaccurate to some degree. Such inaccuracies can be compounded when spatially aggregated units are used as locational proxies for mapping and analyzing spatial relationships, rather than more precise geographic locations. In environmental and public health research, it is common to use proxies for sample unit locations, such as centroids of postal/zip codes, census tracts, dissemination areas, blocks, or lots; however, it is very uncommon for studies to address, or even mention, the potential problems ensuing from the positional discrepancies associated with using imprecise address proxies. It is the responsibility of the researcher to identify, quantify, interpret, and attempt to reduce any errors associated with using particular spatial data and locational proxies, so that they do not interfere with any conclusions and recommendations to be made from the findings (Fotheringham, 1989, Anselin, 2006).

Researchers in spatial epidemiology have long been concerned about the absolute or relative spatial accuracy of the address points used to map sample populations or phenomena within a GIS (Goldberg, 2008). Numerous researchers have examined the ‘positional errors’ which occur when the address from a database is located on a digital map, but the point is not located at the true position of the address (Cayo and Talbot, 2003, Ward et al., 2005, Schootman et al., 2007, Strickland et al., 2007, Zandbergen and Green, 2007, Jacquez and Rommel, 2009). In many previous studies, positional errors are reported as Euclidian distance errors, or errors in the X and Y dimension. While much has been said about positional errors, much less has been said about how study results might be affected when researchers use spatially aggregated units (which themselves might be positionally accurate) as address proxies. Very few studies measure and compare the positional discrepancies between address proxies and the exact address they are used to represent (Bow et al., 2004).

A major area of investigation in the fields of spatial epidemiology, health geography, and public health attempts to assess the levels of accessibility or ‘exposure’ of subject populations to elements in their local environments that are believed to be health-promoting or health-damaging, and are related to certain health-related behaviors or outcomes. Accessibility is typically measured in relation to the distance between subject populations and selected environmental features, and is often operationalized as a binary variable (i.e., accessible/inaccessible, exposed/not exposed) or a density variable (i.e., number of sites within, volume of contaminant within) in relation to an areal unit or ‘buffer’ of a certain threshold distance (radius) around the subject’s address. There is much variability, but unfortunately not much debate, regarding the particular threshold distances to be used in accessibility studies; however, most authors do attempt to justify their choice of threshold distances based on human behavior (e.g. ‘walking distance’) or perhaps some characteristic of contaminant source (e.g. 150 m from roadway). The chosen accessibility thresholds also typically vary by study population (e.g. children vs. adults), setting (e.g. urban vs. rural), and by health-related outcome (e.g. physical activity vs. asthma). In their study of the environmental influences on whether or not a child will walk or bike to school, for example, Larsen and colleagues (2009) justify the choice of a 1600 m neighborhood buffer based on the local school board cut-off distance for providing school bus service (see also Schlossberg et al., 2006, Muller et al., 2008, Brownson et al., 2009, Panter et al., 2009). Studies which have focussed on access to neighborhood resources such as public parks and recreation spaces have utilized a variety of threshold distances, typically between 400 and 1600 m (compare Lee et al., 2007, Bjork et al., 2008, Tucker et al., 2008, Maroko et al., 2009); however, we submit a threshold distance of 500 m is ideal, as it represents a short 5–7 min walk, therefore easily accessible for populations of all ages (see Tucker et al., 2008, Sarmiento et al., 2010, Wolch et al., 2010). The 5–7 min walk zone, as represented by the 500 m buffer around a home or public school, is also a common distance used in studies exploring the relationship between access to junk food and obesity (see Austin et al., 2005, Morland and Evenson, 2009; Gilliland, 2010). Studies of ‘food deserts’ (disadvantaged areas with poor access to retailers of healthy and affordable food) and the potential impact of poor access to grocery stores on dietary habits and obesity have tended to focus on longer distances (800 m or greater), and vary according to urban vs. rural setting (see Wang et al., 2007, Larsen and Gilliland, 2008, Pearce et al., 2008, Sharkey, 2009, Sadler et al., 2011). For the purpose of this analysis, we focus on 1000 m, or the 10–15 min walk zone around a grocery store, as has been identified in previous studies of food deserts in Canadian cities (Apparicio et al., 2007, Larsen and Gilliland, 2008). Explorations of how distance from a patient’s home to emergency services available at hospitals is associated with increased risk of mortality are more likely to use much larger threshold distances than standard ‘walk zones’ (e.g. greater than 5 km) (see Jones et al., 1997, Cudnick et al., 2010, Nicholl et al., 2007, Acharya et al., 2011). Nicholl and colleagues (2007), for example, discovered that a 10 km increase in straight-line distance to hospital is associated with a 1% increase in mortality. As hospitals tend to be a regional, rather than a neighborhood facility, we will use the threshold distance of 10 km for our analyses.

Rushton and colleagues (2006) have argued that when short distances between subject population and environmental features are associated with health effects in epidemiologic studies, the geocoding result must have a positional accuracy that is sufficient to resolve whether such effects are truly present. The purpose of this study is to quantify the magnitude of the positional discrepancies in terms of distance errors and accessibility misclassification that result from using several commonly-used address proxies in public health research. Positional errors have been shown to vary greatly by setting (Bonner et al., 2003, Cayo and Talbot, 2003, Ward et al., 2005); therefore, we quantify errors by multiple neighborhood types: urban, suburban, small town, and rural. We also attempt to ascribe ‘meaning’ to these errors for spatial epidemiologic studies by examining errors in distance and accessibility misclassification with respect to several health-related features, including hospitals, public recreation facilities, schools, grocery stores, and junk food retailers.

Section snippets

Study area and data

The City of London (population 350,200) and Middlesex County (population 69,024) in Southwestern Ontario, Canada are ideal study areas for examining the geocoding errors in accessibility studies as they encompass a mix of urban, suburban, small town, and rural agricultural areas (Statistics Canada, 2011) (see Fig. 1). The study area was categorized into four neighborhood types as follows: (1) urban areas correspond to neighborhoods in the City of London built primarily before World War II; (2)

Magnitude of positional discrepancies

In almost every case, urban neighborhoods show the smallest median distance error for all address proxies, followed successively by suburban, small town, and rural areas (see Table 1). As expected, lot centroids were the most accurate proxy for precise residential dwelling location that we examined in relation to nearest distance to health related facilities, with the median positional discrepancy (50th percentile) between lot centroids and dwelling centroids equal to 6–9 m for locations in

Discussion

It is common in public health research to use spatially aggregated units as address proxies for the locations of subjects and facilities when more precise address information is unavailable. It is rare, however, for public health researchers to examine, or even mention, the potential distance and misclassification errors resulting from the positional discrepancies between the locations of imprecise address proxies and precise subject locations. It is inappropriate for researchers to ignore

References (46)

  • S. Austin et al.

    Clustering of fast-food restaurants around schools: a novel application of spatial statistics to the study of food environments

    Am J Public Health

    (2005)
  • J. Bjork et al.

    Recreational values of the natural environment in relation to neighbourhood satisfaction, physical activity, obesity and wellbeing

    J Epidemiol Community Health

    (2008)
  • M. Bonner et al.

    Positional accuracy of geocoded addresses in epidemiologic research

    Epidemiology

    (2003)
  • C. Bow et al.

    Accuracy of city postal code coordinates as a proxy for location of residence

    Int J Health Geogr

    (2004)
  • R. Brownson et al.

    Measuring the built environment for physical activity: state of the science

    Am J Prev Med

    (2009)
  • M. Cayo et al.

    Positional error in automated geocoding of residential addresses

    Int J Health Geogr

    (2003)
  • Parcels, buildings, address points, and health facilities GIS files [DVD]

    (2010)
  • M. Cudnick et al.

    A geospatial assessment of transport distance and survival to discharge in out of hospital cardiac arrest patients: Implications for resuscitation centers

    Resuscitation

    (2010)
  • DMTI Spatial Inc. Database of postal code centroids and street centerline GIS files [Internet], Ottawa(On);2009....
  • S. Fotheringham

    Scale-independent spatial analysis

  • Gilliland J. The Built environment and obesity: trimming waistlines through neighbourhood design. In: Bunting, Filion,...
  • Goldberg D. A Geocoding Best Practices Guide. Springfield, IL North Am Assoc Cent Cancer...
  • K. Henry et al.

    Estimating the accuracy of geographical imputation

    Int J Health Geogr

    (2008)
  • Cited by (62)

    • Residential greenness and substance use among youth and young adults: Associations with alcohol, tobacco, and marijuana use

      2022, Environmental Research
      Citation Excerpt :

      To limit potential exposure misclassification that results from using address proxies such as postal codes, the following study uses a buffer zone of 1000 m around each centroid. While many authors have elected to use buffers as small as 100 m to 500 m, research has indicated that misclassification could be very large if smaller buffer sizes are used (Healy and Gilliland, 2012). Similarly, since misclassification is likely to be exacerbated for rural areas, the current study is limited to urban areas.

    • Uncertainty in geospatial health: challenges and opportunities ahead

      2022, Annals of Epidemiology
      Citation Excerpt :

      Due to confidentiality restrictions, patient addresses may be aggregated to larger areal units [129]. Aggregation is known to impact true travel distance estimates [130–134], spatial clustering [55,135–138] and exposure misclassification [73,77]. Since it is unclear where patients may reside within an aggregated unit, a common approach is to impute likely coordinates by locating the individuals at the centroid of the geographic unit [139].

    • Examining how children's gender influences parents’ perceptions of the local environment and their influence on children's independent mobility

      2021, Wellbeing, Space and Society
      Citation Excerpt :

      These include the distance between home and school, land-uses, population density, and intersection density. As this study focused on children that live within walking distance residing in urban areas and small towns, postal codes are appropriate proxies for home locations (Healy and Gilliland, 2012). Distance between home and school was measured in kilometers using the shortest network distance between a child's home postal code and school.

    View all citing articles on Scopus
    View full text