Article Text

Recruiting and retaining youth and young adults: challenges and opportunities in survey research for tobacco control
1. Jennifer Cantrell1,2,
2. Elizabeth C Hair1,2,
3. Alexandria Smith1,
4. Morgane Bennett1,
5. Jessica Miller Rath1,2,
6. Randall K Thomas3,
7. Mansour Fahimi3,
8. J Michael Dennis4,
9. Donna Vallone1,5
1. 1 Evaluation and Science Research, Truth Initiative, Washington DC, USA
2. 2 Department of Health, Behavior and Society, Johns Hopkins University Bloomberg School of Public Health, Baltimore, MD, USA
3. 3 GfK New York, New York, NY, USA
4. 4 National Opinion Research Center, Chicago, Illinois, USA
5. 5 Global Institute of Public Health, New York University, New York, NY, USA
1. Correspondence to Dr Jennifer Cantrell, Evaluation and Science Research, Truth Initiative, 900 G Street, NW, Fourth Floor, Washington, DC 20001, Columbia, USA; jcantrell{at}truthinitiative.org

## Abstract

Introduction Evaluation studies of population-based tobacco control interventions often rely on large-scale survey data from numerous respondents across many geographic areas to provide evidence of their effectiveness. Significant challenges for survey research have emerged with the evolving communications landscape, particularly for surveying hard-to-reach populations such as youth and young adults. This study combines the comprehensive coverage of an address-based sampling (ABS) frame with the timeliness of online data collection to develop a nationally representative longitudinal cohort of young people aged 15-21.

Methods We constructed an ABS frame, partially supplemented with auxiliary data, to recruit this hard-to-reach sample. Branded and tested mail-based recruitment materials were designed to bring respondents online for screening, consent and surveying. Once enrolled, respondents completed online surveys every 6 months via computer, tablet or smartphone. Numerous strategies were utilized to enhance retention and representativeness

Results Results detail sample performance, representativeness and retention rates as well as device utilization trends for survey completion among youth and young adult respondents. Panel development efforts resulted in a large, nationally representative sample with high retention rates.

Conclusions This study is among the first to employ this hybrid ABS-to-online methodology to recruit and retain youth and young adults in a probability-based online cohort panel. The approach is particularly valuable for conducting research among younger populations as it capitalizes on their increasing access to and comfort with digital communication. We discuss challenges and opportunities of panel recruitment and retention methods in an effort to provide valuable information for tobacco control researchers seeking to obtain representative, population-based samples of youth and young adults in the U.S. as well as across the globe.

• Prevention
• Priority/special populations
• Media
• Surveillance and Monitoring
• Social Marketing

## Request Permissions

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.

## Introduction

Evaluation studies of population-based tobacco control interventions often rely on large-scale survey data from numerous respondents across many geographical areas to provide evidence of their effectiveness.1 2 Large survey data collections also serve as critical surveillance mechanisms to examine tobacco use patterns at a national level.3

Traditionally, many of the largest and most rigorous tobacco-related survey data collections used phone-based methods.4 5 However, significant challenges for conducting phone surveys have emerged with the evolving communications landscape, including reduced coverage of landline sampling frames, declining response rates and increasing costs.6 The capacity to fully enumerate units of a population who could be potentially selected via landline has declined dramatically given the rise of cellphone-only households. Current coverage rates of landline frames are estimated to represent less than 50% of the population,7 with evidence indicating these frames will reflect less than 10% of households within 10 years.8 Studies that combine landline and cellphone frames have not resolved these problems given declining phone response rates.7 9 Such declines can be attributed to numerous factors, including the broad penetration of caller identification and voicemail.10 These factors serve to significantly reduce the probability of reaching potential respondents, and in turn, increase labour costs associated with data collection.11 Higher costs for cellphone-only surveys6 12 and evolving legal restrictions on automated dialling of cellphones for research purposes present other challenges.13

Online data collection can be less expensive than phone and is especially valuable when surveying younger populations.6 Web-based surveys allow researchers to capitalise on youths’ increasing access to and comfort with digital platforms. However, the quality of online survey data for population-based studies depends, in part, on the degree to which the sampling frame from which the sample is drawn represents the target population. Although non-probability samples are typically used for market and commercial research, academic or government studies often require the coverage of representative sampling frames which allow for the estimation of selection probabilities and ensure representative and accurate estimates.14–16 Such probability-based sampling frames combined with online data collection are growing in popularity in the USA and abroad.17–24 These surveys use phone or address-based sampling (ABS) to recruit individuals for cross-sectional or longitudinal studies. In the USA, ABS frames are based on the US Postal Service (USPS) Computerized Delivery Sequence File which is estimated to cover close to 100% of US households.25

Given the challenges of phone-based survey sampling and data collection, as well as the limitations of online non-probability samples, we employed a hybrid approach to harness both the strengths of ABS sampling with the speed of online data collection in an effort to engage a cohort of youth and young adults. Since the majority of tobacco users begin smoking in adolescence and young adulthood,26 27 methods to recruit and retain respondents from this age group are extremely valuable for effective programme planning, implementation and evaluation studies. This paper outlines the specifications related to the study design, sampling methods, recruitment and retention strategies, data collection methods, and sample performance of the Truth Longitudinal Cohort (TLC), a national sample of 15–21-year-olds designed to evaluate the 2014 relaunch of the antitobacco mass media campaign, truth. We present the methodology, results, challenges and opportunities for survey research in an effort to provide valuable information for tobacco control researchers seeking to obtain representative, population-based samples of youth and young adults in the USA as well as across the globe.

## Methods

### Study design

The evaluation of the truth campaign required the recruitment and maintenance of a large nationally representative cohort of the target audience: youth and young adults aged 15–21 years. To ensure sufficient power to detect campaign effects, target sample goals specified a prospective cohort of 10 000 respondents at baseline, with follow-up assessments every 6 months for 3 years. Cross-sectional samples of approximately 1000 new respondents of the same age range were added at each data collection period and followed thereafter to address attrition and panel conditioning effects. Figure 1 shows data collection and the timeline for waves 1 to 6. Each data collection period for follow-up and refreshment waves was approximately 3–4 months.

Figure 1

Truth Longitudinal Cohort timeline.

### Probability-based sampling

The sample was obtained using the ABS frame described above, which consisted of addresses from the USPS.25 To maximise the ability of locating a member of the ABS-sourced household who fell within our age target, the frame was supplemented with auxiliary information obtained from public and commercial sources for a portion of the sample. Supplementation of the ABS frame with auxiliary information provides data-based insights on the demographic and socioeconomic composition of households in the frame, and allows researchers to adhere to the probability sampling paradigm in designing representative samples of specific hard-to-reach populations if properly designed.25 28 Auxiliary data sources are commercially available from marketing information firms (eg, infoUSA, Acxiom, Experian and Targus).

Of the total sample units invited to the TLC, 14.7% was derived from ‘listed’ households (ie, leveraging auxiliary data) indicated to have an eligible study participant age 15–17 years and 17.9% from a ‘listed’ 18–21-year-old household. The remaining 67.4% of the sample was selected from non-list-assisted households. The sample was not stratified by geography. This protocol led to a highly acceptable design effect of 1.08. Use of the auxiliary information was substantially effective for the intended purpose of increasing the percentage of sample units that qualify for the TLC. In strata 1, 75.7% of households had a 15–17-year-old; for strata 2, 70.1% had an 18–21-year-old; for strata 3, 12.2% had a 15–21-year-old. Online supplementary table 1 provides information on the sample allocation and inclusion probabilities across the strata.

### Data collection

In an effort to optimise recruitment and retention, study branding and a project-specific website were used. Branding of health-related goods, processes or services can help establish positive associations between a target audience and the product and can serve to promote commitment and loyalty over time. The TLC branding was designed to present the study as distinctive, trusted and reliable, which is especially important in longitudinal cohort studies where retention is critical.29 Branding included the study name, Connecting Health and Technology (CHAT), paired with an associated logo and logos of study partners (affiliated academic institutions and contract research organisations) on materials to help increase legitimacy (see online supplementary figure A1). The website featured the respondent screener and subsequent surveys, as well as instructions on accessing future surveys and updating one’s contact information, clear and concise information on frequently asked questions, data privacy policies and contact information of the study coordinator through the web portal. Participants were able to return to this website at any time to check for upcoming surveys, review survey completion and the status of participation incentives, update contact information and provide feedback (see online supplementary figure A2).

### Supplementary file

Testing of all recruitment and retention materials was conducted among a sample of the target audience to refine the content, tone and imagery. Forced exposure tests using online convenience samples were conducted to evaluate receptivity to the study brand name, web content, letters, postcards, envelopes, email messages, logos and imagery for study materials. Cognitive testing of the online data collection process was also conducted with a combined online sample of 15–21-year-olds as well as parents of 15–17-year-olds to examine any potential barriers related to the comprehension of survey items and skip patterns in the baseline instrument to improve survey comprehension and flow. We also tested usability of the household screener and the parental consent process. Once data instruments were finalised, surveys were optimised for completion on desktop and mobile devices. All recruitment, retention and survey materials were provided in English and Spanish.

Study invitations were sent by mail to all potential respondents in a 9×12 envelope with a letter structured in a simple ‘Question and Answer’ format, which provided information on the study, study sponsors, survey website, a unique passcode, and a 1800 toll-free telephone number established to field questions from potential respondents. Reminder postcards and letters were sent to all non-respondents. All materials included the CHAT logo and study sponsor information. Descriptions of the study were provided on study sponsors’ websites and an online ad, and relevant Google search terms were purchased to direct potential respondents who wanted to learn more about the study.

Once a participant logged into the recruitment website with the unique passcode, they were screened, rostered, selected and consented. First, the household respondent completed the online screening questionnaire, which included a rostering of household members for purposes of selecting a qualified respondent. If multiple eligible 15–21-year-olds were listed in the household, one was randomly chosen. Second, the parent/guardian of an eligible youth age 15–17 years was asked to provide consent for that youth, and then asked to have the youth assent and complete the survey. Those aged 18–21 years could provide their own consent and were asked to complete the survey directly. Finally, both parents/guardians and selected participants were informed of the incentive structure for their participation, and the subsequent requests for completing additional surveys. Parents could choose whether to receive the incentive directly or elect the incentive and surveys be sent directly to their teen. Teens were required to complete the survey themselves and not via parental proxy. If participants started the screening process but did not complete the full screening, consent or survey process, an additional reminder letter and two emails (for those who had provided email addresses) were sent. Reminder letters included the member portal website and emails included direct links to the portal and the survey.

Each respondent received a base contingent incentive of $10 for completion of a survey. Respondents defined as hard-to-reach, such as African-Americans and Hispanics, received an additional contingent incentive of$10, for a total of $20. Respondents without household internet access were also defined as hard-to-reach and received an additional contingent incentive of$20. The maximum incentive offered to any single respondent for the baseline survey was $40. Numerous pathways for retention were used, including varying contacts at each retention wave if needed, which included an initial invite letter and email, up to seven reminders via email, postcard or letter (some of which provided additional incentives), and an alternative contact and text reminder for those who provided this information. A suspend reminder postcard, letter and email were also sent for those who started a follow-up survey but did not complete it. A large majority of respondents completed the survey within the first three contacts and the additional contacts were used for a small proportion of the sample to boost response. Other strategies were also used to engage respondents, including birthday cards, humorous emails and social media messages. Between-wave postcards and emails, both of which had a link to the member portal, were sent to request updated contact information 2 months before each survey. ## Results ### Sample performance Table 1 provides a detailed summary of the total number of invites sent by strata and responses by disposition. At baseline, 1 293 801 participants were sent mailed invites. A total of 1 293 801 reminder postcards and 1 288 572 reminder letters were sent following the initial invite. Across the strata, the listed sample performed better than the unlisted sample: rates of screener eligibility, survey completion and consent were five to six times higher for the listed sample compared with the unlisted sample. Of the total invites sent, 40 464 were screened and, of those screened, 12 882 participants were determined to be eligible for the baseline survey. Of these, 11 981 were consented. A total of 10 257 respondents completed the baseline. A portion of those who were eligible and consented started but did not complete the survey initially, and thus reminder letters (n=3244) and two emails (n=3244 and 3075) (for those who had provided email addresses) were sent to encourage respondents to complete the survey. Of those who received these reminder letters and emails, 766 eventually completed the survey. For the baseline survey, the incentive structure was more efficient for the listed strata versus the unlisted strata (see online supplementary table 2). Among the sample overall, 66% of respondents received a$10 incentive, 31% received $20, 1% received$30 and 2% received $40. A greater proportion of respondents in the two listed strata received$10 compared with those in the unlisted strata. This was primarily due to differences in demographics, discussed further below.

In computing the response rate, each of the three strata had a separate rate of eligibility for the two age groups (15–17 years and 18–21 years). These rates of eligibility were used to estimate the proportion of possible eligible participants in the ‘unlisted’ sample within the strata and group. The 18–21-year-old group had a quota limit; once this quota was achieved, the estimated eligibility for all in the ‘listed’ and ‘unlisted’ samples of 18–21-year-olds were assigned ‘0.’ Any additional completes within this age range were not included as they were considered ‘adults overquota’.25 The response rate was calculated as a weighted RR3 using the American Association of Public Opinion Research (AAPOR) response rate calculator for each of the six sample groups (three strata by two age groups).30 The final weighted survey response rate was 52.4% (AAPOR-Response Rate 3 (RR3) using our quota limit). The refreshment samples were recruited using the same custom ABS methodology and response rates were similar.

Table 1

Total numbers of invites and responses by strata and disposition.

### Device utilisation

Approximately 20% of the baseline sample at wave 1 completed the survey on a mobile device. In later waves, the proportion of respondents taking the survey on mobile increased, with close to 30% of refreshment respondents taking the survey via a mobile device by wave 5 (see table 2). The baseline survey took a median of 31.8 min, which varied by device: 29.9 min for desktop/laptop; 33.8 min for tablet and 42.6 min for smartphone. The follow-up surveys at wave 2–5 were generally shorter, with a median time taken of approximately 25 min, ranging from 24–25 min for desktop/laptop, 25–27 min for tablet and 27–28 min for smartphone.

Table 2

Device used to complete baseline survey for the original sample and each refreshment sample

### Sample representativeness

Table 3 provides demographics by sample strata and for the sample overall in comparison with 2014 census statistics to assess representativeness. Proportions for age groups were lower for 15–17-year-olds (36.4% vs 44.5%), which may be partially due to the additional parental consent process required for youth participation under age 18 years. Proportions for gender were similar to census numbers and did not vary significantly across strata. The sample overall reflects a somewhat greater proportion of respondents from the North-East and Midwest compared with census data, and this was largely driven by differences in the listed strata. The biggest difference was in race/ethnicity. The total sample reflects a larger proportion of whites, somewhat fewer African-Americans and significantly fewer Hispanics compared with census estimates, and this, again, was largely driven by the listed sample. The proportion of respondents who reside within a city and its suburbs, also known as a metropolitan statistical area (MSA), was similar to census targets with 86.4% of respondents residing in MSAs compared with 84.4% in census estimates.

Table 3

Unweighted demographic characteristics by sample strata

### Sample retention

Retention efforts at follow-up waves were quite successful. Consistent with other longitudinal data collection, the largest proportion of attrition occurred between the baseline survey and first follow-up. Approximately 72% of the original wave 1 cohort was retained at wave 2. Retention rates remained stable through wave 3–5 (see table 4). Respondents who missed one or more waves were recontacted for every subsequent wave. Retention rates were superior for the listed sample compared with the unlisted sample. The retention pattern was similar for the refreshment samples followed over time.

Table 4

Response rates across waves as a proportion of wave 1 sample

## Discussion

This paper provides among the first detailed published descriptions of a probability-based sample that employs a hybrid approach of combining an ABS sampling frame with online data collection to recruit a longitudinal cohort of youth and young adults. These methods leverage the representativeness of an ABS probability-based frame that accurately calculates the selection probability for each respondent while benefiting from the timeliness of online data collection. Given the effectiveness of mass media campaigns and other population-level interventions in reducing youth tobacco use,31 survey methods to accurately and efficiently evaluate such efforts are critical to advancing public health.

This methodology provides both strong sample performance and time efficiencies for survey research efforts with youth and young adults, particularly for those seeking to obtain representative, population-based samples. ABS frames are currently considered the gold standard for developing representative household survey samples in the USA.25 Among this often hard-to-reach population, we increased data collection efficiency by supplementing the ABS frame with auxiliary data to create ‘listed’ and ‘unlisted’ strata.25 This method improved rates of eligibility, consent, interview completion and retention while maintaining acceptable sample representativeness on most demographic variables of interest, reducing sampling costs and maintaining an acceptable design effect.25

Despite probability-based coverage, an acceptable design effect and rapid recruitment, other challenges to recruitment remained. Multiple contacts were necessary to achieve sufficient response, which increased costs primarily at baseline due to printed material mailing costs. Follow-up surveys leveraged email as much as possible to improve efficiency and reduce costs. Researchers may want to consider targeted efforts to reduce non-response by using auxiliary data, in-person surveys32 or additional survey mode options33 during initial recruitment to reduce costs and improve response efficiencies. Further, neither ‘listed’ nor ‘unlisted’ samples adequately represented population estimates for minority populations, especially Hispanic respondents, despite all recruitment and survey materials being available in Spanish. Lower response rates among Hispanic populations is not uncommon. The practical effect of reduced response among these subpopulations is greater variance in survey estimates. Efforts to increase response among Hispanics include using available data to oversample these groups, such as targeting individuals with common Hispanic surnames,34 using consumer data where Hispanic households are ‘flagged’,35 targeting areas with a high Hispanic population density for ‘listed’ and ‘unlisted’ strata development, or conducting in-person interviews in areas with high Hispanic households in connection with area-based targeting.36 This study employed enhanced incentives for engaging harder-to-reach subgroups. Using such procedures can help improve the representativeness of Hispanic and other minority samples, and requires that appropriate correctives are employed to address misclassification and other potential biases these approaches may introduce.28

ABS sampling paired with either in-person or direct mail-based survey data collection requires significant labour and time.37 38 Alternatively, this method of linking ABS with web-based data collection has been viewed with scepticism given the gaps in internet access.25 38 39 New strategies, however, are being employed to improve access. For example, the advent of mobile devices has led to greater internet penetration in the USA and globally40 41 - particularly among young people,42 minorities and low-income groups.40 43 By providing respondents with home internet access44 or access in other locations, supplementing data collection with other modes (phone, mail or in-person)21 45 46 and/or structuring incentives to engage those who are less likely to respond online, issues of internet access can be minimised.45

Web-based data collection is particularly appropriate when surveying youth and young adults. These populations are more likely to respond to an internet survey than a mailed paper survey compared with older adults.6 47 48 Additionally, youth and young adults have the highest rates of internet use,40 49 are less likely to use conventional mail, and tend to change residential location, especially as youth transition to young adulthood.46 Evidence also demonstrates online data collection almost universally reduces social desirability biases in surveys compared with interviewer-administered modes.50–52 This may be most relevant for surveys of tobacco use among underage youth or illegal substance use. Web surveys also allow respondents to complete the survey at their convenience, and can be efficiently optimised to facilitate survey navigation and reduce item non-response.53 Finally, online data collection can allow for the viewing of visual content such as photos, videos or hyperlinks to digital content in real time54 platforms that are likely to be familiar and accessible to younger populations, and increasingly important in survey research.46 Thus, online surveys are especially valuable for media evaluation studies, as they allow for the easy inclusion of actual digital content assessing exposure to the campaign.

Data from this study demonstrated a 50% increase over a relatively short time period in respondents’ use of mobile devices for completing the survey. Given the increased penetration of mobile devices across the globe, including in low-income and middle-income countries (LMIC), online data collection via mobile devices may be relevant for obtaining representative survey samples.55 In the USA, young adults and minorities are more likely to identify their mobile device as their main source to access the internet.9 Given the rapid increase in mobile phone usage by these harder-to-reach populations,42 49 survey instruments should be modified for the mobile platform. Online surveys can also be deployed relatively easily in multiple languages—a clear benefit to global tobacco control evaluation efforts as mobile phone ownership becomes more common in LMIC.41

Since consistent contact with participants is a necessary component of longitudinal studies, online platforms can provide a rapid and efficient mode for engaging and retaining participants in a study. Fielding time for online data collection in this study averaged 1–2 months less, at a minimum, than estimated phone or in-person data collection (personal communication, GfK, 2016). Further, survey completion can require less time as compared with other modes.46 56 For example, taking a survey online is estimated to take approximately half the time of taking it orally (personal communication, GfK, 2016). Survey completion averaged 31.8 min for the baseline and 25 min for the follow-ups. It is likely that longer survey completion estimates would have resulted in lower response rates. Rapid, easy and efficient data collection via online surveys can benefit both the researcher and the respondent.

Retaining youth and young adults in longitudinal research is challenging. The longitudinal portion of Monitoring the Future, a national survey of 8th, 10th and 12th graders, yielded 54% participation rates in the first year after high school among seniors initially targeted. The second through sixth follow-ups after high school averaged 49% of the initial target sample.57 A representative Massachusetts survey of 12–17-year-olds obtained retention rates of 72.8% after 2 years.58 Probability-based longitudinal surveys targeting nationally representative samples of adult smokers in the USA have reported retention rates of 62.8% and 37.4% at waves 2 and 3.5 The TLC’s retention rates over 2 years are either comparable or significantly higher. Evidence indicates these higher retention rates may reflect enhanced retention efforts used in this study, including listed sample, personalised communications,59 the use of humour and varying incentives.60–68

Although this methodology provides an effective and timely approach to obtaining representative survey data, it is not without limitations. Using ABS frames is most applicable for countries that have robust administrative data sources for enumerating all households across a community. Even with ABS sampling frames, recruitment and retention of this age group generally requires multiple contacts, incentives and additional engagement strategies. Studies using non-probability-based samples can use some of the strategies described here for online data collection. For areas with limited internet access, such as some LMICs, mixed mode studies (online combined with mail and telephone) may be needed to improve representativeness. Finally, the approach described here is relatively novel for this age group, thus further research is needed to evaluate total survey error with respect to sampling frame coverage, online access, non-response and measurement errors. Analyses related to potential sampling and response bias were conducted in some of the pretesting research and additional analyses are ongoing. Researchers should continue to assess total survey error in the context of the survey mode, costs, timeliness and feasibility.

## Conclusions

Tobacco control programmes rely on rigorous evaluation to determine effectiveness at preventing and decreasing tobacco use. Collecting survey data from nationally representative samples in the age of declining response rates presents many challenges. The methodology described above takes advantage of the coverage of an ABS frame while leveraging the time efficiencies of online data collection. This approach is valuable for obtaining representative survey data among younger populations as it capitalises on their increasing access to and preference for digital communication. Given the rapidly evolving communication and technology landscapes, new approaches are needed to evaluate the effectiveness of tobacco control programmes and policies among younger populations who are most at risk of tobacco use initiation. With the penetration of digital technology worldwide, the sampling and data collection strategies described here can provide key information on challenges and potential solutions for evaluating tobacco control research efforts across the globe.

• Evaluation of population-based tobacco control interventions targeting young people often demands accurate and representative survey data.

• Recruiting and retaining a large, probability-based longitudinal sample can be prohibitive in terms of cost, time and effort.

• The study detailed here provides a feasible approach that links the rigour of address-based sampling with the time efficiencies of online data collection to effectively evaluate tobacco control interventions among youth and young adults.

• Despite challenges with recruitment, this methodology provides opportunities for developing robust representative samples that can be retained over time via online platforms.

## Acknowledgments

The authors thank Edward Mulrow, PhD P Stat and Ned English, both of NORC at the University of Chicago, for reviewing and providing feedback on the study’s methods and response rate section.

## Footnotes

• Contributors JC conceptualised and wrote the paper. DV, ECH, JMR and JMD contributed to the conceptualisation and DV contributed to writing and revisions. AS and MB contributed to writing and revisions. JMD, RKT and MF contributed to the study design, analyses and revisions.

• Funding This study was funded by Truth Initiative.

• Competing interests None declared.

• Ethics approval Chesapeake IRB.

• Provenance and peer review Not commissioned; externally peer reviewed.

• Correction notice This article has been corrected since it was published Online First. Several typos have been corrected throughout the text.