Article Text

Machine learning applications in tobacco research: a scoping review
  1. Rui Fu1,
  2. Anasua Kundu2,
  3. Nicholas Mitsakakis1,3,
  4. Tara Elton-Marshall4,
  5. Wei Wang5,
  6. Sean Hill5,
  7. Susan J Bondy5,
  8. Hayley Hamilton5,
  9. Peter Selby5,
  10. Robert Schwartz2,4,
  11. Michael Oliver Chaiton2,4
  1. 1 Institute of Health Policy Management and Evaluation, University of Toronto, Toronto, Ontario, Canada
  2. 2 Ontario Tobacco Research Unit, Dalla Lana School of Public Health, University of Toronto, Toronto, Ontario, Canada
  3. 3 Children's Hospital of Eastern Ontario Research Institute, Ottawa, Ontario, Canada
  4. 4 Institute for Mental Health Policy Research, Centre for Addiction and Mental Health, Toronto, Ontario, Canada
  5. 5 Centre for Addiction and Mental Health, Toronto, Ontario, Canada
  1. Correspondence to Dr Michael Oliver Chaiton, Dalla Lana School of Public Health, University of Toronto, Toronto, Ontario, Canada; Michael.chaiton{at}


Objective Identify and review the body of tobacco research literature that self-identified as using machine learning (ML) in the analysis.

Data sources MEDLINE, EMABSE, PubMed, CINAHL Plus, APA PsycINFO and IEEE Xplore databases were searched up to September 2020. Studies were restricted to peer-reviewed, English-language journal articles, dissertations and conference papers comprising an empirical analysis where ML was identified to be the method used to examine human experience of tobacco. Studies of genomics and diagnostic imaging were excluded.

Study selection Two reviewers independently screened the titles and abstracts. The reference list of articles was also searched. In an iterative process, eligible studies were classified into domains based on their objectives and types of data used in the analysis.

Data extraction Using data charting forms, two reviewers independently extracted data from all studies. A narrative synthesis method was used to describe findings from each domain such as study design, objective, ML classes/algorithms, knowledge users and the presence of a data sharing statement. Trends of publication were visually depicted.

Data synthesis 74 studies were grouped into four domains: ML-powered technology to assist smoking cessation (n=22); content analysis of tobacco on social media (n=32); smoker status classification from narrative clinical texts (n=6) and tobacco-related outcome prediction using administrative, survey or clinical trial data (n=14). Implications of these studies and future directions for ML researchers in tobacco control were discussed.

Conclusions ML represents a powerful tool that could advance the research and policy decision-making of tobacco control. Further opportunities should be explored.

  • health services
  • public policy
  • surveillance and monitoring

Statistics from

Request Permissions

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.


The urgent status of eliminating tobacco-related health problems and the increasingly complex facets of tobacco research call for sophisticated analytical methods to deal with vast amounts of data and perform highly specialised tasks. This paper provides a brief introduction to machine learning (ML) and a scoping review to assess the tobacco literature for studies that self-identified as using ML for analyses.

A gentle introduction to machine learning

ML was historically described as ‘a field of study that gives computers the ability of learn without being explicitly programmed’.1 A more intuitive definition of ML is ‘a set of methods that can automatically detect patterns in data, and then use the uncovered patterns to predict future data, or to perform other kinds of decision making under uncertainty’.2 The core of ML is on the use of brute computational force to replace human guidance in data analysis; thus, ML can be viewed as a natural extension to traditional statistical approaches by having a much lower human-guided vs machine-guided ratio in the analytical pipeline.3

ML is commonly divided into three classes: supervised, unsupervised and reinforcement learning.2 A broader definition of ML also includes deep learning, which entails the use of human brain-inspired artificial neutral network to perform either supervised or unsupervised or reinforcement learning tasks.4 Each class of ML aims to solve a distinct problem and has unique features that may appeal to tobacco researchers.

Supervised learning deals with prediction. It involves the training and validation of a model to ‘predict the values of one or more outputs or response variables for a given set of input or predictor variables’.5 When the goal of is obtaining a highly accurate predictive model for future data through repeated trials of training and testing, such task is supervised learning. On the other hand, if the objective is to convey statistical inference (ie, hypothesis testing and estimations of point or CI), such task falls into the realm of statistical modelling, not supervised learning.6 Supervised learning is relevant in any tobacco research that demands highly accurate prediction, such as the development of a public health surveillance tool that automatically predicts adolescents’ risk of smoking initiation.

Unsupervised learning excludes the need for an output variable; instead, its objective is to ‘learn from the input itself’.2 Typical tasks of unsupervised learning are finding meaningful groups of similar subjects (ie, clustering), identifying latent factors that capture the essence of high-dimensional data (ie, dimensionality reduction) and determining the underlying probability distribution of the data (ie, density estimation). Examples of unsupervised learning in tobacco research include exploring the themes of tobacco-related discussions on social media and discovering potential subtypes of nicotine dependence by analysing the brain MRI data of patients.

Finally, reinforcement learning refers to goal-oriented methods whereby an algorithm is trained to be a decision-maker that finds suitable actions to take in an interactive and complex situation.7 8 In healthcare, reinforcement learning can be applied to design personalised treatment plan that automatically adapts to the changing clinical states and health effects occurring from past treatments.9 Techniques of reinforcement learning can be used to develop smoking cessation programmes that adjust to the individualised needs of smokers and to help understand how recent smoking abstainers made the decision to smoke again.10

A few review studies have been conducted on ML applications in substance use, psychiatry and public health. An apparent focus of these reviews is on supervised learning, including diagnosing and prognosing diseases using neuroimaging data,11–14 predicting neurosurgical outcomes15 and forecasting population health-related outcomes.16 A recent systematic review assessed ML methods beyond supervised learning in analysing clinical text data17 while another review examined ML applied to addiction research and identified four tobacco-related studies.18 A limitation of prior reviews is omitting engineering databases—such as the IEEE Xplore library—that has been shown to yield valuable health literature.19 Furthermore, most of these reviews focus on one type of data and/or certain classes of ML, thereby precluding a comprehensive assessment of ML applications. Finally, there has been limited attention to ML in tobacco research, which gives rise to an absence of systematic insights and guidance that could benefit tobacco researchers. Hence, we conducted this scoping review to address these gaps in the literature. By searching a wide range of databases, we aimed to provide an overview of ML applications in tobacco research, including domains of application, common techniques and findings.


The review protocol is registered on the Open Science and follows the Preferred Reporting Items for Systematic Reviews and Meta-Analyses guideline for Scoping Reviews.20

Our objective is to comprehensively assess the tobacco research literature that self-identified as using ML. A scoping review methodology is appropriate here as it allows us to review a large body of potentially heterogenous studies, describe their range and depth and generate an overall account in a flexible yet systematic manner. Analytical strategies for this scoping review were informed by an enhanced methodological framework by Arksey and O’Malley21 and Levac et al.22 Our interdisciplinary research team identified the following research questions:

  • What is the volume of the literature and are there any time trends regarding publications?

  • What are the key domains of tobacco research that have been studied by ML?

  • Which classes of ML (ie, supervised, unsupervised, reinforcement or deep learning) and specific algorithms have been used?

  • What are the sources of data in these analyses?

  • Who are the knowledge users of these studies and in what settings or in what forms will their findings inform real-world practice?

  • How is data transparency handled in these studies?

We included English-language peer-reviewed journal articles, conference papers and dissertations that present a primary investigation where ML was identified as the method being used to assess human experiences with tobacco. The reference list of these studies was examined by one reviewer (RF) to yield additional citations. We excluded studies that lack an empirical component did not use real-world data or those unavailable in full text. Studies of genomics and diagnostic imaging were also excluded.

One reviewer (AK) conducted literature searches on MEDLINE, EMBASE, CINAHL Plus, APA PsycINFO, PubMed and the IEEE Xplore library up to 21 September 2020 using a combination of subject heading and keyword searching (see online supplemental appendix). We required studies to either being associated with the ‘machine learning’ subject heading (MeSH/Emtree) or have mentioned ‘machine learning’ in the title, abstract or keyword.

Two reviewers (AK and RF) independently screened the titles and abstracts of all studies on the Covidence platform. The full texts of studies that passed title and abstract screening were obtained and screened independently by the same reviewers to determine their eligibility. Disagreements were resolved through discussion with a third reviewer (MC).

Two reviewers (AK and RF) classified the included studies into domains based on the predominant research objective and the type of data used in the analysis. First, reviewers independently reviewed all studies, extracted key phrases and concepts from each study and formulated a set of categories; then, in an iterative process, reviewers collapsed these categories into four mutually exclusive domains: (1) ML-powered technology to assist smoking cessation; (2) content analysis of tobacco on social media; (3) smoker status classification from narrative clinical texts and (4) tobacco-related outcome prediction using administrative, survey or clinical trial data. To avoid counting studies about the same technology, we included the one study with the most comprehensive description. Shared-task competition papers were excluded to accurately reflect the diversity of objectives and data sources in the organic literature.

For all studies, we recorded their authors, year, country, objective, data source, classes of ML and the ML algorithms being used. We also documented the knowledge users and statement of data availability, if available. Data extraction forms were customised with additional items for each domain. For studies of ML-powered technology, we recorded whether it permitted just-in-time intervention. For studies of social media or narrative clinical texts, we extracted the sample size of their data. For studies that predicted tobacco-related outcomes, we recorded the study cohort, outcome(s) of interest and candidate predictors; six items denoting the key components of ML modelling in biomedical research23 (ie, whether the study has performed feature selection, resampling, model training, internal testing, external validation and model performance) and three items describing if studies have reported results beyond model performance, including importance ranking of individual predictors, intersectionality and other results. The same reviewers extracted data from all included studies independently and resolved disagreements through discussions.

Using a thematic analysis approach, we first presented descriptive results of all studies: the number of publications was plotted by year to assess the trend of publication; a pie chart was then used to show the results of a frequency analysis where we computed the proportion of studies that fell into each domain and finally, a bubble graph was used to demonstrate the domains by publication year.


Study inclusion

Initial search of databases yielded 2997 citations, of which 2691 were unique (figure 1). We identified 12 additional studies through reference-list searching. Screening of titles and abstracts identified 120 studies for full-text assessment. We excluded 46 studies for the following reasons: duplications (n=7); non-English (n=1); unavailable in full text (n=10); unrelated to ML (n=11); studies of diagnostic imaging (n=3), genetics (n=2) or non-human (n=1); used a simulated dataset (n=1) or a dataset supplied by a competition (n=4); a study protocol (n=1), commentary (n=1), letter to the editor (n=1), conference poster (n=1), meta-analysis (n=1) or review (n=1). After exclusions, 74 studies were included in the review.

Figure 1

PRISMA diagram documenting study exclusion. ML, machine learning.

Study characteristics

Thirty per cent (n=22) of these studies described a ML-powered technology to assist smoking cessation. Almost half of all studies leveraged ML to conduct content analysis of tobacco on social media (n=32, 43%). Six studies (8%) classified smoker status from unstructured narrative clinical texts. The remaining 14 studies (19%) used ML in a traditional quantitative analysis framework to predict a tobacco-related outcome using administrative, survey or clinical trial data (figure 2).

Figure 2

Distribution of studies in the four domains. ML, machine learning.

The volume of ML studies increased dramatically since 2012 and accelerated since 2018 (figure 3). When stratified by the four domains (figure 4), early ML applications in tobacco research were mostly on outcome prediction. Since 2012, there has been an explosive growth in the use of ML in analysing social media content and narrative clinical texts as well as the development of devices to monitor cigarette use and to support cessation.

Figure 3

Yearly publication trends of machine learning studies in tobacco.

Figure 4

A bubble graph showing the distributions of domains of studies over years. ML, machine learning.

ML-powered technology to assist smoking cessation

Table 1 and online supplemental table S1 summarise the 22 studies on ML-powered technology. Almost half of these studies (n=9, 41%) described a technology that required a custom-made wearable sensor system. Sensors were placed on the wrist,24–26 chest,27 both regions28–30 or the arm31 to collect data on hand-to-mouth gestures or the breathing patterns the user. These data were processed by a supervised or deep learning ML algorithm to detect smoking activities. The support vector machine was the most commonly used ML algorithm by seven out of nine studies in this group.

Table 1

Studies focusing on ML-powered technology to assist smoking cessation

Another seven studies (32%) examined a bespoke sensor system paired up with a smartphone application where ML algorithms were used to detect smoking and generated smoking-triggered, just-in-time interventions on some occasions as well.32–34 The RisQ system sent an alert message to users once smoking was detected by a random forest model.35 The system described by Chen et al 36 operated in a similar fashion and the wrist sensor developed by Añazco et al 37 vibrated in response to a smoking event. On one occasion, the alert messages were chosen by a ML algorithm to suit the preferences and needs of users.38

The remaining six studies (27%) described ML algorithms that operated directly on an off-the-shelf device (such as an Android smartwatch or an iPhone). The StopWatch system only required a smartwatch in order to detect smoking activities using decision trees.39 Other smartwatch-based systems required connection to a smartphone app, including the SmokeBeat,40 SmokeSense41 and the system presented by Fan et al.42 Among the two systems that only required the built-in sensors in a smartphone,43 44 Ahsan et al described a recommender system that automatically sends tailored motivational SMS texts to quit attempters based on their preferences and smoking history.43

Regarding knowledge users, all studies in this domain were conducted to directly inform the development of an actual device; therefore, they identified the knowledge users to be manufacturers, developers and decision-makers who are responsible for considering the adaptation of novel products into smoking cessation programmes. Only one study provided a statement of data availability; this study made both their data and codes publicly available.26

Content analysis of tobacco on social media

Table 2 and online supplemental table S2 present 32 studies where ML was applied to explore tobacco-related discussions on social media. Half of these studies (n=16) were based on Twitter posts (ie, tweets) and the volume of tweets ranged from 500 to an impressive 86 million. Most Twitter studies focused on e-cigarettes (n=10) and cigarettes (n=4) while one study each assessed hookah, little cigars/cigarillos and tobacco in general. Supervised learning was predominantly used, except for one recent e-cigarettes-related study that applied deep learning.45 In terms of objectives, understanding and classifying attitudes and opinions towards vaping,45–51 smoking,48 52 53 hookah smoking54 and tobacco use in general55 was the biggest theme. A few studies developed algorithms to detect advertisements of e-cigarettes45 47 50 56 and little cigars/cigarillo57 on Twitter. Smoking cessation, including intention to quit58 and quitting via vaping,59 was classified on two occasions. JUUL was investigated by two studies in order to uncover its user demographics60 and underage use.49

Table 2

Summary of studies that drew data from the social media or used narrative clinical texts

Eight studies mined posts and comments on various online forums, including Reddit, Facebook, vaping-oriented forums (ie, E-Cigarette Forum, Vapor Talk and Hookah Forum) and smoking cessation forums (ie, QuitStop and BecomeAnEx). The sample size of these studies was large, too, ranging from 2000 to more than 21 million.61 A mix of ML classes was used: supervised and deep learning were applied to identify and classify adverse health effects of vaping,61–63 to recognise tobacco use,64 to categorise types of social support for smoking cessation65 and to identify stages of smoking cessation.66 Unsupervised learning was used in more exploratory contexts to reveal topics on vaping63 and specifically, on JUUL67 and the sentiments and scope of discussion in Facebook groups.68

Text data extracted from the media, including the transcripts of TV and podcasts, newspaper articles and online news stories, were used in four studies. The largest study in this category—with more than 135 000 texts—evaluated the public responses to the enactment of the US Federal Tobacco 21 Policy which raised the legal age of accessing tobacco products from 18 to 21.69 This study found that while young non-smokers were generally in favour of the policy, their smoking counterparts showed support that was strongly and positively associated with the volume of Tobacco 21 media coverage.

Images and videos were also used as data sources in these studies. Three recently published studies mined picture posts on Instagram to identify promotional content of JUUL,70 explore JUUL-related youth culture70 and to assess the types of vaping posts that gained the highest engagement.71 In a study involving 2287 Instagram users, their picture posts, captions and comments were pooled in a deep learning analysis to classify their risk of tobacco use.72 Another study recruited 169 adult smokers who were asked to take photos of their daily smoking environments and non-smoking environments.73 Through a deep learning approach, this study developed an algorithm that could identify environments that invoked urges of smoking. ML studies of videos were relatively scarce, as we only identified two that recognised smoking activities in live stream videos74 or categorised attitudes towards smoking on Youtube.48

Studies in this domain identified public health officials as the main knowledge users of their findings. In particular, understanding the tobacco-related content on social media could inform the design of online campaigns that target the audience.54 56 Eleven studies provided a data availability statement where just one publicised the entirety of their codes.59

Smoker status classification from narrative clinical text

Table 2 and online supplemental table S2 also present the six studies based on narrative clinical texts. The sample size of these studies was smaller, ranging from 0.7K to 85K. The unstructured text data were obtained from the electronic medical records,75 electronic dental records76 and doctor notes documenting patient visits.77–80 Natural Language Processing was performed to process the text data before entering them into supervised learning analysis. Various classification pipelines were developed to classify types of cigarette smokers and smoking intensity in order to supplement the health registry database with data on smoking; hence, officials at health data management were identifies as knowledge users. Only one study provided a data availability statement and made both data and codes available to the public.79

Tobacco-related outcome prediction using administrative, survey or clinical trial data

Table 3 summarises the 14 studies where ML was applied to administrative, survey or clinical trial data to predict a tobacco-related outcome. While most studies recruited adult participants, two specifically focused on adolescents81 or youth.82 Sample size of these studies varied vastly from 92 to 2 million.83 Half of these studies used cross-sectional survey designs; the remaining studies used clinical trial data,84 longitudinal surveys,85 86 linked administrative data83 87 or participant records on a device.88 89

Table 3

Summary of studies predicting tobacco-related outcomes using administrative, survey or clinical trial data

The majority of these studies predicted a binary outcome related to smoking cessation, including the intention to quit smoking,90 adherence to nicotine replacement therapy,84 having high or low urges to smoke during a quit attempt88 and self-reported86 91 or lab-established85 cessation status. Other binary outcomes pertained to tobacco use patterns and history, including ever or current use of tobacco,81–83 87 nicotine dependence92 and whether individuals were exclusive or dual e-cigarette users.93 Only two continuous outcomes—time to first smoking lapse among recent cigarette quitters89 and biological age87—were examined. In terms of ML algorithms, decision trees (n=6)85 88 90–92 94 and tree ensembles—including random forest82 86 87 93 and boosting tree86—were the most popular. Unsupervised clustering analysis was combined with a decision tree in one instance.90 Neural networks82 86 94 and the more advanced, multilayered form87 were also used. The number of candidate features ranged from 6 to 53 027.

While all studies adequately reported the process of model training, internal testing and performance of their model, the reporting of other aspects varied (table 4). Only six studies described a feature selection procedure using regression,85 86 92 random forest87 or model-free filter88 or automated subset selection methods.94 Data imbalance was assessed by three studies82 86 94 where two of them82 86 performed oversampling.

Table 4

Procedures and results of machine learning studies predicting tobacco-related outcomes

Ten studies reported findings beyond the performance of ML models. Nine of these studies presented a ranking of individual predictors based on a relative importance score.81–85 87 89 92 93 One study also assessed intersectionality by plotting the predicted outcomes by sex/smoker subgroups.87 Another study estimated a multivariate logistic model side-by-side to compare its results and performance with ML.94 Among studies that fitted a decision tree model (n=6), half of them visualised their model structure.84 85 92 Other additional findings included the results from the clustering anlaysis,90 the distribution of predicted vs actual outcomes87 and performance of the model using varied decision thresholds.83

Several knowledge users were identified (online supplemental table S3). By gaining an understanding of important predictors of poor smoking cessation outcomes, healthcare professionals could develop personalised behavioural counselling and other services or tools to support individuals who are trying to quit84 85 88 92 94 and recent quitters who are at risk of relapsing.89 Public health officials were identified as the knowledge user of findings on the environmental effects of substance use among children,81 82 predictors of smoking cessation intentions90 and unique characteristics of exclusive vapers.93 A ML-powered phenotype model for smokers may be valuable for researchers and health registry managers to identify populations at risk of smoking using data from health claims databases.83 Five studies provided a data availability statement and two made their codes available to the public.84 94


This is the first scoping review to comprehensively assess the tobacco literature that identified ML as their analytical method. By categorising these studies into domains, we provided a systematic description on how ML has been applied to investigating different aspects of tobacco control. These studies have demonstrated various strengths of ML that could potentially benefit tobacco researchers and the tobacco control community. However, limitations of these studies—and the limitation of ML in general—warrant attention.

Our synthesis revealed unique aspects of ML in health research that have not been discussed previously. Prior reviews placed their main focus on the predictive performance of supervised learning models in comparison with conventional regression models.12–16 In this review we found unsupervised learning has also been applied in content analysis studies with data drawn from social media to uncover qualitatively distinct topics of discussion of tobacco. Hence, ML is a promising tool for prediction and it also has the potential to support qualitative research by handling large textual datasets. A recent article has demonstrated quantitative methods to analyse court transcripts, though the methods used were not ML.95 We believe that when dealing with even larger volumes of tobacco industry documents and other text data, ML could be applied to content analyses and yield valuable insights in an efficient manner. Next, we found ML studies belonging to one domain tended to investigate a certain class of tobacco products. Specifically, our search did not yield any papers in ML-powered technology that targeted non-cigarette tobacco users. In contrast, more than half of ML-based studies using social media data were about e-cigarettes or other alternative forms of tobacco. These observations imply that current ML applications in non-cigarette tobacco products are largely descriptive in nature and have yet to dive deeper into analysing person-level outcomes.93

Implications and future directions

Results of this review shed light on why tobacco researchers might consider adding ML to their analytical toolkit. First, ML allows researchers to streamline the analytical pipeline to improve efficiency. However, doing so requires good practice of data and/or codes sharing in the research community, which, according to our review, is unsatisfactory at the current stage. Second, supervised and deep learning represent promising tools to devise highly accurate predictive models to aid tobacco control interventions. Specifically, ML enables highly efficient data-driven procedures to select important risk factors by screening tens of thousands of candidate variables,83 while such volume is usually unamenable to traditional statistical approaches. Third, for researchers working with unstructured data (such as free-text doctor notes), ML combining with Natural Language Processing provides an effective way to extract and classify information. This strength is particularly relevant now as vast amounts of data are emerging that potentially permits unprecedentedly extensive data linkage and analysis, but these data may not have been manually processed to become ready for traditional statistical analysis.

Successful and ethical implementation of ML can bring remarkable advancements to the decision-making and design of tobacco control interventions. ML-enabled technology, including wearable devices and smartphone apps, may represent cost-effective compliments of existing tobacco cessation programmes as they optimise the precision and personalisation of the cessation experience. Large-scale public surveillance of tobacco use may become feasible as ML is applied to automatically monitor and analyse social media content. Such deployment can be used to target youth and young adults—who are both avid users of social media and population of interest in tobacco control efforts. Furthermore, ML-based content analysis can provide timely feedbacks of public reactions to tobacco control policy, which allows decision-makers to make swift adjustment to reflect the needs of the population.69

We suggest five ways to move ML forward in tobacco research. First, there is an urgent need to mandate data/codes sharing and external validation to permit research reproducibility and reliable adaptation of ML pipelines in other data settings. Second, ML models that predict population health-related outcomes are generally built on large person-level health records, an asset that is often lacking in tobacco research.16 This underscores the establishment of large person-level databanks of tobacco use and algorithms that link existing health records with data on tobacco to form the basis of ML applications.96–101 Third, ML research needs to extend from cigarette smoking and cessation to other tobacco products. Such effort could lead to innovative public health tools that are tailored to users of alternative tobacco. Fourth, the current guideline for reporting ML studies in biomedical research does not require investigators to offer explanatory findings beyond model performance.23 Consequently, complex ML models, especially those that can neither be expressed in a mathematical equation nor represented graphically, remain a ‘black box’. Hence, tobacco researchers using ML need to be fully aware of statistical methods and computer programmes that generate explanatory findings, such as the relative importance of individual predictors and how to characterise complex high-dimensional interaction effects that are not otherwise easily approximated using conventional statistical approaches.102–106 And finally, tobacco researchers need to recognise the inherent bias of implementing ML in tobacco control and solutions to mitigate such bias.107 For example, the use of biased data (such as a dataset that underrepresents racial/ethnical minorities) is likely to lead to a biased ML algorithm that produces predictions that tend to amplify the underlying disparities. Understanding these issues will help standardise the procedures of ML research to ensure ethical and equitable applications of ML to support tobacco control policy.108


Our review has limitations. First, due to the high level of heterogeneity between studies in terms of the objective, design, data and procedures, we were unable to perform formal appraisal of study quality. Hence, future reviewers may use either the guideline for ML studies in biomedical research23 or modified versions of existing checklists for quantitative analyses109 110 to critically assess the reporting quality of these studies. Second, potential selection bias may arise from the English-language restriction. Thus, future reviewers might want to remove the language restriction to yield additional studies. Third, we fully recognise that by requiring studies to either mention ‘machine learning’ or being categorised under the ‘machine learning’ subject header, we might not have captured the entire ML literature in tobacco research. However, the purpose of this review was to assess tobacco studies that identified themselves to use ML in their analysis, and thus we expected studies to either specifically address ‘machine learning’ in the manuscript or are accessible from the ‘machine learning” subject header. Future researchers should consider using thesaurus to develop a standardised vocabulary that exhausts searchable concepts of ML. Such vocabulary could inform the design of a concept map and the use of specific keywords in the literature search.111 Fourth, we excluded ML studies relying solely on diagnostic imaging data as they have already been systematically reviewed and they generally involved similar methods.11–14 Furthermore, the setting of these studies is strictly clinical so that their findings have relatively limited implications in a broader tobacco control context. We also did not have access to ML findings or reports, if any, from large commercial enterprises such as Google, Amazon or Facebook or other data aggregators or tobacco/vaping companies.

In conclusion, ML represents a new class of statistical tools that could improve our understanding of tobacco use. There is untapped potential to make better use of these techniques to explore areas such as intersectionality. Future research needs to explore use of ML with considerations on interpretability, equity and research transparency. The value and risk of ML need to be carefully evaluated before implementation in personalised prevention, treatment or public policy.

What this paper adds

  • This study identified 74 papers that described themselves as using machine learning techniques.

  • These included studies on ML-powered technology to assist smoking cessation (n=22); content analysis of tobacco on social media (n=32); smoker status classification from narrative clinical texts (n=6) and tobacco-related outcome prediction using administrative, survey or clinical trial data (n=14).

What important gaps in knowledge exist on this topic

  • Machine learning techniques have become mainstream over the past few years. This paper explores how has the field of tobacco control research has used machine learning and provides recommendation for where it might be useful in the future.


Ethics statements

Patient consent for publication


Supplementary materials

  • Supplementary Data

    This web only file has been produced by the BMJ Publishing Group from an electronic file supplied by the author(s) and has not been edited for content.


  • Twitter @drpselby

  • Contributors RF, AK and MOC designed the study. AK conducted the literature search. RF, AK and MOC conducted eligibility screening. RF and AK extracted the data. RF undertook data synthesis and led writing and revision of the manuscript. NM, TE-M, WW, SH, SJB, HH, PS, RS and MOC made substantial contribution to the preparation and revision of the manuscript. All authors read, critically revised and approved the final version of the manuscript before submission.

  • Funding This work was supported by the Canadian Institutes of Health Research Catalyst Grant #172898.

  • Competing interests None declared.

  • Provenance and peer review Not commissioned; externally peer reviewed.

  • Supplemental material This content has been supplied by the author(s). It has not been vetted by BMJ Publishing Group Limited (BMJ) and may not have been peer-reviewed. Any opinions or recommendations discussed are solely those of the author(s) and are not endorsed by BMJ. BMJ disclaims all liability and responsibility arising from any reliance placed on the content. Where the content includes any translated material, BMJ does not warrant the accuracy and reliability of the translations (including but not limited to local regulations, clinical guidelines, terminology, drug names and drug dosages), and is not responsible for any error and/or omissions arising from translation and adaptation or otherwise.