We performed a simulation study using data from callers to the Arizona Smoker’s Helpline who answered at least one question on the FTND. First, we removed incomplete observations from that real data, so that we could control the missing data mechanism and calculate the bias because the “true” value was known. Then, we drew random samples with replacement to generate 1000 datasets for each parameter combination: two sample sizes, several proportions of missing data for subjects and for items, and two missingness mechanisms. We then applied six different methods for managing missing items to calculate the total FTND score, and we evaluated the performance of the methods for both the total FTND score and the coefficient for the total FTND score regressed on a covariate. Figure 1 gives an overview of these steps in the simulation study, and they are detailed below. The simulation was coded in R version 3.6.2 [12] and used the tidyverse R package [13]; the simulation code is available in Additional files 2, 3, 4 and 5.

### Data source

We started with data from 49,284 clients enrolled in ASHLine (Arizona Smoker’s Helpline; the state of Arizona’s quitline) from January 3, 2011 to June 23, 2016 who received standardized tobacco cessation protocols. Data for this study were collected over two time periods: the first phone call at time of enrollment (including demographics), and the second phone call (first coaching call, including answers to FTND items). The data used in this study starts with the subset of 38,742 clients who answered at least one of the FTND questions during the second phone call. Clients were 56.7% female (42.4% male; 0.9% missing) and ranged in age from 15 to 93 years with a mean age of 49.2 years [standard deviation (SD): 14.1 years; 0.9% missing]. Clients were asked about their race/ethnicity on two separate questions, and clients were 72.2% White, 16.9% Hispanic (65.7% non-Hispanic; 17.4% missing), 7.1% Black or African American, 6.6% other race (14.1% missing). For education, 15.4% of clients had less than high school, 29.6% had high school, 34.0% had some above high school, and 18.0% had a college degree (3.0% missing). Out of the 38,742 clients, 38,334 (98.9%) answered all the FTND items.

### Simulation sample sizes

We started with the subset of 38,334 clients who completed all items on the FTND. Then we sampled with replacement to select a random sample for a simulation. We chose a small sample size of n_{obs} = 52 and a large sample size of n_{obs} = 788 based on detecting, respectively, a large effect size corresponding to Cohen’s standardized effect [14] d = 0.8 and a small effect size corresponding to Cohen’s d = 0.2, with 80% power for a two-sided t-test with Type I error of 0.05.

### Missingness mechanisms

To imitate missingness patterns seen in questionnaire responses where missing items appear in only a subset of subjects, only certain subjects were eligible for item-level missingness with probability p_{sub}. Which subjects were eligible for item-level missingness depended on the missing data mechanism, as described below. Then item-level missingness was assigned to eligible subjects with probability p_{item} using the Bernoulli distribution: for each subject eligible for missingness, a 0 or 1 was generated for each item, with 1 being generated with probability p_{item}. Items assigned the value 1 were changed to missing. To span a range of missingness rates from the small level seen in our ASHLine data to larger rates to distinguish the performance of different methods of managing missing items, we chose to simulate with values of p_{sub} = 0.1, 0.3, 0.5 and p_{item} = 0.1, 0.3, 0.5, 0.7, resulting in overall proportions of missing items ranging from 0.01 to 0.35.

Investigating different causes of missingness (i.e., missingness mechanisms) is important, because the performance of different methods for managing missing items depends on the underlying reasons for the data being missing. Little and Rubin [15] defined three categories of missingness mechanisms. Data are missing completely at random (MCAR) if the probability of an item being missing (i.e., it’s missingness) is independent of the subject’s missing or observed characteristics. Data are missing at random (MAR) if an item’s missingness depends on the subject’s observed characteristics (note: some consider this covariate-dependent missingness). Data are missing not at random (MNAR) if an item’s missingness depends on what would have been the true value. In this study, we implemented MAR and MNAR missingness mechanisms, omitting MCAR missingness because its main effect is just to reduce the sample size, and MAR and MNAR missingness are likely more realistic.

The MAR missingness mechanism was carried out by choosing subjects to be eligible for item-level missingness based on two variables that were associated at *p* < 0.05 in our original sample (*N* = 38,742) with FTND missingness using single-variable logistic regression and with the FTND value itself using single-variable regression. The two variables were gender (0 = female; 1 = male) and the answer to the question, “If you smoke at home, where?” with possible answers 0 = No; 1 = Yes (outside), and 3 = Yes (inside). Subjects were eligible for missingness with a probability (Pr) determined by the following logistic regression model, where *smoke* _ *where*_{1} and *smoke* _ *where*_{2} are indicator variables for smoking outside at home and smoking inside at home, respectively:

$$logit\Pr (missing)={\beta}_0+{\beta}_1\ast gender+{\beta}_2\ast smoke\_ wher{e}_1+{\beta}_3\ast smoke\_ wher{e}_2$$

Additionally, we fixed the coefficients *β*_{1}, *β*_{2}, and *β*_{3} from logistic regression of FTND missingness on gender and the two smoke_where variables in the original dataset (*N* = 38,742) as follows: *β*_{1} = 0.20 (corresponding to an odds ratio OR = 1.22), *β*_{2} = − 2.12 (corresponding to OR = 0.12), and *β*_{3} = − 1.99 (corresponding to OR = 0.14). Thus, males were more likely to be eligible for missingness than females, and those who answered that they did not smoke at home were more likely to be eligible for missingness than those who answered that they did smoke at home (either inside or outside). We then chose the value of *β*_{0} empirically to make approximately p_{sub} subjects eligible for missingness. Finally, item-level missingness was performed within those subjects with probability p_{item} as detailed above.

The MNAR missingness mechanism was implemented by choosing subjects to be eligible for item-level missingness based on their total FTND score, with a probability determined by the following logistic regression model:

$$logit\Pr (missing)={\beta}_0+{\beta}_1\ast FTND$$

where we fixed the coefficient *β*_{1} = 0.2 (corresponding to OR = 1.22 for a 1-point increase in FTND), so that subjects exhibiting higher nicotine dependence were more likely to be eligible for missingness. The value of *β*_{0} was again chosen empirically to qualify roughly p_{sub} subjects for missingness. Finally, item-level missingness was carried out within those subjects with probability p_{item} as described above.

### Methods for managing missing items

**Complete case analysis (CCA)**, also referred to as listwise deletion, was performed by only calculating the total FTND score for subjects who had no missing items.

The **drop one** method was implemented as follows: if a subject had only one item missing, their total FTND score was calculated without it. Thus, the missing item was assumed to have a value of zero. If a subject had more than one item missing, their total FTND score was coded as missing.

**Item mean** imputation was carried out by imputing a missing item with the mean score for that item for all participants who answered it. **Half-rule (HR) item mean** imputation was performed like item mean imputation, but only if at least half (3) of the items on the FTND were non-missing for a subject (otherwise the total FTND score was coded as missing).

**Proration**, sometimes called person or subject mean imputation, involves imputing the value of a subject’s missing item based on their answers to other items in the questionnaire. Imputation with proration was only performed following the half rule suggested for other questionnaires [4, 6]: if at least half (3) of the items on the FTND were non-missing for a subject, then the item was imputed (otherwise the total FTND score was coded as missing). For the FTND, items 1 and 4 have possible points (0, 1, 2, 3); the other four items all have possible points (0, 1). Thus, to weight items appropriately for proration, imputed item values were calculated as follows before summing values from all items to obtain the total FTND score:

$$\mathrm{Items}\ 1\;\&\;4:\kern0.5em imputed\ item\ value=3\ast \left(\frac{total\ score\ of\ questions\ subject\ answered}{total\ possible\ score\ of\ questions\ subject\ answered}\right)$$

$$\mathrm{Items}\;\ 2,\;3,\;5,\;6: imputed\ item\ value=\frac{total\ score\ of\ questions\ subject\ answered}{total\ possible\ score\ of\ questions\ subject\ answered}$$

K-nearest neighbor **hot deck** imputation consisted of imputing the values of missing items for each subject with values from a subject with data for those items that in other respects was like the subject with missing items. First, for each subject with one or more missing FTND items (called a recipient), the predictors of gender, the smoking allowed in home variable “Is smoking allowed in your home?” with responses 0 = No and 1 = Yes, and the subject’s non-missing FTND items were used to calculate Gower’s distance [16] between subjects to determine the k most similar subjects with non-missing data for those items. Second, a single donor was randomly chosen from those k subjects to donate imputed values to the recipient. Hot deck was implemented using the *simputation* R package [17] with k = 5 and pool = “multivariate”, so a pool of five donors was generated for each recipient, and if a recipient had more than one missing item, all imputations were provided by a single donor.

### Performance measures

We assessed accuracy by estimating the bias of the descriptive parameter of the mean total FTND score itself and of the regression coefficient (association parameter) for the total FTND score in single linear regression on the explanatory variable, “Is smoking allowed in your home?” with responses 0 = No and 1 = Yes. We estimated the bias of each quantity for each simulated dataset by calculating the difference between its value after a method for managing missing items was applied (\(\hat{\theta}\)) and its “true” value from the same simulated dataset before missingness was generated (*θ*): \(\hat{\theta}-\theta\); the percent bias was calculated by dividing the bias by the “true” value and multiplying by 100: \(\frac{\hat{\theta}-\theta }{\theta}\ast 100.\) Then, we calculated the mean and 95% Monte Carlo confidence interval (CI) over the 1000 datasets.

We assessed precision for the FTND by estimating the bias of the standard error (SE) of the mean FTND score compared to the “true” empirical SE calculated from the standard deviation (SD) of the mean FTND score over the 1000 datasets for each missingness mechanism and method. Similarly, we assessed precision for the regression coefficient by estimating the bias of its SE compared to its “true” empirical SE calculated from the SD of the regression coefficient over the 1000 datasets for each missingness mechanism and method [18]. Zero bias for these quantities indicates high precision. We calculated percent bias as above, as well as the mean and 95% CI over the 1000 datasets.

A clinically important level of bias for the mean FTND score was considered to be 1 for two reasons: (1) a difference of 1 point can make the difference between distinct categories of nicotine dependence [e.g., 1 point can mean the difference between a subject being categorized as having medium (score 5) versus high (scores 6 or 7) nicotine dependence], and (2) a minimal important difference of 10% (equal to 1 point on the FTND) may be used in patient-reported outcome questionnaires [19].