Studies
We identified measures of dementia behavioral symptoms that are used in clinical practice, a national survey, and randomized trials of dementia care interventions. Because our ultimate goal is to develop a statistical crosswalk between common measures of behavioral symptoms, we only needed a single record per participant in studies with longitudinal data. National Institute on Aging funded Alzheimer’s Disease Research Centers submit longitudinal clinical evaluations to the National Alzheimer’s Coordinating Center (NACC) uniform dataset, which includes an assessment of behavioral symptoms. For NACC, we used data from clinical evaluations submitted between September 2005 and May 2020 (N = 14,654), and we selected a single random visit for each participant with a dementia diagnosis [1, 4]. The Aging, Demographics and Memory Study (ADAMS) is a US representative survey of cognitive health, in which participants were administered a clinical assessment that includes measures of behaviors [32]. We restricted our analysis to ADAMS Wave A participants with a dementia diagnosis (N = 308). Care of Older Persons in Their Environment (COPE) (N = 237), the Tailored Activity Program (TAP) (N = 60), the Alzheimer’s Quality of Life Study (ALZQOL) (N = 88), the Advancing Caregiver Training (ACT) project (N = 272), the Resources for Enhancing Alzheimer’s Caregiver Health project (REACH) (N = 670), the Adult Day Services Plus study (ADS PLUS) (N = 194) are National Institute on Aging funded trials of non-drug dementia care interventions that included measures of dementia related behaviors and are trials in which the study principle investigator was willing to share data. We used baseline data from these trials so that responses would not be confounded by participation in the trial.
Specifically, COPE was a randomized trial to test the effectiveness of a nonpharmacologic intervention that aims to ameliorate physical functioning, quality of life, and behavioral outcomes for people living with dementia [15]. The TAP trial tested a home-based occupational therapy intervention that aimed to reduce behavioral symptoms of people living with dementia [16]. ALZQOL was a randomized trial to assess potentially modifiable risk factors associated with quality of life for persons with dementia and caregivers [12]. ACT was a randomized trial testing a home-based multicomponent intervention targeting environmental and behavioral factors contributing to quality of life of persons with functional disabilities [13, 14]. REACH II was a randomized controlled trial of the effects of a multicomponent intervention on quality of caregiving among caregivers of dementia patients [3]. ADS Plus study was a randomized controlled trial of the effect of a management intervention on quality of caregiving among dementia caregivers, service utilization and institution placement of care recipients [13, 14]. We selected only baseline data from COPE, TAP, ALZQOL, ACT, REACH II, ADS Plus to be merged with other studies.
Procedure
We acquired codebooks, data entry and test forms, and procedural instructions from each study. We then identified common behavioral instruments and items used within and across studies. We reviewed each individual item to identify its respective behavioral attribute, skip patterns (e.g. questions that are conditional on other items), question stems, response options or scoring types, and theoretical score ranges. This step revealed multiple sources of variation across studies.
Upon reviewing available documentation, we created a crosswalk document that links common items from assorted instruments adopted within and across studies that assess behaviors. As implemented here, a crosswalk is a table that maps common elements between different studies. Each row represents an item of interest (e.g., whether a respondent exhibits false beliefs). The relevant individual test item for a study associated with a construct is placed on the corresponding row in the crosswalk. Items judged by experts to be similar across studies were placed on a row together. Additionally, this crosswalk contained relevant information about each item in each study including the name of source dataset, specific section of the survey, name and version of the instrument being used, study-specific name for each item, question stem, and responses options which included the possible score range. The crosswalk was updated throughout the pre-statistical harmonization procedures listed below. For the purpose of data sharing, this crosswalk will be made available upon request to the senior author.
Workflow
Establishing a workflow is a process that encompasses all aspects involved in data management. We followed procedures described in The Workflow of Data Analysis Using Stata, by J. Scott Long [33]. In our data analysis project, we used a generalizable file structure sharable via a secured online server that can be accessed by team members from different terminals. This structure facilitates reproducible research.
There are nine common folders: 1) Hold then delete, 2) To file, 3) Admin, 4) Documentation, 5) Mailbox, 6) Posted, 7) Private, 8) Readings, and 9) Work. Hold then delete and To file are folders that temporarily hold files so that we can determine the purpose of these files later, as needed. Admin is a folder for budgeting, correspondence with other investigators, IRB paperwork, and the proposal. Posted is probably the most important folder: it contains sub-folders for analysis, data (both source data and data derived with our analytic code; distinguishing between these is especially crucial for purposes of reproducible research), descriptive statistics, figures and interim analyses. Other folders are self-explanatory by their names. Under the Posted folder, there is sub-folder containing the common set of analytic files. Analytic files contain sections of code pertinent to a specific task during the data cleaning and processing. Descriptions of each analytic file are below:
Data management files:
-
1.
Master.do: sets up working directory and calls all files;
-
2.
Preambling.do: sets local macro and global macro to store study-specific item names;
-
3.
Start-latex.do: produces pdfs for reports;
-
4.
Call-source.do: calls data from source data files and processes the raw data minimally, such as reshaping data into long format and generating record ID;
-
5.
Renamevars.do: generates the rename, recode and labeling commands for each item and store them as global macros;
-
6.
Mergestudies.do: calls on renaming, recoding and labeling global macros and merges studies together;
-
7.
Create-variables.do: performs data cleaning so that items have the same values across datasets;
-
8.
Select-cases-for-analysis.do: Identifies data-specific global macros of items for each attribute;
Data analysis files:
-
9.
Model-fitting.do: This program runs IRT analyses of behavioral items in each study separately.
-
10.
Models.do: This program conducts statistical co-calibration via IRT of behavioral items.
-
11.
Sensitivity-analysis.do: This is an optional syntax file for any sensitivity analyses necessary to probe robustness of results.
Conditional dependency
Responses to some items are conditional on others. For example, answering “yes” to a question about a behavior may prompt, in some datasets, additional questions about frequency or recency of the behavior. These items are inappropriate to be include in statistical harmonization because the items are conditionally dependent on each other. To address this in our datasets, we underwent rigorous efforts to manually review each instrument and items therein. We found that, in this project, items assessing severity or frequency of a behavior were usually conditional on a binary yes/no question assessing presence of a behavior.
Standardization of item coding
A critical stage of pre-statistical harmonization is to ensure common items, and items that can be made comparable via transformation, are comparable across instruments or studies. One way to achieve this is to code response options in the same way across multiple instruments and studies. For example, ADAMS adopted the Neuropsychiatric Inventory (NPI) as an instrument to measure behavioral outcomes. One specific test item measures whether delusion of danger is present in the participant’s behavior. The question stem is “Does (NAME) believe that (HE/SHE) is in danger--that others are planning to hurt (HIM/HER)?”. Response options include: yes (coded 1), no (coded 5), invalid (coded 6), skipped or not asked because screening symptom not confirmed (coded 96), not assessed/not asked (coded 97), don’t know (coded 98), not applicable/not assessed for this item (coded 99). To standardize response options, we edited the crosswalk column and replaced text descriptions of value labels with stata-readable code that revised values so that a resulting variable would be ready for analysis. For the above example, we set values of 6, 96, 97, 98 and 99 to missing, and values of 5 to 0, such that the final item for this question to be used for analysis has values of 0 (not endorsed) and 1 (endorsed).
Comparability of items
In addition to manually reviewing each test item, we leveraged several automated approaches to uncover potential cross-study differences in items. We summarized and visualized data by displaying item values specific to each study. For example, we cross-tabulated items and studies conditioning on that item having different minimum and maximum values across studies. Resulting tables can identify items that have different scoring procedures across studies. Another approach to uncover potential sources of heterogeneity is to estimate correlation matrices. These matrices help identify items which have sizable negative intercorrelations with other items, and thus which may need to be reverse coded to be in the same direction as other items. Within each study, we tabulated the frequency of each item to identify skewness and potential outliers. We cycled through every item within each study, filtering out items which have the same minimum value and maximum value within a specific study (indicating no variability). We filtered out items which have maximum value between 90 to 100, or that had negative values because for our scales, such values indicated missing data codes that should be removed prior to analysis via recoding in our crosswalk. In our preliminary IRT analysis, we leveraged both univariate and bivariate residual analysis to identify items that have mismatched model estimated correlation and empirical correlations. Details on how each approach is adopted for a specific harmonization task are given in following sections.
Missingness and skewness
On top of missingness in responses already coded in original documentation (e.g. In ADAMS, 6 = Invalid, 96 = Skipped, or not asked because screening symptom not confirmed, 97 = Not assessed/Not asked (NPI not completed), 98. = DK (Don’t Know), 99 = Not applicable/not assessed for this item), we paid close attention to other sources of outlying values or skewness. For example, some items represent severe or extremely rare behavioral symptoms (i.e. inappropriate sexual contact), such that the frequency of its being endorsed within a particular study is low. Another possible scenario is when an item is only assessed for a subset of participants, as an artifact of conditional dependency (i.e. an item can only be answered given another item’s response). Sometimes there is little to no variability in responses due to small sample sizes. ALZQOL and TAP are both small randomized trials.
Items which have excessive missingness – which is usually explainable – and skewness in responses could create issues when we run statistical models such as IRT analysis later, mainly because they would have poor correlations with other items.
Directionality
Items should be consistent in their polarity or direction to facilitate interpretability of harmonized factor analysis. We decided to code “higher” values as indicative of adverse behaviors. If the presence of a certain behavior is considered “worse”, we gave this response option a higher value. The same procedure was undertaken for items that indicate severity or frequency, such that higher values indicate more severe presentations of a behavioral symptom. This step required careful review of question content and response options.
On top of these qualitative review procedures, we adopted an automated approach to identify items in need of reverse coding. Within each study, we ran correlation matrices and flagged potentially problematic negative correlations (r < − 0.2). If not by chance, a negative correlation tends to indicate an item may be in need of reverse scoring relative to other items within the same study.
Scoring type and scales
Other than ensuring consistent directionality, all studies must have the same response levels for a given item (e.g., 4-point Likert scale, 5-point Likert scale, binary no/yes). Discrepant response levels are present when different instruments were used but judged to have common items. We undertook rigorous efforts to parse out presumably comparable items that were subject to differential scoring across studies.
Model fitting
To check whether that above procedures helped and were not detrimental for statistical co-calibration, in each of the eight studies we estimated two-parameter logistic Item Response Theory (IRT) models. This procedure is akin to testing configural measurement invariance (e.g., [31]). The two-parameter IRT model predicts the probability of an individual endorsing a behavioral symptom or item, as a function of the discrepancy between one’s unobserved level on the underlying trait and the item difficulty parameter, modified by the item discrimination parameter. Item difficulty, akin to a threshold in factor analysis, is the level of the underlying trait at which a randomly selected individual from the population has a 50% probability of endorsing the item. Item discrimination, analogous to a factor loading in factor analysis, describes the strength of association between the item and the underlying trait, or how well the item separates individuals having low and high level of the underlying trait [34, 41]. We scrutinized Mplus output for high residuals between empirical and model-estimated covariances (specifically, standardized residuals greater than 0.3), which would suggest a mismatch between model estimated and empirical correlations [37]. High residuals could imply items measure a similar behavior, or a heretofore undetected problem with conditional independence. Estimation of separate models within each individual study helps establish configural invariance by detecting couplet items that may be subjected to multidimensionality [30]. This procedure helps us to examine whether participants from different studies interpret the behavioral measurement items in a conceptually equivalent way [5]. We use three fit statistics to examine our model fit: root mean square error of approximation (RMSEA), comparative fit index (CFI) and standardized root mean square residual (SRMR). By convention, we adopted a cut-off value of RMSEA less than 0.08 to indicate excellent fit, an RMSEA between 0.05 to 0.08 to indicate mediocre fit. As for CFI, a value greater or equal to 0.90 is indicative of goof fit. A SRMR value lower than 0.08 indicates good fit [28, 29, 35]. Fitting these data to such a model is not conclusive of a problem with harmonization because misfit can also be due to model misspecification, however, we used model fitting as an exploratory approach.