 Research article
 Open Access
 Open Peer Review
 Published:
Systematic review of statistical approaches to quantify, or correct for, measurement error in a continuous exposure in nutritional epidemiology
BMC Medical Research Methodologyvolume 17, Article number: 146 (2017)
Abstract
Background
Several statistical approaches have been proposed to assess and correct for exposure measurement error. We aimed to provide a critical overview of the most common approaches used in nutritional epidemiology.
Methods
MEDLINE, EMBASE, BIOSIS and CINAHL were searched for reports published in English up to May 2016 in order to ascertain studies that described methods aimed to quantify and/or correct for measurement error for a continuous exposure in nutritional epidemiology using a calibration study.
Results
We identified 126 studies, 43 of which described statistical methods and 83 that applied any of these methods to a real dataset. The statistical approaches in the eligible studies were grouped into: a) approaches to quantify the relationship between different dietary assessment instruments and “true intake”, which were mostly based on correlation analysis and the method of triads; b) approaches to adjust point and interval estimates of dietdisease associations for measurement error, mostly based on regression calibration analysis and its extensions. Two approaches (multiple imputation and moment reconstruction) were identified that can deal with differential measurement error.
Conclusions
For regression calibration, the most common approach to correct for measurement error used in nutritional epidemiology, it is crucial to ensure that its assumptions and requirements are fully met. Analyses that investigate the impact of departures from the classical measurement error model on regression calibration estimates can be helpful to researchers in interpreting their findings. With regard to the possible use of alternative methods when regression calibration is not appropriate, the choice of method should depend on the measurement error model assumed, the availability of suitable calibration study data and the potential for bias due to violation of the classical measurement error model assumptions. On the basis of this review, we provide some practical advice for the use of methods to assess and adjust for measurement error in nutritional epidemiology.
Background
Many exposures investigated in epidemiological research, such as physical activity, air pollution and dietary intake, are challenging to measure and therefore prone to measurement error. It has been suggested that efforts should be devoted to improving the measurement of an individual’s environmental exposure, the “exposome”, akin to the efforts that have been devoted to the “genome”, [1] and that international collaborations should be formed in order to translate the “exposome” from a concept to an approach that can be implemented in order to address public health issues [2]. An exemplar is diet and public health, about which there has been renewed controversy recently [3, 4]. The controversy is due in part to the difficulty of measuring nutritional and dietary exposures, and disentangling their effects because both specific exposures and their errors are intercorrelated.
Measurement error in nutritional epidemiology
Two general types of measurement error are described in the nutritional epidemiological literature: random errors and systematic errors. Random errors are chance fluctuations or random variations in dietary intake that average out to the truth in the long run if many repeats are taken (i.e. the law of large numbers applies) [5]. Systematic errors are more serious as they do not average out to the true value even when a large number of repeats are taken. In epidemiological studies random or systematic errors can occur at two different levels: within a person or between persons. Thus at least four types of measurement error can exist and these are described in detail in Table 1.
The situation where the only measurement errors are withinperson random errors that are independent of true exposure with a mean of zero and constant variance is known as the “classical measurement error model” (Table 2) and, in the case of a single mismeasured exposure, its effect is always attenuation of the estimated effect size toward the null [6]. Thus, the magnitude of association is reduced but the statistical test used to estimate the dietary effect is still valid (i.e. the Type I error is unaffected), though its power is reduced. However, in the presence of covariates in the disease model which are also measured imprecisely, the effect of the main exposure could be biased in any direction due to residual confounding and, as a result, the statistical test becomes invalid [7, 8].
Dietary assessment methods
Many largescale epidemiological studies rely on selfreported measures of dietary intake as this is the most costeffective way to collect the information, but these are subject to error. Methods to assess dietary intake include i) food frequency questionnaires, that ask how often certain foods are eaten over a designated period of time; ii) 24h recall are a memory based assessment method that asks about dietary intake over the past 24 h; iii) food diaries which prospectively record dietary data intake for a designated period; (iv) diet history and (v) checklist questions that assess one specific aspect of dietary intake.
The primary aim of many epidemiological studies is to obtain the “true dietary exposure” (which is usually defined as the habitual or longterm average dietary intake over a designated period of time), and establish whether this “true dietary exposure” is associated with a disease of interest. In order to facilitate this process the semiquantitative food frequency questionnaire (FFQ) became popular [9]. An FFQ consists of a structured food list and a frequency response section on which participants indicate his or her usual frequency of intake of each food over a certain period of time in the past, usually the past year. The FFQ has a low participant burden, and thus it is possible to conduct repeated measures over time, which is important to capture longterm variation in diets. The FFQ usually has lower withinperson variation than other dietary assessment methods described above because they are designed to assess longterm dietary intake, the exposure of etiological interest for most diseases [10].
Gold standards and alloyed gold standards
A gold standard dietary assessment instrument measures the “true dietary exposure level” plus classical measurement error [7]. Multiple week diet records which require participants to record everything they eat or drink over the course of several weeks, are considered to be the “gold standard” for ascertaining selfreport dietary information because unlike other methods described they do not rely on memory [10]. In some situations, biomarkers can be used as objective measures of dietary intake; these include recovery biomarkers which provide an estimate of absolute intake over a fixed period of time. Examples of recovery biomarkers are doubly labelled water for total energy intake and 24h urinary nitrogen for protein intake, which provide a measure of intake based on a known direct quantitative relationship between intake and output [8].
Because multiple week diet records are seldom practical, and there are few available biomarkers, “alloyed gold standard” diet assessment instruments have been used to assess longterm dietary intake, such as multiple 24h dietary recalls. Essentially, these are the best performing instruments under reasonable conditions, known to have some residual error but practical to use. Repeated 24h dietary recalls involve a participant reporting all foods consumed in the previous 24h or calendar day to a trained interviewer either in person or over the phone on multiple occasions over time. Although reliance on a participant’s memory leaves room for measurement error that is not necessarily classical, a welltrained interviewer can elicit highly detailed and potentially useful nutritional data. “Alloyed gold standard” biomarkers applied to the assessment of dietary intake includes i) predictive biomarkers and ii) concentration biomarkers. Predictive biomarkers are sensitive, stable, timedependent, and show a doseresponse relationship with dietary intakes. However, they may be affected by personal characteristics but their relationship with diet generally outweighs those factors. Examples of predictive biomarkers are urinary fructose, sucrose and dietary sugars. Concentration biomarkers are measured concentrations of specific compounds in blood (e.g. serum carotenoids, vitamin C and vitamin E) or other tissues (e.g. adipose tissue fatty acids). Unlike recovery biomarkers, concentration biomarkers do not have the same quantitative relationship with intake over a specific time period for all individuals in a given population due to betweensubject variation in digestion, absorption, body distribution, synthesis, metabolism and excretion [8]. Thus, an alloyed gold standard dietary assessment instrument measures “dietary exposure level” plus nonclassical measurement error.
Reference instruments and calibration studies
As there is no universal method of assessing diet/nutritional exposures, and because of technological innovation, investigation of dietdisease relationships often have required, and increasingly require, the development of new instruments that can be used in largescale epidemiological investigations. In principle, a measure of dietary exposure is valid if the dietary instrument measures what it purports to measure, which can be assessed if we can compare it to a reference instrument [11]. In the ideal scenario the new dietary instrument should be compared to a perfect reference instrument (that is a gold standard dietary instrument), that measures the “true dietary exposure”. In most situations, however, it is not possible to have an ideal or gold standard reference instrument and the new instrument for assessing dietary exposure is usually compared with an imperfect reference instrument (an alloyed gold standard), which is considered from previous research to be more accurate than the new dietary instrument. Thus, the relative validity (Table 3) of the new instrument compared with the reference instrument is investigated [11].
To quantify or make corrections for the effects of measurement error on estimated dietdisease associations, information is required additional to the main study data. Generally, random error can be addressed simply with replicate measurements of dietary intake using a specific dietary instrument for a sample of participants, while addressing systematic error requires an additional instrument, assumed to be free of systematic error, to be used as a reference. A smaller detailed study (we use the term “calibration study” to describe these) can be designed to obtain this additional information on random or systematic error in the dietary instrument of interest, in order for this information to be used to quantify or adjust for measurement error (see Table 4). The high participant burden and cost of keeping food records has limited their use in largescale epidemiological studies. However, their ability to accurately ascertain detailed dietary information makes them useful as reference instruments for other dietary assessment methods in a calibration study. Because of their reliance on memory, FFQs may suffer from greater measurement error relative to 24 h dietary recalls and food records. Food records and 24h recalls collected over several days can be used as reference instruments to reflect longerterm intakes (for certain nutrients, just a few days of diet records or 24h recalls in a calibration study might be enough, provided the days are spread out over the entire reference period of the FFQ) [10]. The choice of reference instrument is therefore important and must be based on the judgement of the investigator and the research question. Choosing a particular dietary instrument as a reference means that the researcher is implicitly assuming that this dietary instrument is an unbiased estimate of the “true underlying dietary exposure” that they wish to measure.
Data collected from different types of calibration studies can provide information on the measurement error structure of the dietary instrument that can be incorporated in the analysis of the dietdisease association, to mitigate the impact of measurement error.
Different statistical approaches have been proposed to quantify and correct for measurement error, both in general and specifically for continuous dietary exposures. However, only one overview is available that has summarized all of the main issues and implications for policy decisions based on dietary associations but this review did not give a detailed description of the methods used [10]. We therefore undertook a systematic review to identify and appraise methods that have been applied in nutritional epidemiology to assess “true dietary intake”(defined as usual or habitual levels of a continuous dietary exposure), to assess different types of measurement error and adjust dietdisease associations for them.
Methods
Studies were identified from the online databases Medline, Embase, BIOSIS and CINAHL, using the following search strategy up to the end of May 2016: [(“diet” or “diet records” or “diet surveys” or “food” or “food preferences” or “food habits” or “food analysis” or “nutrition assessment” or “nutrition surveys” or “nutritional physiological phenomena”) in subject headings or (“food frequency questionnaire” or “FFQ” or “diet questionnaire” or “diet” or “24 h recall” or “24 h recall” or “24 h recall” or “weighed record” or “unweighed record” or “diet diary” or “food diary”) in abstract or title] and (“measurement error” or “mismeasurement”) in abstract or title. We also searched the Dietary Assessment Calibration/Validation Register [12] and scrutinised the reference lists of relevant study reports and review articles, and by enquiring among collaborators and colleagues.
Study identification and data extraction
Reports were potentially eligible for inclusion in this systematic review if they satisfied any of the following criteria: (1) they reported the development of a new method to investigate measurement error (2) they considered the issue of measurement error using some form of calibration study; (3) one or more methods were motivated by, and applied to, a real dataset. Using a predesigned piloted data extraction form, data were extracted on: the statistical assumptions of the method [e.g. whether the method can deal with differential (error structure differs between groups of subjects in the study) or nondifferential (error structure is the same between groups of subjects in the study) measurement error]; the reported increase in precision of the method compared with other possible methods of correcting for measurement error; the main and calibration study designs reported to be most appropriate for implementation of the method; whether the method can be implemented in standard statistical software. Based on our search strategy a single reviewer reviewed the titles, abstracts and keywords of every record retrieved. Full articles were retrieved for further assessment of eligibility for inclusion in the review by pairs of reviewers independently using the predesigned extraction form and any disagreements were discussed with a third party. The full study protocol for this systematic review has been previously published [13]. This systematic review is reported according to the guidelines specified in the Preferred Reporting Items for Systematic Reviews and MetaAnalyses (PRISMA) statement [14].
Results
Of 1671 potentially eligible articles identified in the search of the four databases, 126 studies met our eligibility criteria and had data extracted (Fig. 1). Of these, 43 were methodological (reported the development of a novel method and its application in a secondary analysis of a previously published nutritional epidemiological study) [Additional file 1: Table S1 gives the individual studies]; 83 were applied reports (i.e. they reported the application of an existing method) [Additional file 2: Table S2 gives the individual studies].
We used a narrative approach for data synthesis and broadly classified the primary statistical approaches used in the 126 studies into two groups: a) approaches to quantify or assess the relationship between different dietary assessment instruments and “true intake”; b) approaches to adjust point and interval estimates of dietdisease associations for measurement error (Fig. 2).
Approaches to quantify the relationship between different dietary assessment instruments and “true intake”
The dietary assessment instrument used most often in largescale epidemiological studies is the Food Frequency Questionnaire (FFQ), which suffers from random measurement errors. The 50 studies included in this group aimed to assess the ‘relative validity’ of FFQs with alternative dietary assessment methods (e.g., 24h recalls or diet records) by reporting correlations. The two main approaches identified were based on correlation analysis (34 studies) and method of triads (10 studies) [see Fig. 2b]. Table 5 provides a summary of the methods and their assumptions with reference to the original manuscripts and each of the methods is now described in more detail below.
Correlation analysis
Correlation analyses were used by 34 studies included in this systematic review (Fig. 2b; Additional file 2: Table S2) to quantify the amount of measurement error. Correlation coefficients are commonly used to assess relative validity as they allow the presentation of the association between two continuous measurements as a single number. For example, in some situations a dietary questionnaire may be compared with a detailed diet history interview (or some other superior approach) administered at some later interval. Errors in the reference method should be uncorrelated with those of the main study dietary instrument being assessed; if these errors are uncorrelated then the correlation coefficients between the dietary intake as assessed by these two dietary instruments are usually underestimated [15]. In order to obtain a better estimate of the “true” correlation between the dietary intake as assessed by the two dietary instruments then a deattenuated correlation can be computed using the approach summarized in Table 6.
Two typical examples of analysis of correlations between two different instruments with subsequent calculation of deattenuated correlation coefficients from studies included in this review are now briefly described. HornRoss et al. [16] measured four 24h recalls in a 10 month study. The dietary recalls were spaced at threemonth intervals beginning in early 2000. Two selfadministered FFQs were completed. The first, covering usual dietary intake during 1999, was left with the participant following the first dietary recall and returned to the study office by mail. At the end of the study, a second FFQ, covering usual dietary intake during 2000, was mailed to the participant. The authors then calculated energyadjusted deattenuated Pearson correlations for the FFQ with 24h recalls but did not mention performing any transformations prior to computing the correlations. Generally, the deattenuated Pearson correlations ranged between 0.55 and 0.85 and the authors reported that these were consistent with other cohort studies [16]. Katsouyanni et al. [17] assessed the ‘relative validity’ of a 190item semiquantitative FFQ to be used in a large prospective study in Greece. Eighty participants completed two selfadministered semiquantitative FFQ spaced approximately 1 year apart, and within this 1year interval they visited the study centre monthly and completed an intervieweradministered 24h diet recall questionnaire. The authors reported that mean and standard deviations were calculated for all nutrient intakes from both FFQ and for the mean of the 24h recall interviews. In order to account for nonnormality all nutrient intakes were logtransformed prior to further analysis. The average correlation between the energyadjusted nutrients measured by repeated 24h recalls and the semiquantitative FFQ was 0.46 for men and 0.39 for women. Deattenuated Pearson correlations varied for specific nutrients, ranging from 0.25 for betacarotene and polyunsaturated fats to > 0.50 for saturated fats, cislinoleic acid, calcium and phosphorus in both sexes combined.
As dietary variables are usually skewed toward higher values, transformations (such as logarithmic) to increase normality should always be considered before computing correlation coefficients. This has the advantage of reducing the influence of extreme values and of creating a correlation coefficient that is more interpretable. Alternatively, nonparametric correlation coefficients (e.g., Spearman) can be employed when one or both variables are not normally distributed [18]. Two studies in this review reported using Spearman’s correlation to assess the relative validity of dietary questionnaires in the Netherlands [19] and South Africa [20]. A similar study conducted in Germany performed transformations for nonnormally distributed nutrients prior to analysis [21].
Willet has noted that a disadvantage of the correlation coefficient is that it is a function of the true betweenperson variation in the population being studied as well as of the accuracy of the dietary assessment method. Researchers who use correlations to assess relative validity of a FFQ with some other method should be aware that the correlation coefficient obtained is only generalizable to those populations with similar betweenperson variation as the test population, and the populations should be similar in terms of food items included, portion size and interpretation of the questions.
Method of triads
The method of triads (see Table 5), which has been proposed to assess the relative validity of three dietary assessment instruments and derive their validity coefficients, was used in ten studies included in this review [22,23,24,25,26,27,28,29,30,31]. The method of triads is often used with a FFQ (Q), a 24h recall (I_{1}) and one or more biological marker measurements (I_{2}) [32]. The assumptions are that the measurements from the three instruments are linearly related to the true intake levels and that their random errors are statistically independent [11]. The assumption of independent random errors implies that correlations between any pair of measurements are entirely due to the fact that all measurements are related to the unknown ‘true intake’. Under these assumptions, the observed correlation between the three measurements Q, I_{1}, and I_{2} can be written as products of the correlations of each of these three measurements with the “true” level of dietary intake of interest [33]. This method allows the comparison of food or nutrient consumption estimated by the three methods with the true (but unknown) intake by calculating a validity coefficient (the correlation between the dietary intake measured by the three methods and the true [unknown] dietary intake).
One of the problems with the method of triads is that implausible validity coefficients can occur. Validity coefficients typically lie between 0 and 1, but negative correlations and high random variation in the sample can lead to inestimable validity coefficients or validity coefficients greater than unity, a condition referred to as a Heywood case [25, 33]. The main reasons for a Heywood case are either violation of one or more assumptions of the method of triads, or random sampling variation [32]. Violation of the assumption of independence of random errors is more common, particularly when 24h recalls or food records are used as the reference method since their errors are correlated with those of the FFQ.
Kaaks has suggested that using larger sample sizes and more accurate reference methods and biomarkers should reduce the chances of observing negative correlations [32]. More recent evidence from Geelen et al. [34] demonstrated that the method of triads can give misleading results when there are correlated errors between FFQ and 24h recalls, because the method of triads is not able to correct for systematic errors (such as intakerelated bias). Due to the correlation of errors between FFQs and 24h recalls, some researchers have used the ‘validity coefficient’ for the FFQ as the upper limit, and the correlation coefficient of the FFQ with a biological marker as the lower limit, of the validity coefficient between FFQ and true intake [22, 23]. The ‘validity coefficients’ of a nutrient are not always comparable between studies as they may be estimated for different study populations or subgroups using population or subgroup specific reference methods [18].
Extensions to approaches based on method of triads
Three studies included in this systematic review were extensions to approaches based on the method of triads. Fraser and Shavlik [27] investigated how well data from a FFQ, a reference method, and a biological marker correlate with “true” dietary intake [referred to a MOTEX1 in Table 5]. They developed an error model that does not assume the classical measurement error model for either the reference method or the biomarker, and does not assume that the correlation between errors in the FFQ and reference method is zero. When they applied their proposed model to a calibration study, they found that correlations between reference method (24h recalls) and true intake generally exceeded correlations between FFQ and true intake, which they suggested as supporting evidence that the reference method had better ‘relative validity’ than the FFQ. They also found that estimated correlations between errors in the 24h recalls and FFQ were often much larger than zero, which violate the assumptions of the classical measurement error model. Generally for this approach to be applied a biomarker is needed in addition to dietary data (from two different dietary assessment methods) are required to allow calculation of correlations between estimated and ‘true’ dietary intakes [27]. In practice, however, sensitivity analyses are required in order to estimate the correlation due to the violation of these classical measurement error assumptions.
Two other methods that extend the originally proposed method of triads were proposed by Fraser et al. [35, 36] and incorporate two concentration biomarkers as well as a multiple 24h recall or a FFQ [denoted by MOTEX2 and MOTEX3 in Table 5]. The authors analysed information from a calibration study that had used a FFQ, and collected biological data (biomarkers measured in blood or subcutaneous fat). They proposed using two biomarkers, of which the first was considered to provide an estimate of the unknown true intake of a specific nutrient, while the second was a biologic correlate only and did not directly measure the nutrient of interest [35, 36]. Thus, one of their examples used erythrocyte folate (considered to be a concentration measure) and vitamin E measured in the blood (a biologic correlate) in order to assess the relative validity of questionnaire measurements of folate intake [35]. The authors acknowledge that there are several limitations to their approach. First, there are few concentration biomarkers of nutritional intake, and even fewer recovery biomarkers. The available concentration biomarkers show a variety of halflives and may require estimation in a variety of body fluids. Second, it is difficult to select an additional dietary biomarker that is a biological correlate of the intake of a given food or nutrient [35]. Third, it must be assumed that correlations between the errors of different dietary biomarkers are close to zero and the effect estimation must be in standard deviation units of the “true” variable. Fourth, the sample size of calibration studies needs to be large enough to ensure adequate precision of the validity coefficient using this approach  the authors recommend between 2000 and 3000 individuals representative of the main study population [35].
Approaches to adjust estimates in dietdisease associations for measurement error
Two main approaches, one correlationbased and the other regressionbased, have been used to adjust (“calibrate”) estimates in dietdisease associations assuming a classical measurement error model. The details of assumptions and implementation of these methods are summarized in Table 7 and are now described below.
Intraclass correlation
Two studies used the intraclass correlation as the main method of adjusting dietdisease associations for measurement error (Fig. 2a, Additional file 2: Table S2). In the situation where the measurement error is assumed to be strictly due to random withinperson error, the intraclass correlation coefficient (ICC) can be computed based on replicate measures of the dietary exposure instrument, [37] and then the dietdisease association can be adjusted for it. The ICC represents the attenuation factor under the assumption of a classical measurement error model. When using repeat measurement data to compute ICCs to correct for measurement error, it is important to ensure that the calibration study is large enough to estimate ICC with reasonable precision [38, 39]. An example of the application of this method is the study of HornRoss et al. [40] that aimed to assess the association of alcohol intake with breast cancer. Participants were asked about alcohol intake in the year prior to the study start date, and at the end of the study oneyear later asked about intake over the past year using an FFQ. The report found that ICCs for alcohol for pre and post FFQ ranged from 0.63 to 0.87. They reported that the corrected relative risk (per 20 g/1000 kcal/d) was 1.36 (1.03 to 1.51) compared to the uncorrected estimate of 1.25 (1.10 to 1.42) [16] for the association of alcohol intake with breast cancer. Another example is a study that aimed to assess the association of calcium intake with bone mineral content and bone mineral density. The study used repeat administrations of a youth specific FFQ in a multiethnic sample of children and adolescents and the ICC for calcium intake on a logscale was found to be 0.61 [41].
Regression calibration
In this systematic review 71 studies used a linear regression calibration approach (referred to as standard regression calibration) in order to correct for measurement error or some variant of classical regression calibration (Fig. 2a; Additional file 1: Table S1; Additional file 2: Table S2; Table 7). Rosner’s linear regression calibration involves a regression of the reference measurement against main study measurement [37, 42]. The fitted values from the regression of the reference method against the main study dietary method should be a good estimate of the expected “true dietary” intake. Using these fitted values in place of the measured dietary intake in the main study should produce the “regression dilution” corrected estimate. This linear approximation approach is equivalent to the estimation of the regression coefficient of disease outcome on true dietary intake. The regression dilution correction factor b_{RC} is obtained by dividing the ‘naïve’ regression coefficient obtained by regressing the diseaseoutcome on the measured exposure in the main study by the regression coefficient obtained from a regression of true exposure measurements (reference method) on crude exposure measurements (based on the main dietary instrument) in the calibration study [7]. This is the classical linear regression calibration approach but it has also been argued that it may be more appropriate to regress the crude measurement on the more precise (i.e. the reference measurement) [43].
Linear regression calibration is by far the most common approach to correct for measurement error in dietary studies (Fig. 2a). In addition to the assumption of the classical measurement error model, linear regression calibration also assumes a linear relationship between disease outcome and true exposure, between reference method and main study method, and between disease outcome and main study method. These assumptions are only valid when the distributions of true exposure and measurement errors are normal. The method is tolerant to a modest degree of nonnormality, but it would not be expected to perform well with grossly nonnormal distributions such as those that may arise, for example, for intakes of some micronutrients (such as dietary carotenoids and tocopherols) [11]. Often, to get a better approximation for linearity, normality, or nonconstant error variance, the dietary data are transformed, e.g. by taking logarithms. For example, the study of Prentice et al. [44] used urinary recovery biomarkers to correct FFQ assessments for measurement error, and examined absolute energy and protein consumption in relation to cardiovascular disease. Urinary recovery biomarkers of energy and protein were obtained from a subsample of 544 women, with concurrent FFQ information. The authors used simple linear regressions with logtransformed values for both the biomarker and the FFQ assessments in order to obtain an estimate of the correction factor b_{RC}. The authors applied similar approaches to a more extensive series of dietary associations with chronic disease endpoints in a subsequent report [44].
Linear regression calibration also assumes that errors are nondifferential with respect to outcome, which is a reasonable assumption in most prospective cohort studies but less secure in casecontrol studies that are not nested within a cohort. As mentioned earlier it also assumes uncorrelated errors between the dietary assessment method under investigation and the reference method.
For a binary outcome, where logistic regression analyses may be used the conditions for regression calibration are satisfied if the disease is relatively rare (< 10%), the odds ratio small (the odds ratio is a good estimate of the relative risk for rare diseases), and the measurement error small [2]. Rosner and colleagues have extended the regression calibration method to multiple logistic regression that accounts for both random and systematic withinperson measurement error. Furthermore these methods provides tests and confidence limits for the association of interest that incorporate both the uncertainty in the estimate of the odds ratio (relative risk) from the main study and the uncertainty in the estimation of the correction factor [37]. The latter component is important because when calibration studies are small the degree of correction applied to the observed relative risk itself has error. As this method is based on a multiple logistic regression model, measurement error can be accounted for while simultaneously adjusting for confounding by other variables [18].
As an extension of the standard linear regression calibration method, again in the framework of the classical measurement error model, to deal with multiple covariates measured with error, Rosner and colleagues developed multivariable regression calibration (MVRC, Table 7), assuming either strictly random withinperson error [45] or combinations of random and systematic error [42]. The latter approach is a multivariable extension of the linear approximation approach and requires a calibration study with all of the important covariates measured simultaneously. The method aims to take into account not only the error in the measured variables but also the correlation between the errors. This approach has also been applied to multiple linear regression and Cox regression models [46].
Rosner’s linear regression calibration approach is an indirect method of correction for measurement error. In an alternative direct approach to regression calibration first described by Carroll [47], the reported dietary intakes used as explanatory variables in the risk model are directly replaced by the expected values of the true usual intake predicted from the reported intakes and other important factors (such as confounders) that are included in the risk model. This expected value of true usual intake is usually obtained from a calibration study and produces an approximately unbiased estimate of the true relative risk for a dietary intake under the assumption of a classical measurement error model. In Carroll’s approach there does not have to be a linear relationship between the expected value of the true usual intake and the outcome of interest [47]. This is an important extension to Rosner’s method as sometimes linear regression calibration may not be able to obtain the optimal prediction of usual intake and a nonlinear model may be more appropriate [6].
Extensions to linear regression calibration to address departures from the main assumptions
Using “correct” reference instruments (“gold standards”) in regression calibration provides an unbiased estimate of the correction factor b_{RC}, but using imperfect reference instruments will produce biased estimates of b_{RC}. Several reports in this systematic review have conducted sensitivity analyses to assess and quantify the impact of the use of imperfect reference instruments and/or departures from the classical measurement error model on regression calibration. It should be recognised that some of the assumptions made in these approaches, as described below, are difficult or impossible to test directly, so investigators have suggested using sensitivity analyses or simulationbased analyses to approach these issues. Table 8 provides a summary of the methods that aim to address departures from the main assumptions of regression calibration and the key points are now described.
Departures from assumptions of classical measurement error in the reference instrument
There is abundant evidence that food record measurements are biased estimates of ‘true intake’, and that the error in food record measurements depends on both the level of ‘true’ exposure and personspecific errors. [48, 49] Kipnis et al. [50] define these sources of error as “groupspecific bias”, common to all persons with the same “true dietary intake”, and personspecific bias, the groupspecific bias together with the effects of personality, social and cultural influences on the reporting of dietary intake. Both are specific forms of systematic error and result in violation of the assumptions of the classical measurement error model. They also point out that these biases are part of withinperson systematic error and will be reproduced in repeated measurements on the same individual.
Estimates of groupspecific biases are affected by the choice of reference instrument. If the reference instrument is flawed (e.g. an alloyed gold standard with measurement errors related to true intake and not independent from the errors in the instrument under investigation), then this can bias the estimate of the correction factor. Kipnis et al. [50] investigated the impact of using a flawed reference instrument in standard regression calibration (FRIRC). Their results suggested that group and personspecific biases existed in both the FFQ (instrument under investigation) and the weighed food record (flawed or imperfect reference instrument), and that the personspecific biases of the two measurements were correlated. They also suggested that the use of an imperfect reference instrument could lead to underestimation of the true effect by about 50%, with an approximately 2fold inflation of the sample size required to detect the ‘true’ effect.
Use of biomarkers in addition to dietary instruments in regression calibration
Both recovery and concentration biomarkers have been used as reference instruments in four studies included in this report [51,52,53,54], and the implications of their inclusion have been explored via sensitivity analyses. An approach proposed by Speigelman et al. [51], described here as biomarker and alloyed gold standard regression calibration (BAGSRC), uses an imperfect reference instrument, a cruder dietary instrument and a biomarker. An additional assumption is that the errors in the biomarker are uncorrelated with the errors in both imperfect reference instrument and cruder instrument. These authors presented an example that assesses the measurement of dietary intakes of vitamins A and E. The semiquantitative FFQ was the cruder instrument, the imperfect reference instrument was based on two diet records, and a biomarker for each of vitamins A and E intake were plasma concentrations of total carotene and αtocopherol respectively. The findings of their study suggest that this approach is robust when the errors in the biomarker are uncorrelated with the selfreported measures, and can be used as an extension to standard regression calibration when a suitable biomarker is available [51].
Kipnis et al. [52] considered a measurement error model which takes into account both correlated personspecific biases and withperson measurement errors. This model incorporates the models of Freedman et al. [55] and Speigelman et al. [51] as special cases. However, when using a calibration study with an imperfect reference instrument the correction factor cannot be estimated as there are too many parameters in the statistical model, and therefore Kipnis et al. [52] had to assign fixed values to some parameters in order to make their models statistically identifiable. In the hypothetical scenarios considered, these authors found that the estimated relative risk using standard regression calibration could be dramatically underestimated when personspecific biases are highly correlated.
A more general version of BAGSRC was also described by Spiegelman et al. [53] and Preis et al. [54], here referred to as Auxiliary Information Regression Calibration (AIRC) (Table 7). The paper by Speigelman et al. [53] derived an explicit expression for the bias in the correction factor, and describes the magnitude and direction of bias in the relative risk estimate obtained through standard regression calibration as a function of the correlations of subjectspecific biases and withinsubject errors in the two measures being compared in the calibration study. To make the application of the Kipnis et al. [52] model feasible, they proposed two minimally sufficient external calibration study designs, each of which fully identifies the correction factor and other parameters of interest. They proposed a “replicated biomarker design” that includes information on dietary instrument of interest (Z), alloyed gold standard measure of exposure (G) and biomarker (W). They also proposed an augmented design that uses both a biomarker and an instrumental variable (a variable that is correlated with the true exposure X, and uncorrelated with the random and systematic error components in G and Z). They considered the impact of missing data on these new study designs. This was because more than two methods of exposure measurement are needed, some of which must be replicated, and there may be variable numbers of replicates within participants, and different patterns of missingness of the multiple modes of measurement among the study participants. Pries et al. [54] sought to estimate the correction factor for total energy intake, protein and protein density in the calibration and main studies. They used repeatbiomarker measurement error models, which account for both correlated errors in the FFQ and 24h recalls and random withinperson variation in the biomarkers. They further assumed that the diet records or the 24h recalls were unbiased, while the biomarker was assumed to be biased in order to accommodate the fact that most biomarkers available to nutritional epidemiologists are of concentration rather than of recovery [54]. The authors concluded that under certain measurement error models failure to adequately account for withinperson variability in the assessment method that is assumed to be unbiased (i.e. the reference method) can falsely lead to an appearance of correlated errors and to underestimation of the FFQ’s ‘relative validity’ [54]. However, it was later argued that the authors had erroneously concluded that withinperson variation was underestimated as their model had assumed that the recovery biomarkers used (doubly labelled water and urinary nitrogen) were biased which in fact is more likely to be the case for concentration biomarkers, where this model had been previously applied [56]. Keogh, White and Rodwell [57] used a concentration biomarker to assess the levels of fruit and vegetable intake, and their findings were that the estimated effects of error in selfreported measurements was highly sensitive to model assumptions. They concluded that making the incorrect assumption of a classical measurement error could result in a large overcorrection for the effects of measurement error [57].
Incorporating episodically consumed foods in regression calibration
Four reports included in this systematic review attempted to assess the impact of episodically consumed foods on linear regression calibration estimates [58,59,60,61]. Willet has suggested that systematic betweenperson errors in dietary intake are likely to be frequent and can have many sources [18]. One source is episodically consumed foods, defined as foods not consumed every day by most people. Dietary assessment of episodically consumed foods gives rise to nonnegative data that have excess zeroes (due to infrequent consumption) and measurement error [58]. Often, withinperson random error in the 24h recalls intake of episodically consumed foods is dependent on the individual mean and has a skewed distribution, violating the classical measurement error model assumptions. The most common technique has been to transform the data, so that the dietary intake measures more closely follow the classical measurement model with normally distributed errors. Kipnis et al. [59] incorporated episodically consumed dietary components (see ECFRC, Table 8) and assessed the impact using a nonlinear regression calibration approach. In their model, the probability of consumption for an individual may be arbitrarily small but always positive, thus allowing for any finite number of days with zero intakes, but not incorporating neverconsumers (if they existed). Their model was in two parts: in the first part, consumption probability is modelled using mixed effects logistic regression; in the second part, the measurement error for the nonzero reported intake is modelled using a nonlinear mixed effects approach [59]. In addition, Kipnis et al. [59] proposed an extension to the method described by Tooze et al. [58] to predict individual usual intake of such foods and to evaluate the relationships of usual intakes with disease. One feature of their proposed method is that additional covariates that are potentially related to usual intake may be incorporated in order to improve the estimates of usual intake and of dietdisease associations. Due to measurement error, the naïve approach (using simple individual means of several 24h recalls) grossly underestimates the true value, as expected theoretically. In their simulations, however, although the bias of their proposed method was smaller, the precision was poorer (i.e. wider confidence intervals which is a feature of correcting for attenuation) than using the naïve approach [59]. Agogo et al. [60] used a simulation study to assess a twopart model with part 1 assuming a logistic distribution and part II assuming a gamma distribution and required a single replicate of the dietary assessment instrument (a 24h recall). This model was closely related to those already described by Tooze [62] and Kipnis [59] and the authors report that transforming the dietary assessment data to an appropriate scale was found to have the largest influence on model performance [60].
Keogh and White [61] describe a three part measurement error model that extends the twopart models described by Kipnis [59] and Tooze [58] above, with the thirdpart accounting for never consumers. They call their model the “never and episodic consumers (NEC, Table 8) model” that allows for both real zeros (never consumers), and excess zeros (episodic consumers of some foods), and requires repeat measurements. In the most realistic situation, where repeat measurements are available on only a subset of participants, the authors found via simulation that the NEC model appeared to be superior to classical regression calibration in terms of reduced bias [61].
Other approaches
In this systematic review there were several other approaches that were identified in the literature to deal with measurement error and these are summarized in Table 9 and an overview of these methods is now briefly given.
One report [63] used Simulation Extrapolation (SIMEX), which is similar to regression calibration but uses a simulationbased approach rather than a classical regression approach. The algorithm performs multiple iterations whilst adding a small known amount of error at each successive iteration, and reestimates the parameters each time. A trend in the effect of the measurement error is then estimated, and extrapolation made to the case where there is no measurement error. However, SIMEX can only be applied if the measurement error is classical and additive or multiplicative (Table 9). Two reports demonstrate that the four main types of measurement error (see Table 1) can also be incorporated in Structural Equation Modelling (SEM) [63, 64]. SEM considers the measurement error process as a latent variable problem, with the unknown “true exposure” estimated on the basis of a calibration study. The SEM approach is a method for estimating the measurement error structure using additional instruments, with at least one dietary instrument assumed to have no systematic error. However, the approaches of Speigelman et al. [53] and Preis et al. [54] require only that the model is correctly specified, whereas the SEM approach is more restrictive as it requires full multivariate normality of all random variables in the model.
All of the approaches considered in the review so far have assumed that the measurement error in nondifferential (i.e. the error is the same for those with and without the outcome). Two other approaches, moment reconstruction and imputation, may have an advantage over standard regression calibration when there is differential measurement error (error whose magnitude or direction is different for individuals who have the outcome), such as in a casecontrol study where cases and controls may recall their dietary intake differently [65]. First, the moment reconstruction (MR) method was proposed by Freedman et al. [66] as a method for correcting error in univariate continuous exposures, and requires an internal calibration study where the ‘true’ exposure is measured for some individuals where their disease (or outcome status) is known. The estimates of the true exposure derived from the MR method can be used directly in the dietdisease model, and it has been found that this results in unbiased estimates of linear exposuredisease associations [65]. Second, as measurement error can also be conceptualised as a missing data problem in that the “true exposures” are missing, imputation methods have sometimes been used. Multiple imputation (MI) was first introduced by Rubin and has become a widely used approach for dealing with missing data for a variety of epidemiological study designs [67]. Freedman et al. [65] suggested using MI to correct for measurement error in continuous exposures, by treating the true continuous exposure data as missing. The key idea is that values of the true exposure are imputed by drawing a random value from the distribution of the true exposure conditional on all observed data, including the outcome (or disease status). Multiple imputation methods involve creating a number of imputed datasets that would then be available for standard statistical analyses that correct for measurement error, before combining the estimates from each analysis into a final result, using an approach first developed by Rubin [67]. If a calibration study within which the true exposure was available, (via a “gold standard reference instrument”), estimation of the true exposure distribution would follow a similar process to a standard missing data analysis procedure. If the calibration study was based on repeat measurements, then the procedure requires the estimation of the true exposure conditionally on repeat measurements of the errorprone exposure, and the covariates measured without error. The authors demonstrate that if MR and MI are used when the measurement error is in fact nondifferential, then these methods will in general result in a loss in efficiency relative to standard regression calibration [65].
Impact of departures from classical measurement error model on statistical power
All of the approaches identified in the systematic review were generally developed and employed to assess the impact when the assumptions of standard regression calibration are not satisfied. In addition to the bias that may ensue in the estimation of dietdisease association, Fraser and Stram [68] demonstrate via simulation that correlated errors can lead to an added loss of statistical power (reduced to as low as 11% in their simulations) if small (< 100 participants) calibration studies are used to adjust point and interval estimates using standard regression calibration and there are few events (for example ≤250) in the main cohort. Fraser and Stram [68] conclude that if there are modest correlations (≤ 0.5) between the reference instrument and the cruder instrument, useful gains in power accrue (up to 5fold) with calibration study size up to 1000 participants with a reasonable number of events (for example ≥2500) in the main study. More recently, Fraser and Stram [69] considered the impact of nonGaussian distributed dietary intake data on standard regression calibration, and they reported that poor fit in the calibration model: a) does not produce biased calibrated estimates when the “disease” model is linear; b) it produces little bias in a nonlinear “disease” model if the model is approximately linear; and c) will adversely affect statistical power, but this could be alleviated by attempting some of the more complex sensitivity analysis models for dealing with departures from the classical measurement error model outlined in this report [69].
Discussion
Most of the literature addressing the issue of measurement error in nutritional epidemiology is based on the assumption of random withinperson error. This is due to the fact that a) systematic error is much more difficult to measure and b) most of the statistical theory and methodological developments have been based around the assumption of random error. In this systematic review, we found that the most commonly reported statistical methods in nutritional epidemiological studies to quantify measurement error were correlationbased methods. With regard to assessing and correcting for measurement error, by far the most commonly used method was linear regression calibration. A key issue is that the assumptions of the method are satisfied.
Based on the evidence identified in this systematic review, we now propose useful points to consider when the intention is to use a calibration study to either assess a new dietary instrument, quantify measurement error or implement measurement error corrections of dietdisease associations.
General design of a calibration study
A key design issue, which has been addressed by most papers reviewed, is that the calibration study has to be representative of the main study, and this is most likely to be achieved using an internal calibration study based on a random sample of the main study. Depending on the type and purpose of the calibration study, Caroll et al. has suggested that one may use large sample sizes and few food records per individual or smaller samples and more records per subject [70].
Calibration studies to assess and correct for measurement error
In the vast majority of the studies reviewed, a dietary questionnaire measurement was compared to a more detailed and reliable reference instrument, such as multiple 24h recalls or food diaries [71]. The utility of comparing a crude dietary instrument with a reference instrument depends on the important statistical assumption that the reference instrument follows the classical measurement error structure, i.e. withinperson errors for the reference instrument are strictly random. When errors in the crude and reference instrument are dependent, standard regression calibration will undercorrect the dietdisease association. Thompson et al. [72] report that, due to correlation of errors in FFQs and selfreported reference instruments such as the 24h recalls, the correlations and correction factors observed in most calibration studies tend to overestimate FFQ performance. Other researchers have quantified the direction and magnitude of the bias, and have concluded that bias appears to lead to small overestimation in realistic examples [48, 54].
It has been recommended that, when possible, biomarkers should be incorporated into the calibration study as objective measures of intake [73], under the assumption that the measurement errors in biomarkers are independent of the measurement errors from selfreported dietary measurements, [74] although this has never been verified in practice. Keogh et al. [57] used repeated biomarkers to quantify the bias due to measurement error for a selfreported dietary intake from FFQs and food records. They caution that the magnitude of the correction factor is highly sensitive to the model assumptions and that sensitivity analyses that assess the impact of varying these assumptions are essential [57]. In theory additional types of biomarkers, for example metabolomic profiling, may also be incorporated into the calibration study [73]. Unfortunately, only a limited number of recovery biomarkers have so far been identified as able to reflect an individual’s intake with a high degree of accuracy.
Careful consideration of the assessment period and the use of updated dietary information are necessary, particularly if the interest is longterm usual dietary intake since dietdisease relationships may change over time. It is therefore important that the comparison method employed is able to reflect the longer time frame. If there is likely to be medium term variation due to seasonality, then it would be sensible to have multiple measurements throughout the year. Biomarkers can also change over time, so it is important to obtain repeat assessment of these too if possible. It has been demonstrated that if a FFQ is being used with a diet record as the reference instrument and a suitable biomarker exists, and the error in the biomarker is independent of the error in the FFQ and the diet record, then the methods of triads can produce an estimate of the correction factor that is unbiased provided that there are replicate biomarker data on at least a subsample of calibration study subjects [31].
Calibration studies to assess the ‘relative validity’ of a new dietary instrument
Calibration studies can also be used in the development of a new measurement instrument to test whether it provides improvement over currently used methods, and several studies in this report were of this type. In this context, the ‘relative validity’ between the current and the new measurement instrument would be of primary interest [75]. However, the use of correlations may not be the optimal approach in order to assess ‘relative validity. An improvement is to assess the level of agreement (between say a FFQ and 24h dietary recalls), as proposed by Bland and Altman [76]. In this approach, the difference between mean intakes estimated by the FFQ and 24h dietary recalls is plotted against the average of mean intakes by the two methods for each participant and nutritional measure. The BlandAltman approach of plotting difference against average only works if the average is an unbiased estimate of the truth. In our example it may be better to plot the difference of FFQ and 24h recalls against the average of several 24h recalls (the reference instrument and our best estimate of the truth). If the data follow a normal (Gaussian) distribution it would be expected that the mean difference (bias) lies between ±2 standard deviations, and a lack of heteroscedasticity (or nonconstant variability) indicates that the magnitude of error does not vary with the range of measurement [76]. These analyses would need to be performed for all nutritional measures (such as macro and micronutrients). Willett has suggested that the BlandAltman method may be too cumbersome as it requires considerable knowledge about the typical absolute values and betweenperson variation, which differ from nutrient to nutrient [18]. He suggests an alternative, but related method, that involves the calculation of the standard deviation of the “residual” from the regression of the true measure (estimated using the reference method) against the dietary assessment instrument. This “residual” would then represent the variation in true intake that is not accounted for by the dietary assessment instrument [18].
For an assessment of ‘relative validity’, there is always the decision as to whether or not the new dietary instrument is considered to be a satisfactory. Nelson and Margetts note that there are no hard or fast rules as to what constitutes ‘satisfactory’ correlation, as it is usually dependent on the sample size of the study [11]. As Willett has suggested, and we concur, because no single method for relating dietary intake to a measure of true intake conveys all the available information, it is probably best to present the data in several ways. At a minimum, the means and standard deviations of the reference method and dietary instrument of interest plus their correlations should be provided [18]. These may be supplemented with other data such as BlandAltman plots or the residual regression approach.
Number of replicate measures and sample size of the calibration study under the assumption of random withinperson measurement error
Many of the methods considered in this report assume that the errors in dietary intake are due to random withinperson variation, If the aim of the calibration study is to correct for measurement error rather than to assess relative validity then to minimize the correlation between errors on the different occasions, the repeated measurements should be well separated in time. In addition, the measurement of dietary intake via more than one instrument (as different instruments have different strengths and weakness) [77] in a subsample, is essential [7]. It has been recommended that fewer repeats on more individuals provide greater statistical efficiency [78, 79]. Some have argued that in most settings, costefficient designs of FFQ calibration studies do not require more than four or five 1day diet records per participant in order to estimate the true longterm dietary intake, particularly when the costs of collecting any subsequent measure is the same as for the initial dietary measure [79]. Rosner and Willett [80] also considered the cost of collecting replicate information, and found that generally there was no justification for more than two to five measurements per participant, assuming that the total cost of the study was proportional to the total number of observations. Taking the average of multiple reference instrument measurements per participant (e.g., multiple food records or multiple 24h recalls) can increase the correlation (i.e. the relative validity’) between the reference instrument and the questionnaire [74]. However, multiple 24h recalls per participant can lead to high burden on the participant and poor quality information. Moreover, averages over a few days are not an adequate representation of a participant’s usual intake [81]. Thus computing the deattenuated correlation will also increase the correlation coefficient under the assumption of a classical measurement error model. Wong et al. [82] provide examples of sample size calculations for the design of such calibration studies based on the withinperson correlation of the reference instrument. Park and Stram [83] noted that the percentage of individuals from the main study that are required to participate in the calibration substudy decreases in size as the correlation between true and observed exposure increases, that is as measurement error becomes smaller. However, others have suggested that optimal balance of repeats and individuals depends on the primary aim of the calibration substudy (either to compare a new with an old instrument for ‘relative validity’, or specifically to correct for measurement error in the main study), as well as the withinperson variation of the repeated measurements [75]. The effect of measurement error is also reduced when the variance of the true exposure is large relative to the variance of the error [84, 85], since this brings the correction factor closer to 1 and therefore reduces bias on average. The easiest way to do this is to ensure sampling of a wide range of exposures. Even though several studies in this systematic review wanted to use the calibration study information to correct for measurement error, none of these studies reported details of any formal sample size calculations for the size of the calibration study. Kaaks and colleagues have presented some more formal approaches based on the precision required, the cost of conducting the calibration study [78], and Carroll et al. have also made some sample size recommendations that take into account the instrument used (e.g. 24h recalls or multipleday food records) [70]. Willet has suggested that because the most important use of a calibration study is to correct observed dietdisease associations for measurement error, an approach for choosing an appropriate sample size would be to consider the implications of various sample sizes on the corrected effect size estimates and their corrected confidence intervals [18]. In particular calibration studies with too few participants (<30 participants) would lead to a major increase in the width of the corrected confidence intervals due to having to incorporate the imprecision in the estimated correction factor.
Generalizability of the calibration study information under the assumption of random withinperson measurement error
A reproducible instrument will give consistent answers with repeat administration, while a valid instrument measures what it was designed to measure. A valid instrument is reproducible, but a reproducible instrument is not necessarily valid. Information from calibration studies designed to compare a new instrument with a goldstandard or a reference measure (i.e. to assess ‘relative validity’) usually are not considered to be generalizable because an FFQ developed for one population cannot readily be used in another population, as different groups of people eat different foods. As such, FFQs tend to be developed and assessed for ‘relative validity’ within specific populations. Indeed, many of the calibration studies designed to correct for measurement error described in this systematic review were based on the selection of a subsample of individuals from the main cohort study. In epidemiological studies in which there is a need to correct for measurement error, but in which no calibration study data are available, investigators could incorporate information from an external study provided that there is transportability. Transportability is satisfied when the measurement error model in the external calibration study can reasonably be assumed to be the same as the one that generated the cruder dietary assessment measurement (e.g. FFQ) in the main study [86]. In this situation, the measurement error model parameters estimated can be used for measurement error correction [6, 37, 87]. However, correction factors derived from linear or nonlinear regression calibration should include information from all relevant covariates in the disease model, whilst bearing in mind that the calibration study should have sufficient participants to be able to accommodate all covariates and thus reduce the potential for model over fitting [88]. Buonaccorsi et al. investigated the impact of possible transportability problems (such as the assumed measurement error model is incorrect or the population involved in the calibration study differs from those participants in the main study) and found that this could result in over or underestimation of the correction factor in an unpredictable manner [89].
Rather than try to apply the correction factor to the data directly, Caroll et al. [47] have suggested that it is more sensible to extract from the second study the measurement error variance itself, and use this in conjunction with the distribution of mismeasured exposure in the study needing correction for the effects of measurement error [47]. Geelen et al. [34] have recommended that if the research goal is to compare results between studies or populations, calibration to a reference method that performs similarly in different populations, e.g. a 24h recall, is preferred. Even if the chosen superior dietary instrument does not fulfil all the criteria for a valid reference method, comparability between the associations obtained from different populations would be improved. They argue that this is more relevant for the generalizability of study results, and is a crucial argument for designing a calibration study to supplement dietary information gathered from large epidemiological studies, even with an imperfect reference instrument [34].
The most robust approach to correct point and interval estimates for measurement error
Linear regression calibration is the most popular and well understood method to correct associations for measurement error in the nutritional literature, and its application is usually (in the case of the assumption of random error) supported by the collection of repeated measurements of the dietary instrument of interest in a subsample of individuals from the main epidemiological study. If systematic error is assumed then dietary intake will also need to be measured on a subsample using a reference instrument (which may be imperfect). However, we caution that the key assumptions of this approach must be satisfied. This also means that all important covariates should be included in the linear regression calibration model (whilst guarding against over fitting), in order to avoid bias. Second, linear regression calibration requires that the measures obtained by chosen reference instruments (e.g. dietary records) be linearly related to those obtained by the dietary instrument of interest (e.g. FFQreported intakes), and that the residuals of these regressions have constant variance. Often it is necessary to apply a transformation (such as a natural logarithm) to the dietary intake variables in order to meet these conditions. Third, linear regression calibration that is based on an imperfect reference instrument (such as 24h recalls), should always be supplemented by sensitivity analyses (see Table 10) in order to assess and quantify the magnitude of bias that may result from the correlation between the imperfect reference instrument and the FFQ. Fourth, the assessment of episodically consumed foods and the use of concentration biomarkers (or other imperfect reference methods) that are correlated with dietary intakes also require detailed sensitivity analyses in order to assess the effects on dietdisease associations.
Biased correction factors and therefore biased corrections can occur when the errors in true and measured exposure are correlated. For example, if the same erroneous database is used to calculate nutrient intakes for the reference instrument and the dietary instrument of interest, the regression coefficient relating the two methods could be overstated. Willet has suggested that although the potential for correlations in errors between dietary instruments used for regression calibration should always be considered, and should be assessed whenever possible, even moderate correlations in these errors do not appear (on the evidence of methodological studies using simulation) to result in much bias if they are ignored when using standard regression calibration correction procedures [18]. The Strengthening Analytical Thinking for Observational Studies (STRATOS) initiative is developing guidance documents on a variety of topics (including addressing measurement error) to help the research community at large to conduct appropriate statistical analyses of observational studies [90].
Software availability
There are several implementations of regression calibration using SAS software, including userfriendly publicly available implementations (with documentation) of standard regression calibration [37, 42] at [https://www.hsph.harvard.edu/donnaspiegelman/software/blinplusmacro/ (accessed 8 June 2017)] and MVRC [42, 45] at [https://www.hsph.harvard.edu/donnaspiegelman/software/relibpls8/] (accessed 8 June 2017). There is also software that implements the method of Liao [91] for the Cox model that addresses the rare disease assumption, available at [https://www.hsph.harvard.edu/donnaspiegelman/software/rrcmacro/ (accessed 8 June 2017)]. The National Cancer Institute has produced several SAS macros that can be used to analyze data on regular and episodically consumed foods [https://epi.grants.cancer.gov/diet/usualintakes/macros_single.html (accessed 8 June 2017)]. These SAS macros are accompanied by a user guide and other documentation, as well as example datasets and output.
With regard to other software for implementing these approaches, the standard regression calibration approach, the multivariable regression calibration approach, and the simulation extrapolation approach (SIMEX) are all available as part of the merror routines in Stata [92] and there is also a merror package in R statistical software [93]. For some of the nonstandard approaches, the authors provide statistical code for the implementation for their methods in the original publications (e.g. structural equation modelling (64) and auxiliary information regression calibration [59]). Pérez et al. [94] provide details of an R function that can implement an extension to the twopart method of Kipnis et al. [59] for episodically consumed foods into a threepart method that also incorporates the estimation of amount of energy intake per day. In addition Stata [92], SAS [95], and R [96] have the capability to conduct user defined structural equation modelling analyses and imputation methods.
Conclusions
The findings of this systematic review provide insights into how to perform calibration studies to quantify and correct for measurement error. Calibration studies of sufficient size and representative of the main study can be very useful in assessing the ‘relative validity’ of dietary instruments as well as the measurement error structure of the dietary instrument being used in the main study. While previous reviews of measurement error correction methods have concentrated on the more technical statistical details of the methods, [97,98,99,100,101,102], our aim was to review the approaches used to quantify or correct for measurement error in the context of their application to a continuous dietary exposure measurement for a nutritional epidemiological audience. As a consequence, we may not have included some methods that have been described elsewhere if they were not applied to nutritional data. Regression calibration is the most widely used approach to correct for measurement error in nutritional epidemiology; however, it is crucial to ensure that the regression calibration assumptions are fully met. We have discussed other methods proposed to address situations in which the assumptions and requirements of regression calibration are not met. The choice of method should depend on: a) what measurement error model is assumed; b) whether the correct type of calibration study data are available for valid and appropriate correction; and c) if there is potential for bias due to violation of the classical measurement error model assumptions. We propose that sensitivity analyses be conducted in order to assess the impact on dietdisease associations of departures from the classical measurement error model assumptions, and that these be documented and reported in published manuscripts.
Abbreviations
 AIRC:

Auxiliary information regression calibration
 BAGSRC:

Biomarker and alloyed gold standard regression calibration
 ECFRC:

Episodically consumed foods regression calibration
 FFQ:

Food frequency questionnaire
 FRIRC:

Flawed reference instrument regression calibration
 ICC:

Intra class correlation
 MI:

Multiple imputation
 MOTEX1:

Method of triads extension 1
 MOTEX2:

Method of triads extension 2
 MOTEX3:

Method of triads extension 3
 MR:

Moment reconstruction
 MVRC:

Multivariable regression calibration
 NEC:

Never or episodic consumers
 PSBRC:

Person specific bias adjusted regression calibration
 SIMEX:

Simulation extrapolation
 STRATOS:

Strengthening analytical thinking for observational studies
References
 1.
Wild CP. Complementing the genome with an "exposome": the outstanding challenge of environmental exposure measurement in molecular epidemiology. Cancer Epidemiol Biomarkers Prev. 2005;14(8):1847–50.
 2.
Wild CP. The exposome: from concept to utility. Int J Epidemiol. 2012;41(1):24–32.
 3.
Teicholz N. The scientific report guiding the US dietary guidelines: is it scientific? Br Med J. 2015;351
 4.
Ravnskov U, DiNicolantonio JJ, Harcombe Z, Kummerow FA, Okuyama H, Worm N. The questionable benefits of exchanging saturated fat with polyunsaturated fat. Mayo Clin Proc. 89(4):451–3.
 5.
Rossato SL, Fuchs SC. Handling random errors and biases in methods used for shortterm dietary assessment. Rev Saude Publica. 2014;48(5):845–50.
 6.
Freedman LS, Schatzkin A, Midthune D, Kipnis V. Dealing with dietary measurement error in nutritional cohort studies. J Natl Cancer Inst. 2011;103(14):1086–92.
 7.
Clayton DG, Gill C. Covariate measurement errors in nutritional epidemiology: effects and remedies. 2nd ed. New York: Oxford University Press; 1997. p. 87–105.
 8.
Carroll RJ, Freedman LS, Kipnis V. Measurement error and dietary intake. Adv Exp Med Biol. 1998;445:139–45.
 9.
Boeing H. Nutritional epidemiology: new perspectives for understanding the dietdisease relationship[quest]. Eur J Clin Nutr. 2013;67(5):424–9.
 10.
Satija A, Yu E, Willett WC, Hu FB. Understanding nutritional epidemiology and its role in policy. Adv Nutr. 2015;6(1):5–18.
 11.
Nelson M. The validation of dietary assessment. In: Margetts BM, Nelson M, editors. Design concepts in nutritional epidemiology. New York: Oxford University Press; 1997. p. 241–68.
 12.
Health UNIo. Dietary Assessment Calibration/Validation Register. [17 May 2013]; Available from: http://appliedresearch.cancer.gov/cgibin/dacv/index.pl.
 13.
Bennett DA, Little J, Masson LF, Minelli C. Study protocol: the empirical investigation of methods to correct for measurement error in biobanks with dietary assessment. BMC Med Res Methodol. 2011;11:135.
 14.
Moher D, Liberati A, Tetzlaff J, Altman DG. Preferred reporting items for systematic reviews and metaanalyses: the PRISMA statement. BMJ. 2009;339
 15.
Padilla MA, Veprinsky A. Correlation attenuation due to measurement error: a new approach using the bootstrap procedure. Educ Psychol Meas. 2012;72(5):827–46.
 16.
HornRoss PL, Lee VS, Collins CN, Stewart SL, Canchola AJ, Lee MM, et al. Dietary assessment in the California teachers study: reproducibility and validity. Cancer Causes Control. 2008;19(6):595–603.
 17.
Katsouyanni K, Rimm EB, Gnardellis C, Trichopoulos D, Polychronopoulos E, Trichopoulou A. Reproducibility and relative validity of an extensive semiquantitative food frequency questionnaire using dietary records and biochemical markers among Greek schoolteachers. Int J Epidemiol. 1997;26(Suppl 1):S118–27.
 18.
Willett W. Nutritional epidemiology: Oxford University Press, USA, 1998; 1998.514 p.
 19.
KlipsteinGrobusch K, den Breeijen JH, Goldbohm RA, Geleijnse JM, Hofman A, Grobbee DE, et al. Dietary assessment in the elderly: validation of a semiquantitative food frequency questionnaire. Eur J Clin Nutr. 1998;52(8):588–96.
 20.
MacIntyre UE, Venter CS, Vorster HH. A culturesensitive quantitative food frequency questionnaire used in an African population: 1. Dev Reproducibility Public Health Nutr. 2001;4(1):53–62.
 21.
Kroke A, KlipsteinGrobusch K, Voss S, Moseneder J, Thielecke F, Noack R, et al. Validation of a selfadministered foodfrequency questionnaire administered in the European prospective investigation into cancer and nutrition (EPIC) study: comparison of energy, protein, and macronutrient intakes estimated with the doubly labeled water, urinary nitrogen, and repeated 24h dietary recall methods. Am J Clin Nutr. 1999;70(4):439–47.
 22.
Dixon LB, Subar AF, Wideroff L, Thompson FE, Kahle LL, Potischman N. Carotenoid and tocopherol estimates from the NCI diet history questionnaire are valid compared with multiple recalls and serum biomarkers. J Nutr. 2006;136(12):3054–61.
 23.
McNaughton SA, Marks GC, Gaffney P, Williams G, Green A. Validation of a foodfrequency questionnaire assessment of carotenoid and vitamin E intake using weighed food records and plasma biomarkers: the method of triads model. Eur J Clin Nutr. 2005;59(2):211–8.
 24.
Kabagambe EK, Baylin A, Allan DA, Siles X, Spiegelman D, Campos H. Application of the method of triads to evaluate the performance of food frequency questionnaires and biomarkers as indicators of longterm dietary intake. Am J Epidemiol. 2001;154(12):1126–35. Epub 2001/12/18
 25.
Ocke MC, Kaaks RJ. Biochemical markers as additional measurements in dietary validity studies: application of the method of triads with examples from the European prospective investigation into cancer and nutrition. Am J Clin Nutr. 1997;65(4 Suppl):1240S–5S.
 26.
McNaughton SA, Hughes MC, Marks GC. Validation of a FFQ to estimate the intake of PUFA using plasma phospholipid fatty acids and weighed foods records. Br J Nutr. 2007;97(3):561–8.
 27.
Fraser GE, Shavlik DJ. Correlations between estimated and true dietary intakes. Ann Epidemiol. 2004;14(4):287–95.
 28.
Fowke JH, Hebert JR, Fahey JW. Urinary excretion of dithiocarbamates and selfreported cruciferous vegetable intake: application of the 'method of triads' to a foodspecific biomarker. Public Health Nutr. 2002;5(6):791–9.
 29.
Rosner B, Michels KB, Chen YH, Day NE. Measurement error correction for nutritional exposures with correlated measurement error: use of the method of triads in a longitudinal setting. Stat Med. 2008;27(18):3466–89.
 30.
Daures JP, Gerber M, Scali J, Astre C, Bonifacj C, Kaaks R. Validation of a foodfrequency questionnaire using multipleday records and biochemical markers: application of the triads method. J Epidemiol Biostat. 2000;5(2):109–15.
 31.
Rosner B, Hendrickson S, Willett W. Optimal allocation of resources in a biomarker setting. Stat Med. 2015;34(2):297–306.
 32.
Kaaks RJ. Biochemical markers as additional measurements in studies of the accuracy of dietary questionnaire measurements: conceptual issues. Am J Clin Nutr. 1997;65(4 Suppl):1232S–9S.
 33.
Yokota RTC, Miyazaki ES, Ito MK. Applying the triads method in the validation of dietary intake using biomarkers. Cad Saúde Pública. 2010;26:2027–37.
 34.
Geelen A, Souverein OW, Busstra MC, de Vries JHM, van‘t Veer P. Comparison of approaches to correct intake–health associations for FFQ measurement error using a duplicate recovery biomarker and a duplicate 24 h dietary recall as reference method. Public Health Nutr. 2015;18(2):226–33.
 35.
Fraser GE, Yan R. A multivariate method for measurement error correction using pairs of concentration biomarkers. Ann Epidemiol. 2007;17(1):64–73.
 36.
Fraser GE, Butler TL, Shavlik D. Correlations between estimated and true dietary intakes: using two instrumental variables. Ann Epidemiol. 2005;15(7):509–18.
 37.
Rosner B, Willett WC, Spiegelman D. Correction of logistic regression relative risk estimates and confidence intervals for systematic withinperson measurement error. Stat Med. 1989;8(9):1051–69. discussion 713
 38.
Frobisher C, Tilling K, Emmett PM, Maynard M, Ness AR, Davey Smith G, et al. Reproducibility measures and their effect on dietcancer associations in the Boyd Orr cohort. J Epidemiol Community Health. 2007;61(5):434–40.
 39.
Phillips AN, Smith GD. The design of prospective epidemiological studies: more subjects or better measurements? J Clin Epidemiol. 1993;46(10):1203–11.
 40.
HornRoss PL, Barnes S, Lee VS, Collins CN, Reynolds P, Lee MM, et al. Reliability and validity of an assessment of usual phytoestrogen consumption (United States). Cancer Causes Control. 2006;17(1):85–93.
 41.
Ollberding NJ, Gilsanz V, Lappe JM, Oberfield SE, Shepherd JA, Winer KK, et al. Reproducibility and Intermethod reliability of a calcium food frequency questionnaire for use in Hispanic, nonHispanic black, and nonHispanic white youth. J Acad Nutr Diet. 2015;115(4):519–27.e2.
 42.
Rosner B, Spiegelman D, Willett WC. Correction of logistic regression relative risk estimates and confidence intervals for measurement error: the case of multiple covariates measured with error. Am J Epidemiol. 1990;132(4):734–45. Epub 1990/10/01
 43.
Neter J, Wasserman W, Kutner M. Applied Linear Regression Models. Boston: McGrawHill; 1989.
 44.
Prentice RL, Huang Y, Kuller LH, Tinker LF, Horn LV, Stefanick ML, et al. Biomarkercalibrated energy and protein consumption and cardiovascular disease risk among postmenopausal women. Epidemiology. 2011;22(2):170–9.
 45.
Rosner B, Spiegelman D, Willett WC. Correction of logistic regression relative risk estimates and confidence intervals for random withinperson measurement error. Am J Epidemiol. 1992;136(11):1400–13.
 46.
Spiegelman D, McDermott A, Rosner B. Regression calibration method for correcting measurementerror bias in nutritional epidemiology. Am J Clin Nutr. 1997;65(4 Suppl):1179S–86S. Epub 1997/04/01
 47.
Caroll RJ, Ruppert D, Stefanski LA, Crainiceanu CM. Measurement error in nonlinear models. 2nd ed. London: Chapman & Hall; 2006.
 48.
Kipnis V, Subar AF, Midthune D, Freedman LS, BallardBarbash R, Troiano RP, et al. Structure of dietary measurement error: results of the OPEN biomarker study. Am J Epidemiol. 2003;158(1):14–21. discussion 26
 49.
Schatzkin A, Kipnis V, Carroll RJ, Midthune D, Subar AF, Bingham S, et al. A comparison of a food frequency questionnaire with a 24hour recall for use in an epidemiological cohort study: results from the biomarkerbased observing protein and energy nutrition (OPEN) study. Int J Epidemiol. 2003;32(6):1054–62.
 50.
Kipnis V, Midthune D, Freedman LS, Bingham S, Schatzkin A, Subar A, et al. Empirical evidence of correlated biases in dietary assessment instruments and its implications. Am J Epidemiol. 2001;153(4):394–403.
 51.
Spiegelman D, Schneeweiss S, McDermott A. Measurement error correction for logistic regression models with an "alloyed gold standard". Am J Epidemiol. 1997;145(2):184–96.
 52.
Kipnis V, Carroll RJ, Freedman LS, Li L. Implications of a new dietary measurement error model for estimation of relative risk: application to four calibration studies. Am J Epidemiol. 1999;150(6):642–51.
 53.
Spiegelman D, Zhao B, Kim J. Correlated errors in biased surrogates: study designs and methods for measurement error correction. Stat Med. 2005;24(11):1657–82.
 54.
Preis SR, Spiegelman D, Zhao BB, Moshfegh A, Baer DJ, Willett WC. Application of a repeatmeasure biomarker measurement error model to 2 validation studies: examination of the effect of withinperson variation in biomarker measurements. Am J Epidemiol. 2011;173(6):683–94.
 55.
Freedman LS, Carroll RJ, Wax Y. Estimating the relation between dietary intake obtained from a food frequency questionnaire and true average intake. Am J Epidemiol. 1991;134(3):310–20.
 56.
Dodd KW, Midthune D, Kipnis V. Re: “application of a repeatmeasure biomarker measurement error model to 2 validation studies: examination of the effect of withinperson variation in biomarker measurements”. Am J Epidemiol. 2012;175(1):84–5.
 57.
Keogh RH, White IR, Rodwell SA. Using surrogate biomarkers to improve measurement error models in nutritional epidemiology. Stat Med. 2013;32(22):3838–61.
 58.
Tooze J, Midthune D, Dodd K, Freedman L, KrebsSmith S, Subar A, et al. A new statistical method for estimating the usual intake of episodically consumed foods with application to their distribution. J Am Diet Assoc. 2006;106(10):1575–87.
 59.
Kipnis V, Midthune D, Buckman DW, Dodd KW, Guenther PM, KrebsSmith SM, et al. Modeling data with excess zeros and measurement error: application to evaluating relationships between episodically consumed foods and health outcomes. Biometrics. 2009;65(4):1003–10.
 60.
Agogo GO, van der Voet H, van't Veer P, van Eeuwijk FA, Boshuizen HC. Evaluation of a twopart regression calibration to adjust for dietary exposure measurement error in the Cox proportional hazards model: A simulation study. Biometrical J. 2016;58(4):76682.
 61.
Keogh RH, White IR. Allowing for never and episodic consumers when correcting for error in food record measurements of dietary intake. Biostatistics. 2011;12(4):624–36. Epub 2011/03/08
 62.
Tooze JA, Midthune D, Dodd KW, Freedman LS, KrebsSmith SM, Subar AF, et al. A new method for estimating the usual intake of episodicallyconsumed foods with application to their distribution. J Am Diet Assoc. 2006;106(10):1575–87.
 63.
Beydoun MA, Kaufman JS, Ibrahim J, Satia JA, Heiss G. Measurement error adjustment in essential fatty acid intake from a food frequency questionnaire: alternative approaches and methods. BMC Med Res Methodol. 2007;7:41.
 64.
Kaaks R, Riboli E, Esteve J, van Kappel AL, van Staveren WA. Estimating the accuracy of dietary questionnaire assessments: validation in terms of structural equation models. Stat Med. 1994;13(2):127–42.
 65.
Freedman LS, Midthune D, Carroll RJ, Kipnis V. A comparison of regression calibration, moment reconstruction and imputation for adjusting for covariate measurement error in regression. Stat Med. 2008;27(25):5195–216.
 66.
Freedman LS, Fainberg V, Kipnis V, Midthune D, Carroll RJ. A new method for dealing with measurement error in explanatory variables of regression models. Biometrics. 2004;60(1):172–81.
 67.
Sterne JAC, White IR, Carlin JB, Spratt M, Royston P, Kenward MG, et al. Multiple imputation for missing data in epidemiological and clinical research: potential and pitfalls. BMJ. 2009;338
 68.
Fraser GE, Stram DO. Regression calibration in studies with correlated variables measured with error. Am J Epidemiol. 2001;154(9):836–44.
 69.
Fraser GE, Stram DO. Regression calibration when foods (measured with error) are the variables of interest: markedly nonGaussian data with many zeroes. Am J Epidemiol. 2012;175(4):325–31.
 70.
Carroll RJ, Pee D, Freedman LS, Brown CC. Statistical design of calibration studies. Am J Clin Nutr. 1997;65(4):1187S–9S.
 71.
Riboli E, Kaaks R. Invited commentary: the challenge of multicenter cohort studies in the search for diet and cancer links. Am J Epidemiol. 2000;151(4):371–4. discussion 56
 72.
Thompson FE, Kipnis V, Midthune D, Freedman LS, Carroll RJ, Subar AF, et al. Performance of a foodfrequency questionnaire in the US NIHAARP (National Institutes of HealthAmerican Association of Retired Persons) diet and health study. Public Health Nutr. 2008;11(2):183–95.
 73.
Jenab M, Slimani N, Bictash M, Ferrari P, Bingham SA. Biomarkers in nutritional epidemiology: applications, needs and new horizons. Hum Genet. 2009;125(5–6):507–25.
 74.
Wong MY, Day NE, Bashir SA, Duffy SW. Measurement error in epidemiology: the design of validation studies I: univariate situation. Stat Med. 1999;18(21):2815–29.
 75.
Carroll RJ, Freedman L, Pee D. Design aspects of calibration studies in nutrition, with analysis of missing data in linear measurement error models. Biometrics. 1997;53(4):1440–57.
 76.
Bland J, Altman D. Statistical methods for assessing agreeement between two methods of clinical measurement. Lancet. 1986;327(8476):307–10.
 77.
Biró G, Hulshof KFAM, Ovesen L, Amorim Cruz JA. Selection of methodology to assess food intake. Eur J Clin Nutr. 2002;56(Supplement 2):S25–32.
 78.
Kaaks R, Riboli E, van Staveren W. Sample size requirements for calibration studies of dietary intake measurements in prospective cohort investigations. Am J Epidemiol. 1995;142(5):557–65.
 79.
Stram DO, Longnecker MP, Shames L, Kolonel LN, Wilkens LR, Pike MC, et al. Costefficient design of a diet validation study. Am J Epidemiol. 1995;142(3):353–62.
 80.
Rosner B, Willett WC. Interval estimates for correlation coefficients corrected for withinperson variation: implications for study design and hypothesis testing. Am J Epidemiol. 1988;127(2):377–86.
 81.
Dodd KW, Guenther PM, Freedman LS, Subar AF, Kipnis V, Midthune D, et al. Statistical methods for estimating usual intake of nutrients and foods: a review of the theory. J Am Diet Assoc. 2006;106(10):1640–50.
 82.
Wong MY, Day NE, Wareham NJ. Measurement error in epidemiology: the design of validation studies II: bivariate situation. Stat Med. 1999;18(21):2831–45.
 83.
Park S, Stram DO, editors. Costefficient design of main cohort and calibration studies where one or more exposure variables are measured with errors. Amercian statistical association: joint statistical meetings  section on statistics in epidemiology; 2002 2002.
 84.
Freedman LS, Schatzkin A, Thiebaut ACM, Potischman N, Subar AF, Thompson FE, et al. Abandon neither the food frequency questionnaire nor the dietary fatbreast cancer hypothesis. Cancer Epidemiol Biomakers Prev. 2007;16(6):1321–2.
 85.
Schatzkin A, Subar AF, Thompson FE, Harlan LC, Tangrea J, Hollenbeck AR, et al. Design and serendipity in establishing a large cohort with wide dietary intake distributions: the National Institutes of Health–American Association of Retired Persons Diet and Health Study. Am J Epidemiol. 2001;154(12):1119–25.
 86.
Spiegelman D. Approaches to uncertainty in exposure assessment in environmental epidemiology. Annu Rev Public Health. 2010;31:149–63. Epub 2010/01/15
 87.
Willett W. An overview of issues related to the correction of nondifferential exposure measurement error in epidemiologic studies. Stat Med. 1989;8(9):1031–40. discussion 713
 88.
Keogh RH, White IR. Allowing for never and episodic consumers when correcting for error in food record measurements of dietary intake. Biostatistics. 2011;12(4):624–36.
 89.
Buonaccorsi JP, Dalen I, Laake P, Hjartåker A, Engeset D, Thoresen M. Sensitivity of regression calibration to nonperfect validation data with application to the Norwegian women and cancer study. Stat Med. 2015;34(8):1389–403.
 90.
Sauerbrei W, Abrahamowicz M, Altman DG, le Cessie S, Carpenter J. On behalf of the Si. STRengthening analytical thinking for observational studies: the STRATOS initiative. Stat Med. 2014;33(30):5413–32.
 91.
Liao X, Zucker DM, Li Y, Spiegelman D. Survival analysis with errorprone timevarying covariates: a risk set calibration approach. Biometrics. 2011;67(1):50–8.
 92.
Statacorp. Stata software for generalized linear measurement error models. 2015 [1 May 2015]; Available from: http://www.stata.com/merror/.
 93.
Bilonick RA. Merror : accuracy and precision of measurements. R package version 10 http:\\www.rproject.org; 2003.
 94.
Pérez A, Zhang S, Kipnis V, Midthune D, Freedman LS, Carroll RJ. Intake_epis_food(): an R function for fitting a Bivariate nonlinear measurement error model to estimate usual and energy Intake for episodically consumed foods. J Stat Softw. 2012;46(c03):1–17.
 95.
SAS. SAS User Manual. 9.2 ed. Cary, NC: SAS Institute Inc; 2007.
 96.
Boker S, Neale M., Mae H., Metah P., Kenney S., Bates T., Estabrook R., Spies J., Brick T., Spiegel M OpenMx: the OpenMx statistical Modelling package.. R package version 0.2.31006 ed2010.
 97.
Holford TR, Stack C. Study design for epidemiologic studies with measurement error. Stat Methods Med Res. 1995;4(4):339–58.
 98.
Thomas D. Measurement error and exposure models. In: Thomas D, editor. Statistical methods in environmental epidemiology. Oxford: Oxford University Press; 2009. p. 221–57.
 99.
Thurigen D, Spiegelman D, Blettner M, Heuer C, Brenner H. Measurement error correction using validation data: a review of methods and their applicability in casecontrol studies. Stat Methods Med Res. 2000;9(5):447–74.
 100.
Guolo A. Robust techniques for measurement error correction: a review. Stat Methods Med Res. 2008;17(6):555–80.
 101.
Frost C, Thompson SG. Correcting for regression dilution bias: comparison of methods for a single predictor variable. J R Stat Soc Ser A. 2000;163(2):173–89.
 102.
Keogh RH, White IR. A toolkit for measurement error correction, with a focus on nutritional epidemiology. Stat Med. 2014;33(12):2137–55.
 103.
Prentice RL, Tinker LF, Huang Y, Neuhouser ML. Calibration of selfreported dietary measures using biomarkers: an approach to enhancing nutritional epidemiology reliability. Curr Atheroscler Rep. 2013;15(9):353.
 104.
National Institute of Health, Institute NC. Dietary Assessment Primer. [cited 2016 4 November ]; Available from: https://dietassessmentprimer.cancer.gov/.
 105.
Kaaks R, Riboli E, Sinha R. Biochemical markers of dietary intake. IARC Sci Publ. 1997;142:103–26.
 106.
Willet WC, Stampfer M. Total energy intake: implications for epidemiological analyses. Am J Epidemiol. 1986;124(1):17–27.
 107.
Harris JA. On the calculation the intraclass and interclass coefficients of possible correlation from class moments when the number of possible combinations is large. Biometrika. 1913;9(3–4):446–72.
Acknowledgements
Alison Campbell undertook the initial searches, and she was supported by The Public Population Project in Genomics and Society via the Canadian Genome Analysis and Technology Program. We would also like to thank Lindsey F Masson, for help with data extraction. Julian Little is supported by a tier 1 Canada Research Chair in Human Genome Epidemiology.
Funding
This project has not received any funding other than from the initial workshop which identified the need for this systematic review.
Availability of data and materials
Additional file 1: Table S1 and Additional file 2: Table S2 contain the studies with the associated references included in this report.
Author information
Affiliations
Contributions
All authors participated in the design of the study and in the data acquisition and review. DB drafted the manuscript and DL, JL, and CM made critical revisions for important intellectual content. DL executed the search strategy and retrieved all of the relevant reports. All authors contributed to, read, and approved the final manuscript.
Corresponding author
Correspondence to Derrick A. Bennett.
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Additional files
Additional file 1: Table S1.
Reports that described the development of a method to quantify, or correct for measurement error (RTF 268 kb)
Additional file 2: Table S2.
Reports that applied an existing method to quantify, or correct for measurement error (RTF 578 kb)
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Received
Accepted
Published
DOI
Keywords
 Measurement error
 Continuous exposure
 Nutrition
 Statistical methods