### Data

The developed Bayesian method is illustrated using individually matched case-control data from a study of Chan et al. [12]. The objective of the study was to examine the risk of maternal hypothyroxinemia due to exposure to three PFAs: perfluorooctanoic acid (PFOA), perfluorooctane sulfonate (PFOS) and perfluorohexane sulfonate (PFHxS). Chan et al. [12] extracted PFAs from maternal sera samples from 271 pregnant women, aged 18 or older, who elected to undergo a second trimester prenatal "triple screen", delivered at 22 weeks gestation or more to live singletons without evidence of malformations, and were referred by a physician who made a least eight recommendations for the "triple screen" over the study period. The exposure variables were reported on a continuous scale, and censored/non-detectable values (about 5.4% of the total number of records) were recorded as half the value of the limit of detection. Concentrations of the PFAs were transformed to log-molar units, and it was seen that after this transformation the measured exposures approximately follow a normal distribution. A quality control experiment was performed in order to assess the amount of error incurred in the measurement of the exposures. In this experiment, percentages of recovery were calculated for each exposure and the results revealed the presence of a random error in the measurements. Details of this procedure and results are presented in Appendix I.

Chan et al. [12] classified the subjects into cases or controls, based on the analysis of their thyroid stimulating hormone (TSH) and free thyroxin (T4) concentrations. The hypothyroxinemia cases correspond to women exhibiting normal TSH concentrations with no evidence of hyperthyroidism (between 0.15 and 4.0 mU/L) and free T4 in the 10^{th} percentile (less than 8.8 pmol/L). Meanwhile the controls correspond to women with normal TSH concentrations but having free T4 concentrations between the 50^{th} and 90^{th} percentiles (between 12.0 and 14.1 pmol/L). Each case was matched to between one and three controls on the basis of two matching factors: maternal age at blood draw (± 3 years) and referring physician (a total of 29 physicians). Further details on the construction of the data can be found in Chan et al. [12].

In summary, the matched case-control data used to illustrate the Bayesian method to correct for measurement error contain information from 96 cases and 175 individually matched controls. For the purpose of this paper, it is assumed there is no misclassification of control/case status. In addition, the data contain, for each subject, the corresponding exposure to PFOA, PFOS and PFHxS, which are reported on a continuous scale in log- molar units and are assumed to be subject only to random measurement error. Moreover, four potential confounders which are precisely measured are reported: maternal age (years), maternal weight (pounds), maternal race (Caucasian and non-Caucasian) and gestational age (days). All potential confounders except for maternal race were reported on a continuous scale. The maternal age variable is retained despite its use as a matching factor, in case the matching is too coarse to fully eliminate confounding.

### Measurement model

Generally, in observational studies, the vector of imprecise surrogate exposures W is commonly recorded, instead of X itself. Therefore, in order to understand the relationship between the disease risk and the explanatory variables X, having data on *Y* and W, it is necessary to account for measurement error in the exposures. In this paper, the attention is concentrated on the problem of having only random error, by assuming zero systematic error. However, the present methodology can be adapted to introduce the effect of a systematic error.

Assume the vector of independent surrogates

W arises from a classical additive measurement error model, which can be expressed as

where U refers to the measurement error component. This classical model assumes the true exposures are recorded with an additive, independent error. In addition, it can be assumed the measurement error is *non-differential*, and *unbiased*. The assumption of *non-differential* measurement error refers to the fact that the distribution of the surrogate exposures depends only on the actual exposure variables and not on the response variable or other variables in the model. As a result, the conditional distribution of (W|X,*Y*) is identical to the conditional distribution of (W|X). The *unbiased* assumption *E*(U|X) = 0 implies *E*(U|X) = X. Typically, the measurement error component is also assumed to be normally distributed with constant variance, i.e. *U* ~ *N*
_{
P
}(0, *∑*), where is *∑* a diagonal matrix with the main diagonal entries given by
, for *p* = 1, ..., *P*.

Under the stated assumptions,

W|

X follows a

*P*-dimensional multivariate normal distribution with a mean vector given by the vector of true exposures and a covariance matrix

*∑*, which in this case is known. Thus, the density corresponding to the measurement model is given by

W|

X,

*∑* ~

*N*
_{
p
}(

X,

*∑*). Therefore, under the assumptions of an individually matched case-control data, the density of the measurement model is given by

where w
_{
ij
}and x
_{
ij
}correspond to the vector of surrogate and true exposures variables, respectively, for the *j* - *th* member of the *i* - *th* matched set, and *N* refers to the number of matched sets.

For the particular case of data used in the study on PFAs, the surrogate variables are measured concentrations of PFOA, PFOS, and PFHxS, which correspond to the exposures to the compounds reported on a continuous scale in log-molar units. Consequently, an additive measurement error model for the exposures in log-molar units translates into a multiplicative error structure, in which the corresponding error term is proportional to the true exposure in molar scale. In many epidemiological studies, positive explanatory variables are subject to this sort of measurement error. Using available validation data from the quality control procedure performed by Chan et al.[12], the covariance matrix *∑* of the measurement model can be estimated. In Appendix I we present a statistical argument for estimating *∑* from this particular form of quality control data. The argument is based on the multivariate version of the delta method [19] and uses the estimated standard deviation of the percentages of recovery for the concentrations of the three compounds in parts-per-billion to obtain information about the incurred error in the measurement of the exposures.

### Disease model

In order to describe a relationship between the true exposures and the probability associated to the response variable, it is necessary to specify a disease model. Since the study analysed in this paper involves matched sets, the conditional logistic regression likelihood is adopted.

Consider a study having *N* matched sets, such that the *j-th* member (*j* = 1, ..., *n*
_{
i
}) of the *i-th* set (*i* = 1, ..., *N*) has *P* associated continuous exposures X
_{
ij
}= (*X*
_{
ij1}, ..., *X*
_{
ijp
})^{
T
}. In addition, let Y
_{
i
}= (*Y*
_{
i1}, ..., *Y*
_{
ini
})^{
T
}be a vector of response variables associated to the *i-th* matched set, such that *Y*
_{
ij
}= 1 for the cases and *Y*
_{
ij
}= 0 for the controls. Without loss of generality, the subjects can be labelled such that *Y*
_{
i1 }= 1 and *Y*
_{
ij
}= 0, for *j* = 2, ..., *n*
_{
i
}. Thus, the underlying objective is to model the retrospective probabilities for the case (i.e. *P*(X
_{
i1 }| *Y*
_{
i1 }= 1) ), and the controls (i.e. *P*(X
_{
ij
}| *Y*
_{
ij
}= 0), for *j* = 2, ... *n*
_{
i
}), which can be accomplished by using the conditional logistic regression model.

The conditional likelihood is obtained by conditioning on the number of cases in each matched set, i.e. conditioning on

. In the particular case of individual matching, the number of cases is one; therefore, the conditional likelihood for the

*i* -

*th* matched set (

*i* = 1, ...,

*N*) is given by [

20,

21]

where

β= (β

_{1}, ..., β

_{
P
})

^{
T
}corresponds to the log ORs associated with unit changes in each of the exposures. Under the assumptions of a matched case-control study, the full conditional likelihood is the product of

over the

*N* strata or matched sets, which is

The parameter βis assumed to be constant across matched sets, and it is the target of statistical inference.

### Bayesian model

Consider a retrospectively collected matched case-control data where each case is matched to one or more controls based on suspected confounders as matching factors. Let *Y* be the response variable, such that *Y* = 1 for cases and *Y* = 0 for controls, let X= (*X*
_{1}, ..., *X*
_{
P
})^{
T
}be the *P* -dimensional vector of the true, latent, continuous exposures which are subject to measurement error, and let W= (*W*
_{1}, ..., *W*
_{
P
})^{
T
}be the *P*-dimensional vector of surrogate exposures.

The aim of this subsection is to develop a Bayesian method to understand the association between the vector of continuous exposures X and the probability of the response variable *Y*, after correcting for random measurement error in the exposures.

Under the Bayesian paradigm, the posterior density of the unknown quantities is given by

where θ refers to the vector of unknown parameters. The first term of the right hand side of (4) refers to the joint posterior distribution of the true exposures X and the surrogate variables W. As will be shown in Appendix II, this term contains the densities of the measurement model, disease model and exposure model. Meanwhile, the second term corresponds to the prior distribution of the unknown parameters.

### Exposure model

The conditional logistic regression model has been successfully applied in matched retrospective case-control studies, and the use of this procedure has been statistically justified using Bayesian (see for example [11, 22, 23]) and non-Bayesian (see for instance [20, 21]) approaches. This justification is based on the fact that the likelihood term describing the distribution of the total number of cases within-stratum given the exposures can be discarded. The reason for this is that it does not provide information about the parameter of interest, since the likelihood is only a function of the unknown parameter *β*. However, this justification is no longer directly applicable when adjusting for measurement error in exposures, since the omitted likelihood term might also contain information about these exposures [6], i.e., the likelihood is a function of *β* and the unobserved exposures. As a result, the use of a conditional likelihood approach in the presence of measurement error in multiple continuous exposures has not been widely adopted.

We justify the use of conditional logistic regression likelihood as a disease model when adjusting for measurement error in an individually matched case-control study via a random-effect exposure model; details are presented in Appendix II. A different approach that does not involve a random-effect exposure model is provided by Gulo et al. [8].

In order to describe the random-effect exposure model, we assume that the vector of exposures for the

*j* -

*th* subject from the

*i* -

*th* matched set follows a multivariate normal distribution around the vector of exposure means of the corresponding matched set. Moreover, since the vector of exposure means of a matched set is unknown, we assume that this vector follows a multivariate normal distribution centered on the across-set exposures means. That results in

where

, and

, such that

γ
_{
ij
}and

λ
_{
i
}are mutually independent. Also notice that the within-stratum covariance matrix

*V*
_{
W
}is assumed to be constant across matched sets. As a result,

, and

. Thus, the density corresponding to the random-effect exposure model is given by

It has been assumed that the vector of true exposures follows a *P*-dimensional multivariate normal distribution. However, in observational studies, exposures often have a skew distribution [24]. Therefore, it is important to keep in mind that incorrect model specification may lead to biased estimates. To overcome potential misspecifications, for the univariate case some authors [24–26] have proposed the use of flexible distributions to increase the robustness to model specification. However, implementation of such methods can be quite challenging in the context of multivariate exposures.

### Joint posterior density

For the particular case of this paper, the data considered consist of

*N* = 96 matched sets, such that the

*i-th* matched set has

*j* = 1, ...,

*n*
_{
i
}subjects, with

*n*
_{
i
}∈ {2,3,4}. Thus one subject is the case per set and the remaining

*n*
_{
i
}- 1 subjects are the controls. Let

be the vector of unknown parameters. Therefore, by (4) and (AII.3), and using densities in (2), (3) and (5), it follows that the posterior density of the unknown quantities can be expressed as

It is commonly assumed that the unknown parameters are independent of each other

*a priori*, so that

. In order to implement a Bayes-Markov chain Monte Carlo (MCMC) inference, it is necessary to assume prior distributions for the unknown parameters. Proper prior distributions are assumed for all the parameters. Moreover, the corresponding hyperparameters are chosen so that the parameters reflect prior ignorance:

where *W*
_{
P
}(*R, b*) indicates a *P*-dimensional Wishart distribution with a positive definite inverse scale matrix *R*, and *b* degrees of freedom. And *I*
_{
P
}is an identity matrix of size *P*. For the particular case of the matched case-control data from the epidemiological study on PFAs, *P* = 3 and μis estimated using the across-set sample mean of the corresponding observed exposures.

### Adjustment for additional confounders

Considering the possibility that confounding is only partially addressed by matching, further potential confounders can be introduced in the disease model. In general, potential confounders should also be included in the exposure model; however, for simplicity these confounders are not considered in our random-effect exposure model, keeping it as presented in equation (5). For the case of the PFA's data, this simplification might be justified by the fact that the exposures and the confounders exhibit small correlations (less than 0.18), so we do not expect the potential confounders to be very helpful in reconstructing the true exposures. In addition, due to the assumption of *non-differential* measurement error, the measurement model also remains as presented in equation (2).

Consider the situation where the

*j-th* member of the

*i-th* set has associated

*K* potential confounding variables

Z
_{
ij
}= (

*Z*
_{
ij1
}, ...,

*Z*
_{
ijk
})

^{
T
}which are precisely measured. Therefore, the full conditional likelihood corresponding to the disease model in (3) can be rewritten as

where δ= (δ_{1}, ..., δ_{
K
})^{
T
}is the vector of parameters associated with the confounding effect.

Thus, the posterior density of the unknown quantities can be rewritten as

where
is the new vector of unknown parameters, and a proper and diffuse prior distribution is assumed for the parameter δ, by having δ~ *N*
_{
k
}(0, 10000*I*
_{
K
}), where *K* = 4 for the particular case of the motivating example.