Bayesian MR with study heterogeneity
Let X denote the exposure, Y the outcome, and U a scalar variable summarising the set of unobserved confounders of the relationship between X and Y. Traditional MR [9] requires that an IV (denoted by Z) is : i) associated with the exposure X, ii) not associated with the confounders U, and iii) associated with the outcome Y only through the exposure X. These three assumptions can be graphically expressed as Fig. 1 in which our interest is whether X causes Y (the X→Y arrow). For the purpose of illustration, we consider the data generating process shown in Fig. 2. Z_{1},Z_{2} and Z_{3} are vectors consisting of L,K and M independent IVs respectively. Random scalar variables X_{1} and X_{2} represent two exposures. Random scalar variables Y_{1} and Y_{2} represent two outcomes.
In a twosample (or equivalently, twostudy) MR setting with or without overlapping individuals, it has been shown that, compare to conventional MR analysis, a Bayesian approach may lead to more precise estimates of the causal effect by treating it as a case of incomplete data which may be dealt with through iterative imputations using MCMC [12]. Here, we further generalize the approach by allowing for some degree of heterogeneity between different studies.
Suppose we have data collected from two independent studies:

Study A  observed data for IVs, exposures and outcomes {Z_{1},Z_{2},Z_{3},X_{1},X_{2},Y_{1},Y_{2}}.

Study B  observed data for IVs and outcomes {Z_{1},Z_{2},Z_{3},Y_{1},Y_{2}} only.
Study A includes fully observed data for MR, whereas Study B has exposure data completely missing. We shall include random effect terms in our MR model to capture study heterogeneity. By assuming standardised observed variables and linear additivity, according to Fig. 2, our models are constructed as follows.
For Study A,
$$\begin{array}{*{20}l} U &\sim& N(0, 1), \end{array} $$
(1)
$$\begin{array}{*{20}l} X_{1}  \mathbf{Z}_{1},\mathbf{Z}_{3},U &\sim& N\left(\boldsymbol{\alpha}_{1}\mathbf{Z}_{1} + \boldsymbol{\alpha}_{31}\mathbf{Z}_{3} + \delta_{X_{1}}U, \sigma_{X_{1A}}^{2}\right), \end{array} $$
(2)
$$\begin{array}{*{20}l} X_{2}  \mathbf{Z}_{2},\mathbf{Z}_{3},U &\sim& N\left(\boldsymbol{\alpha}_{2}\mathbf{Z}_{2} + \boldsymbol{\alpha}_{32}\mathbf{Z}_{3} + \delta_{X_{2}}U, \sigma_{X_{2A}}^{2}\right), \end{array} $$
(3)
$$\begin{array}{*{20}l} Y_{1}  X_{1},U &\sim& N\left(\beta_{1} X_{1}+\delta_{Y_{1}}U, \sigma_{Y_{1A}}^{2}\right), \end{array} $$
(4)
$$\begin{array}{*{20}l} Y_{2}  X_{2},U &\sim& N\left(\beta_{2} X_{2}+\delta_{Y_{2}}U, \sigma_{Y_{2A}}^{2}\right). \end{array} $$
(5)
For Study B,
$$\begin{array}{*{20}l} U\! &\sim& \!N(0, 1), \end{array} $$
(6)
$$\begin{array}{*{20}l} X_{1}  \mathbf{Z}_{1},\mathbf{Z}_{3},U\! &\sim& \!N\!\left(V_{X_{1}} \,+\, \boldsymbol{\alpha}_{1}\mathbf{Z}_{1} \,+\, \boldsymbol{\alpha}_{31}\mathbf{Z}_{3} \,+\, \delta_{X_{1}}U, \sigma_{X_{1B}}^{2}\right), \end{array} $$
(7)
$$\begin{array}{*{20}l} X_{2}  \mathbf{Z}_{2},\mathbf{Z}_{3},U\! &\sim& \!N\!\left(V_{X_{2}} \,+\, \boldsymbol{\alpha}_{2}\mathbf{Z}_{2} \,+\, \boldsymbol{\alpha}_{32}\mathbf{Z}_{3} \,+\, \delta_{X_{2}}U, \sigma_{X_{2B}}^{2}\right), \end{array} $$
(8)
$$\begin{array}{*{20}l} Y_{1}  X_{1},U\! &\sim& \!N\left(V_{Y_{1}} + \beta_{1} X_{1}+\delta_{Y_{1}}U, \sigma_{Y_{1B}}^{2}\right), \end{array} $$
(9)
$$\begin{array}{*{20}l} Y_{2}  X_{2},U\! &\sim& \!N\left(V_{Y_{2}} + \beta_{2} X_{2}+\delta_{Y_{2}}U, \sigma_{Y_{2B}}^{2}\right). \end{array} $$
(10)
In the above prespecified models, αs are instrument strength parameters, and δs are effects of U on Xs or Ys. Causal effects of Xs on Ys are denoted by βs. The study heterogeneity is accounted for by Vs. Note that X_{1} and X_{2} do not have observed data in Study B, but they are part of data generating process, and thus, should be included in the model. U is a sufficient scalar summary of the unobserved confounders. We assume that U∼N(0,1).
The combined dataset of StudiesA and B (\(\mathcal {D}\), say) will contain fully observed data for the instruments and the outcomes. However, all participants in Study B have missing data of X_{1} and X_{2} which will be treated as unknown quantities and imputed from their conditional distributions given the observed data and current estimated parameters using MCMC. Let X^{∗} be imputed values of X. Our approach involves the following sequence of five steps.

1.
Specify initial values for unknown parameters and the number of Markov iterations T.

2.
At the tth iteration, where 0≤t<T, let missing values of X_{1} and X_{2} in Study B be filled with \(X_{1}^{*}\) drawn from \(N\left (V_{X_{1}}^{(t)} + \boldsymbol {\alpha }_{1}^{(t)}\mathbf {Z}_{1} + \boldsymbol {\alpha }_{31}^{(t)}\mathbf {Z}_{3} + \delta _{X_{1}}^{(t)}U, {\sigma _{X_{1B}}^{2}}^{(t)}\right)\) and \(X_{2}^{*}\) drawn from \(N\left (V_{X_{2}}^{(t)} + \boldsymbol {\alpha }_{2}^{(t)}\mathbf {Z}_{2} + \boldsymbol {\alpha }_{32}^{(t)}\mathbf {Z}_{3} + \delta _{X_{2}}^{(t)}U, {\sigma _{X_{2B}}^{2}}^{(t)}\right)\), respectively. Z_{1},Z_{2} and Z_{3} are observed values of IVs in Study B.

3.
Create a single complete dataset including both the observed and the imputed data.

4.
Estimate model parameters using MCMC based on the complete dataset and set t←t+1.

5.
Repeat Steps 24 until t=T.
Now we specify priors in the Bayesian model (2)(10). Previous GWAS studies show that individual SNPs explain a tiny proportion of exposure variance [14–16], corresponding to small magnitudes of the α parameters in our model. In accord with this finding and previous MR simulation studies [17], we set IV strength parameters αs to be independent and identically distributed with mean zero and a small variance: α_{1}∼N_{L}(0,0.3^{2}I),α_{2}∼N_{K}(0,0.3^{2}I),α_{31}∼N_{M}(0,0.3^{2}I), and α_{32}∼N_{M}(0,0.3^{2}I). The priors of both β_{1} and β_{2} are set to a same distribution N(0,10^{2}). Finally, we assign the priors of the standard deviations σs to a same inversegamma distribution InvGamma(3,2), and random effects Vs to N(0,1) in the Model (7)(10) for Study B.
Bayesian MR for large studies
Advantages of an MCMCpowered Bayesian approach to MR are counterpoised by a relatively higher computational burden and a possibly large memory requirement. A natural way of dealing with this problem would be to divide data \(\mathcal {D}\) into a number (J, say) of subsets D_{1},D_{2},...,D_{J} that we assume to contain an equal number (q, say) of individuals for simplicity. By running separate Bayesian MR analyses in parallel on the subsets, we will obtain J subsetspecific posteriors which can then be aggregated in various ways. In this study, we adopt a “divideandcombine” approach proposed by Xue and Liang [13].
Let θ denote the entire set of unknown quantities in the model. For subset D_{j}, where j=1,2,...,J, let π(θD_{j}) denote the joint posterior distribution of θ and \(\widehat {\boldsymbol {\mu }}_{j} = \widehat {E}(\boldsymbol {\theta }D_{j})\) the corresponding estimated mean vector. Let \(\widehat {\boldsymbol {\mu }} = \frac {1}{J}\sum _{j=1}^{J}\widehat {\boldsymbol {\mu }}_{j}\) be the average of the \(\widehat {\boldsymbol {\mu }}_{j}\)s. According to [13], the posterior based on full data, \(\pi (\boldsymbol {\theta }\mathcal {D})\), can be estimated as the average of the recentred subset posteriors.
$$\begin{array}{@{}rcl@{}} \widetilde{\pi}(\boldsymbol{\theta}\mathcal{D}) = \frac{1}{J}\sum\limits_{j=1}^{J}\widetilde{\pi}(\boldsymbol{\theta}  \widehat{\boldsymbol{\mu}} + \widehat{\boldsymbol{\mu}}_{j}  D_{j}). \end{array} $$
(11)
And it has been proved that ([13])
$$ E_{\widetilde{\pi}}(\boldsymbol{\theta})  E_{\pi}(\boldsymbol{\theta}) = O_{p}\left(q^{1}\right), $$
(12)
and
$$ Var_{\widetilde{\pi}}(\boldsymbol{\theta})  Var_{\pi}(\boldsymbol{\theta}) = o_{p}\left(n^{1}\right), $$
(13)
where q is the sample size of the subsets and n the sample size of the full dataset. \(E_{\widetilde {\pi }}(\boldsymbol {\theta })\) and E_{π}(θ) are expectations of the posteriors of θ aggregated from subsets and obtained from full data respectively. \(Var_{\widetilde {\pi }}(\boldsymbol {\theta })\) and Var_{π}(θ) are their variances. It is easily seen that the difference in expectation depends on the sample size of the subsets and the difference in variation depends on the sample size of the full dataset.
Simulations  Bayesian MR with study heterogeneity
We used simulated data to evaluate our Bayesian MR model with study heterogeneity in comparison with a conventional MR method. In particular, we considered 12 configurations including

3 missing rates of the exposures: 20%, 50%, 80%

2 degrees of the IV strength (α_{1},α_{2},α_{31},α_{32}): 0.1 and 0.3

Zero and nonzero causal effects of the exposures on the outcomes (β_{1},β_{2}): 0 and 0.3.
The number of IVs was set to 15, 15 and 5 for Z_{1},Z_{2} and Z_{3} respectively. Data of each IV were randomly drawn from a binomial distribution B(2,0.3) independently. The specified values of the effects of U on the exposures (\(\delta _{X_{1}}, \delta _{X_{2}}\)) and on the outcomes (\(\delta _{Y_{1}}, \delta _{Y_{2}}\)) were set to 1. Standard deviations σs were set to 0.1. We simulated 200 datasets for each configuration.
For each dataset, we

simulated a dataset of sample size n_{A} which contains observations of the IVs, exposures and outcomes (dataset A, denoted by \(\mathcal {D}_{A}\));

simulated a dataset of sample size n_{B} which contains observations of the IVs, exposures and outcomes, then included data of the IVs and outcomes only as if the exposure data were missing (dataset B, denoted by \(\mathcal {D}_{B}\)).
Sample size of \(\mathcal {D}\), the combined data of \(\mathcal {D}_{A}\) and \(\mathcal {D}_{B}\), was set to 400 in all configurations, i.e., n=n_{A}+n_{B}=400. The missing rate of the exposures was defined as \(\frac {n_{B}}{n}\times 100\%\). For example, if missing rate was 50%, we simulated \(\mathcal {D}_{A}\) of sample size 200 and \(\mathcal {D}_{B}\) of sample size 200. To allow for different degrees of study heterogeneity in different datasets, random effects Vs in study B were randomly drawn from a uniform distribution U(−0.5,0.5) independently. Imputations of missing data and estimations of model parameters were then performed simultaneously using MCMC in Stan [18, 19]. \(\hat {R}\) was used to check convergence of the Markov chains [20].
Estimated causal effects obtained from our Bayesian MR and twosample inversevariance weighted (IVW) estimation [6] were compared using 4 metrics: mean, standard deviation (sd), coverage (proportion of the times that the 95% credible/confidence intervals contained the true value of the causal effect) and power (proportion of the times that the 95% credible/confidence intervals did not contain zero when the true causal effect was nonzero, only applicable when β_{1}=β_{2}=0.3 by defination). Higher power indicates lower chance of getting false negative results. In IVW estimation, we used observed IV and exposure data from \(\mathcal {D}_{A}\) and observed IV and outcome data from \(\mathcal {D}_{B}\).
Simulations  Bayesian MR with study heterogeneity for large studies
We also assessed the performance of dividing a big dataset into subsets in our Bayesian MR with study heterogeneity in simulation experiments. The simulation scheme was the same as above. However, the sample size of \(\mathcal {D}\) was set to a much larger value 50,000. For each configuration, a single dataset was simulated by combining \(\mathcal {D}_{A}\) and \(\mathcal {D}_{B}\). We randomly divided data into 5 subsets of equal sample size, separately, for \(\mathcal {D}_{A}\) (\(\mathcal {D}_{A_{1}},..., \mathcal {D}_{A_{5}}\)) and for \(\mathcal {D}_{B}\) (\(\mathcal {D}_{B_{1}},..., \mathcal {D}_{B_{5}}\)). Subset \(\mathcal {D}_{i}\) was then constructed by combining \(\mathcal {D}_{A_{i}}\) and \(\mathcal {D}_{B_{i}}\), where i=1,...,5. This is to ensure that subset \(\mathcal {D}_{i}\) had the same missing rate as that of the full data \(\mathcal {D}\). Causal effects were estimated using \(\mathcal {D}\), and using the 5 subsets in Bayesian MR. To explore the impact of different data partitioning strategies on estimated causal effects, we carried out the same analysis by also dividing data into 50 subsets of sample size 1,000.