 Research Article
 Open access
 Published:
Synthesis of clinical prediction models under different sets of covariates with one individual patient data
BMC Medical Research Methodology volume 15, Article number: 101 (2015)
Abstract
Background
Recently, increased development of clinical prediction models has been reported in the medical literature. However, evidence synthesis methodologies for these prediction models have not been sufficiently studied, especially for practical situations such as a metaanalyses where only aggregated summaries of important predictors are available. Also, in general, the covariate sets involved in the prediction models are not common across studies. As in ordinary model misspecification problems, dropping relevant covariates would raise potentially serious biases to the prediction models, and consequently to the synthesized results.
Methods
We developed synthesizing methods for logistic clinical prediction models with possibly different sets of covariates. In order to aggregate the regression coefficient estimates from different prediction models, we adopted a generalized least squares approach with nonlinear terms (a sort of generalization of multivariate metaanalysis). Firstly, we evaluated omitted variable biases in this approach. Then, under an assumption of homogeneity of studies, we developed biascorrected estimating procedures for regression coefficients of the synthesized prediction models.
Results
Numerical evaluations with simulations showed that our approach resulted in smaller biases and more precise estimates compared with conventional methods, which use only studies with common covariates or which utilize a mean imputation method for omitted coefficients. These methods were also applied to a series of Japanese epidemiologic studies on the incidence of a stroke.
Conclusions
Our proposed methods adequately correct the biases due to different sets of covariates between studies, and would provide precise estimates compared with the conventional approach. If the assumption of homogeneity within studies is plausible, this methodology would be useful for incorporating prior published information into the construction of new prediction models.
Background
Development of accurate clinical prediction models is a relevant issue in medical research, and the number of published prediction models has increased substantially over the last few decades. For example, the literature already contains 102 proposed risk prediction models for cardiovascular disease [1] and 25 for the risk of developing type 2 diabetes (including 11 logistic regression models) [2]. However, many authors point out that these developed prediction models are not necessarily accurate and cannot be generalized well to larger, broad populations [3–5]. One of the most relevant reasons is imprecise estimates due to substantially insufficient samples compared with the number of involved predictors in the development of prediction models [3]. Therefore, the research synthesis methodologies of clinical prediction models have received interest in terms of achieving more accurate estimations by using larger datasets. Debray et al. [6] addressed the issue of synthesizing results by proposing a multivariate metaanalysis approach to combining regression coefficients of published logistic prediction models. The concept of using this approach for the synthesis of regression coefficients, which can be regarded as a variant of the generalized least squares method by Becker and Wu [7], explicitly takes into account the distinction of within and betweenstudy covariance of coefficients, and it should provide a valid solution when complete and unbiased estimates of the regression coefficients are available. However, in general, each published prediction model is developed using different sets of covariates [7–10], and dropping relevant covariates would raise potentially serious biases to the prediction models, as in ordinary model misspecification problems. To tackle this problem, the Fibrinogen Studies Collaboration [11] proposed a multivariate metaanalysis approach to borrow strength from partially adjusted results by using individual patient data (IPD), and Riley et al. [12] demonstrated the approach in practice. In addition, RescheRigon et al. [13] adopted a multiple imputation method with IPD. On the other hand, instead of using every IPD record, Debray et al. [6] considered a method that uses the reported summary statistics with one set of IPD. They adopted an adhoc approach utilizing mean or zero imputations for the missing coefficient estimates to straightforwardly apply the multivariate metaanalysis method [6]. Although Debray’s approach is a simple implementation strategy, it should raise substantial biases to the synthesized results because the interpretation of the coefficients depends on which covariates are included in each regression model.
In this article, we develop valid inference methods for synthesizing regression coefficients of published prediction models under different sets of covariates. We provide bias assessment methods for regression coefficients when important predictors are dropped in some studies, and thereby supply biascorrected estimators for the synthesized prediction models under the assumption of the homogeneity of studies in metaanalysis. We show that our method is asymptotically more efficient than the conventional approach applying multivariate metaanalysis, by using studies with common covariate sets and the previously proposed approach of Debray et al. [6]. Further, we demonstrate the robustness property against the misspecification of withinstudy covariance matrices. While we discuss here the synthesis of logistic prediction models, our approach could be extended to more general cases such as survival prediction models [14].
The rest of the paper is organized as follows. In Section ‘Methods’, we consider as the first step the problem of omitted variable bias in the logistic regression model. Then we propose a nonlinear model with the terms of omitted variable bias to synthesize the published coefficients, where the generalized nonlinear least squares (GNLS) method is applied for estimation. We also show that our method has desirable properties (i.e., efficiency and robustness). In Section ‘Results: simulation studies’, the performance of our method is numerically checked by simulation studies. Our method is illustrated through the use of a practical dataset in Section ‘Results: application in risk prediction models for occurrences of stroke’. This dataset was obtained from epidemiological studies on the incidence of stroke; these studies were conducted in Japan and contain several covariates. Each study separately analyzed with logistic regression models on each cohort, but the covariates in the models are unbalanced across cohorts. Finally, Section ‘Discussion’ provides some discussion and an examination of future problems.
Methods
We consider a similar situation as Debray et al. [6] in that we can use reported summary statistics from previous logistic regression models with different sets of covariates and at least one IPD from the publications or the authors themselves. Suppose that each published prediction model has a subset of covariates in the IPD, and is constructed for same prediction task. The number of published prediction models is N (\(i=1,\dots,N\)) and the ith article reports the estimated coefficients, \(\hat {\boldsymbol {\theta }}_{i}\), and the covariance matrix \(\boldsymbol {\Sigma }_{i}=\text {Cov}(\hat {\boldsymbol {\theta }}_{i})\) (at least its diagonal elements). Each \(\hat {\boldsymbol {\theta }}_{i}\) is a column vector of possibly different length, \(\hat {\boldsymbol {\theta }}_{i}=(\hat {\theta }_{i1},\dots,\hat {\theta }_{i p_{i}})^{T}\), and p_{ i } represents the differences of covariate sets among studies. To synthesize these regression coefficients, Debray et al. [6] utilize the mean or zero imputation methods for omitted coefficients and apply the technique of multivariate metaanalysis. Another simple approach is to apply multivariate metaanalysis using studies with common covariate sets [6]. However, the former approach leads to biased results and the latter is not biased but leads to loss of efficiency by ignoring indirect information from omitted studies. In order to improve this situation, we propose a new method for synthesis of logistic regression coefficients under different sets of covariates.
For simplicity, we assume the case where the true model has the full set of covariates in the IPD, which means the prior models have subsets of covariates and are considered as underspecified models. Note that since the omitted variables from the true models (full models) may be correlated with the included variables, the subset models are confounded and biased compared with true models. Our method can be generalized to more complex cases where the previous models for metaanalysis are a mixture of under and overspecified models. Our method can be also applied to more general models such as generalized linear models, including linear regression and other nonlinear regression models with some modification. This study was approved by the Institutional Review Board at the National Cancer Center in Tokyo and The University of Tokyo, Japan.
The omitted variable bias in the logistic regression model
Firstly, we introduce the omitted variable bias under one original logistic regression setting, which can afterward be extended to the metaanalysis setting with the assumption that the covariate sets differ among studies.
Let \(\boldsymbol {X}=\left (1,X_{2}, \dots, X_{j}\right)^{T}\) and \(\boldsymbol {Z}=(Z_{1}, \dots, Z_{k})^{T}\) be vectors of covariates and Y∈{0,1} be a binary response variable. Suppose the datagenerating process (DGP) can be formulated by the true model:
where α,β are the true parameters of interest and “logit” means the logistic function, \(\text {logit}(p)=\text {log} \left (p/(1p)\right)\). The misspecified model is assumed to be fitted, which omits relevant covariates Z from the true model (1). Specifically,
We investigate the degree to which the regression coefficient γ, estimated under the misspecified model, differs from the true parameters α,β, and define the differences as the omitted variable bias.
To derive the omitted variable bias, the unbiasedness condition of the estimating function can be employed [15]; this approach is a generalized result of the landmark paper of White [16]. In the metaanalysis framework, the idea of the omitted variable bias can function as an analogy of different covariate sets and as a representation of the incorporation of indirect information from prior models.
In general, score functions from misspecified models cannot satisfy the unbiasedness condition of estimating functions. Therefore, the first step is to find the solution of the unbiasedness condition of estimating function (3), i.e., find γ^{∗}=f(α,β,p_{ XZ }), which is the function of the true parameters α,β and the joint distribution of covariates, p_{ XZ };
Here, the expectation is taken by the true joint distribution of Y, X and Z defined from (1) and p_{ XZ }. Under some regularity conditions, the maximum likelihood estimate of γ from the misspecified model (2) is a consistent estimate of γ^{∗}.
Secondly, for assessing biases caused by dropping the important predictors, we assume to have (at least) one IPD with the outcome and the full covariates X, Z. This assumption is considered reasonable for researchers who want to develop a new prediction model on their own IPD, incorporating prior summary statistics from regression results. Using the IPD, we can empirically solve (3) and derive the omitted variable bias.
Note that in the general case, the function f cannot be written in closed form due to its nonlinearity, but in the following case where every omitted covariate is a continuous variable it can be explicitly written.
Special case : omitted covariates, Z, are continuous variables
In general, the maximum likelihood estimate of γ in (2) (consistently) estimates γ^{∗} as the solution of (3). In particular, for the cases of normal continuous variables, the following analytical evaluation can be adapted. Now we suppose ZX follows the multivariate normal distribution, N(μ_{ ZX },Ω_{ ZX }). Based on the normality assumption of ZX, we have
where \(\boldsymbol {\Delta }=(\boldsymbol {\delta }_{1}, \dots, \boldsymbol {\delta }_{k})^{T}\) is k×j matrix and τ∼N_{ τ }(0,Ω_{ ZX }).
Applying the technique of Chao et al. [17] to our covariate structure and using the probit approximation of logistic distribution, the expectation of Y conditional on X can be expressed as follow:
where Φ is the cumulative distribution function of standard normal distribution and c=16(3)^{1/2}/15π is the adjustment factor for probit approximation of the logistic distribution proposed by Johnson et al. [18].
In order to satisfy the unbiasedness condition of the estimating function, (3), we have
Therefore, the function f should be denoted as
which is the generalization of the results of Chao et al. [17] and Cramer et al. [19].
Nonlinear model for metaanalysis
Suppose there exist N reported models (\(i=1,\dots,N\)) with their estimated coefficients of α,β and γ and their covariance matrices, and when \(i=1,\dots,N\), studies fit the true model (1) with a full set of covariates, X and Z, and when \(i=n+1,\dots, N\), studies mistakenly omit covariates Z. We assume the homogeneity of studies (i.e., the distribution of covariates and outcomes are common across the studies in the metaanalysis). Here we show only the case where Z is omitted, but the case where X is omitted can be considered in the same manner, and further, it is easy to generalize to various other omittance patterns. To synthesize the estimated coefficients vectors from the logistic regression models, we apply a GNLS method to incorporate the unequal variances of studies into metaanalysis.
Based on this setting, the nonlinear model for metaanalysis can be formulated as follows;
where
and \(\hat {\boldsymbol {\theta }}_{i}\) is the column vector of reported coefficients in the ith study. The function f() comes from the omitted variable bias formula introduced in the previous section, whose formulation is reasonable if an assumption of homogeneity of studies in metaanalysis is acceptable.
In a large sample, the estimated coefficients \(\hat {\boldsymbol {\theta }}_{i}\) are (approximately) normally distributed with mean θ_{ i }=g_{ i }(α,β,p_{ XZ }) and covariance \(\text {Cov}(\hat {\boldsymbol {\theta }}_{i})\). This asymptotic normality of estimated coefficients leads to the justification of the GNLS approach.
Under the model (5), overall estimates of the regression coefficients \(\hat {\boldsymbol {\alpha }}^{*}\) and \(\hat {\boldsymbol {\beta }}^{*}\) can be obtained by GNLS as follows:
where \(\hat {p}_{\boldsymbol {XZ}}\) is an estimate of p_{ XZ } from the IPD.
The diagonal of the covariance matrix Σ is typically reported in the literature but the offdiagonals are unknown, thus offdiagonal elements can be imputed by using the IPD. We employ the same imputation method as Debray et al. [6] based on the IPD as follows;
where \(\text {Cov} (\hat {\boldsymbol {\theta }}_{i,W})\) is a working covariance matrix of the ith study which is applied to one of the block diagonal elements of Σ, \(\hat {V}_{i}=\text {diag}(\text {Cov} (\hat {\boldsymbol {\theta }}_{i}))\) is a diagonal matrix whose diagonal elements are the estimated standard errors (SE) reported from each study and R_{ IPD } is a working correlation matrix of coefficients calculated from the IPD. The covariance matrix can be calculated with a sandwich estimator under the model misspecification assumption instead of the imputation based on the IPD [15], but there computational complexity remains a problem and little improvement is gained in simulations studies. Furthermore, even if the covariance matrix is misspecified, the proposed estimator is still consistent and asymptotically normally distributed with a sandwich covariance matrix. This robustness follows the asymptotic theory of the generalized estimating equations. In this situation, let \(\hat {\boldsymbol {\alpha }}_{W}\) and \(\hat {\boldsymbol {\beta }}_{W}\) denote our estimators with the working covariance matrix. The covariance matrix of these estimators can be estimated by
where \(\hat {\boldsymbol {D}}=\left (\hat {\boldsymbol {D}}_{1}^{T},\dots,\hat {\boldsymbol {D}}_{N}^{T}\right)^{T}, \hat {\boldsymbol {D}}_{i}=\partial \boldsymbol {g}_{i}\left (\boldsymbol {\alpha }, \boldsymbol {\beta }, \hat {p}_{\boldsymbol {XZ}}\right)/\partial \left (\boldsymbol {\alpha }^{T}, \boldsymbol {\beta }^{T}\right)_{\left (\boldsymbol {\alpha }^{T}, \boldsymbol {\beta }^{T}\right)=\left (\hat {\boldsymbol {\alpha }}^{T}_{W}, \hat {\boldsymbol {\beta }}^{T}_{W}\right)}\), Σ_{ W } is a working covariance matrix, and \(\text {Cov}(\hat {\boldsymbol {\theta }}_{I})=\left (\left \{\hat {\boldsymbol {\theta }}_{i}  \boldsymbol {g}_{i}\left (\hat {\boldsymbol {\alpha }}_{W}, \hat {\boldsymbol {\beta }}_{W}, \hat {p}_{\boldsymbol {XZ}}\right)\right \}\left \{\hat {\boldsymbol {\theta }}_{i}  \boldsymbol {g}_{i}\left (\hat {\boldsymbol {\alpha }}_{W}, \hat {\boldsymbol {\beta }}_{W}, \hat {p}_{\boldsymbol {XZ}}\right)^{T}\right \}\right)\) [20, 21]. This idea essentially comes from Liu et al. [21] and can be regarded as analogy of the result proposed by Chen et al. [22].
In addition to the above, if the working covariance matrix is a good approximation of the true covariance structure, the following relationship holds:
where Avar denotes an asymptotic covariance matrix, and \(\hat {\boldsymbol {\theta }}^{*}\), \(\hat {\boldsymbol {\theta }}_{M}\) and \(\hat {\boldsymbol {\theta }}_{S}\) are the estimates of \(\boldsymbol {\theta }=\left (\boldsymbol {\alpha }^{T},\boldsymbol {\beta }^{T}\right)^{T}\) obtained from our proposed method, from multivariate metaanalysis using only the studies with full covariates and from a single study with full covariates, respectively, when the number of studies N goes to infinity.
Here, we assume a fixed effect model which presumes that there is no heterogeneity in the distribution of covariates and in the values of the parameters of interest. This assumption may sometimes be unrealistic. Therefore, we recommend considering whether this assumption is reasonable based on background knowledge or reported information. In addition, we can propose how to modify this to a random effects model to incorporate the heterogeneity by assuming that the parameters underlying studies and the parameters of distribution of covariates follow some distribution. For example, considering the case that all omitted variables are continuous (i.e., Section ‘The omitted variable bias in the logistic regression model)’, we can incorporate random effects by assuming that α,β, Δ and Ω_{ ZX } in (4) follow distributions. Random effects in α,β accommodate the heterogeneity of parameters and random effects in Δ and Ω_{ ZX } accommodates that of distribution of covariates.
Results: simulation studies
Simulation setup
In this section we describe a Monte Carlo simulation which was performed to evaluate the performance of our proposed method. In the simulation, we empirically calculate the omitted variable bias by using Eq. (3) instead of Eq. (4). The parameters which varied in the simulation scenario were the true value of a parameter in a DGP model, the number of predictors and the distribution of covariates (continuous/discrete covariates). For simplicity, we examined the case where the number of predictors in the models in this simulation was 1 or 2 (i.e. X_{1}, X_{2} or both). The DGP model was logitP(Y=1X_{1},X_{2})=α_{0}+α_{1}X_{1}+α_{2}X_{2}. For checking the sensitivity for the true value of the parameter in the DGP model, α_{1} varied from 2 to 1, and true values of other parameters was set at 1 (i.e. α_{0}=α_{2}=1).
We simulated \(N=9, \ (i=1,\dots,9)\) independent studies with 100 samples in each, and of these studies, 3 studies (i=1,2,3) included a full set of covariates (X_{1} and X_{2}), 3 (i=4,5,6) were supposed to omit X_{1} and 3 (i=7,8,9) were supposed to omit X_{2}. One of the studies with the full set of covariates was used as the IPD. As mentioned above, the offdiagonals of the covariance matrix were often unknown, thus we adopted the imputation by IPD proposed in the Methods section. In this simulation, we compared the performance of this imputation with the setting using a true covariance structure, which could be estimated from simulation settings.
We classified the scenario into 2 cases according to the distribution of covariates (continuous/discrete distribution). In Case 1, X_{1} and X_{2} were both continuous and followed the multivariate normal distribution, \(X_{1},X_{2} \sim N\left (\left (\begin {array}{l} 2 \\ 2\end {array}\right), \left (\begin {array}{ll} 1 & r \\ r &1 \end {array}\right)\right)\). The correlation, r, between X_{1} and X_{2} was set at 0 or 0.5. In this case, we checked the performance of the approximation formula (4). Case 2 was the more practical case in which continuous and discrete distributions were mixed (i.e., X_{1} was continuous and X_{2} was binary). X_{2} was binarized from the distribution in Case 1 by a threshold value set at 2. Under these settings, 1000 Monte Carlo simulations were implemented. If the models could not be fitted and converged, their results were excluded from the calculation of bias and mean squared error (MSE).
Performance of the proposed method was evaluated by bias and MSE, comparing it with two ordinary methods. M1 was the multivariate metaanalysis using only 3 studies with a full set of covariates. From a theoretical perspective, the M1 strategy does not include any bias but is inferior in efficiency compared with our proposed method, which can be checked using the results of MSE. M2 was the multivariate metaanalysis after mean imputation of missing coefficients, whose method was proposed in Debray et al. [6]. For example, coefficients and their estimated standard errors of X_{2} from 3 studies (i=4,5,6) were imputed by the means of the other 6 studies. We tried the zero imputation method, which Debray et al. adopted [6] and called uninformative regression coefficients, but it did not show notable results compared with the results from M2 (mean imputation). Therefore, we decided not to include the results of this method.
Simulation results
The results of the simulation revealed that compared with the ordinary metaanalysis, our proposed estimator generally produced more precise and lessbiased estimates for all simulation settings (Table 1). The rationale of small proportion of convergence at α_{1}=1 was that under the simulation setting, P(Y=1) could be over 0.9, which led to small number of event at a single study. The bias of our estimator ranged from 0.052 to 0.097 (mean: 0.021) for Case 1 and from 0.064 to 0.488 (mean: 0.040) for Case 2. The MSE of our estimator ranged from 0.021 to 0.803 (mean: 0.124) for Case 1 and from 0.012 to 0.486 (mean: 0.091) for Case 2. Although the M2 strategy in Case 1 and r=0 yielded somewhat biased results, the greatest amount of variation seemed to arise from the biased estimates of α_{0} in the models from which X_{2} was omitted.
The relative efficiency (RE) of the estimates of M1 versus those of our proposed method ranged from 1.023 to 9.913 (mean: 2.323) for Case 1 and from 1.098 to 10.047 (mean: 2.495) for Case 2. The RE of the estimates of M2 ranged from 1.025 to 82.069 (mean: 20.043) for Case 1 and from 0.600 to 93.405 (mean: 123.760) for Case 2.
In terms of the RE of the estimates from the true covariance structure versus the imputation method for unknown elements in the covariance structure, the RE of the covariance structure imputed from the IPD versus the true covariance structure ranged from 0.900 to 1.448 (mean: 1.126) for Case 1 and from 0.895 to 1.193 (mean: 1.065) for Case 2.
Comparing the MSE by correlation value (r=0 versus 0.5), in Case 1 the mean MSE of our proposed method was r=0: 0.074 versus r=0.5: 0.005. In Case 2, the mean MSE of our proposed method was r=0: 0.113 versus r=0.5: 0.174.
Results: application in risk prediction models for occurrences of stroke
We applied the proposed methods to a series of epidemiologic studies that developed risk prediction models for occurrences of stroke. Stroke is one of the leading causes of death or physical/cognitive impairment in both developed and developing countries, and therefore numerous prediction models have been developed and many clinical characteristics have been identified as potential predictors [23, 24]. However, the overall influence of various risk factors is still unclear, with conflicting results from several reports [24].
Application setup
We obtained 10 sets of IPD from studies conducted by the Japan Public Health Centerbased Prospective Study (JPHC study). The JPHC study covers 11 public health center areas (Areas 1 – 11) across Japan. The total number of participants was 140,420, and the study population consisted of residents who were 40 to 69 years old at the time of the baseline survey. Details of the study design are welldocumented in a previous report [25]. The outcome was confirmed according to the criteria provided by the National Survey of Stroke, which required a constellation of neurological deficits of sudden or rapid onset lasting at least 24 hours or until death [26, 27].
We fitted a logistic regression model to each available set of IPD and explored the important covariates related to patient characteristics and metabolic syndrome [28–30] such as age (years), time since last meal (minutes), body mass index (BMI (kg/ m^{2})), total cholesterol level (mg/dl), blood pressure (mmHg), cigarettes (per day), diabetes (yes/no), blood glucose (mg/dl), highdensity lipoprotein (HDL (mg/dl)), and serum triglycerides (mg/dl) (Table 2). The sets of available covariates differed by region. For example, IPD from the Area 1 cohort did not include data on blood glucose, HDL or serum triglycerides since subjects in that cohort did not undergo any blood tests. One of our motivations in this study was to overcome this discrepancy among cohorts, which is typical in largescale cohort studies investigating several outcomes.
Coefficients from each model were stored as aggregated statistics, which could be regarded as prior studies for metaanalysis. In terms of handling sporadically missing data (average missing rate was 2.8 % with a standard deviation of 2.5 %), complete case analysis was executed. One cohort (Area 9) remained as IPD and one cohort (Area 11) was used as test data for prediction. Next, we compared our methodology with conventional multivariate metaanalysis using only studies with a full set of covariates and with results based solely on IPD data.
Lastly, new prediction models were constructed by plugging the synthesized coefficients into the models and checking the performance of each model using the test data.
The discriminant performance of the prediction models was measured by the area under the receiver operator characteristic curve (AUC) and the Brier score (BS) (multiplied by 100), both of which are indicators of the accuracy of the prediction model. A higher AUC indicates higher prediction accuracy, while the BS has an inverse relationship [31]. In addition, the model’s calibration was examined by the HosmerLemeshow chisquared statistic [32].
Application results
The results demonstrated that our approach provided considerably narrower confidence intervals and slightly better prediction performance compared with conventional multivariate metaanalysis (Table 2). Our estimator reduced the SE by 38–53 and 56–71 % compared with the SE from conventional metaanalysis and from the IPD, respectively.
In terms of prediction performance, the prediction model constructed from the synthesized coefficients showed slight improvements over the conventional approach, particularly in BS. The AUC and BS were respectively increased by 1.1 and −1.0 % on average compared with conventional metaanalysis, and decreased by −0.4 and 1.0 % on average compared with the IPD. The improvements in prediction performance were relatively small because the cohort of test data was remarkably similar to other cohorts across Japan that were aggregated into summary statistics as previously published studies.
Discussion
Along with increasing attention to prediction models, there has been higher demand for approaches to the metaanalysis of regression coefficients. However these methodologies are not well developed due to the many difficulties caused by the different settings used by various studies, and further research is still needed, particularly compared with conventional metaanalysis methods such as synthesizing mean differences, correlation and so on [7]. This study demonstrated a method to conduct the metaanalysis of regression coefficients with different covariate sets under the assumption of homogeneity of studies (i.e., it is applicable in cases where studies in the metaanalysis have similar distributions of covariates and outcomes). Although this study temporarily assumed the models with a full set of covariates as a true model, our approach can be generalized to any formulation of previous models even if they are over/underspecified compared to a constructing model. We notice, however, that we need careful arguments about what is an appropriate covariate set. Further, the assumption that (at least) one IPD is available can be considered reasonable in the frequent case in which a single researcher wants to construct a new prediction model on his or her own IPD, incorporating prior regression results (but with such prior results reported just in the form of summary statistics). The minimal use of IPD (use of one IPD and other summary statistics) distinguishes our approach from that of the Fibrinogen Studies Collaboration [11]. They assume that both full and partial models are applied in each cohort by using its cohort IPD, and thus the estimation of the correlation of coefficients between full and partial models is applicable. As the future work, we need to study the relationship between our method and their one, and examine whether their method can be applicable to the situation we are considering. Regarding these discussions, our study can provide the following guidelines for practitioners about how to analyze prior models with their own IPD by recognizing the issue of omitted variable bias as the differences of sets of predictors between their constructing models and prior models: 1) the first step is to construct a new and temporal model on their own data set, and 2) the second step is to apply our method to synthesize the previous regression coefficients with their temporal model and then update the model and obtain more accurate estimators.
Our method proved robust against the misspecification of the covariance structure. Because of this property we can arbitrarily set the covariance matrix of coefficients and thus it is possible to avoid the argument, often discussed with methods such as that of Becker and Wu [7], on whether the full covariance matrix of coefficients should be reported or not. This robustness property can be considered as an analogical result provided by Liu et al. [21]. They provide a framework of metaanalysis under heterogeneity by using a confidence density function and reparametrization of the problem setting. Their approach utilizes the reparameterization connecting each studyspecific parameter to the common parameter using the transformation function M_{ i }, which is used as the omitted variable bias formula in our setting. However, they assume that the omitted covariates are fixed values and thus they can estimate M_{ i } without a consideration of the distribution of covariates. In contrast, our approach provides more general guidelines for treating missing covariates in the metaanalysis.
The simulation performed in this study illustrated that our method is unbiased and has greater efficiency than a conventional metaanalysis approach as well as the technique proposed by Debray et al. [6]. Although our estimator was most efficient if the covariance structure was truly specified, it maintained its efficiency even if we misspecified the covariance structure, with a loss of efficiency by misspecification of only around 10 %.
Finally, we demonstrated the practical use of our approach with medical data on stroke prediction. Although the improvement of accuracy of the prediction model was relatively small, the confidence intervals of synthesized coefficients were dramatically decreased because information from other studies helped improve efficiency. In the context of multivariate metaanalysis, it is well known that we can gain precision by borrowing strength from other partially reported results [33–35]. This implies that our methodology can be applied not only to prediction models but also to observational studies such as a casecontrol/crosssectional study whose main purpose is to identify causal effects.
As a limitation of this study, our method was examined in only one practical dataset. Although this data includes over 100,000 samples, the population was Japanese only, and can thus be regarded as one group with small heterogeneity. This situation may not be representative of an ordinary metaanalysis because the majority of recent metaanalyses include several groups with large heterogeneity due to studies undertaken globally. We think, however, that we took this heterogeneity into account by incorporating random effects, as mentioned in the Methods section. We welcome the reevaluation of our method in other practical cases. Another potential limitation is that we implicitly assumed that the distributions of covariates are the same between studies. The assumption is required to calculate the expectation in Eq. (3) for each study to derive the omitted variable bias formula, γ^{∗}=f(α,β,p_{ XZ }). The assumption of homogeneity can also be relaxed by incorporating random effects into parameters related to the distribution, as discussed in the Methods section. However, a random effect model obscures the objective of a metaanalysis because under this model, a global “average”İ effect and the effect prevailing in particular circumstances are not identical [36]. We need further research about how to incorporate random effects and its interpretation.
Conclusions
This study proposed a correction method for the omitted variable bias due to different sets of covariates between literature models in the metaanalysis of regression coefficients. Our approach attained efficiency that was comparable to that of conventional approaches. This study should be useful for practitioners who want to develop their prediction model on their own dataset and incorporate prior regression results.
Abbreviations
 IPD:

individual patient data
 GNLS:

generalized nonlinear least squares
 DGP:

datagenerating process
 MSE:

mean squared error
 JPHD:

Japan Public Health Centerbased prospective study
 AUC:

area under the receiver operator characteristic curve
 BS:

brier score
References
Matheny M, McPheeters ML, Glasser A, Mercaldo N, Weaver RB, Jerome RN, et al.Systematic review of cardiovascular disease risk assessment tools. USA: Agency for Healthcare Research and Quality; 2011.
Abbasi A, Peelen LM, Corpeleijn E, van der Schouw YT, Stolk RP, Spijkerman AM, et al. Prediction models for risk of developing type 2 diabetes: systematic literature search and independent external validation study. BMJ: Br Med J. 2012;345.
Steyerberg E, Eijkemans M, Van Houwelingen J, Lee K, Habbema J. Prognostic models based on literature and individual patient data in logistic regression analysis. Stat Med. 2000; 19(2):141–60.
Bleeker S, Moll H, Steyerberg E, Donders A, DerksenLubsen G, Grobbee D, et al.External validation is necessary in prediction research:: A clinical example. J Clin Epidemiol. 2003; 56(9):826–32.
Steyerberg EW, Bleeker SE, Moll HA, Grobbee DE, Moons KG. Internal and external validation of predictive models: a simulation study of bias and precision in small samples. J Clin Epidemiol. 2003; 56(5):441–7.
Debray T, Koffijberg H, Vergouwe Y, Moons KG, Steyerberg EW. Aggregating published prediction models with individual participant data: a comparison of different approaches. Stat Med. 2012; 31(23):2697–712.
Becker BJ, Wu MJ. The synthesis of regression slopes in metaanalysis. Stat Sci. 2007; 22(3):414–29.
Balázs K, Hidegkuti I, De Boeck P. Detecting heterogeneity in logistic regression models. Appl Psychol Meas. 2006; 30(4):322–44.
Higgins J, Thompson S, Deeks J, Altman D. Statistical heterogeneity in systematic reviews of clinical trials: a critical appraisal of guidelines and practice. J Health Serv Res Policy. 2002; 7(1):51–61.
Wu MJ, Becker BJ. Synthesizing regression results: a factored likelihood method. Res Synth Methods. 2013; 4(2):127–43.
The Fibrinogen Studies Collaboration. Systematically missing confounders in individual participant data metaanalysis of observational cohort studies. Stat Med. 2009; 28(8):1218.
Riley R, Price M, Jackson D, Wardle M, Gueyffier F, Wang J, et al.Multivariate metaanalysis using individual participant data. Res Synth Methods. 2014; 6:157–74.
RescheRigon M, White IR, Bartlett JW, Peters SA, Thompson SG. Multiple imputation for handling systematically missing confounders in metaanalysis of individual participant data. Stat Med. 2013; 32(28):4890–905.
Cox DR, Oakes D. Analysis of survival data. Chapman & Hall/CRC Monographs on Statistics & Applied Probability, vol. 21. UK: CRC Press; 1984.
Yi GY, Reid N. A note on misspecified estimating functions. Stat Sinica. 2010; 20:1749–69.
White H. Maximum likelihood estimation of misspecified models. Econometrica: J Econ Soc. 1982; 50(1):1–25.
Chao WH, Palta M, Young T. Effect of omitted confounders on the analysis of correlated binary data. Biometrics. 1997; 53(2):678–89.
Johnson NL, Kotz S. Distributions in statistics: continuous univariate distributions, 2nd edn. Wiley Series in Probability and Statistics, vol. 2. USA: Houghton Mifflin; 1970.
Cramer JS. Logit models from economics and other fields. UK: Cambridge University Press; 2003.
Greene WH. Econometric analysis, 7th edn. USA: Pearson Education; 2003.
Liu D, Liu R, Xie M. Multivariate metaanalysis of heterogeneous studies using only summary statistics: efficiency and robustness. J Am Stat Assoc. 2015; 110(509):326–40.
Chen Y, Hong C, Riley RD. An alternative pseudolikelihood method for multivariate randomeffects metaanalysis. Stat Med. 2015; 34(3):361–80.
Go AS, Mozaffarian D, Roger VL, Benjamin EJ, Berry JD, Blaha MJ, et al.Heart disease and stroke statistics–2014 update: a report from the american heart association. Circulation. 2014; 129(3):28.
Johnston K, Connors A, Wagner D, Knaus W, Wang XQ, Haley EC, et al.A predictive risk model for outcomes of ischemic stroke. Stroke. 2000; 31(2):448–55.
Tsugane S, Sawada N. The jphc study: design and some findings on the typical japanese diet. Jpn J Clin Oncol. 2014; 44(9):777–82.
Walker A, Robins M, Weinfeld F. The national survey of stroke. clinical findings. Stroke; J Cerebral Circ. 1981; 12(2 Pt 2 Suppl 1):13.
Iso H, Rexrode K, Hennekens CH, Manson JE. Application of computer tomographyoriented criteria for stroke subtype classification in a prospective study. Ann Epidemiol. 2000; 10(2):81–7.
Saito I, Iso H, Kokubo Y, Inoue M, Tsugane S. Body mass index, weight change and risk of stroke and stroke subtypes: the japan public health centerbased prospective (jphc) study. Int J Obes. 2010; 35(2):283–91.
Yatsuya H, Iso H, Yamagishi K, Kokubo Y, Saito I, Suzuki K, et al.Development of a pointbased prediction model for the incidence of total stroke japan public health center study. Stroke. 2013; 44(5):1295–302.
Noda H, Iso H, Saito I, Konishi M, Inoue M, Tsugane S. The impact of the metabolic syndrome and its components on the incidence of ischemic heart disease and stroke: the Japan public health centerbased study. Hypertens Res. 2009; 32(4):289–98.
Pepe MS. The statistical evaluation of medical tests for classification and prediction, 1st edn. Oxford Statistical Science Series. USA: Oxford University Press; 2003.
Collins GS, de Groot JA, Dutton S, Omar O, Shanyinde M, Tajar A, et al.External validation of multivariable prediction models: a systematic review of methodological conduct and reporting. BMC Med Res Methodol. 2014; 14(1):40.
Riley RD, Abrams K, Lambert P, Sutton A, Thompson J. An evaluation of bivariate randomeffects metaanalysis for the joint synthesis of two correlated outcomes. Stat Med. 2007; 26(1):78–97.
Riley RD, Abrams KR, Sutton AJ, Lambert PC, Thompson JR. Bivariate randomeffects metaanalysis and the estimation of betweenstudy correlation. BMC Med Res Methodol. 2007; 7(1):3.
Jackson D, Riley R, White IR. Multivariate metaanalysis: potential and promise. Stat Med. 2011; 30(20):2481–98.
Shi JQ, Copas J. Metaanalysis for trend estimation. Stat Med. 2004; 23(1):3–19.
Acknowledgements
We thank Hisashi Noma, the Institute of Statistical Mathematics, for advise on the statistical analyses and the drafting of the manuscript. This work was supported by JSPS KAKENHI Grant Number 24500355.
Author information
Authors and Affiliations
Corresponding author
Additional information
Competing interests
The authors declare that they have no competing interests.
Authors’ contributions
DY and MH contributed the research questions, developed the study methods, and interpreted the results. NS and MI were responsible for data acquisition and advised to the manuscript. All authors read and approved the final manuscript.
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License(http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Cite this article
Yoneoka, D., Henmi, M., Sawada, N. et al. Synthesis of clinical prediction models under different sets of covariates with one individual patient data. BMC Med Res Methodol 15, 101 (2015). https://doi.org/10.1186/s128740150087x
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s128740150087x