Synthesis of clinical prediction models under different sets of covariates with one individual patient data

Background Recently, increased development of clinical prediction models has been reported in the medical literature. However, evidence synthesis methodologies for these prediction models have not been sufficiently studied, especially for practical situations such as a meta-analyses where only aggregated summaries of important predictors are available. Also, in general, the covariate sets involved in the prediction models are not common across studies. As in ordinary model misspecification problems, dropping relevant covariates would raise potentially serious biases to the prediction models, and consequently to the synthesized results. Methods We developed synthesizing methods for logistic clinical prediction models with possibly different sets of covariates. In order to aggregate the regression coefficient estimates from different prediction models, we adopted a generalized least squares approach with non-linear terms (a sort of generalization of multivariate meta-analysis). Firstly, we evaluated omitted variable biases in this approach. Then, under an assumption of homogeneity of studies, we developed bias-corrected estimating procedures for regression coefficients of the synthesized prediction models. Results Numerical evaluations with simulations showed that our approach resulted in smaller biases and more precise estimates compared with conventional methods, which use only studies with common covariates or which utilize a mean imputation method for omitted coefficients. These methods were also applied to a series of Japanese epidemiologic studies on the incidence of a stroke. Conclusions Our proposed methods adequately correct the biases due to different sets of covariates between studies, and would provide precise estimates compared with the conventional approach. If the assumption of homogeneity within studies is plausible, this methodology would be useful for incorporating prior published information into the construction of new prediction models.

regression coefficients, which can be regarded as a variant of the generalized least squares method by Becker and Wu [7], explicitly takes into account the distinction of within-and between-study covariance of coefficients, and it should provide a valid solution when complete and unbiased estimates of the regression coefficients are available. However, in general, each published prediction model is developed using different sets of covariates [7][8][9][10], and dropping relevant covariates would raise potentially serious biases to the prediction models, as in ordinary model misspecification problems. To tackle this problem, the Fibrinogen Studies Collaboration [11] proposed a multivariate meta-analysis approach to borrow strength from partially adjusted results by using individual patient data (IPD), and Riley et al. [12] demonstrated the approach in practice. In addition, Resche-Rigon et al. [13] adopted a multiple imputation method with IPD. On the other hand, instead of using every IPD record, Debray et al. [6] considered a method that uses the reported summary statistics with one set of IPD. They adopted an adhoc approach utilizing mean or zero imputations for the missing coefficient estimates to straightforwardly apply the multivariate meta-analysis method [6]. Although Debray's approach is a simple implementation strategy, it should raise substantial biases to the synthesized results because the interpretation of the coefficients depends on which covariates are included in each regression model.
In this article, we develop valid inference methods for synthesizing regression coefficients of published prediction models under different sets of covariates. We provide bias assessment methods for regression coefficients when important predictors are dropped in some studies, and thereby supply bias-corrected estimators for the synthesized prediction models under the assumption of the homogeneity of studies in meta-analysis. We show that our method is asymptotically more efficient than the conventional approach applying multivariate meta-analysis, by using studies with common covariate sets and the previously proposed approach of Debray et al. [6]. Further, we demonstrate the robustness property against the misspecification of within-study covariance matrices. While we discuss here the synthesis of logistic prediction models, our approach could be extended to more general cases such as survival prediction models [14].
The rest of the paper is organized as follows. In Section 'Methods' , we consider as the first step the problem of omitted variable bias in the logistic regression model. Then we propose a non-linear model with the terms of omitted variable bias to synthesize the published coefficients, where the generalized nonlinear least squares (GNLS) method is applied for estimation. We also show that our method has desirable properties (i.e., efficiency and robustness). In Section 'Results' , the performance of our method is numerically checked by simulation studies. Our method is illustrated through the use of a practical dataset in Section 'Results: application in risk prediction models for occurrences of stroke' . This dataset was obtained from epidemiological studies on the incidence of stroke; these studies were conducted in Japan and contain several covariates. Each study separately analyzed with logistic regression models on each cohort, but the covariates in the models are unbalanced across cohorts. Finally, Section 'Discussion' provides some discussion and an examination of future problems.

Methods
We consider a similar situation as Debray et al. [6] in that we can use reported summary statistics from previous logistic regression models with different sets of covariates and at least one IPD from the publications or the authors themselves. Suppose that each published prediction model has a subset of covariates in the IPD, and is constructed for same prediction task. The number of published prediction models is N (i = 1, . . . , N) and the ith article reports the estimated coefficients,θ i , and the covariance matrix i = Cov(θ i ) (at least its diagonal elements). Eachθ i is a column vector of possibly different length,θ i = (θ i1 , . . . ,θ ip i ) T , and p i represents the differences of covariate sets among studies. To synthesize these regression coefficients, Debray et al. [6] utilize the mean or zero imputation methods for omitted coefficients and apply the technique of multivariate meta-analysis. Another simple approach is to apply multivariate metaanalysis using studies with common covariate sets [6]. However, the former approach leads to biased results and the latter is not biased but leads to loss of efficiency by ignoring indirect information from omitted studies. In order to improve this situation, we propose a new method for synthesis of logistic regression coefficients under different sets of covariates.
For simplicity, we assume the case where the true model has the full set of covariates in the IPD, which means the prior models have subsets of covariates and are considered as under-specified models. Note that since the omitted variables from the true models (full models) may be correlated with the included variables, the subset models are confounded and biased compared with true models. Our method can be generalized to more complex cases where the previous models for meta-analysis are a mixture of under-and over-specified models. Our method can be also applied to more general models such as generalized linear models, including linear regression and other nonlinear regression models with some modification. This study was approved by the Institutional Review Board at the National Cancer Center in Tokyo and The University of Tokyo, Japan.

The omitted variable bias in the logistic regression model
Firstly, we introduce the omitted variable bias under one original logistic regression setting, which can afterward be extended to the meta-analysis setting with the assumption that the covariate sets differ among studies.
Let X = 1, X 2 , . . . , X j T and Z = (Z 1 , . . . , Z k ) T be vectors of covariates and Y ∈ {0, 1} be a binary response variable. Suppose the data-generating process (DGP) can be formulated by the true model: where α, β are the true parameters of interest and "logit means the logistic function, logit(p) = log (p/ (1 − p)).
The misspecified model is assumed to be fitted, which omits relevant covariates Z from the true model (1). Specifically, We investigate the degree to which the regression coefficient γ , estimated under the misspecified model, differs from the true parameters α, β, and define the differences as the omitted variable bias.
To derive the omitted variable bias, the unbiasedness condition of the estimating function can be employed [15]; this approach is a generalized result of the landmark paper of White [16]. In the meta-analysis framework, the idea of the omitted variable bias can function as an analogy of different covariate sets and as a representation of the incorporation of indirect information from prior models.
In general, score functions from misspecified models cannot satisfy the unbiasedness condition of estimating functions. Therefore, the first step is to find the solution of the unbiasedness condition of estimating function (3), i.e., find γ * = f (α, β, p XZ ), which is the function of the true parameters α, β and the joint distribution of covariates, p XZ ; Here, the expectation is taken by the true joint distribution of Y, X and Z defined from (1) and p XZ . Under some regularity conditions, the maximum likelihood estimate of γ from the misspecified model (2) is a consistent estimate of γ * .
Secondly, for assessing biases caused by dropping the important predictors, we assume to have (at least) one IPD with the outcome and the full covariates X, Z. This assumption is considered reasonable for researchers who want to develop a new prediction model on their own IPD, incorporating prior summary statistics from regression results. Using the IPD, we can empirically solve (3) and derive the omitted variable bias.
Note that in the general case, the function f cannot be written in closed form due to its nonlinearity, but in the following case where every omitted covariate is a continuous variable it can be explicitly written.

Special case : omitted covariates, Z, are continuous variables
In general, the maximum likelihood estimate of γ in (2) (consistently) estimates γ * as the solution of (3). In particular, for the cases of normal continuous variables, the following analytical evaluation can be adapted. Now we suppose Z|X follows the multivariate normal distribution, N(μ Z|X , Z|X ). Based on the normality assumption of Z|X, we have Applying the technique of Chao et al. [17] to our covariate structure and using the probit approximation of logistic distribution, the expectation of Y conditional on X can be expressed as follow: where is the cumulative distribution function of standard normal distribution and c = 16(3) 1/2 /15π is the adjustment factor for probit approximation of the logistic distribution proposed by Johnson et al. [18]. In order to satisfy the unbiasedness condition of the estimating function, (3), we have Therefore, the function f should be denoted as which is the generalization of the results of Chao et al. [17] and Cramer et al. [19].

Nonlinear model for meta-analysis
Suppose there exist N reported models (i = 1, . . . , N) with their estimated coefficients of α, β and γ and their covariance matrices, and when i = 1, . . . , n, studies fit the true model (1) with a full set of covariates, X and Z, and when i = n + 1, . . . , N, studies mistakenly omit covariates Z. We assume the homogeneity of studies (i.e., the distribution of covariates and outcomes are common across the studies in the meta-analysis). Here we show only the case where Z is omitted, but the case where X is omitted can be considered in the same manner, and further, it is easy to generalize to various other omittance patterns. To synthesize the estimated coefficients vectors from the logistic regression models, we apply a GNLS method to incorporate the unequal variances of studies into meta-analysis. Based on this setting, the nonlinear model for metaanalysis can be formulated as follows; where andθ i is the column vector of reported coefficients in the ith study. The function f () comes from the omitted variable bias formula introduced in the previous section, whose formulation is reasonable if an assumption of homogeneity of studies in meta-analysis is acceptable. In a large sample, the estimated coefficientsθ i are (approximately) normally distributed with mean θ i = g i (α, β, p XZ ) and covariance Cov(θ i ). This asymptotic normality of estimated coefficients leads to the justification of the GNLS approach.
Under the model (5), overall estimates of the regression coefficientsα * andβ * can be obtained by GNLS as follows: wherep XZ is an estimate of p XZ from the IPD.
The diagonal of the covariance matrix is typically reported in the literature but the off-diagonals are unknown, thus off-diagonal elements can be imputed by using the IPD. We employ the same imputation method as Debray et al. [6] based on the IPD as follows; is a working covariance matrix of the ith study which is applied to one of the block diagonal elements of ,V i = diag(Cov(θ i )) is a diagonal matrix whose diagonal elements are the estimated standard errors (SE) reported from each study and R IPD is a working correlation matrix of coefficients calculated from the IPD. The covariance matrix can be calculated with a sandwich estimator under the model misspecification assumption instead of the imputation based on the IPD [15], but there computational complexity remains a problem and little improvement is gained in simulations studies. Furthermore, even if the covariance matrix is misspecified, the proposed estimator is still consistent and asymptotically normally distributed with a sandwich covariance matrix. This robustness follows the asymptotic theory of the generalized estimating equations. In this situation, letα W and β W denote our estimators with the working covariance matrix. The covariance matrix of these estimators can be estimated by [20,21]. This idea essentially comes from Liu et al. [21] and can be regarded as analogy of the result proposed by Chen et al. [22]. In addition to the above, if the working covariance matrix is a good approximation of the true covariance structure, the following relationship holds: where Avar denotes an asymptotic covariance matrix, and θ * ,θ M andθ S are the estimates of θ = α T , β T T obtained from our proposed method, from multivariate meta-analysis using only the studies with full covariates and from a single study with full covariates, respectively, when the number of studies N goes to infinity.
Here, we assume a fixed effect model which presumes that there is no heterogeneity in the distribution of covariates and in the values of the parameters of interest. This assumption may sometimes be unrealistic. Therefore, we recommend considering whether this assumption is reasonable based on background knowledge or reported information. In addition, we can propose how to modify this to a random effects model to incorporate the heterogeneity by assuming that the parameters underlying studies and the parameters of distribution of covariates follow some distribution. For example, considering the case that all omitted variables are continuous (i.e., Section 'The omitted variable bias in the logistic regression model)' , we can incorporate random effects by assuming that α, β, and Z|X in (4) follow distributions. Random effects in α, β accommodate the heterogeneity of parameters and random effects in and Z|X accommodates that of distribution of covariates.

Simulation setup
In this section we describe a Monte Carlo simulation which was performed to evaluate the performance of our proposed method. In the simulation, we empirically calculate the omitted variable bias by using Eq. (3) instead of Eq. (4). The parameters which varied in the simulation scenario were the true value of a parameter in a DGP model, the number of predictors and the distribution of covariates (continuous/discrete covariates). For simplicity, we examined the case where the number of predictors in the models in this simulation was 1 or 2 (i.e. X 1 , X 2 or both). The DGP model was logitP(Y = 1|X 1 , X 2 ) = α 0 + α 1 X 1 + α 2 X 2 . For checking the sensitivity for the true value of the parameter in the DGP model, α 1 varied from -2 to 1, and true values of other parameters was set at 1 (i.e. α 0 = α 2 = 1).
We simulated N = 9, (i = 1, . . . , 9) independent studies with 100 samples in each, and of these studies, 3 studies (i = 1, 2, 3) included a full set of covariates (X 1 and X 2 ), 3 (i = 4, 5, 6) were supposed to omit X 1 and 3 (i = 7, 8,9) were supposed to omit X 2 . One of the studies with the full set of covariates was used as the IPD. As mentioned above, the off-diagonals of the covariance matrix were often unknown, thus we adopted the imputation by IPD proposed in the Methods section. In this simulation, we compared the performance of this imputation with the setting using a true covariance structure, which could be estimated from simulation settings.
We classified the scenario into 2 cases according to the distribution of covariates (continuous/discrete distribution). In Case 1, X 1 and X 2 were both continuous and followed the multivariate normal distribution, X 1 , X 2 ∼ N 2 2 , 1 r r 1 . The correlation, r, between X 1 and X 2 was set at 0 or 0.5. In this case, we checked the performance of the approximation formula (4). Case 2 was the more practical case in which continuous and discrete distributions were mixed (i.e., X 1 was continuous and X 2 was binary). X 2 was binarized from the distribution in Case 1 by a threshold value set at 2. Under these settings, 1000 Monte Carlo simulations were implemented. If the models could not be fitted and converged, their results were excluded from the calculation of bias and mean squared error (MSE). Performance of the proposed method was evaluated by bias and MSE, comparing it with two ordinary methods. M1 was the multivariate meta-analysis using only 3 studies with a full set of covariates. From a theoretical perspective, the M1 strategy does not include any bias but is inferior in efficiency compared with our proposed method, which can be checked using the results of MSE. M2 was the multivariate meta-analysis after mean imputation of missing coefficients, whose method was proposed in Debray et al. [6]. For example, coefficients and their estimated standard errors of X 2 from 3 studies (i = 4, 5, 6) were imputed by the means of the other 6 studies. We tried the zero imputation method, which Debray et al. adopted [6] and called uninformative regression coefficients, but it did not show notable results compared with the results from M2 (mean imputation). Therefore, we decided not to include the results of this method.

Simulation results
The results of the simulation revealed that compared with the ordinary meta-analysis, our proposed estimator generally produced more precise and less-biased estimates for all simulation settings ( Table 1). The rationale of small proportion of convergence at α 1 = 1 was that under the simulation setting, P(Y = 1) could be over 0.9, which led to small number of event at a single study. The bias of our estimator ranged from -0.052 to 0.097 (mean: 0.021) for Case 1 and from -0.064 to 0.488 (mean: 0.040) for Case 2. The MSE of our estimator ranged from 0.021 to 0.803 (mean: 0.124) for Case 1 and from 0.012 to 0.486 (mean: 0.091) for Case 2. Although the M2 strategy in Case 1 and r = 0 yielded somewhat biased results, the greatest amount of variation seemed to arise from the biased estimates of α 0 in the models from which X 2 was omitted.
The relative efficiency (RE) of the estimates of M1 versus those of our proposed method ranged from 1.023 to 9.913 (mean: 2.323) for Case 1 and from 1.098 to 10 In terms of the RE of the estimates from the true covariance structure versus the imputation method for unknown elements in the covariance structure, the RE of the covariance structure imputed from the IPD versus the true covariance structure ranged from 0.900 to 1.448 (mean: 1.126) for Case 1 and from 0.895 to 1.193 (mean: 1.065) for Case 2.
Comparing the MSE by correlation value (r = 0 versus 0.5), in Case 1 the mean MSE of our proposed method was r = 0: 0.074 versus r = 0.5: 0.005. In Case 2, the mean MSE of our proposed method was r = 0: 0.113 versus r = 0.5: 0.174.

Results: application in risk prediction models for occurrences of stroke
We applied the proposed methods to a series of epidemiologic studies that developed risk prediction models for occurrences of stroke. Stroke is one of the leading causes of death or physical/cognitive impairment in both developed and developing countries, and therefore numerous prediction models have been developed and many clinical characteristics have been identified as potential predictors [23, 24]. However, the overall influence of various risk factors is still unclear, with conflicting results from several reports [24].

Application setup
We obtained 10 sets of IPD from studies conducted by the Japan Public Health Center-based Prospective Study (JPHC study). The JPHC study covers 11 public health center areas (Areas 1 -11) across Japan. The total number of participants was 140,420, and the study population consisted of residents who were 40 to 69 years old at the time of the baseline survey. Details of the study design are well-documented in a previous report [25]. The outcome was confirmed according to the criteria provided by the National Survey of Stroke, which required a constellation of neurological deficits of sudden or rapid onset lasting at least 24 hours or until death [26,27].
We fitted a logistic regression model to each available set of IPD and explored the important covariates related to patient characteristics and metabolic syndrome [28][29][30] such as age (years), time since last meal (minutes), body mass index (BMI (kg/m 2 )), total cholesterol level (mg/dl), blood pressure (mmHg), cigarettes (per day), diabetes (yes/no), blood glucose (mg/dl), high-density lipoprotein (HDL (mg/dl)), and serum triglycerides (mg/dl) ( Table 2). The sets of available covariates differed by region. For example, IPD from the Area 1 cohort did not include data on blood glucose, HDL or serum triglycerides since subjects in that cohort did not undergo any blood tests. One of our motivations in this study was to overcome this discrepancy among cohorts, which is typical in large-scale cohort studies investigating several outcomes. Coefficients from each model were stored as aggregated statistics, which could be regarded as prior studies for meta-analysis. In terms of handling sporadically missing data (average missing rate was 2.8 % with a standard deviation of 2.5 %), complete case analysis was executed. One cohort (Area 9) remained as IPD and one cohort (Area 11) was used as test data for prediction. Next, we compared our methodology with conventional multivariate metaanalysis using only studies with a full set of covariates and with results based solely on IPD data.
Lastly, new prediction models were constructed by plugging the synthesized coefficients into the models and checking the performance of each model using the test data.
The discriminant performance of the prediction models was measured by the area under the receiver operator characteristic curve (AUC) and the Brier score (BS) (multiplied by 100), both of which are indicators of the accuracy of the prediction model. A higher AUC indicates higher prediction accuracy, while the BS has an inverse relationship [31]. In addition, the model's calibration was examined by the Hosmer-Lemeshow chi-squared statistic [32].

Application results
The results demonstrated that our approach provided considerably narrower confidence intervals and slightly better prediction performance compared with conventional multivariate meta-analysis (Table 2). Our estimator reduced the SE by 38-53 and 56-71 % compared with the SE from conventional meta-analysis and from the IPD, respectively.
In terms of prediction performance, the prediction model constructed from the synthesized coefficients showed slight improvements over the conventional approach, particularly in BS. The AUC and BS were respectively increased by 1.1 and −1.0 % on average compared with conventional meta-analysis, and decreased by −0.4 and 1.0 % on average compared with the IPD. The improvements in prediction performance were relatively small because the cohort of test data was remarkably similar to other cohorts across Japan that were aggregated into summary statistics as previously published studies.

Discussion
Along with increasing attention to prediction models, there has been higher demand for approaches to the meta-analysis of regression coefficients. However these methodologies are not well developed due to the many difficulties caused by the different settings used by various studies, and further research is still needed, particularly compared with conventional meta-analysis methods such as synthesizing mean differences, correlation and so on [7]. This study demonstrated a method to conduct the meta-analysis of regression coefficients with different covariate sets under the assumption of homogeneity of studies (i.e., it is applicable in cases where studies in the meta-analysis have similar distributions of covariates and outcomes). Although this study temporarily assumed the models with a full set of covariates as a true model, our approach can be generalized to any formulation of previous models even if they are over-/under-specified compared to a constructing model. We notice, however, that we need careful arguments about what is an appropriate covariate set. Further, the assumption that (at least) one IPD is available can be considered reasonable in the frequent case in which a single researcher wants to construct a new prediction model on his or her own IPD, incorporating prior regression results (but with such prior results reported just in the form of summary statistics). The minimal use of IPD (use of one IPD and other summary statistics) distinguishes our approach from that of the Fibrinogen Studies Collaboration [11]. They assume that both full and partial models are applied in each cohort by using its cohort IPD, and thus the estimation of the correlation of coefficients between full and partial models is applicable. As the future work, we need to study the relationship between our method and their one, and examine whether their method can be applicable to the situation we are considering. Regarding these discussions, our study can provide the following guidelines for practitioners about how to analyze prior models with their own IPD by recognizing the issue of omitted variable bias as the differences of sets of predictors between their constructing models and prior models: 1) the first step is to construct a new and temporal model on their own data set, and 2) the second step is to apply our method to synthesize the previous regression coefficients with their temporal model and then update the model and obtain more accurate estimators. Our method proved robust against the misspecification of the covariance structure. Because of this property we can arbitrarily set the covariance matrix of coefficients and thus it is possible to avoid the argument, often discussed with methods such as that of Becker and Wu [7], on whether the full covariance matrix of coefficients should be reported or not. This robustness property can be considered as an analogical result provided by Liu et al. [21]. They provide a framework of meta-analysis under heterogeneity by using a confidence density function and reparametrization of the problem setting. Their approach utilizes the reparameterization connecting each studyspecific parameter to the common parameter using the transformation function M i , which is used as the omitted variable bias formula in our setting. However, they assume that the omitted covariates are fixed values and thus they can estimate M i without a consideration of the distribution of covariates. In contrast, our approach provides more general guidelines for treating missing covariates in the meta-analysis.
The simulation performed in this study illustrated that our method is unbiased and has greater efficiency than a conventional meta-analysis approach as well as the technique proposed by Debray et al. [6]. Although our estimator was most efficient if the covariance structure was truly specified, it maintained its efficiency even if we misspecified the covariance structure, with a loss of efficiency by misspecification of only around 10 %.
Finally, we demonstrated the practical use of our approach with medical data on stroke prediction. Although the improvement of accuracy of the prediction model was relatively small, the confidence intervals of synthesized coefficients were dramatically decreased because information from other studies helped improve efficiency. In the context of multivariate meta-analysis, it is well known that we can gain precision by borrowing strength from other partially reported results [33][34][35]. This implies that our methodology can be applied not only to prediction models but also to observational studies such as a case-control/cross-sectional study whose main purpose is to identify causal effects.
As a limitation of this study, our method was examined in only one practical dataset. Although this data includes over 100,000 samples, the population was Japanese only, and can thus be regarded as one group with small heterogeneity. This situation may not be representative of an ordinary meta-analysis because the majority of recent meta-analyses include several groups with large heterogeneity due to studies undertaken globally. We think, however, that we took this heterogeneity into account by incorporating random effects, as mentioned in the Methods section. We welcome the re-evaluation of our method in other practical cases. Another potential limitation is that we implicitly assumed that the distributions of covariates are the same between studies. The assumption is required to calculate the expectation in Eq. (3) for each study to derive the omitted variable bias formula, γ * = f (α, β, p XZ ). The assumption of homogeneity can also be relaxed by incorporating random effects into parameters related to the distribution, as discussed in the Methods section. However, a random effect model obscures the objective of a meta-analysis because under this model, a global "average"İ effect and the effect prevailing in particular circumstances are not identical [36]. We need further research about how to incorporate random effects and its interpretation.

Conclusions
This study proposed a correction method for the omitted variable bias due to different sets of covariates between literature models in the meta-analysis of regression coefficients. Our approach attained efficiency that was comparable to that of conventional approaches. This study should be useful for practitioners who want to develop their prediction model on their own dataset and incorporate prior regression results.