 Research article
 Open Access
 Published:
Prevalence estimation by joint use of big data and health survey: a demonstration study using electronic health records in New York city
BMC Medical Research Methodology volume 20, Article number: 77 (2020)
Abstract
Background
Electronic Health Records (EHR) has been increasingly used as a tool to monitor population health. However, subjectlevel errors in the records can yield biased estimates of health indicators. There is an urgent need for methods to estimate the prevalence of health indicators using large and realtime EHR while correcting the potential bias.
Methods
We demonstrate joint analyses of EHR and a smaller goldstandard health survey. We first adopted Mosteller’s method that pools two estimators, among which one is potentially biased. It only requires knowing the prevalence estimates from two data sources and their standard errors. Then, we adopted the method of Schenker et al., which uses multiple imputations of subjectlevel health outcomes that are missing for the subjects in EHR. This procedure requires information to link some subjects between two sources and modeling the mechanism of misclassification in EHR as well as modeling inclusion probabilities to both sources.
Results
In a simulation study, both estimators yielded negligible bias even when EHR was biased. They performed as well as health survey estimator when EHR bias was large and better than health survey estimator when EHR bias was moderate. It may be challenging to model the misclassification mechanism in real data for the subjectlevel imputation estimator. We illustrated the methods analyzing six health indicators from 2013 to 14 NYC HANES and the 2013 NYC Macroscope, and a study that linked some subjects in both data sources.
Conclusions
When a small goldstandard health survey exists, it can serve as a safeguard against potential bias in EHR through the joint analysis of the two sources.
Background
Electronic Health Records (EHR) has been increasingly used as a tool for public health surveillance by local and national jurisdictions [1]. For example, recent studies in New York City (NYC) reported that the prevalence estimates from NYC Macroscope, an EHRbased surveillance system in NYC [2], were comparable to the surveybased estimates for diabetes, hypertension, and smoking [3, 4]. EHR often cover more people (n ≥ 100,000) than traditional population health surveys and, and once the infrastructure is in place, the data collection occurs in near realtime without additional recruitment or interviewing cost.
Despite these advantages, the prevalence estimates from EHR can often be biased mainly due to two causes. The first is selection bias. That is, EHR may not represent the target population. For example, the patient population from NYC Macroscope underrepresents young men, overrepresents patients living in high poverty neighborhoods. It only includes patients who visit primary care doctors connected to a particular EHR system [2]. The selection bias can be corrected, if modeled correctly, by poststratification. The other source of error is the misclassifications of health outcomes, which is the main interest of our study. It comprises measurement error (e.g., due to the use of nonstandardized instruments across sites), extraction error, or the collection of proxymeasurement (e.g., due to the recording without distinction of both selfreport and actual measurements). McVeigh et al. [3] reported such subjectlevel discrepancies by examining a chart review of participants who both visited NYC Macroscope providers and also participated in the NYC Health and Nutrition Examination Survey (HANES), a populationrepresentative survey with field interviews and biospecimen collection. Assuming NYC HANES measurements as “goldstandard,” the chart review found a 5% subjectlevel error for obesity, 19% for depression, and 19% for influenza vaccination. Notably, the sensitivity (i.e., the proportion of the medical condition identified in NYC HANES also indicated in the EHR) was as low as 31% for depression and 19% for influenza vaccination. In a later study, McVeigh et al. [5] extracted chart data from more than 20 additional EHR software systems from primary care providers and repeated similar study for 190 participants of the 2013–14 NYC HANES. For the public health surveillance system using EHR records, there is an urgent need for methods to estimate the prevalence of health indicators using large and realtime EHR while correcting the potential bias using external sources.
Many existing methods allow investigators to pool multiple data sources and some may be suitable for the unique context of combining big data with a small goldstandard survey. They can be classified by whether the subjects are linked at the individual level and whether potential biases are accounted for. For data sources that are unavailable at the individual level, aggregate statistics are pooled from the sources. For example, Thompson [6] developed methods to combine aggregate statistics from standardized surveys by an international tobacco control project to find programs that are effective in reducing tobacco use. She studied several approaches including a model with random effects for the country. However, her model assumed that all surveys were equally likely to be biased and the bias across countries canceled each other out. There are a handful of works that account for pooling a goldstandard source with potentially biased sources [7,8,9,10,11]. Earlier, Mosteller [9] studied ways to combine the means from two samples when one is potentially biased. Mosteller’s estimator, chosen as one end of the methods, will be discussed further in the following section. Lohr and Brick [7] explored methods for pooling domainlevel estimates from two surveys that measure victimization prevalence: their goldstandard survey, the United States National Crime Victimization Survey, and a larger but potentially biased telephone companion survey. In their study, they compared ten methods that combine a goldstandard survey with another biased data source. The methods included calibration methods, weighted averages of the estimators from the two sources without any bias adjustment (i.e. unadjusted dualframe estimators), with bias adjustment pooled across the domains, and with domainspecific bias adjustment. The last method performed the best. Another estimator that performed well was the multiplicative bias estimator published earlier [11]. Manzi et al. [8] used a Bayesian hierarchical model to pool domainlevel smoking prevalence estimates from seven surveys in the eastern regions of England. Similarly, Raghunathan et al. [10] used Bayesian hierarchical model to combine a potentially biased countylevel prevalence of cancer outcomes and risk factors from a larger telephone survey, The Behavioral Risk Factor Surveillance System, with an unbiased (or less biased) facetoface National Health Interview Survey (NHIS) covering fewer counties and fewer households.
When data are available at the individual level, Kim and Rao [12] developed a method to combine a small survey with outcome measurement and auxiliary information with a larger independent survey with only auxiliary information. Park et al. [13] developed a model to pool one goldstandard source with outcome measurement and auxiliary information with another independent source with a potentially biased outcome and the same auxiliary measure. Schenker et al. [14] used multiple imputations to combine selfreported outcomes from a large survey, NHIS, with a smaller NHANES that has both clinical and selfreport outcomes. They imputed clinical measurement of health outcomes for the participants of the larger survey by modeling both the underlying mechanism of misspecification of outcomes and the mechanism of inclusion to each survey. We will study further this method in the following section as another end of the methods. For more than two proxy outcome variables measured with lagged overlaps, Gelman et al. [15] and He et al. [16] used similar multiple imputation approaches.
In this study, we aim to demonstrate that the joint analysis of a large EHR with a much smaller goldstandard health survey can improve the accuracy of the prevalence estimates. Our aim is not to study all available methods but instead to demonstrate two statistical procedures at both ends of spectrum. First, we adopt Mosteller’s method [9] to pool two estimators when one is potentially biased. It only requires knowing the prevalence estimates from two data sources and their standard errors. Second, we adopt the method of Schenker et al. [14], which uses iterative multiple imputations of subjectlevel health outcomes for both surveys. This procedure requires information to link some subjects between two sources and modeling the mechanisms underlying the misclassifications in EHR as well as modeling inclusion probabilities to both sources. We demonstrate the statistical properties of the two estimators using simulation studies. Finally, we illustrate these methods analyzing 2013–14 NYC HANES and the 2013 NYC Macroscope and a small study that linked some subjects between the two sources.
Methods
We consider two data sources. First is a health survey of a smaller sample S_{1} with survey weights w_{1} that is representative of the target population. Measurement Y_{1} in the survey is the goldstandard and hence \( {\hat{p}}_1 \)\( ={\sum}_{i\in {S}_1}{w}_{1,i}{Y}_{1,i}/\sum {w}_{1,i} \) is an unbiased estimator of the prevalence of interest p_{1}. Another data source is EHR of a larger sample S_{2} that becomes representative of the population with poststratified weights w_{2}. Measurement Y_{2} in the EHR may have subjectlevel errors and \( {\hat{p}}_2 \)\( ={\sum}_{i\in {S}_2}{w}_{2,i}{Y}_{2,i}/\sum {w}_{2,i} \) may be a biased estimator of p_{1}. We denote logit of the prevalence as ϕ_{1} =logit(p_{1}) and logit of prevalence estimators from the two sources as y_{1} =logit(\( {\hat{p}}_1 \)) and y_{2} =logit(\( {\hat{p}}_2 \)). We assume that the covariance between two estimators is ignorable since the number of the overlapping subjects (S_{1}∩S_{2}) is typically very small relative to the size of EHR (S_{2}). We can link the subset of the overlapping subjects (S_{c}) between the two sources. Figure 1 outlines the data structure. We used statistical software R for all statistical analyses [17, 18].
Mosteller estimator
At the core of the problem is a simple question: “Can we gain by pooling two estimates when one is possibly biased but from a larger sample?” Earlier, Mosteller (1948) [9] studied whether to pool two sample means when one is potentially biased. He compared the mean squared error (MSE) of various mean estimators: the unbiased mean, testthenpool estimator (i.e., pooling two means only when the mean difference was not significant), and maximum likelihood estimator (MLE) assuming meanzero Gaussian bias. The last estimator showed the least MSE. We adopt his approach to account for unequal sample sizes and unequal variances. The estimator is a weighted average of y_{1} and y_{2}:
It can be shown that the MSE of this family of estimators is minimized when \( {k}_1=1/{\upsigma}_1^2 \), \( {k}_2=1/\left({\uptau}^2+{\upsigma}_2^2\right) \), where σ_{1} and σ_{2} are the standard errors of y_{1} and y_{2}, and τ = E(y_{2}) − ϕ_{1} is the bias of y_{2}. The estimator is also the MLE of ϕ_{1} under the model y_{j} = ϕ_{1} + 1(j = 2)θ + e_{j} where θ and e_{j} are mutually independent zeromean normal variable with variance τ^{2} and \( {\sigma}_j^2 \), respectively. The variance and bias parameters were estimated by consistent estimators \( {\hat{\upsigma}}_1^2={s}_1^2 \), \( {\hat{\upsigma}}_2^2={s}_2^2 \) and \( {\hat{\uptau}}^2={\left({y}_1{y}_2\right)}^2. \) For example, \( {s}_j^2 \) can be the sample variance estimated using survey weights.
The same estimator can also be derived from an approximate Bayesian perspective [19] by setting a prior to the asymptotically normal sampling distribution of y_{j}. If we set a noninformative prior (i.e. normal with infinite variance) of ϕ_{1}, and zeromean normal prior of the bias E(y_{2}) − ϕ_{1} with variance τ^{2}, then the posterior distribution of ϕ_{1} can be shown to be normal with mean \( {\hat{\phi}}^{\mathrm{M}} \) and variance \( {\sigma}_1^2\left({\sigma}_2^2+{\tau}^2\right)/\left({\sigma}_1^2+{\sigma}_2^2+{\tau}^2\right) \). τ measures the prior belief in closeness of the prevalence measured by EHR and health survey. The 95% highest density credibility interval of the logit prevalence is given as
The estimator, while less efficient than the subjectlevel imputation estimator below, is simpler to implement by practitioners who often do not have resources to link subjects in two sources or model the mechanisms of the misclassifications in EHR.
Subjectlevel imputation estimator
Misclassification model
We adapted the approach by Schenker et al. [14] and modeled the misclassification between the binary outcomes of i^{th} subject in health survey (Y_{1, i}) and EHR (Y_{2, i}):
where z_{i} is a vector predictor. Since the relationship may depend on the design factors of surveys, the model is stratified by four levels (l = 1, 2, 3, 4) divided by the quartiles of the inclusion probabilities to the health survey as q_{11}, q_{12}, q_{13} and to the EHR as q_{21}, q_{22}, q_{23}.
Model for inclusion to each source
Since the inclusion probabilities to health survey (π_{1i}) are unknown for most EHR subjects, we model them by a model, logit π_{1i} = a_{0+}a_{1}u_{i}, where u_{i} is a vector of survey design factors. The model is fit over entire EHR subjects weighted by their poststratified weights (w_{2}). Similarly, we model the inclusion probability to EHR logit π_{2i} = b_{0+}b_{1}v_{i} and fit it over entire health survey subjects weighted by their survey weights.
Bayesian iterative regression imputation
While we are ultimately interested in imputing missing health survey outcomes (1) in Fig. 1, we follow Schenker et al. [14] and perform iterative imputations between two models M1, to impute missing EHR values (2) in the figure, and M2, to impute missing health survey values (1) in the figure. This is repeated B times. Imputing missing EHR values (2) in the figure increases sample size when fitting M2, the model we are ultimately interested. The additional variation caused by using imputed values was accounted for by the multiple imputation standard error formula below. The following is the detailed procedure.
To impute missing Y_{2,i}, we divided the subjects S_{1} ∪ S_{2} into 4 (l = 1, …, 4) groups by the quartiles q_{21}, q_{22}, q_{23}, and within each group fit Bayesian regression model M1 with a weakly informative prior for β_{l} = (β_{0l,}β_{1l,}β_{2l}) of independent Cauchy distributions with 2.5 scale and zero location, first on the subjects S_{c} whose identities can be linked between two data sources. Then, we drew a posterior sample of β_{l}, and in turn Y_{2,i} conditional on β_{l} for all health survey subjects missing Y_{2,i}. Subsequently, treating this imputed Y_{2,i} as observed, we imputed missing Y_{1,i} by dividing the subjects into 4 groups by q_{11}, q_{12}, q_{13} and fitting the regression model M2 on all EHR subjects with independent Cauchy prior for γ_{l} = (γ_{0l}, γ_{1l}, γ_{2l}) with 2.5 scale and zero location. We drew a posterior sample from γ_{l}, then in turn Y_{1,i} for all EHR subjects missing Y_{1,i}. We iterated B times to fit models M1 and M2, treating imputed values from the previous step as observed and imputing the missing outcome variables until convergence. Then we calculated a prevalence estimator \( {\hat{p}}_m \) = \( {\sum}_{i\in {S}_2}{w_2}_i{\hat{Y}}_{m,1,i}/\sum {w_2}_i \) based on the imputed health survey measurements of all EHR subjects. Notice that the outcome values were imputed only when they are missing. In other words, \( {\hat{Y}}_{m,1,i} \) = Y_{1, i} for subjects whose health survey outcome was observed. Finally, we combined inferences from M such multiple imputations. The resulting prevalence estimator is unbiased when the specified models are correct:
The standard error of \( {\hat{\phi}}^{\mathrm{R}} \) =logit(\( {\hat{P}}^{\mathrm{R}} \)) was estimated by the standard way [20, 21]:
where W= \( {\sum}_m{s}_m^2/M \), B= \( {\sum}_m{\left({\hat{\phi}}_m{\hat{\phi}}^{\mathrm{R}}\right)}^2/\left(M1\right) \), and s_{m} is the naïve standard error of the logit prevalence (\( {\hat{\phi}}_m \)) calculated from m^{th} imputation. Since the overlap between two sources can be small, we used BarnardRubin degrees of freedom [22, 23] to compute credibility intervals, first in logodds scale before they were transformed to probability scale.
Results
Simulation studies
We performed simulation studies to assess the performance of the methods under various settings. We generated correlated binary outcomes (Y_{1}, Y_{2}) of a target population (N = 10,000,000) whose conditional distributions follow logistic models: logit P(Y_{1} = 1Y_{2}) = η_{10}+ φ Y_{2} and logit P(Y_{2} = 1Y_{1}) = η_{01}+ φ Y_{1} where η_{10} = γ_{0}+ γ_{1}x_{1}+ γ_{2}x_{2}, η_{01} = β_{0}+ β_{1}x_{1}+ β_{2}x_{2}. To do so, we first generated an independent Bernoulli variable x_{1} with success probability .5 and a standard normal variable x_{2}. Then we generated the correlated binary outcomes (Y_{1}, Y_{2}) which has 4 possible outcomes (0,0) (0,1), (1,0), (1,1) with corresponding joint probabilities p_{00}, p_{01}, p_{10}, p_{11} where p_{11}: p_{10}:p_{01}: p_{00} = exp.(φ + η_{10} + η_{01}): exp.(η_{10}):exp.(η_{01}):1. This set up guarantees that the conditional distributions of outcomes are the two stated logistic models. The log odds ratio φ and the linear coefficients were set so that the true prevalence based on two datasets were p_{1} = p_{11}+ p_{10} = 0.3 and p_{2} = p_{11}+ p_{01} = 0.3, 0.31, 0.32, 0.33, or 0.35.
Then, we randomly selected subjects for the health survey (n_{1} = 250, 500, or 1000) and EHR (n_{2} = 100,000) by inclusion probabilities proportional to logit π_{1i} = a_{0} + a_{1}u_{1i} + a_{2}u_{2i} + a_{3}x_{1i} and logit π_{2i} = b_{0} + b_{1}u_{1i} + b_{2}u_{2i} + b_{3}x_{1i}. u_{1} was an independent Bernoulli variable with success probability .5 and u_{2} was a standard normal variable. We set (a_{0}, a_{1}, a_{2}, a_{3}) = (b_{0}, b_{1}, b_{2}, b_{3}) = (1,1,1, 0.187). x_{1}, the predictor of misclassification, was also included as a survey design factor so that the missing mechanism is missingatrandom but not missingcompletlelyatrandom. Then, we selected more EHR subjects among the health survey participants so that the proportion of health survey participants that are also in EHR is 20, 50%, or 100%. Finally, we deleted the values of Y_{1} and π_{1} for the subjects not in the health survey and Y_{2} for the subjects not in EHR. All π_{2} values were deleted as inclusion probabilities are unknown in typical EHR.
For each simulated survey and EHR, we used u_{1}, u_{2}, and x_{1} to calculate poststratified weights w_{2} for the EHR. Then we calculated four prevalence estimates: estimator based only on the health survey, estimator based only on EHR, Mosteller estimator, and the subjectlevel imputation estimator. For the subjectlevel imputation estimator, we included burnin iterations and combined inferences of M = 30 multiple imputations. The overall process of the generation of the target population, sampling health survey and EHR from the population, and calculating the prevalence estimates was repeated 200 times.
Table 1 shows the average prevalence estimates by the four estimators. The size of the health survey (n_{1}) and the size of subjects linked between two sources (n_{12}) were both 500. Health survey estimator was unbiased in all settings. On the contrary, EHR estimator was biased except when there was no misclassification bias (i.e., p_{2} = 0.3), in which case poststratification successfully adjusted for the selection bias. Both Mosteller estimator and the subjectlevel imputation estimator showed less than 3% bias in all settings.
Table 2 shows the MSE of the estimators. When bias was less than or equal to 5% (i.e., p_{2} = 0.3 or 0.31), the EHR estimator outperformed the health survey estimator due to a larger sample size. When the bias was more substantial, however, it overwhelmed the benefit from the sample size. Then, the subjectlevel imputation model and the Mosteller estimator performed better than the estimators based only on either source. Notably, they either outperformed or were similar to the health survey estimator in all settings. Between the two, the Mosteller estimator performed better than the subjectlevel imputation estimator when bias was small to moderate (p_{2} = 0.3–0.33), but worse when bias was large (p_{2} = 0.35).
We studied how the size of the health survey and the size of subjects linked between two sources affect the performance (Table 3). We fixed the true prevalence (p_{1}) at 0.3 and the prevalence (p_{2}) measured from EHR (Y_{2}) at 0.32. The EHR estimator performed best when the health survey was small (n_{1} = 250) but Mosteller’s estimator performed best when the health survey size was moderate (n_{1} = 500, 1000). The subjectlevel imputation estimator requires enough size of subjects linked between two sources. Mostellers’ method, on the other hand, performed well in most settings.
Analysis of NYC macroscope and NYC HANES
We illustrate the methods with data from NYC. To protect patient privacy, the authors did not directly access the data but submitted R codes to the NYC Department of Health and Mental Hygiene (DOHMH) and received back the results of the joint analysis of two data sources presented below.
Description of data sources
NYC Macroscope is an EHRbased surveillance system developed by the NYC DOHMH in collaboration with the City University of New York School of Public Health to estimate the prevalence of chronic diseases and risk factors for adult population (20 years or older) in care by participating primary care providers in NYC [2, 5]. The data were available only as aggregate data stratified by age group, sex, and neighborhood poverty level. Detailed provider and patient inclusion and exclusion criteria are documented elsewhere [2]. In this study, we used the 2013 data that included 716,076 patients.
The 2013–14 NYC HANES is a populationrepresentative survey of NYC residents aged 20 or older (n = 1527) with the interview, physical examination and biospecimen collection [24]. The data used in this study were limited to incare participants (i.e., participants who have seen a provider for primary care in the previous year; n = 1135). Recently, a chart review study was conducted among a subsample (n = 190) of incare participants from NYC HANES (Fig. 1) [5]. In their study, more than 20 EHR from primary care providers were abstracted for each chart review participant, and the data were linked to the NYC HANES data at the individual level. The chart review sample consisted of participants who received primary care from NYC Macroscope or nonNYC Macroscope providers. Because there was little difference in demographic and clinical characteristics between the two groups, we used data from all participants in this study. They performed the chart review on subjects enrolled in NYC HANES 2013–14 (n = 1524) who had doctors visit during the year (n = 1135) and signed a consent form and Health Insurance Portability & Accountability Act (HIPPA) waiver (n = 491) whose EHR were available and valid (n = 190).
Definition of health indicators
We selected six health indicators in the sources to demonstrate the methods: hypertension diagnosis, diabetes diagnosis, smoking, obesity, depression, and influenza vaccination. NewtonDame and her collegues describes these indicators in detail [2]. Hypertension diagnosis was defined as either systolic blood pressure ≥ 140 mmHg or diastolic blood pressure ≥ 90 mmHg or an existing record of hypertension diagnosis (based on ICD9 in NYC Macroscope and selfreport in NYC HANES). Diabetes indicator was based on the presence of an ICD9 diagnosis in NYC Macroscope and selfreport in NYC HANES. Smoking was based on an indication of ‘current smoking’ in the most recent smoking status in the NYC Macroscope and based on a selfreport of current smoking in NYC HANES. The obesity indicator was based on the most recent body mass index (BMI) ≥ 30 in NYC Macroscope and based on the measured height and weight at the interview in NYC HANES. Depression indicator was based on the presence of an ICD9 depression diagnosis ever recorded, or a Patient Health Questionnaire (PHQ9) score ≥ 10 in NYC Macroscope and based on a selfreported diagnosis or a PHQ9 score ≥ 10 at interview in NYC HANES. Influenza vaccination indicator was based on the presence of a relevant ICD9/CPT/CVX code in NYC Macroscope and based on the selfreport of receiving influenza vaccination in the past 12 months in NYC HANES.
Illustration of the methods on NYC data
The NYC Macroscope used poststratification to address the selection bias of Macroscope data [25, 26] by matching the joint distribution of gender, age group, and neighborhoodlevel poverty to that of the city’s incare population. The prevalence estimates among the incare city populationbased on NYC HANES and NYC Macroscope were close for hypertension diagnosis (NYC HANES 34.3% vs. NYC Macroscope 33.7%), moderately different for diabetes diagnosis (13.3% vs 14.8%), smoking (17.3% vs. 15.9%), and obesity (31.7% vs. 29.1%), and significantly different for depression (19.0% vs. 8.6%) and influenza vaccination (48.6% vs. 21.2%). The discrepancies in the depression prevalence and influenza vaccination rate were likely due to the underdiagnosis of depression in primary care settings and influenza vaccination outside of clinics (e.g., pharmacies) that are not recorded by the primary care EHR. The population characteristics in NYC HANES and NYC Macroscope for the adult incare population are described elsewhere [27].
We estimated prevalence by the four estimators: estimator based only on NYC HANES, estimator based only on Macroscope data, Mosteller estimator, and the subjectlevel imputation estimator. We assumed that NYC HANES was the gold standard since data were collected using a populationrepresentative sample design with a controlled and standardized data collection. The chart review study with 190 subjects whose identities were linked between NYC HANES and NYC Macroscope enabled us to calculate the subjectlevel imputation estimates for which we used age group, sex, and neighborhood poverty level as covariates for inclusion models and misclassification models. There was a lack of predictors that could properly model misclassifications in the EHR, such as hospital size, instrument labels, or types of visits.
Mosteller prevalence estimates showed improvement over both NYC HANES and NYC Macroscope estimates (Table 4). In all six health outcomes, they showed smaller standard errors compared to NYC HANES estimates and smaller biases compared to Macroscope estimator. The bias reduction was especially substantial (> 99% reduction) in depression and influenza vaccination estimates because, for these indicators, EHR estimates were given little weight (Table 5). On the other hand, the subjectlevel imputation estimates did not outperform NYC HANES estimates: their credibility intervals were larger than NYC HANES estimates. This was due to the lack of predictors, as mentioned above, that could model the mechanism of misclassification in EHR. The subjectlevel imputation method requires us to correctly model the misclassification as well as to approximate the inclusion probabilities to the health survey for the EHR subjects.
Table 4 also demonstrates that the selection bias in Macroscope was less than the bias due to subjectlevel misclassifications: the range of differences in prevalence estimates between Macroscope and NYC HANES for diabetes, smoking, and obesity were similar with (1.6–3.7%) and without (1.5–2.6%) poststratification. However, it decreased to 0.4–0.6% for the Mosteller estimator. The range of differences in depression prevalence and influenza vaccination rate were also similar with (10.7–26.9%) and without (10.4–27.4%) poststratification but it reduces dramatically to 0.1% for the Mosteller estimator. This shows that poststratification alone was insufficient to correct the bias in the EHR for these outcomes. But Mosteller estimator and subjectlevel imputation estimator both used NYC HANES as a safeguard against potential bias in EHR.
Discussion
Compared to traditional health surveys, EHR has a much larger sample size and the potential to reduce standard errors of prevalence estimates. It can be very helpful in estimating prevalence in small subgroups of the populations. In NYC Macroscope and our simulation study, we found that the correction of the subjectlevel error of EHR is necessary and possible.
In the simulation study, the health survey estimator was unbiased, but the standard error was the largest. On the contrary, the bias in EHR estimator can overwhelm the benefit of its sample size. When that happened, both Mosteller estimator and the subjectlevel imputation estimator yielded negligible bias and small standard errors: they either outperformed or were comparable to the estimators based solely on either source. The subjectlevel imputation estimator may outperform Mosteller estimator when EHR bias is large. However, the estimator requires enough size of subjects linked between two sources and correctly modeling the mechanism of misclassification as well as modeling inclusion probabilities to both sources.
The difficulty of such a task was demonstrated in the analysis of the NYC data. Mosteller estimators showed considerably smaller standard error than NYC HANES estimates especially when the NYC Macroscope estimates and NYC HANES estimates were close. The subjectlevel imputation estimator did not outperform NYC HANES estimator in part due to a lack of predictors for misclassification. The predictors for misclassification can be both patientlevel characteristics, such as types of visit, and institutionlevel predictors, such as hospital size or instrument labels. These variables are typically going to be found in EHR (or administrative data sets that accompany EHR), while some patient characteristics will still be found in a health survey. In practice, the fit of the misclassification model should guide the choice between considered approaches, whether to model the underlying mechanism of misclassification or to use Mosteller’s estimator. This can be done, for example, by crossvalidated estimation of area under the curve of the receiver operating characteristic (ROC) curve as one moves the probability cutoff in the logistic regression model M2.
In this article, we considered the health survey as the gold standard. Here we acknowledge that survey measurements are rarely unbiased. However, it is often helpful to treat one survey as goldstandard over another. For example, investigators have treated a smaller inperson survey as goldstandard over a larger telephone survey [10], or clinical surveys as goldstandard over selfreported outcomes [14, 28]. EHR are often administrative data collected for billing purposes with nonstandardized instruments and protocols, with complex unknown inclusion mechanisms. NYC HANES was designed for health survey purposes by standardized instruments and protocols and collected by representative probability sampling. We assumed that typical bias treatment for the health survey, such as poststratification and calibration for nonresponse bias has been successfully performed.
Conclusions
We demonstrated that the joint use of a small goldstandard health survey with a larger EHR improves accuracy in prevalence estimation. Depending on the available data, one can aim to model the misclassification completely or calculate the weighted average of the prevalence estimates from two sources. The studied approaches can improve the quality of EHR as a public health surveillance tool. In another work, we are extending the methods to model subgroup level prevalence estimators from health surveys and EHR.
Availability of data and materials
This study includes a secondary analysis of two data sources that are not owned by the authors. Readers can inquire about data by visiting the NYC HANES Project (http://nychanes.org) or contacting NYC DOHMH.
Abbreviations
 BMI:

Body mass index
 CPT:

Common procedural technology
 CVX:

Vaccine administered
 DOHMH:

Department of Health and Mental Hygiene
 EHR:

Electronic health records
 ICD9:

International classification of diseases  ninth revision
 MLE:

Maximum likelihood estimator
 MSE:

Mean squared error
 NHIS:

National health interview survey
 NYC:

New York city
 NYC HANES:

New York City health and nutrition examination survey
 PHQ9:

Patient health questionnaire9
References
 1.
Paul MM, Greene CM, NewtonDame R, Thorpe LE, Perlman SE, McVeigh KH, et al. The state of population health surveillance using electronic health records: a narrative review. Popul Health Manag. 2015;18(3):209–16.
 2.
NewtonDame R, McVeigh KH, Schreibstein L, Perlman S, LurieMoroni E, Jacobson L, et al. Design of the New York City Macroscope: innovations in population health surveillance using electronic health records. EGEMS (Washington, DC). 2016;4(1):1265.
 3.
Thorpe LE, McVeigh KH, Perlman S, Chan PY, Bartley K, Schreibstein L, et al. Monitoring prevalence, treatment, and control of metabolic conditions in New York City adults using 2013 primary care electronic health records: a surveillance validation study. EGEMS (Washington, DC). 2016;4(1):1266.
 4.
McVeigh KH, NewtonDame R, Chan PY, Thorpe LE, Schreibstein L, Tatem KS, et al. Can electronic health records be used for population health surveillance? Validating population health metrics against established survey data. EGEMS (Washington, DC). 2016;4(1):1267.
 5.
McVeigh KH, LurieMoroni E, Chan PY, NewtonDame R, Schreibstein L, Tatem KS, et al. Generalizability of indicators from the New York city macroscope electronic health record surveillance system to systems based on other EHR platforms. EGEMS (Washington, DC). 2017;5(1):25.
 6.
Thompson ME. International surveys: motives and methodologies. Surv Methodol. 2008;34(2):131–41.
 7.
Lohr SL, Brick JM. Blending domain estimates from two victimization surveys with possible bias. Can J Stat. 2012;40(4):679–96.
 8.
Manzi G, Spiegelhalter DJ, Turner RM, Flowers J, Thompson SG. Modelling bias in combining small area prevalence estimates from multiple surveys. J Royal Stat Soc Ser A. 2011;174:31–50.
 9.
Mosteller F. On pooling data. J Am Stat Assoc. 1948;43(242):231–42.
 10.
Raghunathan TE, Xie D, Schenker N, Parsons VL, Davis WW, Dodd KW. Combining information from two surveys to estimate countylevel prevalence rates of cancer risk factors and screening. J Am Stat Assoc. 2007;102(478):474–86.
 11.
Ybarra LMR, Lohr SL. Small area estimation when auxiliary information is measured with error. Biometrika. 2008;95(4):919–31.
 12.
Kim J, Rao J. Combining data from two independent surveys: a modelassisted approach. Biometrika. 2012;99(1):85–100.
 13.
Park S, Kim JK, Stukel D. A measurement error model for survey data integration: combining information from two surveys. Metron. 2017;75:345–57.
 14.
Schenker N, Raghunathan TE, Bondarenko I. Improving on analyses of selfreported data in a largescale health survey by using information from an examinationbased survey. Stat Med. 2010;29(5):533–45.
 15.
Gelman A, King G, Liu C. Not asked and not answered: multiple imputation for multiple surveys: rejoinder. J Am Stat Assoc. 1998;93(443):869–74.
 16.
He Y, Landrum MB, Zaslavsky AM. Combining information from two data sources with misreporting and incompleteness to assess hospiceuse among cancer patients: a multiple imputation approach. Stat Med. 2014;20(33):3710–24.
 17.
Gelman A, Su Y. Arm : data analysis using regression and multilevel/hierarchical modelshttp://cran.rproject.org/web/packages/arm; 2011.
 18.
R Core Team. R: a language and environment for statistical computing; 2016.
 19.
Wang Z, Kim JK, Yang S. Approximate Bayesian inference under informative sampling. Biometrika. 2017;105(1):91–102.
 20.
Gelman A, Hill J. Data analysis using regression and multilevel/hierarchical models. Cambridge: Cambridge University Press; 2006.
 21.
Rubin DB. Multiple imputation for nonresponse in surveys. New York: Wiley; 2006.
 22.
Barnard J, Rubin DB. Smallsample degrees of freedom with multiple imputation. Biometrika. 1999;86(4):948–55.
 23.
van Buuren S, GroothuisOudshoorn K. Mice: multivariate imputation by chained equations in R. J Stat Softw. 2011;45(3):67.
 24.
Thorpe LE, Greene C, Freeman A, Snell E, RodriguezLopez JS, Frankel M, et al. Rationale, design and respondent characteristics of the 20132014 New York City health and nutrition examination survey (NYC HANES 20132014). Prev Med Rep. 2015;2:580–5.
 25.
Lumley T. Analysis of complex survey samples. J Stat Softw. 2004;9(8):19.
 26.
Valliant R. Poststratification and conditional variance estimation. J Am Stat Assoc. 1993;88(421):89–96.
 27.
Chan PY, Zhao Y, Lim S, Perlman SE, McVeigh KH. Using calibration to reduce measurement error in prevalence estimates based on electronic health records. Prev Chronic Dis. 2018;15:E155.
 28.
Raghunathan TE. Combining information frommultiple surveys for assessing health disparities. Allg Stat Arch. 2006;90:515–26.
Acknowledgements
We are grateful to NYC DOHMH who agreed for publication of the study, and helped us describe their data, and run our R code on their data sources. We thank the reviewer’s thoughtful comments on our manuscript as they improved the clarity and quality of the paper.
Funding
The analysis of NYC data began as the authors worked as paid statistical consultants for NYC DOHMH. The authors sent R codes to the NYC DOHMH and received back the results along with the standard description of two data sources. As a courtesy, NYC DOHMH was given the opportunity to review the the manuscript.
Author information
Affiliations
Contributions
RK developed the statistical methods, performed analysis of NYC data and simulation studies, interpreted the results, and wrote the manuscript. VS also performed analysis on NYC data and edited the manuscript. The authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
This study reports aggregate prevalence estimate of six health outcomes among NYC adult incare residents. To produce these estimates, the authors did not directly access the data but submitted R codes to NYC DOHMH and received back the prevalence estimates from the joint analysis of two data sources: 2013–14 NYC HANES and NYC Macroscope. The 2013–2014 NYC HANES was approved by the NYC DOHMH and City University of New York School of Public Health institutional review boards, and the chart review study was approved by the NYC DOHMH institutional review board.
Consent for publication
Not applicable.
Competing interests
Drs. Kim and Shankar have provided paid statistical consultation to NYC DOHMH on projects, including the joint analysis of Macroscope and NYC HANES data.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Kim, R.S., Shankar, V. Prevalence estimation by joint use of big data and health survey: a demonstration study using electronic health records in New York city. BMC Med Res Methodol 20, 77 (2020). https://doi.org/10.1186/s12874020009566
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s12874020009566
Keywords
 Big data
 Electronic health records
 Multiple imputations
 Measurement error
 Selection bias
 Population health surveillance