Prevalence estimation by joint use of big data and health survey: a demonstration study using electronic health records in New York city

Background Electronic Health Records (EHR) has been increasingly used as a tool to monitor population health. However, subject-level errors in the records can yield biased estimates of health indicators. There is an urgent need for methods to estimate the prevalence of health indicators using large and real-time EHR while correcting the potential bias. Methods We demonstrate joint analyses of EHR and a smaller gold-standard health survey. We first adopted Mosteller’s method that pools two estimators, among which one is potentially biased. It only requires knowing the prevalence estimates from two data sources and their standard errors. Then, we adopted the method of Schenker et al., which uses multiple imputations of subject-level health outcomes that are missing for the subjects in EHR. This procedure requires information to link some subjects between two sources and modeling the mechanism of misclassification in EHR as well as modeling inclusion probabilities to both sources. Results In a simulation study, both estimators yielded negligible bias even when EHR was biased. They performed as well as health survey estimator when EHR bias was large and better than health survey estimator when EHR bias was moderate. It may be challenging to model the misclassification mechanism in real data for the subject-level imputation estimator. We illustrated the methods analyzing six health indicators from 2013 to 14 NYC HANES and the 2013 NYC Macroscope, and a study that linked some subjects in both data sources. Conclusions When a small gold-standard health survey exists, it can serve as a safeguard against potential bias in EHR through the joint analysis of the two sources.


Background
Electronic Health Records (EHR) has been increasingly used as a tool for public health surveillance by local and national jurisdictions [1]. For example, recent studies in New York City (NYC) reported that the prevalence estimates from NYC Macroscope, an EHR-based surveillance system in NYC [2], were comparable to the survey-based estimates for diabetes, hypertension, and smoking [3,4]. EHR often cover more people (n ≥ 100, 000) than traditional population health surveys and, and once the infrastructure is in place, the data collection occurs in near real-time without additional recruitment or interviewing cost.
Despite these advantages, the prevalence estimates from EHR can often be biased mainly due to two causes. The first is selection bias. That is, EHR may not represent the target population. For example, the patient population from NYC Macroscope under-represents young men, over-represents patients living in high poverty neighborhoods. It only includes patients who visit primary care doctors connected to a particular EHR system [2]. The selection bias can be corrected, if modeled correctly, by post-stratification. The other source of error is the misclassifications of health outcomes, which is the main interest of our study. It comprises measurement error (e.g., due to the use of non-standardized instruments across sites), extraction error, or the collection of proxy-measurement (e.g., due to the recording without distinction of both self-report and actual measurements). McVeigh et al. [3] reported such subjectlevel discrepancies by examining a chart review of participants who both visited NYC Macroscope providers and also participated in the NYC Health and Nutrition Examination Survey (HANES), a population-representative survey with field interviews and biospecimen collection. Assuming NYC HANES measurements as "gold-standard, " the chart review found a 5% subject-level error for obesity, 19% for depression, and 19% for influenza vaccination. Notably, the sensitivity (i.e., the proportion of the medical condition identified in NYC HANES also indicated in the EHR) was as low as 31% for depression and 19% for influenza vaccination. In a later study, McVeigh et al. [5] extracted chart data from more than 20 additional EHR software systems from primary care providers and repeated similar study for 190 participants of the 2013-14 NYC HANES. For the public health surveillance system using EHR records, there is an urgent need for methods to estimate the prevalence of health indicators using large and real-time EHR while correcting the potential bias using external sources.
Many existing methods allow investigators to pool multiple data sources and some may be suitable for the unique context of combining big data with a small goldstandard survey. They can be classified by whether the subjects are linked at the individual level and whether potential biases are accounted for. For data sources that are unavailable at the individual level, aggregate statistics are pooled from the sources. For example, Thompson [6] developed methods to combine aggregate statistics from standardized surveys by an international tobacco control project to find programs that are effective in reducing tobacco use. She studied several approaches including a model with random effects for the country. However, her model assumed that all surveys were equally likely to be biased and the bias across countries canceled each other out. There are a handful of works that account for pooling a gold-standard source with potentially biased sources [7][8][9][10][11]. Earlier, Mosteller [9] studied ways to combine the means from two samples when one is potentially biased. Mosteller's estimator, chosen as one end of the methods, will be discussed further in the following section. Lohr and Brick [7] explored methods for pooling domain-level estimates from two surveys that measure victimization prevalence: their gold-standard survey, the United States National Crime Victimization Survey, and a larger but potentially biased telephone companion survey. In their study, they compared ten methods that combine a gold-standard survey with another biased data source. The methods included calibration methods, weighted averages of the estimators from the two sources without any bias adjustment (i.e. unadjusted dual-frame estimators), with bias adjustment pooled across the domains, and with domain-specific bias adjustment. The last method performed the best. Another estimator that performed well was the multiplicative bias estimator published earlier [11]. Manzi et al. [8] used a Bayesian hierarchical model to pool domain-level smoking prevalence estimates from seven surveys in the eastern regions of England. Similarly, Raghunathan et al. [10] used Bayesian hierarchical model to combine a potentially biased county-level prevalence of cancer outcomes and risk factors from a larger telephone survey, The Behavioral Risk Factor Surveillance System, with an unbiased (or less biased) face-to-face National Health Interview Survey (NHIS) covering fewer counties and fewer households.
When data are available at the individual level, Kim and Rao [12] developed a method to combine a small survey with outcome measurement and auxiliary information with a larger independent survey with only auxiliary information. Park et al. [13] developed a model to pool one gold-standard source with outcome measurement and auxiliary information with another independent source with a potentially biased outcome and the same auxiliary measure. Schenker et al. [14] used multiple imputations to combine self-reported outcomes from a large survey, NHIS, with a smaller NHANES that has both clinical and self-report outcomes. They imputed clinical measurement of health outcomes for the participants of the larger survey by modeling both the underlying mechanism of misspecification of outcomes and the mechanism of inclusion to each survey. We will study further this method in the following section as another end of the methods. For more than two proxy outcome variables measured with lagged overlaps, Gelman et al. [15] and He et al. [16] used similar multiple imputation approaches.
In this study, we aim to demonstrate that the joint analysis of a large EHR with a much smaller goldstandard health survey can improve the accuracy of the prevalence estimates. Our aim is not to study all available methods but instead to demonstrate two statistical procedures at both ends of spectrum. First, we adopt Mosteller's method [9] to pool two estimators when one is potentially biased. It only requires knowing the prevalence estimates from two data sources and their standard errors. Second, we adopt the method of Schenker et al. [14], which uses iterative multiple imputations of subject-level health outcomes for both surveys. This procedure requires information to link some subjects between two sources and modeling the mechanisms underlying the misclassifications in EHR as well as modeling inclusion probabilities to both sources. We demonstrate the statistical properties of the two estimators using simulation studies. Finally, we illustrate these methods analyzing 2013-14 NYC HANES and the 2013 NYC Macroscope and a small study that linked some subjects between the two sources.

Methods
We consider two data sources. First is a health survey of a smaller sample S 1 with survey weights w 1 that is representative of the target population. Measurement Y 1 in the survey is the gold-standard and hencep 1 ¼ P i∈S 1 w 1;i Y 1;i = P w 1;i is an unbiased estimator of the prevalence of interest p 1 . Another data source is EHR of a larger sample S 2 that becomes representative of the population with post-stratified weights w 2 . Measurement Y 2 in the EHR may have subject-level errors andp 2 ¼ P i∈S 2 w 2;i Y 2;i = P w 2;i may be a biased estimator of p 1 . We denote logit of the prevalence as ϕ 1 =logit(p 1 ) and logit of prevalence estimators from the two sources as y 1 =logit( p 1 ) and y 2 =logit(p 2 ). We assume that the covariance between two estimators is ignorable since the number of the overlapping subjects (S 1 ∩S 2 ) is typically very small relative to the size of EHR (S 2 ). We can link the subset of the overlapping subjects (S c ) between the two sources. Figure 1 outlines the data structure. We used statistical software R for all statistical analyses [17,18].

Mosteller estimator
At the core of the problem is a simple question: "Can we gain by pooling two estimates when one is possibly biased but from a larger sample?" Earlier, Mosteller (1948) [9] studied whether to pool two sample means when one is potentially biased. He compared the mean squared error (MSE) of various mean estimators: the unbiased mean, test-then-pool estimator (i.e., pooling two means only when the mean difference was not significant), and maximum likelihood estimator (MLE) assuming mean-zero Gaussian bias. The last estimator showed the least MSE. We adopt his approach to account for unequal sample sizes and unequal variances. The estimator is a weighted average of y 1 and y 2 : It can be shown that the MSE of this family of estimators is minimized when k 1 ¼ 1=σ 2 1 , k 2 ¼ 1=ðτ 2 þ σ 2 2 Þ , where σ 1 and σ 2 are the standard errors of y 1 and y 2 ,  τ = E(y 2 ) − ϕ 1 is the bias of y 2 . The estimator is also the MLE of ϕ 1 under the model y j = ϕ 1 + 1(j = 2)θ + e j where θ and e j are mutually independent zero-mean normal variable with variance τ 2 and σ 2 j , respectively. The variance and bias parameters were estimated by consistent estimatorsσ 2 j can be the sample variance estimated using survey weights.
The same estimator can also be derived from an approximate Bayesian perspective [19] by setting a prior to the asymptotically normal sampling distribution of y j . If we set a non-informative prior (i.e. normal with infinite variance) of ϕ 1 , and zero-mean normal prior of the bias E(y 2 ) − ϕ 1 with variance τ 2 , then the posterior distribution of ϕ 1 can be shown to be normal with meanφ M and variance σ 2 1 ðσ 2 2 þ τ 2 Þ=ðσ 2 1 þ σ 2 2 þ τ 2 Þ. τ measures the prior belief in closeness of the prevalence measured by EHR and health survey. The 95% highest density credibility interval of the logit prevalence is given aŝ The estimator, while less efficient than the subjectlevel imputation estimator below, is simpler to implement by practitioners who often do not have resources to link subjects in two sources or model the mechanisms of the misclassifications in EHR.

Subject-level imputation estimator Misclassification model
We adapted the approach by Schenker et al. [14] and modeled the misclassification between the binary outcomes of i th subject in health survey (Y 1, i ) and EHR (Y 2, where z i is a vector predictor. Since the relationship may depend on the design factors of surveys, the model is stratified by four levels (l = 1, 2, 3, 4) divided by the quartiles of the inclusion probabilities to the health survey as q 11 , q 12 , q 13 and to the EHR as q 21 , q 22 , q 23 .

Model for inclusion to each source
Since the inclusion probabilities to health survey (π 1i ) are unknown for most EHR subjects, we model them by a model, logit π 1i = a 0+ a 1 u i , where u i is a vector of survey design factors. The model is fit over entire EHR subjects weighted by their post-stratified weights (w 2 ). Similarly, we model the inclusion probability to EHR logit π 2i = b 0+ b 1 v i and fit it over entire health survey subjects weighted by their survey weights.

Bayesian iterative regression imputation
While we are ultimately interested in imputing missing health survey outcomes (1) in Fig. 1, we follow Schenker et al. [14] and perform iterative imputations between two models M1, to impute missing EHR values (2) in the figure, and M2, to impute missing health survey values (1) in the figure. This is repeated B times. Imputing missing EHR values (2) in the figure increases sample size when fitting M2, the model we are ultimately interested. The additional variation caused by using imputed values was accounted for by the multiple imputation standard error formula below. The following is the detailed procedure.
To impute missing Y 2,i , we divided the subjects S 1 ∪ S 2 into 4 (l = 1, …, 4) groups by the quartiles q 21 , q 22 , q 23 , and within each group fit Bayesian regression model M1 with a weakly informative prior for β l = (β 0l, β 1l, β 2l ) of independent Cauchy distributions with 2.5 scale and zero location, first on the subjects S c whose identities can be linked between two data sources. Then, we drew a posterior sample of β l , and in turn Y 2,i conditional on β l for all health survey subjects missing Y 2,i . Subsequently, treating this imputed Y 2,i as observed, we imputed missing Y 1,i by dividing the subjects into 4 groups by q 11 , q 12 , q 13 and fitting the regression model M2 on all EHR subjects with independent Cauchy prior for γ l = (γ 0l , γ 1l , γ 2l ) with 2.5 scale and zero location. We drew a posterior sample from γ l , then in turn Y 1,i for all EHR subjects missing Y 1,i . We iterated B times to fit models M1 and M2, treating imputed values from the previous step as observed and imputing the missing outcome variables until convergence. Then we calculated a prevalence estimatorp m = P i∈S 2 w 2iŶ m;1;i = P w 2i based on the imputed health survey measurements of all EHR subjects. Notice that the outcome values were imputed only when they are missing. In other words,Ŷ m;1;i = Y 1, i for subjects whose health survey outcome was observed. Finally, we combined inferences from M such multiple imputations. The resulting prevalence estimator is unbiased when the specified models are correct: The standard error ofφ R =logit(P R ) was estimated by the standard way [20,21]: is the naïve standard error of the logit prevalence (φ m ) calculated from m th imputation. Since the overlap between two sources can be small, we used Barnard-Rubin degrees of freedom [22,23] to compute credibility intervals, first in log-odds scale before they were transformed to probability scale.

Simulation studies
We performed simulation studies to assess the performance of the methods under various settings. We generated correlated binary outcomes (Y 1 , Y 2 ) of a target population (N = 10,000,000) whose conditional distributions follow logistic models: logit P(Y 1 = 1|Y 2 ) = η 10 + φ Y 2 and logit P( To do so, we first generated an independent Bernoulli variable x 1 with success probability .5 and a standard normal variable x 2 . Then we generated the correlated binary outcomes (Y 1 , Y 2 ) which has 4 possible outcomes (0,0) (0,1), (1,0), (1,1) with corresponding joint probabilities p 00 , p 01 , p 10 , p 11 where p 11 : p 10 :p 01 : p 00 = exp.(φ + η 10 + η 01 ): exp.(η 10 ): exp.(η 01 ):1. This set up guarantees that the conditional distributions of outcomes are the two stated logistic models. The log odds ratio φ and the linear coefficients were set so that the true prevalence based on two datasets were p 1 = p 11 + p 10 = 0.3 and p 2 = p 11 + p 01 = 0.3, 0.31, 0.32, 0.33, or 0.35. Then, we randomly selected subjects for the health survey (n 1 = 250, 500, or 1000) and EHR (n 2 = 100,000) by inclusion probabilities proportional to logit π 1i = a 0 + a 1 u 1i + a 2 u 2i + a 3 x 1i and logit π 2i = b 0 + b 1 u 1i + b 2 u 2i + b 3 x 1i . u 1 was an independent Bernoulli variable with success probability .5 and u 2 was a standard normal variable. We set (a 0 , a 1 , a 2 , a 3 ) = (b 0 , b 1 , b 2 , b 3 ) = (1,1,1, 0.187). x 1 , the predictor of misclassification, was also included as a survey design factor so that the missing mechanism is missing-at-random but not missingcompletlely-at-random. Then, we selected more EHR subjects among the health survey participants so that the proportion of health survey participants that are also in EHR is 20, 50%, or 100%. Finally, we deleted the values of Y 1 and π 1 for the subjects not in the health survey and Y 2 for the subjects not in EHR. All π 2 values were deleted as inclusion probabilities are unknown in typical EHR.
For each simulated survey and EHR, we used u 1 , u 2 , and x 1 to calculate post-stratified weights w 2 for the EHR. Then we calculated four prevalence estimates: estimator based only on the health survey, estimator based only on EHR, Mosteller estimator, and the subject-level imputation estimator. For the subject-level imputation estimator, we included burn-in iterations and combined inferences of M = 30 multiple imputations. The overall process of the generation of the target population, sampling health survey and EHR from the population, and calculating the prevalence estimates was repeated 200 times. Table 1 shows the average prevalence estimates by the four estimators. The size of the health survey (n 1 ) and the size of subjects linked between two sources (n 12 ) were both 500. Health survey estimator was unbiased in all settings. On the contrary, EHR estimator was biased except when there was no misclassification bias (i.e., p 2 = 0.3), in which case post-stratification successfully adjusted for the selection bias. Both Mosteller estimator and the subject-level imputation estimator showed less than 3% bias in all settings. Table 2 shows the MSE of the estimators. When bias was less than or equal to 5% (i.e., p 2 = 0.3 or 0.31), the EHR estimator outperformed the health survey estimator due to a larger sample size. When the bias was more substantial, however, it overwhelmed the benefit from the sample size. Then, the subject-level imputation model and the Mosteller estimator performed better than the estimators based only on either source. Notably, they either outperformed or were similar to the health survey estimator in all settings. Between the two, the Mosteller estimator performed better than the subjectlevel imputation estimator when bias was small to moderate (p 2 = 0.3-0.33), but worse when bias was large (p 2 = 0.35).
We studied how the size of the health survey and the size of subjects linked between two sources affect the performance (Table 3). We fixed the true prevalence (p 1 ) at 0.3 and the prevalence (p 2 ) measured from EHR (Y 2 ) at 0.32. The EHR estimator performed best when the health survey was small (n 1 = 250) but Mosteller's estimator performed best when the health survey size was moderate (n 1 = 500, 1000). The subject-level imputation estimator requires enough size of subjects linked between two sources. Mostellers' method, on the other hand, performed well in most settings.

Analysis of NYC macroscope and NYC HANES
We illustrate the methods with data from NYC. To protect patient privacy, the authors did not directly access the data but submitted R codes to the NYC Department of Health and Mental Hygiene (DOHMH) and received back the results of the joint analysis of two data sources presented below.

Description of data sources
NYC Macroscope is an EHR-based surveillance system developed by the NYC DOHMH in collaboration with the City University of New York School of Public Health to estimate the prevalence of chronic diseases and risk factors for adult population (20 years or older) in care by participating primary care providers in NYC [2,5]. The data were available only as aggregate data stratified by age group, sex, and neighborhood poverty level. Detailed provider and patient inclusion and exclusion criteria are documented elsewhere [2]. In this study, we used the 2013 data that included 716,076 patients.
The 2013-14 NYC HANES is a populationrepresentative survey of NYC residents aged 20 or older (n = 1527) with the interview, physical examination and biospecimen collection [24]. The data used in this study were limited to in-care participants (i.e., participants who have seen a provider for primary care in the previous year; n = 1135). Recently, a chart review study was conducted among a subsample (n = 190) of in-care participants from NYC HANES (Fig. 1) [5]. In their study, more than 20 EHR from primary care providers were abstracted for each chart review participant, and the data were linked to the NYC HANES data at the individual level. The chart review sample consisted of participants who received primary care from NYC Macroscope or non-NYC Macroscope providers. Because there was little difference in demographic and clinical characteristics between the two groups, we used data from all participants in this study. They performed the chart review on subjects enrolled in NYC HANES 2013-14 (n = 1524) who had doctors visit during the year (n = 1135) and signed a consent form and Health Insurance Portability & Accountability Act (HIPPA) waiver (n = 491) whose EHR were available and valid (n = 190).

Definition of health indicators
We selected six health indicators in the sources to demonstrate the methods: hypertension diagnosis, diabetes diagnosis, smoking, obesity, depression, and influenza vaccination. Newton-Dame and her collegues describes these indicators in detail [2]. Hypertension diagnosis was defined as either systolic blood pressure ≥ 140 mmHg or diastolic blood pressure ≥ 90 mmHg or an existing record of hypertension diagnosis (based on ICD-9 in NYC Macroscope and self-report in NYC HANES). Diabetes indicator was based on the presence of an ICD-9 diagnosis in NYC Macroscope and self-report in NYC HANES.
Smoking was based on an indication of 'current smoking' in the most recent smoking status in the NYC Macroscope and based on a self-report of current smoking in NYC HANES. The obesity indicator was based on the most recent body mass index (BMI) ≥ 30 in NYC Macroscope and based on the measured height and weight at the interview in NYC HANES. Depression indicator was based on the presence of an ICD-9 depression diagnosis ever recorded, or a Patient Health Questionnaire (PHQ-9) score ≥ 10 in NYC Macroscope and based on a self-reported diagnosis or a PHQ-9 score ≥ 10 at interview in NYC HANES. Influenza vaccination indicator was based on the presence of a The size of health survey (n 1 ) and the size of subjects linked between two sources (n 12 ) are both 500 Square root of MSE for estimating p 1 is shown. The size of health survey (n 1 ) and the size of subjects linked between two sources (n 12 ) are both 500. For each row, the best performing method in each row is highlighted in bold relevant ICD-9/CPT/CVX code in NYC Macroscope and based on the self-report of receiving influenza vaccination in the past 12 months in NYC HANES.

Illustration of the methods on NYC data
The NYC Macroscope used post-stratification to address the selection bias of Macroscope data [25,26]  in the depression prevalence and influenza vaccination rate were likely due to the under-diagnosis of depression in primary care settings and influenza vaccination outside of clinics (e.g., pharmacies) that are not recorded by the primary care EHR. The population characteristics in NYC HANES and NYC Macroscope for the adult in-care population are described elsewhere [27]. We estimated prevalence by the four estimators: estimator based only on NYC HANES, estimator based only on Macroscope data, Mosteller estimator, and the subject-level imputation estimator. We assumed that NYC HANES was the gold standard since data were collected using a population-representative sample design with a controlled and standardized data collection. The chart review study with 190 subjects whose identities were linked between NYC HANES and NYC Macroscope enabled us to calculate the subject-level  The units are in percentage imputation estimates for which we used age group, sex, and neighborhood poverty level as covariates for inclusion models and misclassification models. There was a lack of predictors that could properly model misclassifications in the EHR, such as hospital size, instrument labels, or types of visits.
Mosteller prevalence estimates showed improvement over both NYC HANES and NYC Macroscope estimates (Table 4). In all six health outcomes, they showed smaller standard errors compared to NYC HANES estimates and smaller biases compared to Macroscope estimator. The bias reduction was especially substantial (> 99% reduction) in depression and influenza vaccination estimates because, for these indicators, EHR estimates were given little weight (Table 5). On the other hand, the subject-level imputation estimates did not outperform NYC HANES estimates: their credibility intervals were larger than NYC HANES estimates. This was due to the lack of predictors, as mentioned above, that could model the mechanism of misclassification in EHR. The subject-level imputation method requires us to correctly model the misclassification as well as to approximate the inclusion probabilities to the health survey for the EHR subjects. Table 4 also demonstrates that the selection bias in Macroscope was less than the bias due to subject-level misclassifications: the range of differences in prevalence estimates between Macroscope and NYC HANES for diabetes, smoking, and obesity were similar with (1.6-3.7%) and without (1.5-2.6%) post-stratification. However, it decreased to 0.4-0.6% for the Mosteller estimator. The range of differences in depression prevalence and influenza vaccination rate were also similar with (10.7-26.9%) and without (10.4-27.4%) poststratification but it reduces dramatically to 0.1% for the Mosteller estimator. This shows that post-stratification alone was insufficient to correct the bias in the EHR for these outcomes. But Mosteller estimator and subjectlevel imputation estimator both used NYC HANES as a safeguard against potential bias in EHR.

Discussion
Compared to traditional health surveys, EHR has a much larger sample size and the potential to reduce standard errors of prevalence estimates. It can be very helpful in estimating prevalence in small sub-groups of the populations. In NYC Macroscope and our simulation study, we found that the correction of the subject-level error of EHR is necessary and possible.
In the simulation study, the health survey estimator was unbiased, but the standard error was the largest. On the contrary, the bias in EHR estimator can overwhelm the benefit of its sample size. When that happened, both Mosteller estimator and the subject-level imputation estimator yielded negligible bias and small standard errors: they either outperformed or were comparable to the estimators based solely on either source. The subject-level imputation estimator may outperform Mosteller estimator when EHR bias is large. However, the estimator requires enough size of subjects linked between two sources and correctly modeling the mechanism of misclassification as well as modeling inclusion probabilities to both sources.
The difficulty of such a task was demonstrated in the analysis of the NYC data. Mosteller estimators showed considerably smaller standard error than NYC HANES estimates especially when the NYC Macroscope estimates and NYC HANES estimates were close. The subject-level imputation estimator did not outperform NYC HANES estimator in part due to a lack of predictors for misclassification. The predictors for misclassification can be both patient-level characteristics, such as types of visit, and institution-level predictors, such as hospital size or instrument labels. These variables are typically going to be found in EHR (or administrative data sets that accompany EHR), while some patient characteristics will still be found in a health survey. In practice, the fit of the misclassification model should guide the choice between considered approaches, whether to model the underlying mechanism of misclassification or to use Mosteller's estimator. This can be done, for example, by cross-validated estimation of area under the curve of the receiver operating characteristic (ROC) curve as one moves the probability cutoff in the logistic regression model M2.
In this article, we considered the health survey as the gold standard. Here we acknowledge that survey measurements are rarely unbiased. However, it is often helpful to treat one survey as gold-standard over another. For example, investigators have treated a smaller inperson survey as gold-standard over a larger telephone survey [10], or clinical surveys as gold-standard over self-reported outcomes [14,28]. EHR are often administrative data collected for billing purposes with nonstandardized instruments and protocols, with complex unknown inclusion mechanisms. NYC HANES was designed for health survey purposes by standardized instruments and protocols and collected by representative probability sampling. We assumed that typical bias treatment for the health survey, such as post-stratification and calibration for non-response bias has been successfully performed.

Conclusions
We demonstrated that the joint use of a small goldstandard health survey with a larger EHR improves accuracy in prevalence estimation. Depending on the available data, one can aim to model the misclassification completely or calculate the weighted average of the prevalence estimates from two sources. The studied approaches can improve the quality of EHR as a public health surveillance tool. In another work, we are extending the methods to model subgroup level prevalence estimators from health surveys and EHR.