A method for estimating wage, using standardised occupational classifications, for use in medical research in the place of self-reported income
© Clemens and Dibben; licensee BioMed Central Ltd. 2014
Received: 6 January 2014
Accepted: 10 April 2014
Published: 28 April 2014
Income is predictive of many health outcomes and is therefore an important potential confounder to control for in studies. However it is often missing or poorly measured in epidemiological studies because of its complexity and sensitivity. This paper presents and validates an alternative approach to the survey collection of reported income through the estimation of a synthetic wage measure based on occupation.
A synthetic measure of weekly wage was calculated using a multilevel random effects model of wage predicted by a Standard Occupational Classification (SOC) fitted in data from the UK Labour Force Survey (years 2001–2010)a. The estimates were validated and tested by comparing them to reported income and then contrasting estimated and reported income’s association with measures of health in the Scottish Health Survey (SHS) 2003 and wave one (2009) of the UK Household Longitudinal Study (UKHLS).
The synthetic estimates provided independent and additional explanatory power within models containing other traditional proxies for socio-economic position such as social class and small area based measures of socio-economic position. The estimates behaved very similarly to ‘real’, reported measures of both household and individual income when modelling a measure of ‘general health’.
The findings suggest that occupation based synthetic estimates of wage are as effective in capturing the underlying relationship between income and health as survey reported income. The paper argues that the direct survey measurement of income in every study may not actually be necessary or indeed optimal.
KeywordsIncome Synthetic data Standard occupational classification General health Social survey
The association between socio-economic position (SEP) and health is well established and adjusting for the confounding effect of SEP in epidemiological and medical research is common practice. Common measures of SEP include educational attainment, housing (tenure, conditions or amenities), a number of occupational based measures and income [1, 2]. Of these measures, income is perhaps the best indicator of an individual’s material and wealth circumstances and has been linked to many health outcomes including mental health [3–6], mortality [7–10] and self-assessed health [11, 12].
Income is a sensitive topic and a potentially complex question to answer for many individuals and this can result in measurement error or missing data in surveys. Non-response rates of around 10-25% are common [13, 14]. Furthermore, the format of the question or the subject matter of the survey is also likely to affect accuracy . By varying degrees, all of these factors are, therefore, likely to introduce bias  and whilst some of these difficulties can be overcome with the implementation of more sophisticated survey designs , in many cases these are expensive and difficult to implement . In some cases, concerns over the impact of asking an income question on overall response rates means that an income question is not asked at all. For example, despite strong pressure from the research community [18–20] the UK census will continue to omit an income question because of the difficulties respondents faced answering the question, the effect on response rates and potential negative coverage in the national press .
Despite this, there has been relatively little use of detailed descriptions of occupation to approximate a measure of material disadvantage, beyond collapsing them into traditional social class measures, despite their ubiquity in many data sources. This is possibly because of the difficulties associated with meaningfully incorporating large numbers of different occupational categories into a statistical analysis. However, aggregating occupation groups into social classes will result in a significant loss of occupation related discrimination in terms of socio-economic position. This paper argues that the utilisation of detailed occupation information, and its conversion onto an estimated continuous monetary scale, offers the potential for improved adjustment for SEP in medical research over traditional proxy measures. The paper sets out an approach based on common and widely available occupation classification schemes such as the UK Standard Occupation Classification (SOC). In addition to being highly discriminating in terms of wages, the SOC and other similar measures such as the International Standard Classification of Occupations (ISCO) are relatively common in most social survey datasets and so the approach described here is generalizable to a wide range of datasets from different countries.
We used a multi-level mixed effects approach to model and then synthesise wages for working individuals by SOC units (the finest detailed grouping within the SOC). In order to ensure that the resulting estimates could be replicated in as wide a range of data sources as possible, it was decided to restrict the variables used in the models to age, sex and SOC occupation. The mixed effect models utilise the tiered structure of the SOC and estimate random effect parameters associated with each of the groups (levels) within corresponding tiers of the SOC together with fixed effect parameters for age and sex. Modelling was carried out in STATA version 11 and used maximum likelihood estimation within the xtmixed command. The ‘predicted’ wage (the synthetic estimate) was calculated in the survey data using the parameters from the original model including the fixed effects coefficients for age and sex (applied to age and sex information in the survey) together with the ‘shrunk’ multi-level residuals from each SOC group (corresponding to the SOC information in the survey) (see Additional file 1). The hierarchical nature of the SOC classification and the use of Empirical Bayes ‘shrunk’ multi-level residuals meant that the estimates were statistically efficient even where the sample size in any given SOC unit group was small.
Information on individual wage, from which the prediction models were estimated, was obtained from the UK Labour Force Survey (LFS). The LFS is a survey of approximately 50,000 households living at private addresses. It is collected every 3 months and is designed to be nationally representative of the UK population including all people resident in private households, all persons resident in National Health Service accommodation and young people living away from the parental home in a student hall of residence or similar institution during term time. We used data collected for the years 2001–2005 and 2007–2010. Data from 2006 was omitted to allow subsequent internal validation of the models. The wage information was derived from self-reported responses to a question asking for ‘gross weekly income in main job’ and was standardised to 2006 earnings levels using the consumer price index (CPI). The analysis sample was restricted to individuals of working age (16–65 for men and 16–60 for women) as pension type earnings could not be estimated reliably with occupation and to those individuals who were in employment in the week previous to the survey as those out of work were not assigned a SOC code. Outlying values in the wage distribution were determined by examining the skewness and were omitted together with those individuals who were missing information for wage, age, sex and SOC. These adjustments left a remaining sample size 251,537. For the modelling, wage values were log transformed in order to reduce the overall skewness in the distribution.
Data from both the Scottish Health Survey (SHS) 2003 and wave one (2009) of the UK Household Longitudinal Study (UKHLS) were used to test the external validity of the synthetic estimate. The SHS has a total sample size of 11,472 and the UKHLS 22,265 and, similarly to the LFS, both surveys are designed to be representative of the population and cover similar age ranges. The surveys were chosen because they both contain information on self-rated general health as well as a large amount of demographic information including SOC, social class and income. The SHS has a measure of equivalised household income and the UKHLS has individual ‘take home’ wages. This then allowed a comparison between both of these measures and our synthetic estimate of wage. Samples were restricted to individuals with complete information for the variables in the analysis and to those aged 16 or over.
The most effective model configuration was determined by examining the wage predictions generated in the 2006 labour force survey data which were evaluated by calculating the standard deviation of the residuals of the predicted wage subtracted from the actual wage.
External validation of the wage estimates
A comparison amongst working individuals between our estimate of wage against reported individual salary and household income when predicting health status was carried out using the SHS and UKHLS. This comparison was made using both grouped and continuous versions of the wage and income variables. In each case we examined the strength of the relationship between income or wage and general health. The general health variable in the SHS was assessed with the question “how is your health in general?” with responses very good, good, fair, bad or very bad and in the UKHLS the question was “In general, would you say your health is?” with responses excellent, very good, good, fair and poor. These variables were coded into binary indicators with the SHS comparing those with fair, bad or very bad health with those with good or very good health and the UKHLS data comparing those with fair or poor health with those with good or better health. Because the outcomes in both cases were binary, logistic regression was used to estimate model parameters and a method proposed by Zheng and Agresti  using correlations was used to compare model fit (the predicted values for the outcome are correlated with the observed values - the higher the correlation the better the fit of the model).
Estimation of the prediction equations
Details of different prediction models fitted to the master data predicting log weekly wage fixed and random effect parameters (to two significant figures) are reported together with overall residual variance of the model
Model 1: 2-level random intercepts (individuals nested in SOC minor groups)
Model 2: 2-level random intercepts and age slopes (individuals nested in SOC minor groups)
Model 3: 3-level random intercepts (individuals nested in SOC unit groups nested in SOC minor groups)
Model 4: 3-level random intercepts and age slopes (individuals nested in SOC unit groups nested in SOC minor groups)
Fixed effects #
Age (increments of one year)
Sex (female reference)
Random effects $
Level - SOC minor
Level - SOC unit
N (for all models)
Internal validation of the prediction equations
Evaluation (using average deviation of the predicted wage from actual wage and% reduction in total deviation) of both prediction models and simple geometric means (grand mean and mean within SOC unit categories) for the internal validation data (2006 LFS data only)
Average deviation from actual wage
% reduction of deviation
Grand Geometric Mean Wage
Geometric Mean wage in SOC Unit Group
Model 1: 2-level intercept
Model 2: 2-level intercept and age slopes
Model 3: 3-level intercept
Model 4: 3-level intercept and age slopes
Comparison of synthetic estimates to other measures of SEP
Comparison of synthetic wage and measured equivalised household income coefficients from models predicting fair, bad or very bad health estimated from Scottish Health Survey, adjusting for other measures of socio-economic position
Wage (synthetic estimate)
Reported equivalised income
Odds ratio of poor health
Odds ratio of poor health
1. Wage (scaled in units of £100)-controlling for Age and Sex
2. As 1 with additional control for social class
3. As 2 with additional control for SIMD
4. As 1 with additional control for SIMD
5. As 2 with additional control for NSSEC8 and SIMD
Comparison of synthetic estimates with household and individual income
Model coefficients for synthetic wage and survey reported income (continuous and deciled) for age and sex adjusted logistic regression models predicting fair, bad or very bad health in Scottish Health Survey reported income is equivalised household income and in UK household Longitudinal Study individual wage
Scottish Health Survey
UK Household Longitudinal Study
Odds ratio for poor health
Odds ratio for poor health
Continuous Income (scaled in units of £100)
Survey reported income
Synthetic wage (lowest income decile as reference)
Survey reported income (lowest income decile as reference)
When comparing the synthetic estimates with an individual survey income measure the relative patterns between models differ from those examining household income. Firstly, the correlation values suggest much smaller differences in the fit of both the discrete and continuous models between the synthetic wage and real income variables. In the continuous models, the fit is actually marginally better when using the synthetic measure and, similarly, in comparison with the household income, has a stronger effect on general health. In terms of the discrete variable, the gradient pattern was less marked particularly for the real income model which also showed perhaps a slightly shallower gradient when compared to the synthetic model. It is worth noting that the two surveys asked the general health question in different ways and this may explain the difference between surveys in the various proportions of the population stating they experience good or poor health.
The collection of income information on a questionnaire or within time limited interview situations is not straightforward. This is reflected in both its absence from some research instruments, its simplified form in others and more generally its relatively high level of missing or improbable responses. In studies where a measure of income is entirely missing, the use of other indicators of socio-economic position such as social class, educational attainment or small area based indicators are frequently used to approximate the material disadvantage that would have been captured by an income measure. This study has proposed and examined an alternative approach, the estimation of a synthetic measure of individual wages among workers based on detailed occupation groups from a standard occupation classification. While occupation forms a key component of many social class based measures, this often involves collapsing detailed occupational categories to such an extent that much ‘information’ is lost. We utilised this ‘information’ to estimate a synthetic measure of occupational based wage and then tested its external validity in relation to the prediction of an often used self-reported general health measure. We observed two main findings. Firstly, the estimates provide independent and additional explanatory power within models containing only social class and small area based measures of socio-economic position alone and secondly, that they behaved very similarly to ‘real’, reported measures of both household and individual income when modelling ‘general health’. These findings suggest that occupation may be a useful variable with which to estimate a synthetic measure of wages and may provide a reliable and effective alternative or supplement for the recording of reported income in social and health surveys.
The approach we have taken has a number of advantages both when datasets are missing an income measure entirely as well as for those where it is imprecisely measured. In the former case, our findings appear to support the notion that wage measures a different component of SEP than that captured by social class and small area poverty or deprivation measures. This suggests that social class and small area deprivation on their own may not be sufficient to adjust for all socio-economic differences in general health and certainly not those differences that are related to income.
It could be argued that the synthetic occupation-based wage estimates also provide a more analytically useful measure of ‘average income’. Research suggests that of the many aspects of SEP, income is perhaps the component with the greatest degree of short-term variability  which means that a traditional cross sectional survey, collecting the data at a single time point, may not capture the underlying information of interest. Because our estimates are closer to an individual’s medium term average wage given their occupation, it may capture important economic forces more effectively than reported measures of income for a specific period of time. This may explain why our synthetic measure has better discrimination (in terms of health) at lower levels than reported income (see Table 4 – odds ratio for deciles). At this point in the income distribution casual employment with more variable rates of wage within any period of time will be more common and therefore a single sample in time may provide a poor estimate of average wage, the more important factor in the determination of health.
The methodology can be applied to a wide range of studies or datasets because the estimation models are reasonably parsimonious and only require a record of age, sex and occupation coded within some form of hierarchical or tiered standard classification. In most datasets these variables are unlikely to contain significant numbers of missing cases leading to mostly ignorable and negligible missing cases in the resulting estimates. It may also be possible to simplify the model further. Provided a sufficiently large dataset is available (ie sufficient number of cases in each occupation group), the mean wage level within an occupational group may provide as good an estimate of wage as the empirical bayes estimate used in this study (see Table 2).
The findings have a number of important implications for understanding confounding by socio-economic position and the collection of income data in surveys for health research. Firstly, it is clear that other non-income measures or components of SEP do not entirely capture the effect of income on their own and that omitting an income measure risks introducing income-related confounding. This is particularly problematic in datasets in which income is not measured such as the UK census and census based longitudinal studies. Extending this argument further, the findings may have wider implications for the measurement of income in health surveys more generally. Although we have restricted our analysis to an examination of self-reported health, the evidence begins to suggest that the collection of reported income data in health surveys may not be as crucial as the measurement of occupation. This is important as occupation is a far easier characteristic to measure and is much less problematic in terms of missing data, mis-measurement or inaccuracy.
There are limitations with the approach that we have used. Firstly, it relies on occupational information being available for subjects and, if household income needs to be calculated, for all those contributing to the household budget. Secondly, for those of working age, who are not employed or those who have retired, a description of occupation, if available, will not necessarily be an accurate measure of their income. However, it is possible to estimate the likely income for those who are unemployed or retired by using the standard welfare payments or occupational related pension payments. For those who have retired but have a pre-retirement occupation recorded, a similar modelling approach could be used to estimate pension level. Finally, the study was restricted to an examination of a measure of self-reported general health and it does not necessarily follow that our findings can be generalised to other health variables. For example, the shape, magnitude and functional form of the relationship between income and other health indicators such as mortality and physical health measures differs markedly in some cases [24–26]. It is important for future research to examine the validity of these synthetic estimates in relation to other health variables.
This study suggests that a synthetic measure of wage based on occupation can be used as an effective alternative to self-reported income in health studies. Given the problems associated with questions about income in a survey context, for example its non- or inaccurate completion and its negative effect on overall response rates, this study also suggests that it may more effective to ask about and use occupation as a control for material disadvantage in health studies.
aData available from UK Data Service http://ukdataservice.ac.uk/.
Tom Clemens was supported by a grant from the Wellcome Trust (The Scottish Health Informatics Programme-Ref WT086113).
- Galobardes B, Shaw M, Lawlor DA, Lynch JW: Indicators of socioeconomic position (part 2). J Epidemiol Community Health. 2006, 60: 95-101. 10.1136/jech.2004.028092.View ArticlePubMedPubMed CentralGoogle Scholar
- Galobardes B, Shaw M, Lawlor DA, Lynch JW, Smith GD: Indicators of socioeconomic position (part 1). J Epidemiol Community Health. 2006, 60: 7-12. 10.1136/jech.2004.023531.View ArticlePubMedPubMed CentralGoogle Scholar
- Benzeval M, Judge K: Income and health: the time dimension. Soc Sci Med. 2001, 52: 1371-1390. 10.1016/S0277-9536(00)00244-6.View ArticlePubMedGoogle Scholar
- Lundberg O, Fritzell J: Income distribution, income change and health: on the importance of absolute and relative income for health status in Sweden. WHO Reg Publ Eur Ser. 1994, 54: 123-128.Google Scholar
- Lynch JW, Kaplan GA, Shema SJ: Cumulative impact of sustained economic hardship on physical, cognitive, psychological, and social functioning. N Engl J Med. 1997, 337: 1889-1895. 10.1056/NEJM199712253372606.View ArticlePubMedGoogle Scholar
- Mullis RJ: Measures of economic well-being as predictors of psychological well-being. Soc Indic Res. 1992, 26: 119-135. 10.1007/BF00304395.View ArticleGoogle Scholar
- Duncan GJ: Income dynamics and health. Int J Health Serv. 1996, 26: 419-444. 10.2190/1KU0-4Y3K-ACFL-BYU7.View ArticlePubMedGoogle Scholar
- McDonough P, Duncan GJ, Williams D, House J: Income dynamics and adult mortality in the United States, 1972 through 1989. Am J Public Health. 1997, 87: 1476-1483. 10.2105/AJPH.87.9.1476.View ArticlePubMedPubMed CentralGoogle Scholar
- Menchik PL: Economic status as a determinant of mortality among black and white older men: does poverty kill?. Popul Stud. 1993, 47: 427-436. 10.1080/0032472031000147226.View ArticleGoogle Scholar
- Wolfson M, Rowe G, Gentleman JF, Tomiak M: Career earnings and death: a longitudinal analysis of older Canadian men. J Gerontol. 1993, 48: 167-179.View ArticleGoogle Scholar
- Frijters P, Haisken-DeNew JP, Shields MA: The causal effect of income on health: evidence from German reunification. J Health Econ. 2005, 24: 997-1017. 10.1016/j.jhealeco.2005.01.004.View ArticlePubMedGoogle Scholar
- Jones AM, Wildman J: Health, income and relative deprivation: Evidence from the BHPS. J Health Econ. 2008, 27: 308-324. 10.1016/j.jhealeco.2007.05.007.View ArticlePubMedGoogle Scholar
- Moore JC, Stinson LL, Welniak EJ: Income measurement error in surveys: a review. J Official Stat-stockh. 2000, 16: 331-362.Google Scholar
- Turrell G: Income non-reporting: implications for health inequalities research. J Epidemiol Community Health. 2000, 54: 207-214. 10.1136/jech.54.3.207.View ArticlePubMedPubMed CentralGoogle Scholar
- Davern M, Rodin H, Beebe TJ, Call KT: The effect of income question design in health surveys on family income, poverty and eligibility estimates. Health Serv Res. 2005, 40: 1534-1552. 10.1111/j.1475-6773.2005.00416.x.View ArticlePubMedPubMed CentralGoogle Scholar
- Kim S, Egerter S, Cubbin C, Takahashi ER, Braveman P: Potential implications of missing income data in population-based surveys: an example from a postpartum survey in California. Public Health Rep. 2007, 122: 753-763.PubMedPubMed CentralGoogle Scholar
- Galobardes B, Demarest S: Asking sensitive information: an example with income. Sozial-und Präventivmedizin/Social and Preventive Medicine. 2003, 48: 70-72.View ArticlePubMedGoogle Scholar
- Dorling D: Who’s afraid of income inequality?. Environ and Planning A. 1999, 31: 571-574.View ArticleGoogle Scholar
- Boyle P, Dorling D: Guest editorial: the 2001 UK census: remarkable resource or bygone legacy of the “pencil and paper era”?. Area. 2004, 36: 101-110. 10.1111/j.0004-0894.2004.00207.x.View ArticleGoogle Scholar
- White I, McLaren E: The 2011 Census taking shape: the selection of topics and questions. Popul Trends. 2009, 35: 8-19.View ArticleGoogle Scholar
- Walker S, Watson J: NISRA Census Office, Marques dos Santos M. 2007 Census test: the effects of including questions on income and implications for the 2011 Census. 2007, Newport: Office for National StatisticsGoogle Scholar
- Office for National Statistics: ONS standard occupational classification 2000. Volume 1: Structure and description of unit groups. Volume 2: The coding index. 2000, London: Stationery OfficeGoogle Scholar
- Zheng B, Agresti A: Summarizing the predictive power of a generalized linear model. Stat Med. 2000, 19: 1771-1781. 10.1002/1097-0258(20000715)19:13<1771::AID-SIM485>3.0.CO;2-P.View ArticlePubMedGoogle Scholar
- Backlund E, Sorlie PD, Johnson NJ: The shape of the relationship between income and mortality in the united states* 1 evidence from the national longitudinal mortality study. Ann Epidemiol. 1996, 6: 12-20. 10.1016/1047-2797(95)00090-9.View ArticlePubMedGoogle Scholar
- Der G, Macintyre S, Ford G, Hunt K, West P: The relationship of household income to a range of health measures in three age cohorts from the West of Scotland. Eur J Public Health. 1999, 9: 271-277. 10.1093/eurpub/9.4.271.View ArticleGoogle Scholar
- Ecob R, Davey SG: Income and health: what is the nature of the relationship?. Soc Sci Med. 1999, 48: 693-705. 10.1016/S0277-9536(98)00385-2.View ArticlePubMedGoogle Scholar
- The pre-publication history for this paper can be accessed here:http://www.biomedcentral.com/1471-2288/14/59/prepub
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.