This analysis used five statistical models to predict cost for a population of patients with MH/SA disorders in the VA. Several methods for overall model fit, as well as fit within deciles of predicted costs, were used to test the predictive ability of the models. Moreover, a test of sensitivity of model choice to sample size was performed using simulation methods.

Ordinary least squares is often used to regress cost on patient characteristics. The population tested in this study has multiple comorbidities, with some patients (or a large proportion) incurring very high costs. This causes the tail of the distribution of costs to be very right-skewed and residuals from the model are not distributed normally. Nevertheless, even for distributions that account for long tails, often there are not enough observations with extremely high values to estimate the tail accurately.

The sample used in this study is large (more than 10 times larger than what is reported for other studies) and allows for extensive study of how well each of the models predict and also how well they predict for smaller sample sizes. This is of extreme importance, given that in many studies, researchers do not have access to such large datasets or for other reasons cannot analyze data from an entire population.

The Gamma Log model was found to be the worst model in every statistic analyzed. It did particularly poorly for the RMSE, with a value that was more than double the smallest RMSE value corresponding to the Normal Identity model. It also performed poorly for deciles of predicted cost, underpredicting consistently for the first 9 deciles and overpredicting in the 10th decile.

Nixon and Thompson [11] conclude that skewed parametric distributions fit cost data better than the Normal distribution. In our sample, the Normal Identity model performed well, as shown by the measures of the predictions in the full sample. The Normal Identity had the second smallest RMSE and a predictive ratio for the top decile not statistically different from 1. The major problem with this model is the extremely large percentage of negative predictions it generates. If all one is interested in is the overall mean prediction, then the model is adequate. However, if one is interested in individual predictions, such as those patients with small costs, the model is not as good. Nonetheless it performed reasonably well at the top deciles, which is often an important target group for policy makers and disease management planners.

The models tailored to deal with the skewed sample perform reasonably well. In the overall sample, models with square-root transformation or link perform the best. This could be due to the fact that the square root transformation forces a form of interaction among the independent variables that might be needed in this sample because many of the patients have multiple MH/SA conditions. Interactions usually are not used in risk-adjustment systems except for systems that use hierarchies within conditions. However, hierarchies are a limited form of interactions and are designed primarily to avoid double counting specific diagnoses within a disease category, e.g., for a patient with paranoid schizophrenia and psychoses NOS ("not otherwise specified"), only the paranoid schizophrenia is counted. The Square-root Normal model has the smallest MAPE and RMSE that are statistically different from the other models values. The Gamma with square-root link has PRs that are (for each decile) consistently very close to 1.

The Log Normal is a multiplicative model. It does well when assessed on the log scale (not shown here) but after retransformation and even with adjustments, it does poorly. One reason is the fact that we are using a sample of all MH/SA patients, which are, in general, a highly comorbid population within the VA. Those individuals most comorbid are the ones found in the upper deciles. When bringing predictions back to the original scale, the multiplicative effect in this model causes large predictions as evidenced by an extremely high PR in the 10th decile. The overprediction in the 10th decile, together with the fact that we are forcing the mean predicted to equal the mean observed, translates into very poor predictions in the middle deciles.

Simulation results show that even though on average results do not differ from those in the larger sample, gamma models have some convergence problems for smaller sample sizes. However, this problem is directly related to the extremely small number of subjects in certain cells. This can be dealt with by inspecting the data before running the model. In the case of obtaining a sample with very small numbers for certain categories, the investigator should consider combining categories with small cell sizes before deciding that gamma models cannot be run. In the sample presented, the Gamma with square-root link model gives a very good fit in the overall sample; even for the small samples where the model converges, this is a reasonable choice based on the statistics we assessed.

Choosing a parsimonious model is an important statistical practice. This argument often is used to justify the choice of OLS risk-adjustment models. However, parsimony requires that the model be the simplest one possible that also fits the data well. The large percentage of negative predictions from our OLS models invalidates, in this study, this characterization for the OLS model.

More advanced models have been introduced in the literature that are an extension of the GLM models. Basu and Rathouz 2005 [27] introduce a method that directly estimates the link function in a GLM from the data. Manning, Basu, and Mullahy 2005 [28] describe the generalized Gamma models, which include the OLS with Normal error, OLS for Log Normal, and Gamma with a log link as special cases. One limitation of our study is that we compared models using one risk-adjustment system. The population of interest and the goals for the risk-adjustment system dictate our choice of the best model for this setting. So, although the model choice may not generalize to other risk-adjustment systems, the process of defining goals and testing whether the model meets the goals is generalizable to good statistical practice.