From concepts, theory, and evidence of heterogeneity of treatment effects to methodological approaches: a primer
 Richard J Willke^{1}Email author,
 Zhiyuan Zheng^{2},
 Prasun Subedi^{1},
 Rikard Althin^{3} and
 C Daniel Mullins^{2}
DOI: 10.1186/1471228812185
© Willke et al.; licensee BioMed Central Ltd. 2012
Received: 31 July 2012
Accepted: 3 December 2012
Published: 13 December 2012
Abstract
Implicit in the growing interest in patientcentered outcomes research is a growing need for better evidence regarding how responses to a given intervention or treatment may vary across patients, referred to as heterogeneity of treatment effect (HTE). A variety of methods are available for exploring HTE, each associated with unique strengths and limitations. This paper reviews a selected set of methodological approaches to understanding HTE, focusing largely but not exclusively on their uses with randomized trial data. It is oriented for the “intermediate” outcomes researcher, who may already be familiar with some methods, but would value a systematic overview of both more and less familiar methods with attention to when and why they may be used. Drawing from the biomedical, statistical, epidemiological and econometrics literature, we describe the steps involved in choosing an HTE approach, focusing on whether the intent of the analysis is for exploratory, initial testing, or confirmatory testing purposes. We also map HTE methodological approaches to data considerations as well as the strengths and limitations of each approach. Methods reviewed include formal subgroup analysis, metaanalysis and metaregression, various types of predictive risk modeling including classification and regression tree analysis, series of nof1 trials, latent growth and growth mixture models, quantile regression, and selected nonparametric methods. In addition to an overview of each HTE method, examples and references are provided for further reading.
By guiding the selection of the methods and analysis, this review is meant to better enable outcomes researchers to understand and explore aspects of HTE in the context of patientcentered outcomes research.
Keywords
Heterogeneity Risk adjustment Estimation techniques Comparative effectiveness researchReview
Background
Recent interest in ‘patientcentered’ outcomes research (PCOR) stems from a growing need for valid and reliable evidence that can be used by stakeholders to make individualized treatment decisions. Currently, patients, physicians and payers often make treatment or payment decisions based upon data representing the average effect of an intervention, as observed from selected pools of patients in a clinical trial setting. Reliance on trial outcomes data for realworld decision making requires an assumption that the study population from which the average was generated accurately represents the individual patient. However, a ‘real’ patient is likely different from the average trial patient in important ways, such as demographic, disease severity, or health behavior characteristics.
In terms of intervention outcomes, these differences could mean that the average effect observed in the study population may bear little resemblance to the real effect observed in the individual patient [1, 2]. Growing awareness of this phenomenon – known as heterogeneity of treatment effect (HTE) – has fueled recent discussions regarding how PCOR studies can be designed to better account for HTE, so that the results of such research can guide treatment and insurance coverage decision making. Given a pressing need to achieve better value in health care spending, timely HTE evidence can contribute to both more individualized and more efficient care. While observational studies have and will continue to play an important role in generating PCOR evidence, there is also an opportunity to more consistently incorporate HTE considerations in the design and analysis of randomized clinical trials (RCT).
Assessment of HTE is becoming more common in the medical literature, often in the form of subgroup analysis within RCTs [3]. Formal, preplanned subgroup analysis of RCT data is certainly one means of analyzing HTE (and is discussed further below), but may not represent the most efficient or appropriate approach to investigating HTE. This is because HTE may be the result of complex interactions or latent factors that can only be unveiled by more elaborate empirical strategies. The purpose of this paper is to outline a set of key considerations for conducting prospective HTE research through a discussion of a number of validated HTE methodological approaches. We systematically address how background prior beliefs (“priors”) and preexisting evidence can shape a methodological approach to better understand HTE. Aimed at study designers with an “intermediate” level of understanding of both statistics and HTE, this paper provides an overview of select methods for evaluating HTE. Along with a description of each method is guidance for their most appropriate applications. This paper can be used as a starting point for an audience that may need to factor HTE considerations into their research plans, but may be unfamiliar with the full constellation of methods available.
This article is intended to serve as a thorough discussion about how to explore, evaluate and evaluate HTE evidence, a primer of sorts, for researchers in all sectors but particularly those involved in product development requiring RCT’s, who are interested in incorporating HTE considerations into their RCT studies. Because, for a given study, aims and circumstances can vary widely, we have sought to avoid a prescriptive approach to the process of methods selection. Instead, we seek to provide the reader with a general framework, supplemented with sufficient background material that, when combined with examples and references, enable the researcher who is interested in developing their own HTE study with the tools needed to do so.
The HTE methodological approaches discussed below were informed by a literature review of HTE methods and selected to provide a review of a variety of techniques in the medical, statistics, and economics literature. In order to focus on methods useful in product development, they were also selected for their apparent relevance to RCT conduct and analysis. Nonrandomized data analyses can inform trial design and realworld comparative effectiveness research (CER) studies, but are also subject to treatment selection biases which can significantly complicate HTE analysis. We considered such issues beyond the scope of this paper and so nonrandomized data applications are included to a much lesser degree. Also beyond this paper’s scope is the background science needed to analyze geneticallydriven differences, although the HTE methods discussed below may be used in an exploratory manner in that area.
We begin with relatively established approaches such as formal subgroup analysis of clinical trial data, and heterogeneity in metaanalysis of trials. We then discuss more exploratory approaches for HTE, particularly the family of predictive risk modeling approaches, with some detail on classification and regression tree (CART) analysis. This is followed by several approaches more explicitly accounting for intraindividual effects, albeit in quite different ways – latent growth and growth mixture models, and series of “nof1” trials  with the latter being an example of how an alternative trial design can be combined with specific methodological approaches to model HTE. Finally, we discuss some HTE methods receiving relatively more attention in the econometrics literature: quantilebased treatment heterogeneity and nonparametric methods testing for HTE. The overview of methods is followed by a discussion of how those with an interest in PCOR and HTE during product development can develop an appropriate research agenda by appropriately matching methods with study questions.
Subgroup analysis of clinical trial data
Widespread in use but contentious in value, subgroup analysis can assess whether observed clinical trial treatment effects are consistent across all patients, or whether HTE exists across the patient population’s demographic, biologic, or disease characteristics. Subgroup analyses may be prespecified as part of a trial’s analysis plan; however, reviews of clinical trial reports suggest that subgroup analyses are often employed when a statistically significant effect is not detected in the overall population in an effort to identify a statistical effect of interest [4, 5].
Subgroup analysis should be both undertaken and interpreted with caution, especially when not prespecified [6, 7]. One key issue in subgroup analyses is incomparability of the subpopulations of interest, which can arise when the trial’s patient randomization process has not appropriately taken into account the factors that define the subgroup [8]. In cases where subgroup analyses are prespecified as part of the clinical trial protocol, imbalances may be minimized through appropriate stratification during randomization.
Sample size and power are additional concerns, as most trials are powered to only detect treatment differences on the overall population level. Even in cases where a significant effect within a subgroup is detected, multiplicity is also a concern, as the likelihood of a falsepositive response increases significantly with each additional analysis conducted [9]. Established adjustment techniques (such as the Bonferroni, Hochberg, and related methods) can help to adjust for the multiplicity issue. However, depending on the number of analyses conducted, this adjustment, in conjunction with power limitations, may significantly reduce the likelihood of observing a true effect [9]. Clear guidance has been developed regarding the appropriate approaches to subgroup analyses; these overviews provide a clear framework for how subgroup analyses should be undertaken [10, 11].
Several studies have demonstrated how spurious results from subgroup analyses can be easily, if inadvertently, generated [7, 12, 13]. The publication and overinterpretation of these likely false findings only serves to increase skepticism around the potential value of subgroup analyses. Subgroup analyses have greatest value when used to generate new hypotheses that can be appropriately tested in experimental studies that are specifically designed to test these questions and balance potential imbalance, power, and multiplicity issues.
Metaanalysis
Metaanalysis is a technique that can be used to combine treatment effects across trials and their variations into an aggregated treatment effect with higher statistical power than observed in the individual trials. It can also provide an opportunity to detect HTE by testing for differences in treatment effects across similar RCTs. It is important that the individual treatment effects are similar enough for the pooling to be meaningful. If there are large clinical or methodological differences between the trials, an appropriate judgment may be to not perform a metaanalysis at all.
HTE across studies included in a metaanalysis may exist because of differences in the design or execution of the individual trials (such as randomization methods, patient selection criteria, handling of intermediate outcomes, differences in treatment modalities, etc). Cochran's Q, a technique to detect this type of statistical heterogeneity, is computed as the weighted sum of squared differences between each study's treatment effect and the pooled effects across the studies, and provides an indication of whether intertrial differences are impacting the observed study result [14]. Similarly, tests such as the I^{2}, H^{2}, and R^{2} indices developed by Higgins and Thompson are closely related and measure the degree of statistical heterogeneity [15].
A possible source of error in a metaanalysis is publication bias, i.e., the likelihood that a particular trial result is reported depends on the statistical significance and the direction of the result. Trial size might also result in publication bias since larger trials would be less likely to escape publication than smaller, lessknown ones. Language and accessibility might be other factors. There are methods of identifying and adjusting for publication bias, e.g. the funnel plot which plots the effect size against the sample size and if no bias is present is shaped as a funnel [16]. The 'trim and fill' method is a nonparametric method of adjusting for publication bias based on the funnel plot [17, 18]. Significance tests such as the Egger's test and Begg's test can also be used to identify publication bias [19, 20]. Identifying and adjusting for publication bias, in the presence of HTE, has been shown to be difficult when the metaanalysis is not large [21]. As has been noted, however, not all HTE is bad [22]. If the heterogeneity is not a consequence of poor study design it should be welcomed as a possibility to optimize treatment benefits for different patient categories.
HTE can sometimes be minimized by choosing the measure with the smallest variance across trials. Common measures of treatment effect when comparing proportions are: the additive Risk Difference (RD), and the multiplicative Relative Risk (RR) and Odds Ratio (OR). The RR and the OR both have good statistical properties but are less intuitive to interpret than the less statistically efficient RD. It may be the case that one measure, such as the RR, does not vary across subgroups, while the RD does vary, and it likely will if a common RR is applied to different baseline risks [22, 23]. In such cases it is important to determine what type of HTE is meaningful for the purpose at hand.
Metaregression is a variant of metaanalytic technique that allows for a more indepth understanding of the pooled clinical trial data, by “exploring whether a linear association exists between variables and a comparative treatment effect, along with the direction of that association” [24]. As pointed out by Baker et al, metaregression should not be undertaken unless there is a sound rationale for the hypothesis that one, or more, covariates vary linearly with the treatment effect (e.g. what effect a one year increase in age has on the treatment effect) [24].
Predictive risk modeling
A rapidly growing method for identifying potential for HTE is predictive risk modeling, whereby individual patient risk for diseaserelated events at baseline is differentiated based on observed factors. Most common measures are disease staging criteria, such as those used in COPD or heart failure, as well as more continuous algorithmic measures such as the Framingham risk scores for cardiovascular event risk [25, 26]. Genetic variations, such as HER2 for breast cancer or cytochrome p450 polymorphisms, are other such factors [27].
Initial predictive risk modeling, also known as risk function estimation, is often but not always performed prior to including treatment effects, and can employ a variety of methods. This is an area of extensive current research for more efficient algorithms, given the proliferating sources of individual patient data with large sample sizes, better data on outcomes, and large numbers of potential predictors. Traditional least squares or Cox proportional hazards regression methods are still quite appropriate in many cases and provide relatively more interpretable risk functions, but are typically based on linearity assumptions and may not give the highest scores for predictive metrics. Partial least squares is an extension of least squares methods that can reduce the dimensionality of the predictor space by interposing latent variables, predicted by linear combinations of observable characteristics, as the intermediate predictors of one or more outcomes [28]. Other, less interpretable methods include various types of recursive partitioning, such as random forests, support vector machines, and neural networks [29–32]. Some of these latter methods, particularly support vector machines, have been shown to often have better predictive success than more linear methods, generally at the expense of clarity of the risk mechanisms. Risk function estimation can range from highly exploratory analyses to near metaanalytic model validation, and may be useful at any stage of product development. The better validated the risk mechanism, however, the more it can be used for hypothesisdriven rather than exploratory analyses.
Given a risk function that generates pretreatment event risk predictions for individual patients, one must choose how to use it with RCT data to evaluate HTE. For categorical risk predictions, methods such as subgroup interactions or stratified treatment analyses can be used, subject to the considerations discussed above. Continuous risk predictions can be interacted with the treatment response in a regression format, but questions about the nature of the interaction – linear, quadratic, logistic, etc. – must be managed. Some other techniques have been proposed as well. For example, Ioannidis and Lau propose dividing patients into quartiles based on predicted risks and analyzing accordingly [33]. Lazar et al propose a technique they term “subpopulation treatment effect pattern plot” that evaluates the effect of risk on treatment outcomes in a continuous, nonparametric method, using moving averages over successively higher risk groups [34]. Crown describes a regressionbased decomposition method that is useful in parsing out risk factor effects in nonRCT data [35]. With such continuous risktreatment interactions, if subgroupdetermining breakpoints are subsequently needed for decisionmaking, one approach is posthoc application of clinically meaningful treatment effects.
Classification and regression tree (CART) analysis
A decisiontree based technique, the Classification and Regression Tree (CART) approach considers how variation observed in a given response variable (continuous or categorical) can be understood through a systematic deconstruction of the overall study population into subgroups, using explanatory variables of interest [36]. In the context of the various statistical tools that can be used to understand HTE, CART is a simple approach best suited for earlystage, exploratory analyses. CART’s relative simplicity can be powerful in helping the researcher understand basic relationships between variables of interest, and thus identify potential subgroups for more advanced analyses.
The key to CART is its ‘systematic’ approach to the development of the subgroups, which are constructed sequentially through repeated, binary splits of the population of interest, one explanatory variable at a time. In other words, each ‘parent’ group is divided into two ‘child’ groups, with the objective of creating increasingly homogeneous subgroups. The process is repeated and the subgroups are then further split, until no additional variables are available for further subgroup development. The resulting tree structure is oftentimes overgrown, but additional techniques are used to ‘trim’ the tree to a point at which its predictive power is balanced against issues of overfitting. Because the CART approach does not make assumptions regarding the distribution of the dependent variable, it can be used in situations where other multivariate modeling techniques often used for exploratory predictive risk modeling would not be appropriate – namely in situations where data are not normally distributed.
CART analyses are useful in situations where there is some evidence to suggest that HTE exists, but the subgroups defining the heterogeneous response are not well understood [36]. CART allows for an exploration of response in a myriad of complex subpopulations, and more recently developed ensemble methods (such as Bayesian Additive Regression Trees) allow for more robust analyses through the combination of multiple CART analyses [37, 38].
Latent growth and growth mixture modeling
Latent growth modeling (LGM) is a structural equation modeling technique that captures interindividual differences in longitudinal change corresponding to a particular treatment. In LGM, patients’ different timing patterns of the treatment effects are the underlying sources of HTE. Not only does LGM distinguish patients who do or do not respond, it also examines whether the patient responds quickly or slowly, and if they have temporary or durable responses. The heterogeneous individual growth trajectories are estimated from intraindividual changes over time by examining common population parameters, i.e., slopes, intercepts, and error variances. For example, each individual has unique initial status (intercept) and response rate (slope) during a specific time interval. The variances of all individuals’ baseline measures (intercepts) and changes (slopes) in health outcomes represent the degree of HTE. The HTE of individual growth curves identified in LGM can also be attributed to observed predictors, including both fixed and time varying covariates. Duncan & Duncan provide a nontechnical introduction to LGM [39]. Stull applies LGM to a clinical trial and argues that LGM gives rise to better parameter estimates than the traditional regressionbased approach and that LGM can explain a larger proportion of variance [40]. LGM is also closely related to multilevel modeling [41, 42].
However, the assumption that all individuals are from the same population in LGM is too restrictive in some research scenarios. If the HTE is due to observed demographic variables, such as age, gender, and marital status, one may utilize multiplegroup LGM. Despite its successful applications for modeling longitudinal change, there may be multiple subpopulations with unobserved heterogeneities. Growth mixture modeling (GMM), built upon LGM, allows the identification and prediction of unobserved subpopulations in longitudinal data analysis. Each unobserved subpopulation may constitute its own latent class and behave differently than individuals in other latent classes. Within each latent class, there are also different trajectories across individuals; however, different latent classes don’t share common population parameters. For example, Wang and Bodner use a simulated dataset to study retirees’ psychological wellbeing change trajectory when multiple unknown subpopulations exist [43]. They add another layer (the latent class variable) on the LGM framework so that the unobserved latent classes can be inferred from the data. Moreover, the covariates in GMM are designed to affect growth factors distinctly across different latent classes. Therefore, there are two types of HTE: 1) the latent class variable in GMM divides individuals into groups with different growth curves; and 2) coefficient estimates vary across latent classes. Donald Stull et al apply GMM to identify and characterize differential responders to treatment for COPD [44]. In comparison of LGM and GMM focusing on longitudinal data, Luke & Muthen discuss factor mixture modeling as a method for crosssectional studies when heterogeneous populations arise in a similar fashion as in GMM [45].
Wang and Bodner and Jung & Wickrama provide intuitive introductions to GMM [43, 46]. Both point out that the precision of GMM depends on the number of predictors included in the model. Moreover, the optimal number of latent classes needs to be determined. In the twostep approach they discuss, different criteria, e.g. Akaike Information Criterion and Bayesian Information Criterion can be applied for this purpose, but they are sensitive to sample sizes. GMM also comes with potential computational burden, and may result in nonconvergence or local solutions.
Series of n of 1 trials
Combined (aka, “series of”) nof1 trial data provide a unique way to identify HTE. An nof1 trial is a repeated crossover trial for a single patient, which randomly assigns the patient to one treatment vs. another for a given time period, after which the patient is rerandomized to treatment for the next time period, usually repeated for 46 time periods. Such trials are most feasibly done in chronic conditions, where little or no washout period is needed between treatments and treatment effects are identifiable in the shortterm, such as pain or reliable surrogate markers [47–49].
Zucker et al furnish a good example of the information provided by analysis of combined nof1 trials [50]. Nof1 trials were conducted for 23 fibromyalgia patients, comparing amytriptyline with placebo for up to six time periods. The overall mean difference in the disease evaluation score was significantly positive but slightly below a level considered clinically significant. However, the posterior means of the treatment effect for 10 patients were greater than the clinically meaningful threshold, a signal that the treatment was beneficial to a subset of the patients.
Combined nof1 trials could be used for early exploratory testing for HTE, or for laterphase, more focused testing of comparative effectiveness or costeffectiveness [51, 52]. While not currently wellaccepted for regulatory purposes, in today’s highly competitive environment for chronic treatments, nof1 trials could provide HTE information useful in creating phase 3 trial designs that may lead to evidence more clearly differentiating new treatments.
Quantile regression
Quantile regression provides additional distributional information about the central tendency and statistical dispersion of the treatment effect in a population, which is not normally revealed by the conventional mean estimation in RCTs. For example, patients with different comorbidity scores may respond differently to a treatment. Quantile regression has the ability to reveal HTE according to the ranking of patients’ comorbidity scores or some other relevant covariate by which patients may be ranked. Therefore, in an attempt to inform patientcentered care, quantile regression provides more information on the distribution of the treatment effect than typical conditional mean treatment effect estimation. The quantile treatment effect (QTE) characterizes the heterogeneous treatment effect on individuals and groups across various positions in the distributions of different outcomes of interest. This unique feature has given quantile regression analysis substantial attention and has been employed across a wide range of applications, particularly when evaluating the economic effects of welfare reform [53–55].
The literature on HTE of the impact of welfare reform has focused on mean treatment effects across demographic subgroups. This leads us to assume that the potential HTE results from observed differences in some demographic characteristics. It is possible that the statistical significance of observed HTE from subgroup analysis is due to some outliers in the dataset, especially when the number of patients in a subgroup is relatively small. Quantile regression has the advantage of being robust to outliers. In a RCT where outliers are a potential issue, QTE will certainly have the superior performance compared with subgroup analysis and provide more convincing evidence for HTE.
Nonparametric methods
Nonparametric test statistics have been proposed in the literature as powerful tools to identify potential HTE. Crump et al [59] propose two test statistics for experimental evaluations of welfare reforms by using the power series method to test whether the average effects of a binary treatment are zero or constant over different subpopulations defined by covariates. Lee proposes a kernel smoothed nonparametric test for heterogeneity of conditional treatment effects when covariates are continuous and the outcome variable is randomly censored [60].
The significance of utilizing nonparametric models lies in the less restrictive assumptions (i.e., differentiability and moment conditions) imposed in comparison with the functional form assumptions of their parametric counterparts. More often than not, the structure in parametric models implicitly assumes a homogeneous treatment effect. Therefore, some nonparametric regression frameworks are flexible in their designs so that they permit HTE across individual patients. Rather than providing a pvalue for the existence of HTE, the nonparametric regression frameworks may present treatment effects that vary among patients, from which the distribution of the response to a treatment is observable. The underlying hypothesis is that differential treatment response can be explained by differences in patients’ demographic characteristics, clinical variables, and contextual variables. Frolich considers a local likelihood logit model for binary dependent variables [61]. The proposed estimator combines the parametric logit function with the nonparametric kernel smoothing framework. The HTE is identified by looking at varying conditional means and marginal effects for particular changes in the observable covariates.
Discussion and conclusion
Two factors motivated the generation of this primer. First was a growing recognition of the interest in and need for more granular and patientcentric data with which individualized treatment decisions could potentially be made. Second was a realization that there were no unifying guiding principles for those researchers who might be interested in exploring HTE as part of a PCOR agenda.
Once the level of prior information is established, a second consideration relates to the development of HTE study objectives. Key elements include the treatments to be studied as well as the prior evidence about the nature and sources of HTE, which may be populationbased or treatmentbased or both. These will inform the intent of the HTE investigation, whether it is to be largely exploratory or testing specific hypotheses. Early stage studies with little prior evidence may need to be more exploratory in nature. Subsequent phase 3 studies may need to determine the most appropriate doses and populations for initial labeling and may be designed to test very specific hypotheses already formed by such studies.
Features of selected approaches to analysis of HTE
Metaanalysis  CART  N of 1 trials  LGM/GMM*  QTE**  Nonparametric  Predictive risk models  

Intent of HTE Analysis  · Exploratory and confirmatory  · Exploratory  · Exploratory and initial testing  · Exploratory, initial testing, and confirmatory  · Exploratory, initial testing, and confirmatory  · Exploratory and confirmatory  · Initial testing and confirmatory 
Data Structure  · Trial summary results, possibly with subgroup results  · Panel or crosssection  · Repeated measures for a single patient: time series  · Time series and panel  · Panel and crosssectional  · Panel, time series, and crosssectional  · Panel or crosssectional 
Data Size Consideration  · Advantage of combining small sample sizes  · Large sample sizes  · Small sample sizes  · LGM: small to large sample sizes  · Moderate to large sample sizes  · Large sample sizes  · Sample sizes depends on specific risk function 
· GMM: Large sample sizes  
Key Strength(s)  · Increase statistical power by pooling of results  · Does not require assumptions around normality of distribution  · Patient is own control  · Accounting for unobserved characteristics  · Robust to outcome outliers  · No functional form assumptions  · Multivariate approach to identifying risk factors or HTE 
· Estimates patientspecific effects  
· Heterogeneous response across quantiles  · Flexible regressions  
·Heterogeneous response across time  
· Possible to identify HTE across trials  · Can utilize different types of response variables  
· Possibility to measure and explain covariate's effect on treatment effect  
Key Limitation(s)  · Included studies need to be similar enough to be meaningful  · Fairly sensitive to changes in underlying data  · Requires de novo study  · Criteria for optimization solutions not clear  · Treatment effect designed for a quantile, not a specific patient  ·Computationally demanding  · May be more or less interpretable or useful clinically 
· Not applicable to all conditions or treatments  · Smoothing parameters required for kernel methods  
· May not fully identify additive impacts of multiple variables  
· Assumed distribution  
· Selection bias 
The importance of ‘context’ for HTE studies goes beyond just methodological concerns. The recent and growing interest in patientcentered treatment means that HTE studies are increasingly likely to be used in clinical decisionmaking. However, the hope that HTE evidence can serve to significantly improve patient outcomes needs to be balanced against questions regarding the reliability of the scientific methodology used to identify the HTE in question, the weight of what is already understood about conditions in which HTE studies are developed, and new concerns that may arise as additional HTE evidence is generated.
An example of such a concern is whether it is sufficient to be able to detect that a given population responds differentially to a treatment. What if this differential response is novel, was not previously detected in prior attempts at HTE investigation, but was only discovered using a relatively new statistical technique? Should those patients who reflect the differential response population be treated differently – and if so, how?
The key question facing researchers and policy makers is as follows: what level of evidence is required before treatment paradigms may change on the basis of HTE data, and how do we understand and accept the validity of this evidence in a landscape where new tools for detection are consistently being developed? This question is only likely to become more urgent as increased availability of electronic data sources yields more and more research that could profoundly impact clinical treatment paradigms. While there is great hope that recent efforts like those being undertaken by the Patient Centered Outcomes Research Institute will result in the development of better evidence and improved decisionmaking ability for all stakeholders, significant work is needed to standardize and build consensus around which methods are most appropriate to be used to generate this evidence.
Despite the high levels of enthusiasm and funding directed towards evidence generation, key questions regarding dissemination and assimilation of evidence into clinical practice remain. In the case and context of HTE studies, it will be crucial to further understand what types of analyses are most likely to impact clinical decisionmaking behavior. By contextualizing various existing HTE methods that could potentially be used against a novel framework, which highlights both prior evidence as well as other considerations related to elements of study design, this primer sought to fill what seems to be an increasingly important gap in the outcomes research literature.
Appendix – notes on estimation routines
Standard metaanalysis like fixed and random effect models, and tests of heterogeneity, together with various plots and summaries, can be found in the Rpackage rmeta (http://cran.rproject.org/web/packages/rmeta). Logistic regression and survival model routines, both basic approaches to predictive modeling, are found in all major statistical packages (e.g., SAS (Proc Logistic, …), Stata (logistic or logit, stcox, etc.). Routines for calculating empirical Bayes shrinkage estimates for nof1 trials are available in SPlus, with more general Bayesian approaches available in WinBUGS or in R with R2WinBUGS. Basic quantile regressions can be estimated in Stata with the command qreg or in SAS using Proc Quantreg, although some additional programming is needed to generate the full range of quantile estimates. The linear regression decomposition approach can be implemented in Stata with commands decomp, decompose, and Oaxaca. For nonparametric approaches, R offers many available routines, which are welldocumented at http://cran.rproject.org/web/packages/np/vignettes/np.pdf. In SAS, Proc NLMIXED and Proc TRAJ are available for the estimation of LGM/GMM; in Stata LGM is handled within the sem command.
Authors’ information
The authors are health economists by training who work in the field of health economics and outcomes research with a focus on pharmaceuticals. The authors have observed the variety of HTE study methods becoming popular in different disciplines, and have taken a particular interest in how different methods can best contribute to the evidence regarding the optimal use of new treatments among different individuals.
Abbreviations
 HTE:

Heterogeneity of treatment effect
 PCOR:

Patientcentered outcomes research
 RCT:

Randomized clinical trial
 CER:

Comparative effectiveness research
 CART:

Classification and regression tree
 RD:

Risk difference
 RR:

Risk ratio
 OR:

Odds ratio
 LGM:

Latent growth mixture
 GMM:

Growth mixture modeling
 QTE:

Quantile treatment effect.
Declarations
Acknowledgements
We would like to acknowledge this journal’s referees and editor for their very helpful comments, as well as Lisa Blatt and Joanna Monday for their assistance in the preparation of this manuscript.
Authors’ Affiliations
References
 Kravitz RL, Duan N, Breslow J: Evidencebased medicine, heterogeneity of treatment effects, and the trouble with averages. Milbank Q. 2004, 82: 661687. 10.1111/j.0887378X.2004.00327.x.View ArticlePubMedPubMed CentralGoogle Scholar
 Greenfield S, Kravitz R, Duan N, Kaplan S: Heterogeneity of treatment effects: implications for guidelines, payment, and quality assessment. Am J Med. 2007, 120 (4A): S3S9.View ArticlePubMedGoogle Scholar
 Gabler NB, Duan N, Liao D, et al: Dealing with heterogeneity of treatment effects: is the literature up to the challenge?. Trials. 2009, 10: 4310.1186/174562151043.View ArticlePubMedPubMed CentralGoogle Scholar
 Assmann SF, Pocock SJ, Enos LE, Kasten LE: Subgroup analysis and other (mis)uses of baseline data in clinical trials. Lancet. 2000, 355: 10641069. 10.1016/S01406736(00)020390.View ArticlePubMedGoogle Scholar
 Pocock SJ, Assmann SE, Enos LE, Kasten LE: Subgroup analysis, covariate adjustment and baseline comparisons in clinical trial reporting: current practice and problems. Stat Med. 2002, 21: 29172930. 10.1002/sim.1296.View ArticlePubMedGoogle Scholar
 Cook DI, Gebski VJ, Keech AC: Subgroup analysis in clinical trials. Med J Aust. 2004, 180: 28991.PubMedGoogle Scholar
 Rothwell PM: Treating individuals 2. Subgroup analysis in randomised controlled trials: importance, indications, and interpretation. Lancet. 2005, 365: 176186. 10.1016/S01406736(05)177095.View ArticlePubMedGoogle Scholar
 Cui L, Hung HM, Wang SJ, Tsong Y: Issues related to subgroup analysis in clinical trials. J Biopharm Stat. 2002, 12: 347358. 10.1081/BIP120014565.View ArticlePubMedGoogle Scholar
 Kraemer HC, Frank E, Kupfer DJ: Moderators of treatment outcomes: clinical, research, and policy importance. JAMA. 2006, 296 (10): 12861289. 10.1001/jama.296.10.1286.View ArticlePubMedGoogle Scholar
 Brookes ST, Whitely E, Egger M, et al: Subgroup analyses in randomized trials: risks of subgroupspecific analyses; power and sample size for the interaction test. J Clin Epidemiol. 2004, 57 (3): 229236. 10.1016/j.jclinepi.2003.08.009.View ArticlePubMedGoogle Scholar
 Kent DM, Rothwell PM, Ioannidis JP, et al: Assessing and reporting heterogeneity in treatment effects in clinical trials: a proposal. Trials. 2010, 11: 8510.1186/174562151185.View ArticlePubMedPubMed CentralGoogle Scholar
 Debate SP: Subgroup analyses in clinical trials: fun to look at  but don't believe them!. Curr Control Trials Cardiovasc Med. 2000, 1: 2527. 10.1186/CVM11025.View ArticleGoogle Scholar
 Wang R, Lagakos SW, Ware JH, et al: Statistics in medicine–reporting of subgroup analyses in clinical trials. N Engl J Med. 2007, 357 (21): 21892194. 10.1056/NEJMsr077003.View ArticlePubMedGoogle Scholar
 Cochran WG: The combination of estimates from different experiments. Biometrics. 1954, 10: 101129. 10.2307/3001666.View ArticleGoogle Scholar
 Higgins JP, Thompson SG: Quantifying heterogeneity in a metaanalysis. Stat Med. 2002, 21: 15391558. 10.1002/sim.1186.View ArticlePubMedGoogle Scholar
 Sterne JA, Becker C, B J, Egger M: The funnel plot. Publication bias in metaanalysis: prevention, assessment and adjustment. Edited by: Rothstein HR, Sutton AJ, Borenstein M. 2005, Chichester: Wiley, 7598.Google Scholar
 Duval S, Tweedie R: Trim and fill: a simple funnelplotbased method of testing and adjusting for publication bias in metaanalysis. Biometrics. 2000, 56: 455463. 10.1111/j.0006341X.2000.00455.x.View ArticlePubMedGoogle Scholar
 Duval S, Tweedie RL: A nonparametric "Trim and Fill" method of accounting for publication bias in metaanalysis. J Am Statist Ass. 2000, 95: 8998.Google Scholar
 Egger M, Smith GD, Schneider M, Minder C: Bias in metaanalysis detected by a simple, graphical test. Br Med J. 1997, 315: 629634. 10.1136/bmj.315.7109.629.View ArticleGoogle Scholar
 Begg CB, Mazumdar M: Operating characteristics of a rank correlation test for publication bias. Biometrics. 1994, 50: 10881101. 10.2307/2533446.View ArticlePubMedGoogle Scholar
 Peters JL, Sutton AJ, Jones DR, Abrams KR, Rushton L, Moreno SG: Assessing publication bias in metaanalyses in the presence of betweenstudy heterogeneity. J R Statist Soc A. 2010, 173: 575591. 10.1111/j.1467985X.2009.00629.x. Part 3View ArticleGoogle Scholar
 Schmid CH, Lau J, McIntosh MW, Cappelleri JC: An empirical study of the effect of the control rate as a predictor of treatment efficacy in metaanalysis of clinical trials. Stat Med. 1998, 17: 19231942. 10.1002/(SICI)10970258(19980915)17:17<1923::AIDSIM874>3.0.CO;26.View ArticlePubMedGoogle Scholar
 DerSimonian R, Laird N: Metaanalysis in clinical trials. Control Clin Trials. 1986, 7: 177188. 10.1016/01972456(86)900462.View ArticlePubMedGoogle Scholar
 Baker WL, White CM, Cappelleri JC, et al: Understanding heterogeneity in metaanalysis: the role of metaregression. Int J Clin Pract. 2009, 63: 14261434. 10.1111/j.17421241.2009.02168.x.View ArticlePubMedGoogle Scholar
 Alatorre CI, Carter GC, Chen C, et al: A comprehensive review of predictive and prognostic composite factors implicated in the heterogeneity of treatment response and outcome across disease areas. Int J Clinn Pract. 2011, 65: 831847. 10.1111/j.17421241.2011.02703.x.View ArticleGoogle Scholar
 D'Agostino RB, Russell MW, Huse DM, et al: Primary and subsequent coronary risk appraisal: new results from the Framingham study. Am Heart J. 2000, 139 (2 Pt 1): 272281.View ArticlePubMedGoogle Scholar
 Mega JL, Close SL, Wiviott SD: Cytochrome p450 polymorphisms and response to clopidogrel. N Engl J Med. 2009, 360: 354362. 10.1056/NEJMoa0809171.View ArticlePubMedGoogle Scholar
 Stone M, Brooks JR: Continuum regression: crossvalidated sequentially constructed prediction, embracing ordinary least squares, partial least squares and principal components regression (with discussion). J R Statist Soc B. 1990, 52: 237269.Google Scholar
 Breiman L: Random forests. 1999, University of CaliforniaBerkeley: Working paperGoogle Scholar
 Vapnik V: The nature of statistical learning theory. 1995, Berlin: SpringerVerlagView ArticleGoogle Scholar
 Fradkin D, Muchnik I: Support vector machines for classification. Discrete methods in epidemiology", DIMACS series in discrete mathematics and theoretical computer science, volume 70. Edited by: Abello J, Carmode G. 2006, New Brunswick: Department of Computer Science, Rutgers, The State University of New Jersey, 1320.Google Scholar
 Bishop CM: Neural networks for pattern recognition. 1995, Oxford: Oxford University PressGoogle Scholar
 Ioannidis JP, Lau J: Heterogeneity of the baseline risk within patient populations of clinical trials: a proposed evaluation algorithm. Am J Epidemiol. 1998, 148: 11171126. 10.1093/oxfordjournals.aje.a009590.View ArticlePubMedGoogle Scholar
 Lazar AA, Cole BF, Bonetti M, Gelber RD: Evaluation of treatmenteffect heterogeneity using biomarkers measured on a continuous scale: subpopulation treatment effect pattern plot. J Clin Oncol. 2010, 28: 453944. 10.1200/JCO.2009.27.9182.View ArticlePubMedPubMed CentralGoogle Scholar
 Crown WH: There’s a reason they call them dummy variables: a note on the use of structural equation techniques in comparative effectiveness research. Pharmaco Economics. 2010, 28: 947955. 10.2165/1153775000000000000000.View ArticleGoogle Scholar
 Province MA, Shannon WD, Rao DC: Classification methods for confronting heterogeneity. Adv Genet. 2001, 4 (2): 273286.View ArticleGoogle Scholar
 Green , Donald P, Kern HL: Generalizing experimental results. The annual meeting of the American political science association. 2010, DC: WashingtonGoogle Scholar
 Green , Donald P, Holder LK: Detecting heterogenous treatment effects in largescale experiments using Bayesian additive regression trees. The annual summer meeting of the society of political methodology. 2010, University of IowaGoogle Scholar
 Duncan TE, Duncan SC: An introduction to latent growth curve modeling. Behav Ther. 2004, 35: 333363. 10.1016/S00057894(04)80042X.View ArticleGoogle Scholar
 Stull DE: Analyzing growth and change: latent variable growth curve modeling with an application to clinical trials. Qual Life Res. 2008, 17: 4759. 10.1007/s1113600792905.View ArticlePubMedGoogle Scholar
 Chan D: The conceptualization and analysis of change over time: an integrative approach incorporating longitudinal mean and covariance structures analysis (LMACS) and multiple indicator latent growth modeling (MLGM). Organ Res Methods. 1998, 1: 421483. 10.1177/109442819814004.View ArticleGoogle Scholar
 Curran P: Have multilevel models been structural equation models all along?. Multivar Behav Res. 2003, 38: 529569. 10.1207/s15327906mbr3804_5.View ArticleGoogle Scholar
 Wang M, Bodner TE: Growth mixture modeling: identifying and predicting unobserved subpopulations with longitudinal data. Organ Res Methods. 2007, 10: 635656. 10.1177/1094428106289397.View ArticleGoogle Scholar
 Stull DE, Wiklund I, Gale R, et al: Application of latent growth and growth mixture modeling to identify and characterize differential responders to treatment for COPD. Contemp Clin Trials. 2011, 32 (6): 818828. 10.1016/j.cct.2011.06.004.View ArticlePubMedGoogle Scholar
 Lubke GH, Muthén B: Investigating population heterogeneity with factor mixture models. Psychol Methods. 2005, 10 (1): 2139.View ArticlePubMedGoogle Scholar
 Jung T, Wickrama KA: An introduction to latent class growth analysis and growth mixture modeling. Social and Personality Psychology Compass. 2008, 2: 302317. 10.1111/j.17519004.2007.00054.x.View ArticleGoogle Scholar
 Guyatt G, Sackett D, Adachi J, et al: A clinician's guide for conducting randomized trials in individual patients. CMAJ. 1988, 139: 497503.PubMedPubMed CentralGoogle Scholar
 Guyatt GH, Keller JL, Jaeschke R, et al: The nof1 randomized controlled trial: clinical usefulness. Our threeyear experience. Ann Intern Med. 1990, 112: 293299.View ArticlePubMedGoogle Scholar
 Lewis JA: Controlled trials in single subjects. 2. Limitations of use. BMJ. 1991, 303: 175176. 10.1136/bmj.303.6795.175.View ArticlePubMedPubMed CentralGoogle Scholar
 Zucker DR, Schmid CH, McIntosh MW, et al: Combining single patient (Nof1) trials to estimate population treatment effects and to evaluate individual patient responses to treatment. J Clin Epidemiol. 1997, 50: 401410. 10.1016/S08954356(96)004295.View ArticlePubMedGoogle Scholar
 Karnon J, Qizlibash N: Economic evaluation alongside Nof1 trials: getting closer to the margin. Health Econ. 2001, 10: 7982. 10.1002/10991050(200101)10:1<79::AIDHEC567>3.0.CO;2Z.View ArticlePubMedGoogle Scholar
 Scuffham PA, Dip BA, Yelland MB, et al: Are Nof1 trials an economically viable option to improve access to selected high cost medications? The Australian experience. Value Health. 2008, 11: 97109. 10.1111/j.15244733.2007.00218.x.View ArticlePubMedGoogle Scholar
 Djebbari H, Smith JA: Heterogeneous impacts in PROGRESA. J Econ. 2008, 145: 6480.View ArticleGoogle Scholar
 Finkelstein A, McKnight R: What did medicare do? the initial impact of medicare on mortality and out of pocket medical spending. J Public Econ. 2008, 92: 16441668. 10.1016/j.jpubeco.2007.10.005.View ArticleGoogle Scholar
 Shimshack JP, Ward MB, Beatty TKM: Mercury advisories: information, education, and fish consumption. J Environ Econ Manag. 2007, 53: 158179. 10.1016/j.jeem.2006.10.002.View ArticleGoogle Scholar
 Bitler MP, Gelbach JB, Hoynes HW: What mean impacts miss: distributional effects of welfare reform experiments. Am Econ Rev. 2006, 96: 9881012. 10.1257/aer.96.4.988.View ArticleGoogle Scholar
 Swindell WR: Accelerated failure time models provide a useful statistical framework for aging research. Exp Gerontol. 2009, 44 (3): 190200. 10.1016/j.exger.2008.10.005.View ArticlePubMedGoogle Scholar
 Müller HG, Abramson I, Azari R: Nonparametric regression to the mean. Proc Natl Acad Sci USA. 2003, 100 (17): 97159720. 10.1073/pnas.1733547100.View ArticlePubMedPubMed CentralGoogle Scholar
 Crump RK, Hotz VJ, Imbens GW, Mitnik OA: Nonparametric tests for treatment effect heterogeneity. National bureau of economic research technical working paper 324. Available from: http://www.nber.org/papers/t0324 [Accessed February 12, 2012]
 Lee M: Nonparametric tests for distributional treatment effect for randomly censored responses. J R Statist Soc B. 2009, 71: 243264. 10.1111/j.14679868.2008.00683.x.View ArticleGoogle Scholar
 Frölich M: Nonparametric regression for binary dependent variables. Econ J. 2006, 9: 511540.Google Scholar
 The prepublication history for this paper can be accessed here:http://www.biomedcentral.com/14712288/12/185/prepub
Prepublication history
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.