A wide-ranging debate has taken place in recent years on mediation analysis and causal modelling [1–9]. This debate has involved many different fields and raised profound questions about the status of scientific explanations, statistical theory and research methodology when making causal inferences. In this paper, we build on this discussion to outline an integrated approach to the analysis of research questions that situate survival outcomes in relation to complex causal pathways. There are good reasons for pursuing this goal, as researchers are increasingly seeking to shed light on the “mechanisms” that generate survival outcomes by exploring mediated effects. As Aalen et al. [1] observe, “In other areas [outside Psychology and Social Science] mediation analysis has largely been ignored. This is especially so for situations where time plays a central role, as in survival analysis. In view of the importance of survival analysis in medicine and other areas, it is surprising that not more attention has gone into the issue of mediation.”
The background to this contribution is the increasingly urgent need for policy-relevant research on the nature and form of social inequalities in relation to health and health care, as interventions to promote population health and to improve equity rest on causal interpretations of the determinants of health-related outcomes, however incomplete or flawed these may be [10]. At the same time, and despite the enormous progress that has been made in each of the aforementioned areas, an integrated framework for causal modelling has not yet been identified in health research, with a view to incorporating survival outcomes with such desirable features as (a) latent variables, (b) time-varying covariates, (c) complex pathways and (d) support for causal inferences in relation to direct and indirect effects.
We will begin by briefly summarising recent debates on causal inference, mediated effects and statistical models. We will show that these three strands of research have powerful synergies which can be exploited by bringing them together within an appropriate analytical framework. We will then present an illustrative example using survival data for a sample of patients diagnosed with colon cancer in the Republic of Ireland between 2004 and 2008. We will assess whether social class (measured by a proxy variable) exerts statistically-significant direct and indirect causal effects on survival prospects. We are particularly interested in assessing whether the influence of this socio-economic baseline covariate is mediated by the route of admission to or by the caseload of the hospital where the main treatment was received. These indirect pathways are of great relevance from a policy-making perspective, as they have the potential to shed light on the mechanisms that (re)produce social inequalities in health outcomes.
Literature review
Mediation effects
The study of mediation raises complex issues, although the basic structure of such effects is simple. By including mediators in a regression equation, the coefficients for other variables in the model may change or become statistically or substantively non-significant. In this way, mediation effects can mask the influence of certain variables and impede a full appreciation of their role in determining outcomes. Conversely, the appropriate specification of such effects can provide practitioners and policy-makers with richer information on disparities in access to health and health care.
Mediation analysis has stimulated interest amongst health researchers due to its potential to provide answers to a series of important research questions and due to dissatisfaction with the methods and approaches which have tended to dominate health research [4]. The latter have recently been called into question, primarily due to their tendency to focus on empirical associations (“black-box epidemiology”) and consequent failure to develop plausible explanations [3, 11]. As a possible solution to this problem, “mechanisms” have been contrasted with “black boxes”. The aim of applied research, it is argued, should be to develop increasingly sophisticated accounts of the systemic relationships and processes that generate empirical regularities [12]. In this vein, mediation analysis can inform intervention strategies, identify “active ingredients” and suggest strategic sites for action.
Where the mediator and outcome are singular, continuous and observed, multiple-equation techniques for studying mediation are frequently used, building on Baron and Kenny’s influential approach [13, 14]. As these have been widely discussed, we will merely note that this technique relies on a series of linear regression models and enables the researcher to assess whether a single variable may be said to mediate between a covariate and the outcome [15]. Although these techniques have been applied countless times, they are of limited use if either the mediator or the outcome are categorical or ordinal (or represent the time to an event), or if more complex forms of mediation are involved [5, 8]. These limitations have discouraged health researchers from exploring mediation effects, partly due to the fact that non-linear models like the Cox model make it difficult to estimate indirect effects [16].
Causal inference
Causality has become a major issue in Statistics in recent years [1]. The “traditional” statistical approach to the analysis of direct effects involved conditioning on a mediating variable. Aware that this does not rest on a rigorous definition of causality, Robins and Greenland [17] and Pearl [18] developed alternative formulations. The “causal inference” literature which subsequently developed relies on the counterfactual theory of causality proposed by Rubin [19]. Judea Pearl, an influential scholar in this area, contributed to the new-found popularity of causal questions amongst statisticians by combining Rubin’s approach with the theory of non-parametric Structural Equation Models. Other authors have used similar techniques to clarify the necessary and sufficient conditions for making causal inferences about mediation effects [6, 20].
Within this literature, causal inference focuses on four different kinds of effects: the total effect, the “controlled” direct effect (based on the idea of holding the mediating variables fixed by setting their values to a constant by some kind of intervention), the “natural” direct effect (where the treatment is set at a given level and we compare outcomes without fixing the mediators to a constant, but allowing them to assume the “natural” levels that they would have taken in the absence of the treatment) and the “natural” indirect effect (where the direct effect is disabled and we focus on the effect transmitted by the mediator).
Pearl, in a recent paper [10], clarified some of the issues at stake when making causal inferences about mediation effects using statistical models. Firstly, he argues that indirect effects should not be treated as artefacts or nuisance parameters, but as “an intrinsic property of reality that has tangible policy implications”. The second is that it is possible to define direct and indirect effects within a general, causal approach that does not require particular distributional assumptions. Thirdly, he shows that the assumptions required by causal mediation analysis are essentially analogous to those that apply to causal models more generally: no confounding due to unmeasured common causes. Fourthly, he demonstrates that the total effect, natural direct effect and natural indirect effect are identified for linear Structural Equation Models as long as the aforementioned assumptions are satisfied and can be estimated in a straightforward way from the estimated coefficients. Finally, Pearl considers such models to be potentially useful despite their reliance on assumptions which cannot be tested explicitly.
This raises interesting questions about the relationship between statistical models, generative mechanisms and causality – which hinge around a fundamental paradox. Although statistical models can permit valid inferences about causal mechanisms under certain conditions, the very nature of these models implies that these conditions will rarely, if ever, be (fully) satisfied. After all, reality is infinitely complex, whilst models provide relatively simple, stylised representations, and researchers can never be certain that they have included all relevant confounders.
One way of tackling this paradox is to embed it within the process of scientific discovery. The plausibility of models is assessed by the scientific community using prevailing criteria and techniques, which either reinforces or undermines the conviction that a model captures the essence of a really-existing mechanism. If a model omits an important confounder, the onus is on other researchers to demonstrate that alternative specifications yield different conclusions. In other words, it is not sufficient to appeal to the possibility of misspecification or omission (which applies to all models); this must be substantiated explicitly.
The impact of model misspecification depends on the strength of the effects associated with the omitted variables or paths, which implies that once substantively-important covariates have been included in a model, the omission of less important effects will, ceteris paribus, have a weaker influence on the model. Rather than seeking a warrant for making absolute claims, we would suggest that the aim of causal models is to clarify important relationships and pathways and to contribute to the development of mechanism-based explanations.
Statistical models for mediation analysis
In an attempt to overcome the limitations of existing approaches to mediation analysis, researchers have sought to extend the Baron-Kenny approach to survival outcomes by applying them directly to Cox models [21, 22]. This technique is known to yield biased results, however, and has met with forceful criticism in the scientific literature, as summarised by Lange and Hansen [23]:
Most importantly, the observed changes in hazard ratios cannot be given a causal interpretation. In addition, the important assumption of proportional hazards can never be satisfied for both models with and without the mediator. In other words, it is not mathematically consistent to use a Cox model both with and without a potential mediator (mathematically, this is due to the fact that the class of proportional hazard models is not closed under marginalization).
As a result of these difficulties, researchers have concentrated their efforts on extending survival models in different ways. One such approach uses “marginal” models and focuses on obtaining causally-valid inferences for single mediation effects using standard survival models [24]. Another approach – known as “marginal structural modelling” - can be used to identify the causal effect of time-dependent exposures while controlling for time-dependent confounders which are also affected by the exposure [25]. These models use inverse probability of treatment weights and inverse probability of censoring weights to create a pseudo-population in which treatment is un-confounded by subject-specific characteristics or censoring [26]. The models are therefore designed to remove confounding due to a specific type of mediation effect, rather than to study mediation effects more generally. The independent variable of interest has to be dichotomous and their integration with survival outcomes is limited.
The third approach uses Dynamic Path Analysis, developed by Fosen et al. [27] using Aalen’s additive hazards model, as “an extension of classical path analysis to a time-continuous survival setting where path effects are estimated as a function of time” [16]. Lange and Hansen [23] suggest that this approach has weaknesses when used to study mediation, as it cannot sustain causal interpretations and cannot be implemented using standard software. Their recommendation is to adapt the additive hazards model in a different way to calculate the counterfactual rate difference, which represents the number of deaths that can be attributed to mediation through the mediator, compared with those that can be attributed to the direct path. Martinussen and Vansteelandt [28] also use the Aalen additive hazards model to adjust survival models for confounding in a similar way.
These approaches seek to extend existing survival models to obtain valid estimates of causal effects. As a consequence, they encounter constraints on the number and kinds of variables that can be analysed, and more complex causal mechanisms typically cannot be assessed. An alternative strategy is to integrate survival outcomes within Structural Equation Models, as the latter already include specifications such as growth curves, multilevel structures, latent variables, latent classes and multiple outcome variables [29]. Iacobucci [5] offers a general motivation for this strategy:
Mediation models have also been generalized to allow for nomological networks that are richer than just the three central constructs, X, M, and Y. If there are additional predictors or consequences of any of these, Structural Equation Models are superior (i.e., mathematically statistically optimal given their smaller standard errors), substantively to get a better sense of the bigger theoretical picture, and statistically because the focal associations will be estimated more purely, having other effects partialed out and statistically controlled…
We favour this strategy, which seeks to integrate survival outcomes within a Structural Equation Model, not least because the latter has come to be seen as the most appropriate methodological framework for carrying out mediation analysis more generally [10, 30–33]. The nature of survival models has, for a long time, appeared to exclude this possibility [5]. We will show in the next section how this challenge may be tackled, preparing the ground for an integrated framework.
Structural equation modelling
There is an intuitively appealing way of integrating time-to-event data within Structural Equation Models. The idea of using a linear specification of the hazard function based on discrete-time modelling techniques was proposed more than 20 years ago, and Singer and Willett [34] showed that this model could be estimated using the tools of traditional logistic regression analysis. Muthén and colleagues subsequently integrated the discrete-time survival model within the MPlus program [35, 36]. This approach – which will be described in greater detail below – makes it possible to estimate complex discrete-time survival models using existing software. It is possible, for example, to relate survival outcomes to other kinds of data structures and to develop models which more accurately reflect real-world mechanisms:
Discrete-time models have the strength that they can easily accommodate time-varying covariates. They also do not require a hazard-related proportionality assumption that is commonly used in continuous-time survival analysis, for example, the Cox proportional hazards model. In addition, these models easily allow for unstructured as well as structured estimation of the hazard function at each discrete time point. [35]
This conceptual shift – from continuous to discrete time, and from a single equation to a Structural Equation Model – permits the kind of integration of methods that is required for mediation analysis to yield its full potential in health research. Amongst the benefits of this approach are that it encourages researchers to formulate and test more comprehensive hypotheses and to develop more ambitious theories regarding generative mechanisms.
The notion of developing and testing mechanism-based accounts of the world involves a metaphorical mapping which is highly effective in this context. One way of understanding this concept is to situate it, once again, within the process of scientific discovery, whereby a little-understood association may be replaced, over time, by a more detailed explanation. This process gives rise to a constant revision of explanations, accompanied by new and more powerful accounts which articulate the relationship between processes situated at different levels. We argue that the central aim of scientific research is to provide an increasingly accurate or powerful account of these “mechanisms”.
The mechanism-based approach can be applied effectively to the development of statistical models. Models offer a stylised representation of real-world mechanisms; by interpreting the results of statistical models, we can make substantiated claims about the ways in which these mechanisms work. In fact, “direct” and “indirect” effects always relate to a specific theory/model, as “typically, there are other (unmeasured) intermediate variables that would mediate the direct effect” [3]. Indeed, every direct effect in a statistical model may be treated as a “black box”, and replaced (over time) by a more complex set of direct and indirect effects. It is the substantive focus of each research project that ultimately decides which black boxes should be opened (simultaneously creating new black boxes).