In the following we start by discussing our findings from the breast cancer example. We continue by discussing general methodological issues of our metaTEF procedure, mention alternatives and compare results with the meta-STEPP approach. We stress the importance of complete reporting with a structured profile and discuss strengths and limitations of our approach.

### Results of metaTEF for the DataMart studies

For early breast cancer patients treated with tamoxifen, the metaTEF functions provide convincing evidence of an interaction between chemotherapy treatment and estrogen receptor values. Whereas CT has hardly any effect for larger (say > 500 fmol) values, the log hazard ratio is monotonically increasing from about − 0.25 for ER ‘0’ to about 0. An overall test for an interaction is significant (*p* = 0.0215) but the estimated treatment effect function is a much stronger argument for the interaction. As individual RCTs are typically underpowered for exploring patient characteristics interacting with treatment [32] it is no surprise that only one of the eight studies (study 8) pointed towards a significant interaction. Three of the studies (4, 7, 8, see Fig. A2) included very few patients with ER below 10 fmol, a potential reason that two of the corresponding individual treatment effect functions had a negative slope (Fig. 2). This IPD meta-analysis clearly shows that such approaches are needed to provide evidence of whether individual studies are too small. Effective sample size (number of events) ranged from 127 to 712 in the eight studies, too low to investigate a treatment covariate interaction in single studies. Irrespective of using a fixed or a random effects model and whether an FP1 or an FP2 function is chosen, the main finding from the metaTEF approach provides clear evidence of an interaction between ER and CT. FP2 functions point to slightly larger effects for low values and the fixed effect models are flat for ER values up to about 5 fmol, whereas the random effect functions increase even in this area.

### General issues of the three-stage metaTEF approach

To investigate for an interaction between a continuous predictor and treatment, metaTEF combines three stages. First, the derivation of the functional relationship with fractional polynomials in both treatment groups (extension for more than two groups is straightforward); second, the estimation of a continuous TEF as the difference between the functions in the two groups; and third, the averaging of the TEF from each study. The first stage requires deciding between an FP1 or an FP2 function (FP3 or FP4 are possible but unlikely to be needed) and a decision between several variants. A simulation study [10] provided arguments for FP1 (flex 3) as the preferred option. FP2 (flex1) is the preferred approach if non-monotonic functions are expected [10]. See [13] for an example. It is advisable to use one of the approaches for the main analysis and the other for a sensitivity analysis.

TEFs from single studies show considerable variation and using a monotonic FP1 function within each study does not logically imply that the overall TEF will be monotonic. In our case study, the TEFs seem to suggest that in some studies there is no effect of ER values on the effect of CT or even that the effect points in the ‘wrong’ direction. However, with small sample sizes in many of the studies, that is not surprising. To avoid difficulties caused by too-small studies, Royston and Sauerbrei [9, 10] used sample sizes of 250 and 500 in their simulation study with a continuous outcome. Since it is also known that single points can have a major influence on FP functions selected, we decided to truncate ER values at 1000 fmol/l.

As the estimation of a treatment effect function with fractional polynomial methodology requires positive values for ER we used ER + 1 in the study. An extended FP approach was developed for variables with a spike at zero [33]. To our knowledge MFPI has not been extended to cover this situation. In principle it should be straightforward, and we would expect results to be similar to the simpler ‘standard’ approach used here.

### Potential alternatives for each of the three stages

FP functions in the first stage may be replaced by STEPP functions [14] (see below) or spline functions. In principle, splines are a natural alternative, but it is unclear which specific spline approach should be chosen. Regression splines with 2, 3 or 4 d.f. and automatic knot selection were considered in the simulation study mentioned above. Riley et al. show details of a meta-analysis with restricted cubic splines, but many more spline approaches are available [34]. Perperoglou et al. provide an overview of the most widely used spline-based techniques and their implementation in [35]. There is no ‘best’ spline approach. Further guidance is needed before a comparison of spline techniques with FPs will be able to provide important information for the selection of the most suitable approach to estimate a continuous TEF.

In the second stage, STEPP compares treatment effects (e.g. estimate of survival rate at 5 years or hazard ratios) in subgroups. This means that treatment differences are calculated for patients belonging to k (overlapping) intervals. If the pointwise approach is used in the third stage, it is possible to use any approach which estimates a functional relationship in each of the treatment groups and estimates the treatment difference with related variance for each point in a relevant interval.

In the third stage we use the two-step approach with pointwise weighted averaging of the derived study specific TEFs. Weights depends on the variances of the individual functions and therefore on distributions of the predictor in each study. Differences between studies are also used in the random effects approach, which implies that differences of predictor distribution are down weighted. In the context of the assessment of risk factors, White et al. [19] compared the pointwise approach with a multivariate meta-analysis procedure (‘mvmeta’) which combines the set of regression coefficients from each study [36,37,38]. In the latter, study specific estimates based on the same type of function are required. Linear functions per study are simplest but non-linear functions are possible if common powers are used across studies. Under these restrictions it would be possible to describe the individual TEFs with a set of regression coefficients and a multivariate meta-analysis would be possible. Results from the two approaches showed only minor differences in a very large IPD meta-analysis of risk factors (> 80 cohorts) but the pointwise approach is more flexible [19].

### Comparison with meta-STEPP

To extract all information from a continuous variable, Bonetti and Gelber [14, 15] proposed the ‘Subpopulation Treatment Effect Pattern Plot’ (STEPP), a graphical tool to elucidate the pattern of treatment-covariate interactions in two-arm clinical trials with time-to-event endpoints when the covariate of interest is continuous. The primary advantage of STEPP is that it is very intuitive –no functional form for the interaction needs to be specified, the method is based on the use of traditional measures of treatment effect on well-defined, overlapping subgroups of patients, and that it allows one to explore the pattern of possible treatment effect heterogeneity [18]. Subgroups can be defined in two different ways, known as sliding window (SW) and tail-oriented (TO). Differences of estimates in subgroups are arguments for an interaction of the prognostic factor with treatment. Related significance tests were proposed [39].

However, STEPP has disadvantages as a tool for inference and estimation. There are the two variants (SW and TO) to choose between. The size of the subpopulations is critical to the performance of the method and hence to the interpretation of the results. That is a specific issue for the SW variant, as shown for a single study by Sauerbrei et al. [17]. For a fixed effects meta-STEPP analysis Wang et al. [18] propose to create ER subpopulations based on meta-windows which use the data from the joint distribution of all studies. Consequently, some studies can have small sample sizes and some of the individual functions fluctuate substantially. A random effects meta-STEPP approach was proposed in [40]. There are only small differences between the fixed and random effect approaches. In agreement with the FP results, the plots show a clear increasing trend in the treatment effect as ER value increases, suggesting that the magnitude of the chemotherapy effect is smaller for tumours with higher levels of ER.

While metaTEF allows additional variables (prognostic factors, confounders) in a regression model, meta-STEPP cannot accommodate such variables. However, the issue is not critical here since we use data from eight randomized trials of chemotherapy with no covariates other than ER.

### Good reporting to help assessment of credibility

In Table 1 we introduce the three-part profile MethProf-MA as an instrument to improve reporting of available data and of all steps in the analysis. With an emphasis on the latter, Altman et al. [23] proposed the REMARK profile, a structured display in the context of prognostic factor research. Created prospectively, the profile helps by pointing out relevant issues, such as the necessity of initial data analysis and checking of important assumptions of models used [41]. Concentrating on all steps of our analysis, we adapt the key ideas of the REMARK profile to methodological investigations, here to a meta-analysis. Obviously, such profiles can also be used to better present and understand investigation of properties and comparison of variable selection procedures, simulations studies and many more [42] and call it a MethProf-MA profile, relating it to the reporting guidelines for meta-analyses. In our methodological presentation we use the data from an earlier meta-analysis which means that the PRISMA statement [24] is less relevant here. A key feature is the illustration of all steps of the analysis conducted, which can be easily seen in part C of MethProf-MA. The profile will also help to assess the credibility of effect modification analyses with ICEMAN (Instrument to assess the Credibility of Effect Modification Analyses), an instrument recently developed for randomized trials and meta-analyses [43]. The version for RCTs includes 5 core questions and that for meta-analyses 8 core questions, 4 of which overlap. One of the overlapping core questions is ‘If the effect modifier is a continuous variable, were arbitrary cutpoints avoided?’. Clearly this emphasizes the use of the full information from a continuous variable, as done with MFPI and the related metaTEF approach. A recent systematic survey clearly showed the necessity of assessing whether claims of subgroup effects are supported by the available data [44]. Other researchers may use ICEMAN to assess the credibility of the investigated interaction between estrogen receptor values and chemotherapy in patients with early breast cancer treated with tamoxifen. The overview provided by the profile will be most helpful for assessing some of the criteria.

### Strengths and limitations

MetaTEF is an approach which uses the full information from a continuous variable to assess whether and how effect of a treatment is modified by the variable. It avoids the use of cutpoints with their related critical issues [4]. It extends the MFPI approach, which has sufficient flexibility to model non-linear treatment effect functions in many cases. MFPI has greater power than several alternatives [10] and provides functions which are simple, understandable and transferable. Related metaTEF functions are pointwise weighted averages of FP functions and are therefore complex. The resulting functional form cannot be described by a simple formula. Smoothing the pointwise average and variance functions may help to increase visualisation and practical use but it is beyond the scope of the paper.

Provided IPD data is available from all relevant randomized trials, metaTEF summarizes all relevant information concerning a treatment modifying effect of a continuous variable and can even include further prognostic factors. This may increase the power of an analysis and may point to other factors which may modify the treatment effect. Age and progesterone receptor would be of interest in our example, but these data were not available to us. At a first glance MFPI modelling is straightforward, but there is the danger that some of the studies may be (too) small. In such cases it is likely that the algorithm will select a linear function in each group, even though the true effect may be far from linear. Mismodelling is also an issue if outliers or influential points are present. To cope with such issues, we suggest selecting a suitable truncation point. Also, we note that data-dependent modelling introduces bias in the estimates of the regression coefficients and that variances are underestimated [13].

We used data which were previously analyzed by Wang et al. [18]. This avoided the considerable task of assembling and cleaning the data, a complex and difficult issue in IPD meta-analysis. In addition, we have a clinically relevant and methodologically well-defined data set. It is unlikely that considerations such as publication bias are particularly relevant in our example. Furthermore, before starting the analysis there was no doubt that ER has an effect on chemotherapy. The main clinical questions relate to selection of a cutpoint for clinical decision making. In general, the treatment effect function will facilitate this decision. In our example, however, the TEF provides no clear answer.