 Research article
 Open Access
 Published:
Using Bonferroni, BIC and AIC to assess evidence for alternative biological pathways: covariate selection for the multilevel EmbryoUterus model
BMC Medical Research Methodology volume 13, Article number: 73 (2013)
Abstract
Background
IVF treatments for infertility involve the transfer of multiple embryos in any one treatment cycle. When data is available on individual embryos the outcomes of each embryo are only partially observed, as treatment outcome (live birth) is assessed at the patient level. Twolevel EmbryoUterus (EU) models have been developed which assume a biologically plausible mechanism and assume that effects are mediated directly through the embryo (E) and also through the uterine environment (U), represented by two submodels. This approach potentially allows inference as to the association of patient variables with outcome. However, when the variable is measured at the patient level either additional decisions have to be made in the modelling process as to in which submodel the variable should be included or some model selection algorithm has to be invoked. These uncertainties have limited the practical application of these models.
Methods
We have conducted simulation studies based around realistic parameter values of situations where a putative patientlevel variable is being considered for inclusion in an EU model and/or the mechanistic interpretation from the submodel assignment is of interest. Firstly we explore various strategies for inference for a variable of interest where the submodel is either prespecified or considered unknown. Secondly we explore the use of information criteria to select the appropriate submodel and the strength of evidence for that assignment. These are demonstrated in a reanalysis of a previously published dataset.
Results
In the absence of prior evidence for potential prognostic factors measured at the patient level, two single degreeoffreedom likelihood ratio tests with a Bonferroni correction including the variable of interest in first the E then the U submodel performs well as a statistical test for association with outcome. For model building the information criteria can be used, but large differences are required (⪆6) to provide reasonable evidence of submodel assignment. Previous interpretations have been overoptimistic.
Conclusions
These results suggest simple strategies and should enable these models to be used more confidently in practical applications.
Background
Invitro Fertilization (IVF) is a treatment for infertility in which embryos are created outside of the prospective mother and, after culture for 3–6 days, one or more embryos are transferred to her uterus. Analysis of data arising from IVF treatment often includes prognostic factors observed at an embryo level. However the outcome at this embryo level is often only partially observed. This partial observability arises due to the fact that individual embryos cannot be tracked after transfer and, unless all or none of the transferred embryos develop, it is not possible to determine which embryo(s) implanted. Analysis has to either be conducted at the patient level using aggregated embryo data or nonstandard methodology is required.
IVF treatments have a hierarchical structure with effects on the embryo from both mother and father and can also have a complex nested structure [1] if effects across multiple treatment cycles are included and donor eggs or sperm are utilised. Considering just the embryo transfer process in these treatments, Spiers et al.[2] suggested a model based on a plausible biological mechanism. These models are named “embryouterus” (EU) models and are based on the idea that the successful development of an embryo depends independently on two binary factors, the embryos own inherent viability (E) and the receptivity of the uterus (U). For an embryo to develop it must both be viable (E = 1) and placed in a receptive uterus (U = 1). Spiers et al.[2] assumed E and U had constant probabilities (e and u) over the whole population. In later work, by Baeten et al.[3] an EUL model was proposed, assuming the e probability was constant over the whole population and modelling U through logistic regression. A more general approach was developed by Zhou and Weinberg [4] which replaced both E and U with logistic regression submodels allowing the inclusion of patient and embryo covariates. Although the original derivation assumed a very specific biological mechanism, the interpretation has evolved somewhat, and in particular it is acknowledged that the U component may be only partially related to uterine receptivity and can include contributions from a range of factors related to treatment [5, 6]. These approaches have been utilised for predictive and inferential analyses in a number of practical applications for the assessment of prognostic factors [5–7] and developing predictive models for assessing alternative treatment pathways [8].
The EU model has an explicit multilevel structure with the transferred embryos nested within the uteri. A prognostic factor measured at the patient level, can validly be included in either the E or the U submodels (or both) and therefore assumed or inferred to be operating through either the uterine receptivity of the patient or the viability of the embryos. Whilst it may seem counterintuitive to model a Ulevel covariate at the Elevel, the implied mechanism does suggest that this is what can be required, for instance donor egg data [9] suggests strongly that embryo viability declines with age whilst maternal receptivity does not decline as strongly. As noted by Roberts et al.[10] the level at which a covariate acts is identifiable (albeit not strongly) through the twin rate if there are multiple embryos transferred per cycle.
The potential to choose in which of the two submodels a covariate should be included adds an extra level of complexity to the practical use of these models. The natural biological interpretation means that this choice is not merely a statistical convenience, but that identification of the appropriate level may have a biological interpretation with clinical consequences. For example, age has been shown to primarily affect the embryo with a much weaker effect on the uterine component and so the effects of ageing can be offset by the use on egg donation or preservation. The presence of a substantial uterine component means that the loss in pregnancy rates associated with a move from double to single embryo transfer is more modest than would otherwise be expected [11]. In practical applications some authors have avoided any statisticallybased decision and instead made the choice of submodel arbitrarily [7, 10] or based on “previous knowledge” [5]. Roberts et al.[5] used the Akaiki Information Criterion (AIC) criterion comparing the model with the effect in the E and the model with the effect in the U. In this work the authors noted that while the AIC distinguished a “best” model, it was difficult to provide any clear statement as to the weight of evidence in favour of the particular chosen model and its biological interpretation. Although suggestions on the interpretation of BIC differences have been made previously in different contexts [12, 13], it is not clear how or whether such suggestions have relevance here. Others have chosen the patient covariates to be considered for inclusion in the model by constructing two multivariate logistic regression models, a pregnancy model and a twin pregnancy model, using a backward stepwise elimination [7]. Subsequently a number of models were considered consisting of a number or all the covariates included in the final logistic models in either the E and U with the choice of the final model being made by the models predictive abilities for pregnancy and multiple pregnancy.
As these models are now being utilised for practical applications, there is a need for guidance on how inference and model selection should be performed. When using these models to determine whether a variable is associated with outcome (inference), unless the submodel can be prespecified, there will be explicit or implicit multiple hypotheses tests associated with each variable. Thus it will be generally anticonservative to test the parameter only in the submodel selected (by whatever procedure), or to perform two tests at the nominal level. On the other hand the nonindependence of the tests suggests it may be overconservative to test in both submodels with a standard Bonferroni correction. When developing prognostic or predictive models an information criterion approach to model selection is attractive, particularly as some of the comparisons of interest are not naturally nested. However there is a need for guidance on how to interpret the information criteria in terms of the weight of evidence that an effect can be properly ascribed to the E or U component.
Motivating example
This work arose out of issues encountered in the towardSET project, which involved the use of EU models on large datasets to predict outcomes for various options for the implementation of a single embryo transfer policy. Details of this dataset have been published elsewhere [5, 11]. The main aim of the towardSET project was pragmatic, to develop a useable predictive model to allow modelling of potential treatment policies. However, given that the dataset was large and comprehensive, there was also interest in the prognostic model itself: which factors were predictive of outcome and in particular was their effect mediated through the embryo or the recipient mother. This particular analysis is described in [5]. The dataset comprised 12,487 fresh treatments from 8775 couples across 5 UK centres and included 16 categorical patient and treatment factors along with two measures of embryo quality.
Outline of present work
This present work presents a series of simulation studies which aims to clarify the issues underpinning the use of EU models for practical applications. Firstly we assess the performance of alternative strategies for the inference problem with some consideration as to power and sample size issues. Secondly we evaluate the use of the information criteria AIC and BIC to make a choice as to which submodel these effects should be included in and more generally selecting between the 4 alternative covariate inclusion models. Following this we present a reworking of a previously analysed dataset to illustrate the methodology and offer some guidance for the use of these models in real applications.
Methods
The EU model
In this work we will use the EU model of Zhou and Weinberg [4]. Following the notation of Roberts [14], an EU model each cycle i has a u _{ i } uterine receptivity probability and each embryo j of cycle i has a survival probability e _{ ij }. The u _{ i } and e _{ ij } are represented as logistic regression submodels:
where U _{ i } and E _{ ij } are the covariate design matrices of uterus and embryo covariates respectively, and ${\underset{\u02dc}{{\mathit{\beta}}^{\prime}}}_{\mathit{U}}$ and ${\underset{\u02dc}{{\mathit{\beta}}^{\prime}}}_{\mathit{E}}$ are their corresponding parameter vectors. So now the probability of a kfold pregnancy for cycle i with n _{ i } embryos transferred becomes:
where δ(k) is 1 if k = 0 and 0 otherwise and ${\mathbf{P}}_{j=1,{n}_{i}}^{k}\left({e}_{\mathit{ij}}\right)$ is the sum over all subsets of size k of the product of e _{ ij } in the subset and the complement (1 − e _{ ij }) of the e _{ ij } not in the subset. In terms of an indicator vector, s, the elements s _{ l } of the vector take the values of 0 and 1 such that their sum is equal to k. So we can write
The assumptions of the EU model are that, conditional on the covariates, embryo viabilities and uterine receptivity are independent, and embryo viability is assumed independent among the embryos produced in a single IVF cycle. In this work we assume that the cycles are independent, but this assumption can be relaxed and random effects included to allow, for example, for repeat cycles from the same couples [15, 16].
Fitting the model
The model can be readily fit to data using direct maximisation of the likelihood and the details are available in [14, 15]. This has been implemented in Splus [7, 17], R [14, 18] and Stata [15, 19]. The simulation work here used the Stata implementation, whilst the worked example was fitted using R.
Simulation design
Simulated datasets were created assuming an EU model. The simulated datasets were designed to have similar outcomes (both in terms of pregnancy rates and twin rates) to that of the large multicentre UK dataset from the towardSET project which motivated this work (see above). This simplified model only considers four covariates representing, in each submodel, a putative prognostic variable which is to be tested in the analysis and fixed covariates which are prespecified as being included in the analytical models:
We include covariates in the embryo submodel which are measured at both the embryo and the patient level. The U, E _{ p } and E _{ e } covariates are based on the distributions of the two linear predictors from a fitted EU model in the motivating dataset:
and have, by definition, coefficients β _{ u }, β _{ ee }, β _{ ep } equal to 1. The intercept terms have values determined by the outcomes in the motivating dataset:
P represents a new patient level covariate which is being considered for inclusion and is arbitrarily assigned a standard normal distribution P ~ N(0,1).
The final parameters β _{ up } and β _{ ep2 } represent the effects of the putative patient variable under investigation (P) and a range of values are considered:
In the case where β _{ up } and β _{ ep2 } are zero this reproduces an overall pregnancy rate of 19.7% and an overall twin rate of 3.2% (4.6% in those cycles with two embryos transferred) close to those observed in the towardSET dataset. Over the range of parameters considered the overall success rate varies from 19.7 to 24.8% and the twin rate between 3.2 and 6.9%.
All combinations of the coefficients of the variables being considered (β _{ up } and β _{ ep2 }) were simulated, leading to 121 different cases. Four explicit cases are of particular interest:

1.
The prognostic variable under consideration did not affect the treatment outcome (β _{ up } = 0 and β _{ ep2 } = 0).

2.
The prognostic covariate under consideration is associated with the treatment outcome at an embryo level (embryo submodel, β _{ up } = 0 and β _{ ep2 } = 0.1, 0.2,…,1).

3.
The prognostic covariate under consideration is associated with the treatment outcome at a recipient level (uterus submodel, β _{ ep2 } = 0 and β _{ up } = 0.1, 0.2,…,1).

4.
The prognostic covariate under consideration is associated with the treatment outcome at both levels (β _{ up } = 0.1, 0.2,…,1 and β _{ ep2 } = 0.1, 0.2,…,1).
Simulations were performed at 5 different sample sizes, 400, 800, 1600, 3200 and 6400 treatments all with 30% single and 70% double embryo transfers, reflecting European practice.
For each simulated cycle, values for U, E _{ p } and E _{ e } are sampled from the covariate distributions. Using these values equation (3) is used to determine the probabilities u _{ i } and e _{ ij } and these probabilities realised by sampling from a Bernoulli distribution to determine the cycle outcome.
6000 replications were performed giving estimates with an estimated precision (half width of twosided 95% CI) on a 5% test size of ±0.6% and ±1.3% on a 50% misclassification rate.
Inference: Is a variable associated with outcome?
The statistical significance of the new variable measured at the patient level is tested using a likelihood ratio (lr) test comparing the full model with a reduced model excluding the tested parameter. A lrtest is used in preference to the Wald test as previous work has suggested that for these models the Wald test performs less well for small sample sizes [14]. Since the variable can be included and tested in more than one submodel the following alternative strategies were employed and compared:

1.
“FoundinEither” strategy: Two single degree of freedom (df) lrtests are performed, one on the model including the new variable in the uterus submodel and another on the model including the variable in the embryo submodel both against the reduced model that does not include the new variable. The new patient level variable is regarded as significant in this strategy if either of the lrtests exceeds the nominal significance level.

2.
“FoundinBoth” strategy: The same single df lrtest tests are performed as in FoundinEither strategy, but both tests need to reach the nominal significance level to regard the variable as significant.

3.
“Bonferroni” strategy: This strategy is the identical as the FoundinEither strategy, but a Bonferroni correction is applied to the significance levels of the two tests to adjust for multiple testing. This strategy regards the new variable as significant at the nominal α significance level, if either of the two tests is found to be significant at an α/2 significance level.

4.
“Global” strategy: A single two df lrtest is performed on the model including the new variable in both submodels against the model not containing the variable in either submodel. This strategy regards the new variable as significant, if the lrtest is found significant at the nominal significance level.
All hypothesis tests used the conventional P < 0.05 significance level. The test size and hence type I error rates are estimated for each scenario and compared to the nominal rates, and power estimated for the nonnull cases.
For comparative purposes we also computed the estimated power based on a naive approximation where the individual submodels were considered in isolation with the same effect size as the full EU model, a sample size based on the number of cycles and an event rate given by the pregnancy rate. This reflects a conservative power estimate (using the pregnancy rate rather than the E or U probabilities) that is potentially available from standard software, although the estimates presented here were computed using simulation.
Model selection
Model selection is performed using Information Criteria (IC), the two IC considered here are the Akaike Information Criterion (AIC) and the Bayesian Information Criterion (BIC). The two IC are defined as
where $L\left(\underset{\u02dc}{\widehat{\mathit{\theta}}}\right)$ is the maximum of the likelihood function, p is the number of parameters in the model and n is the sample size [20]. These are compared in terms of their ability to correctly determine in which submodel the effect should be included. Note that these two IC are equivalent when the models considered have the same number of parameters (the nonnested case), as when comparing the inclusion of a variable in either of the two submodels (embryo or uterus).
Since BIC is a sample size dependent statistic, as with other multilevel modelling situations, the sample size can be measured at any of the levels of the hierarchy [21]. In EU models the number of cycles in the dataset and the number of embryos in the dataset could both be regarded as appropriate measures of sample size. As the outcome is only observed at the patient level the BIC used here uses the number of cycles as sample size. This issue is not a concern when choosing to include a patient effect in either the uterus or the embryo submodel (but not in both) since the BIC difference is independent of sample size.
Unlike the AIC where the difference between two models has no direct interpretation, the BIC difference can be considered as an approximation of the Bayes factor [13]. Jeffreys [12] proposed a rule of thumb for interpreting Bayes factors, and this was slightly modified by Raftery [13]. This rule of thumb is shown in Table 1. The interpretation of the BIC as a Bayes factor has
proved controversial, and the Raftery ’s interpretation is based on a very different scenario from the EU model. The proportion of simulations in which the correct assignment is made is estimated in each scenario and compared to the proportions suggested by Raftery.
In this work we pragmatically consider AIC and BIC as commonly applied. However it should be noted that there are subtle differences in the assumptions behind these two criteria. The AIC is designed to obtain the optimal model available assuming the true model is not one of the models considered; whereas the BIC assumes that the true model is one of the models considered [22].
Results and discussion
Hypothesis tests for association: type 1 error and power
Table 2 show the type 1 error for the strategies investigated here. As Table 2 and Figure 1 show; the strategies which had a type 1 error close to the design test size (5%) were the Global strategy and the somewhat conservative Bonferroni strategy. The FoundinEither strategy was found to be an overly liberal analysis with the type 1 error reaching up to 8.7%. Finally, the FoundinBoth strategy was found to be a conservative analysis having a type 1 error (2.3%2.7%) approximately half of the designed test size. As noted previously [14], the tests were slightly more liberal for the smallest sample size.
Figure 1 shows the power to detect an effect as a function of effect size for the two strategies that have close to the nominal type I error rate (Bonferroni and Global
strategies). The two cases where the true effect is in the embryo submodel (left hand panels) and uterus submodel (right) are shown. Also shown in Figure 1 are naive logistic power estimates based on the component submodels, with a probability of success equal to the birth rate. The Bonferroni strategy and the Global strategy were almost identical in terms of power, with the Bonferroni strategy being slightly more powerful for all but the smallest effect sizes. Also it is clear that a naive power calculation based on the component logistic models will severely overestimate the power.
The simulations also allow us to investigate the scenario in which the submodel is prespecified. Table 2 shows that the type 1 error was close to the nominal level (4.9%5.8%) when a uterusmodel effect was tested in the uterus submodel and similarly (4.7%5.7%) when an embryomodel effect was tested in the embryo submodel. Figures 2 and 3, show power curves when variables are correctly and incorrectly specified as acting through the embryo or uterus submodels. The Bonferroni strategy with no assumption about the correct submodel is presented as reference. As would be expected the power curves show that when the correct submodel was assumed correct the power of these strategies was better than the Bonferroni strategy but if the incorrect submodel was assumed correct then the test has less power than the Bonferroni strategy.
Thus if it is known which submodel the effect is acting in, testing in that prespecified submodel can be regarded
as an optimal strategy having a type 1 error close to the designed test size and also a higher power than it would if no such assumptions were made. However if the model specification is incorrect there is a not insignificant loss of power incurred by misspecification and a conservative Bonferroni approach would be preferable. Since in practice even when biological evidence suggests inclusion of a patient prognostic variable in one submodel, if the data is observational it can often be argued that the same prognostic variable may affect the other submodel due to patient selection effects. If this may be so then the strategy of testing the variable only in the prespecified submodel might be advised against. For example, male infertility naturally would be included in the embryo submodel since there is no plausible mechanism for the male to directly affect the uterine or maternal component. However as these models are usually applied to a population of infertile couples, only one of which is expected to be necessarily infertile, the fact that there is a known male cause will probably itself lead to the female partners of infertile males having greater fertility than those with fertile males.
Submodel selection
Firstly we consider the simple case where a variable of interest has been prespecified and one wants to know in which of the submodels it should be included, possibly in order to investigate mechanistic hypotheses. Tables 3 and 4 show the proportion of the correct classifications by the AIC/BIC criteria (the information criteria AIC and BIC are equivalent in this case) when selecting a single submodel for a prognostic variable when the true effect is in just the embryo or uterus submodel respectively. As would be
expected, the proportion of correct classifications increases from 50% (when there is no true effect) as the effect of the parameter and/or the sample size increases. Rather large sample sizes compared to typical current datasets or effect sizes are required to reliably determine the correct assignment of effect to submodel. There is no detectable bias toward selecting either of the submodels when there is no true effect in either model; over the range of simulation scenarios considered, the selection rates are equal at 50% to within the simulation error.
Table 5 shows the proportion of correct classifications when choosing between the submodels according to the difference in the BIC (or equivalently AIC in this context) between the two models. The ranges and expected
proportion of true classifications using the ranges from Raftery [13] are shown for comparison. Despite the very different application here from the original work, the proportions of correct model choices are in good agreement with Raftery, and differences in BIC between models of ≥6 are required to provide strong evidence (>95% probability of correct assignment) in favour of one or the other submodel. The previously suggested criterion of an AIC difference of >2 [5] corresponds to what Raftery termed “positive evidence” and a probability of >75% that the assignment is correct. From the table we also observe that the BIC differences proportions of correct classifications increases slightly as the sample size increases, this may be due to the fact that BIC is a large sample Bayes factor approximation and as the sample size increases the estimate becomes closer to the true Bayes factor estimate [13].
A more realistic case is that where a variable is being considered for inclusion in a model, maybe for prognostic or predictive purposes, and there is no evidence as to which submodel, if any it should be included in. Thus there are four possible models to be considered: omit the variable, include it in E, include it in U or include it in both submodels. Figure 4 shows, for an illustrative sample size of 800, the proportion of simulated datasets which are assigned to the 4 potential models as a function of the effect sizes in each of the models using the AIC (left hand panels) or BIC (right hand panels). As with other applications [23, 24], the AIC is more likely to choose a larger model than the BIC but conversely is less able to correctly identify a true effect. However, as can be seen from the margins in the lower panels of Figure 4, when the true effect is in only one of the submodels the AIC selects the model that includes the covariate in both submodels 1115% of the times (with the larger misclassification rates observed on larger sample sizes), whereas the BIC incorrectly selects the larger model <1% of the times. Both AIC and BIC become more likely to select the correct model as the sample size increases (data not shown).
Tables 3 and 4 and Figure 4 also indicate that, for the same effect size, if the information criteria are used for
model selection then it is slightly more likely that a prognostic variable will be correctly included if it truly acts through the embryo submodel than if it belongs in the uterus submodel. This reflects the greater amount of information available at the embryo level compared to that at the patient level.
Worked example
As the dataset was large, all the potential factors were included in the model. To determine in which of the submodels (E, U or both) a factor would be included in a mixture of prespecification based on other work and selection on the basis of AIC was used. A simple likelihoodratio test comparing the fitted model with a reduced model with that parameter excluded was used to give an indicative Pvalue for each factor, acknowledging that this test does not allow for the model selection process. Although the AIC allowed statements to be made as to the best fitting model and therefore the best supported mechanism (E or U) for the mediation of the effects, that work struggled to provide any indication as to the strength of the evidence and noted the need for further research.
Table 6 shows the AICs for the fitted model where each factor in turn is permuted to be in E, U, both or neither submodel whilst the remaining variables are specified as per the final selected model. Age was prespecified to act through both E and U as previous data from donated eggs suggests that both maternal age and egg age are prognostic, although the AIC would suggest that it should only be included in the E submodel. Centre and year as proxies for treatment and population changes were also included in both models. Several factors are included which would not be selected by AIC.
Whilst we would not necessarily advocate formal hypothesis tests for prognostic factors in such a large dataset, preferring to focus on effect sizes, nevertheless such tests were offered in the original work. The possible tests comparing each factor in turn with a null model for that factor, holding all other factors at their selected locations and performing a 1 or 2df likelihood ratio test are shown in Table 7. The highlighted values indicate the Pvalues (conditional on model selection) used in the original analysis,
The Bonferroni approach would give identical results at a critical value of P < 0.025 for the E and U tests, but we note that the test used previously does overstate the significance. For transfer day the P = 0.024 would be considered borderline significant with the recommended method. We note also in this example with a large sample size, that the 2df test of E + U gives overall similar conclusions, but that the significance levels from the two 1df Bonferroni test and the 2df global tests can diverge appreciably (eg Attempt number). As always, there is no substitute for carefully formed prespecified hypotheses.
With regard to the interpretation of the assignment of effects to the E or U submodel, the previous work tentatively used an arbitrary AIC difference of >2 to make statements that there was reasonable evidence that one mechanism should be preferred. The work here suggests that this statement is perhaps too optimistic, with a misclassification rate of ~25% and to make strong statements AIC differences of ⪆6 are required. If this more stringent level were used, only one of the five ascribed
assignments would be considered to have strong support. This underlines the fact that there is very limited statistical power to determine these assignments.
Finally we note that the earlier work observed that the year and centre effects, presumed surrogates for variation in clinical practice and population not captured by the available covariates, were difficult to interpret. If, rather than AIC, the BIC criteria were used and the model selected accordingly, then these effects would not be included. This suggests that these complex effects could be considered as resulting from over fitting in the large dataset.
Guidance for practice
EU models due to their biological derivation can in principle separate the effect of the embryo’s viability and the receptivity of the uterus, and yield statistical models with a causal interpretation. Even if the mechanistic basis is oversimplified or the data subject to confounding, the models have proved rich enough to be of practical utility whilst solving the statistical issues of partialobservability.
When attempting to determine whether a variable is associated with outcome, there will often be no prior evidence from which one can prespecify in which submodel a variable of interest should be included. The simulations indicate that for practical purposes a reasonable strategy would be to perform two single df likelihood ratio tests; one for the model including the variable in the uterus submodel and one for the model which includes the variable in the embryo submodel. The variable is then considered to be statistically significant if either of the lrtests is found to be significant at half the nominal level. The commonly used strategy of testing the effect in both submodels without the Bonferroni correction leads to an overliberal test. Over the range of parameters and models simulated the concern that this approach may be overconservative due to the tests being nonindependent is not borne out.
If prior knowledge reliably allows the prespecification of the submodel, then a single test at the nominal level should be preferred. However if this model is incorrectly specified then this test has less power than the Bonferroni approach with no assumptions as to the submodel: indeed the loss from missspecification exceeds the gain from correct specification. The strong patient selection effects in most observational IVF datasets mean that a priori assumptions based on biological and clinical arguments can be misleading. Thus unless the evidence for an appropriate submodel is strong a conservative approach of testing in both models with a Bonferroni correction would be advised.
The second major application of these models has been in the development of prognostic models of patient outcomes. When a patientlevel variable is to be included, the AIC and BIC can be used to make the distinction in which submodel to include that variable. The criteria suggested by Raftery [13] and summarised in Table 1 provide a good basis for assessing the weight of evidence for the resultant assignment. When considering including a patient level prognostic variable in an EU model there is a need to select between the 4 possible alternative models; BIC seems to perform well with the number of patients being used as the sample size parameter. However large sample sizes are required to determine the assignments with any degree of reliability and so to draw causal inferences from the assignment.
To date EU models have only been used to analyze observational datasets (not clinical trials) and there is little guidance as how such studies can be adequately powered, with most studies relying on heuristic or feasibility arguments. The power curves and the percentages of correct classifications presented here can be used to aid in the design of future studies.
Extensions to the EU model
In this paper we investigated, using simulation, the performance of various strategies for model selection and significance testing for a patient prognostic variable in an EU model. Whilst these simulations inevitably cover only a small subset of the parameter space, they do, we believe, provide insight as to the performance of the EU model and guidance as to its use for practical data analysis.
The work here has focussed on the case where there is only a single treatment for each couple and has not considered the potentially complex hierarchical structure that can be introduced in real data with repeat treatments. The EU model can in principal be extended to include such correlations, and the nested cases with random intercepts in the E and U submodels have been considered elsewhere [15, 16]. Simulation studies with these extended models are not yet computationally feasible. The work so far suggests that in practical applications including random effects has a negligible impact on the fixed effect estimates and inference: although the variance estimates are useful in themselves, model selection and inference for fixed effect parameters may be adequately performed whilst ignoring the higher level structure.
The major caveat is the underlying assumption in the simulation work that the EU model does reflect the true behaviour and further work is required to relax this assumption. Earlier work [14] included some additional sources of variability and in that case no qualitatively different behaviour was observed. A collaborative model has been proposed [25], but to date does not consider covariates and its properties are not yet well understood. Although the simplistic mechanistic basis of the EU model can be contested and practical interpretations have gone beyond the putative mechanism, the multilevel structure does reflect the structure of the data and account for the major statistical issue of the partial observability of the outcome. The strong assumptions of conditional independence between embryos and between the embryo and uterus effects do need to be acknowledged and further work is required to determine the extent to which the model estimates and their interpretation depend on these assumptions. For datasets with one cycle per patient, additional correlations will be indistinguishable from those induced by additional covariates and as noted above approaching these questions using simulation is as yet infeasible. Real datasets with multiple cycles per patient are subject to strong selection biases as a complex interaction between health care provider policy, patient choice and resource availability will determine of the number of cycles as well as the outcomes of previous cycles. Given that the EU model (particularly if patient random effects are included) has a rich structure which can accommodate a range of potential correlations, and that real datasets have only limited numbers of repeat cycles, it is unlikely that any realistic dataset will have sufficient power to identify departures from the EU model.
Conclusion
We believe that the EU model approach, despite its limitations, is currently the only practical approach that can properly account for the data structure encountered in the analysis of IVF data with multiple embryo transfer. There is now a sufficient body of methodological and practical work to support its more widespread use in real applications.
References
 1.
Ecochard R, Clayton DG: Multivariate parametric random effect regression models for fecundability studies. Biometrics. 2000, 56: 10231029. 10.1111/j.0006341X.2000.01023.x.
 2.
Speirs A, Lopata A, Gronow M, Kellow G, Johnston W: Analysis of the benefits and risks of multiple embryo transfer. Fertil Steril. 1983, 39: 468471.
 3.
Baeten S, Bouckaert A, Loumaye E, Thomas K: A regression model for the rate of success of in vitro fertilization. Stat Med. 1993, 12: 15431553. 10.1002/sim.4780121703.
 4.
Zhou HB, Weinberg CR: Evaluating effects of exposures on embryo viability and uterine receptivity in in vitro fertilization. Stat Med. 1998, 17: 16011612. 10.1002/(SICI)10970258(19980730)17:14<1601::AIDSIM870>3.0.CO;22.
 5.
Roberts SA, Hirst WM, Brison DR, Vail A: Embryo and uterine influences on IVF outcomes: an analysis of a UK multicentre cohort. Human Reproduction. 2010, 25: 279210.1093/humrep/deq213.
 6.
Stylianou C, Critchlow D, Brison D, Roberts SA: Embryo morphology as a predictor of IVF success: an evaluation of the proposed UK ACE grading scheme for cleavage stage embryos. Human Fertility. 2012, 15: 1117. 10.3109/14647273.2011.652251.
 7.
Hunault CC, Eijkemans MJC, Pieters MHEC, te Velde ER, Habbema JDF, Fauser BCJM: A prediction model for selecting patients undergoing in vitro fertilization for elective single embryo transfer* 1. Fertil Steril. 2002, 77: 725732. 10.1016/S00150282(01)032435.
 8.
Roberts SA, McGowan L, Mark Hirst W, Vail A, Rutherford A, Lieberman BA: Reducing the incidence of twins from IVF treatments: predictive modelling from a retrospective cohort. Human Reproduction. 2011, 26: 56910.1093/humrep/deq352.
 9.
Legro RS, Wong I, Paulson RJ, Lobo RA, Sauer MV: Recipient’s age does not adversely affect pregnancy outcome after oocyte donation. American journal of obstetrics and gynecology. 1995, 172: 96100. 10.1016/00029378(95)900918.
 10.
Roberts SA, Fitzgerald CT, Brison DR: Modelling the impact of single embryo transfer in a national health service IVF programme. Human Reproduction. 2009, 24: 122
 11.
Roberts SA, McGowan L, Hirst WM, Brison DR, Vail A, Lieberman BA: Towards single embryo transfer? Modelling clinical outcomes of potential treatment choices using multiple data sources: predictive models and patient perspectives. Health Technol Assess. 2010, 14: 1237.
 12.
Jeffreys H: Theory of probability. 1961, Oxford: Oxford University Press
 13.
Raftery AE: Bayesian model selection in social research (with discussion by Andrew Gelman & Donald B. Rubin, and Robert M. Hauser, and a rejoinder). Sociological methodology. 1995, 111196.
 14.
Roberts SA: Models for assisted conception data with embryospecific covariates. Stat Med. 2007, 26: 156170. 10.1002/sim.2525.
 15.
Stylianou C: Predictive modelling of assisted conception data with embryolevel covariates. Statistical issues and application. 2011, University of Manchester, PhD Thesis
 16.
Roberts SA, Stylianou C: The nonindependence of treatment outcomes from repeat IVF cycles: estimates and consequences. Human Reproduction. 2012, 27: 436443. 10.1093/humrep/der420.
 17.
TIBCO Software, Inc: 2012, Somerville, MA, USA
 18.
R Core Team: R: A language and environment for statistical computing. 2013, Vienna, Austria: R Foundation for Statistical Computing, URL http://www.Rproject.org/, 3900051070
 19.
StataCorp: Stata Statistical Software: Release 11. 2009, College Station, TX: StataCorp LP
 20.
Liddle AR: Information criteria for astrophysical model selection. Monthly Notices of the Royal Astronomical Society: Letters. 2007, 377: L74L78. 10.1111/j.17453933.2007.00306.x.
 21.
Schwarz G: Estimating the dimension of a model. The annals of statistics. 1978, 6: 461464. 10.1214/aos/1176344136.
 22.
Yang Y: Can the strengths of AIC and BIC be shared?. A conflict between model indentification and regression estimation. Biometrika. 2005, 92: 937
 23.
Kuha J: AIC and BIC. Sociological Methods & Research. 2004, 33: 18810.1177/0049124103262065.
 24.
Saidane M, Lavergne C: A new HMM learning algorithm for event studies: empirical evidence from the French stock market. Applied Economics Research Bulletin. 2008, 1: 30
 25.
Matorras R, Matorras F, Mendoza R, Rodriguez M, Remohi J, RodriguezEscudero FJ: The implantation of every embryo facilitates the chances of the remaining embryos to implant in an IVF programme: a mathematical model to predict pregnancy and multiple pregnancy rates. Human Reproduction. 2005, 20: 29232931. 10.1093/humrep/dei129.
Prepublication history
The prepublication history for this paper can be accessed here:http://www.biomedcentral.com/14712288/13/73/prepub
Acknowledgements
CS: Was supported by a Medical Research Council studentship during the research that lead to this paper.
AP: Final work on this study was supported by the NIHR Biomedical Research Centre for Mental Health.
SAR: Was supported by the NIHR Manchester Biomedical research centre and Central Manchester University Hospitals NHS Foundation Trust.
The original work on the towardSET project providing the example data was funded by the NIHR Health Technology Assessment programme (project number 05/43/01) and was be published in full in the Health Technology Assessment journal [11]. The views and opinions expressed are those of the authors and do not necessarily reflect those of the Department of Health or our advisors. We also acknowledge the towardSET collaboration for providing the data and Mark Hirst for collating the data and performing much of the original analysis. The reanalysis presented here was conducted as part of the EU FP7 EpiHealth project FP7HEALTH2011278418.
Author information
Affiliations
Corresponding author
Additional information
Competing interests
The authors declare they have no competing interests.
Authors’ contributions
CS designed and performed the simulation work. AP and SR supervised this work. SR conceived the project and added the worked example. All authors contributed to the writing of the manuscript. All authors read and approved the final manuscript.
Authors’ original submitted files for images
Below are the links to the authors’ original submitted files for images.
Rights and permissions
Open Access This article is published under license to BioMed Central Ltd. This is an Open Access article is distributed under the terms of the Creative Commons Attribution License ( https://creativecommons.org/licenses/by/2.0 ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
About this article
Cite this article
Stylianou, C., Pickles, A. & Roberts, S.A. Using Bonferroni, BIC and AIC to assess evidence for alternative biological pathways: covariate selection for the multilevel EmbryoUterus model. BMC Med Res Methodol 13, 73 (2013). https://doi.org/10.1186/147122881373
Received:
Accepted:
Published:
Keywords
 Embryo uterus models
 Invitro fertilization
 Hypothesis testing
 Model selection
 Information criteria
 Simulation