Meta-analytic estimation of measurement variability and assessment of its impact on decision-making: the case of perioperative haemoglobin concentration monitoring

Charpentier, Emmanuel; Looten, Vincent; Fahlgren, Björn; Barna, Alexandre; Guillevin, Loïc

doi:10.1186/s12874-016-0107-5

Research Article
Open access
Published: 19 January 2016

Meta-analytic estimation of measurement variability and assessment of its impact on decision-making: the case of perioperative haemoglobin concentration monitoring

Emmanuel Charpentier¹,
Vincent Looten¹,
Björn Fahlgren¹,
Alexandre Barna¹ &
…
Loïc Guillevin²

BMC Medical Research Methodology volume 16, Article number: 7 (2016) Cite this article

2287 Accesses
6 Citations
3 Altmetric
Metrics details

Abstract

Background

As a part of a larger Health Technology Assessment (HTA), the measurement error of a device used to monitor the hemoglobin concentration of a patient undergoing surgery, as well as its decision consequences, were to be estimated from published data.

Methods

A Bayesian hierarchical model of measurement error, allowing the meta-analytic estimation of both central and dispersion parameters (under the assumption of normality of measurement errors) is proposed and applied to published data; the resulting potential decision errors are deduced from this estimation. The same method is used to assess the impact of an initial calibration.

Results

The posterior distributions are summarized as mean ± sd (credible interval). The fitted model exhibits a modest mean expected error (0.24 ± 0.73 (−1.23 1.59) g/dL) and a large variability (mean absolute expected error 1.18 ± 0.92 (0.05 3.36) g/dL). The initial calibration modifies the bias (−0.20 ± 0.87 (−1.99 1.49) g/dL), but the variability remains almost as large (mean absolute expected error 1.05 ± 0.87 (0.04 3.21) g/dL). This entails a potential decision error (“false positive” or “false negative”) for about one patient out of seven.

Conclusions

The proposed hierarchical model allows the estimation of the variability from published aggregates, and allows the modeling of the consequences of this variability in terms of decision errors. For the device under assessment, these potential decision errors are clinically problematic.

Peer Review reports

Background

The CEDIT¹ is a Health Technology Assessment (HTA) agency within the University Hospitals in Paris (AP-HP²). It is in charge since 1982 of advising the senior management about the adoption and use of innovative medical technologies in AP-HP’s hospitals.

We have had to assess, in a limited time frame, the possible impact of the introduction of a device³ monitoring the hemoglobin concentration of patients undergoing surgical intervention. This device is used to produce a measurement (SpHb) of the current hemoglobin concentration by means of a sensor which is a variation of the pulse oxymetry sensors; this measure is supposed to replace the measurement (tHb) produced by a laboratory analyzer, thus avoiding the wait for the laboratory results (an element that could be important in a surgical context) and the disruption in the laboratory work flow caused by unplanned requests.

Previous studies of this device in various clinical settings showed that its measurement errors were large but almost symmetric around 0. A recent meta-analysis [1] aggregated the results reported in 32 papers, 13 of which reported results of operating room use; the average mean error (bias) in this surgical subgroup was 0.4 g/dL, but the measurement error standard deviation was larger than 1 g/dL in 15 of the 16 measurement series reported by these 13 papers.

The authors report a bias whose confidence interval includes 0, but they state “We have not found any publications that provide statistical methods to quantify the uncertainty of SD in meta-analysis”. Therefore, its clinical conclusions are based on hypotheses on the possible standard deviation of the measurement errors, without estimating it. The authors complete their conclusion on the bias by warning that “the wide LOA [limits of agreement] mean clinicians should be cautious when making clinical decisions based on these devices”.

In order to assess the usability of this device, our HTA therefore required the assessment of decision error risks, hence the need to estimate not only the bias (which can be done by a variety of methods, see [2] for an example), but also the variability of the measurements used in this decision. In other words, the use of this device requires not only the assessment of a (possibly “significant”) bias (i.e. an average error whose confidence/credible interval does not contain 0), but also of its variability (e.g. by estimating its standard deviation). This allows us to estimate the probability of a potential clinical decision error.

However, as pointed out by [1], such methods for meta-analytic assessment of variability are almost nonexistent in the field (see Discussion), hence our proposal.

We also wanted to assess the impact of an initial calibration of the device (proposed by some authors in order to remove patient-specific systematic errors) which consists in the subtraction from a given measure SpHb of an initial error SpHb₀−tHb₀ obtained from initial calibrating measurements of SpHb and tHb:

$${~}_{c}\text{SpHb}\,=\,\text{SpHb}-\left(\text{SpHb}_{0}-\text{tHb}_{0}\right) $$

Therefore, we propose a Bayesian model allowing to pool the information given in various papers about the distribution of measurement errors, and to use this estimation to assess its impact in the modeling of the clinical decision error risks of these two modes of use of the device.

Methods

Literature review

We repeated the published search strategy of [1] on Pubmed and Embase databases, and augmented this search by manual search in the references marked as “Related to” by Pubmed; we then obtained full texts of a first selection of papers, whose “References” section was used to complete the search. Our selection was driven by the following criteria :

The device whose operating characteristics were reported in the paper had to use the same operating principle as our target device.
The paper had to report clinical use during a surgical intervention.
The paper had to report an estimation of both mean and standard deviation of the differences of paired reference (tHb) and device-derived (SpHb) measurements made at the same time, or at least to quote some indicator (such as Bland & Altman’s LOA [3]) enabling to reconstruct these measures.

The selected papers were analyzed to extract and/or reconstruct sample sizes, observed point estimates of mean and standard deviation of each study population.

Modeling

For the intended use case (monitoring of hemoglobin concentration in the operating room), the measurement given by reference methods is the only available reference, and the anesthesiologists’ methods are built against this measure. Therefore, we ignored its possible errors and choose to consider tHb, as our standard.

In the selected papers, the same patient may have coupled tHb/SpHb measurements at one or more occasions; we shall see (see Table 1) that in most papers, these different occasions are merged in the same series, without information about intra- and inter-patient variabilities: other papers reported separately measurements made at different occasions, but without information on the possible correlation of measurement errors on the same patient.

Table 1 Data extracted from the literature

Full size table

Therefore, when a paper reported more than one series of measurement errors (i.e. set of assessments of this error made in the same circumstances on independent patients), these series were kept separate, and analyzed as independent: these series were usually characterized by a factor (e.g. operating phase) strongly linked to hemoglobin concentration, overwhelming the (weak) patient-related factors.

In other words, we ignored a possible “paper” level in our model.

Raw SpHb

We postulated that in each series i in the literature, the individual measurement errors e _i,j,k=SpHb_i,j,k−tHb_i,j,k in patient j of the series i at occasion k are normally distributed (Eq. (1) below). We also postulated that the series-specific means μ _i of measurement errors (i.e. the series-specific biases) are normally distributed in the (hypothetical) population of all possible repetitions of such studies, with a population-level mean μ _m (overall bias) and a population-level standard deviation σ _m (2); similarly, the series-specific standard deviations σ _i are supposed to have a lognormal (μ _s,σ _s) distribution in the population (3).

$$\begin{array}{*{20}l} \mathrm{e}_{\text{\textit{i,j,k}}} & \sim {\mathcal{N}}\left(\mu_{i}, {\sigma_{i}^{2}}\right) \end{array} $$

((1))

$$\begin{array}{*{20}l} \mu_{i} & \sim {\mathcal{N}}\left(\mu_{m}, {\sigma_{m}^{2}}\right) \end{array} $$

((2))

$$\begin{array}{*{20}l} \sigma_{i} & \sim {\mathcal{L}\mathcal{N}}\left(\mu_{ls}, \sigma_{ls}^{2}\right),~\text{which we shall use as:} \\ \log\sigma_{i}\ & \sim {\mathcal{N}}\left(\mu_{ls}, \sigma_{ls}^{2}\right) \end{array} $$

((3))

The postulate of normality of measurement errors (1) allows us to use two well-known results of the sampling theory from normal distributions to derive the likelihoods of the usual m and s estimators of μ and σ from a sample of size n:

$$\begin{array}{*{20}l} \sqrt{n_{i}-1}\,\frac{m_{i}-\mu_{i}}{s_{i}} & \sim t_{n_{i}-1}\qquad\text{and, independently,} \end{array} $$

((4))

$$\begin{array}{*{20}l}[-2pt] (n_{i}-1)\,\frac{{s_{i}^{2}}}{{\sigma_{i}^{2}}} & \sim \chi^{2}_{n_{i}-1} \end{array} $$

((5))

(4) and (5) allow us to compute the likelihoods of the published series-level estimators m _i and s _i instead of requiring patient-level data e_i,j,k.

Calibrated SpHb

The error for occasion k in patient j in series i, e _i,j,k, is defined by e _i,j,k=SpHb_i,j,k−tHb_i,j,k. The error of _cSpHb (“calibrated error”) _c e _i,j,k will be:

$$\begin{array}{*{20}l} {~}_{c}e_{\text{\textit{i,j,k}}} & = {~}_{c}\text{SpHb}_{\text{\textit{i,j,k}}}-\text{tHb}{i,j,k} \\ {} & = \text{SpHb}_{\text{\textit{i,j,k}}}-\left(\text{SpHb}_{i,j,0}-\text{tHb}_{i,j,0}\right)-\text{tHb}{i,j,k}\\ {} & = \left(\text{SpHb}_{\text{\textit{i,j,k}}}-\text{tHb}{i,j,k}\right) - \left(\text{SpHb}_{i,j,0}-\text{tHb}_{i,j,0}\right)\\ & = e_{\text{\textit{i,j,k}}}-e_{i,j,0}\,. \end{array} $$

Now, in each series i, we can decompose e _i,j,k as the sum of a series-specific bias μ _i, a patient specific random effect f _i,j distributed with mean 0 and variance ${\tau _{i}^{2}}$, and an occasion-specific random residual g _i,j,k distributed with mean 0 and variance $\upsilon _{i,j}^{2}$.

Suppose further that these terms are independent and, for simplicity, homoscedastic in each series⁴ (i.e. for all patients j of the series i, $\upsilon _{i,j}^{2}={\upsilon _{i}^{2}}$). Then, $\forall i, \text {Var}\left (e_{\text {\textit {i,j,k}}}\right)={\sigma _{i}^{2}}=\text {Var}\left (\mu _{i}+f_{i,j}+g_{\text {\textit {i,j,k}}}\right)={\tau _{i}^{2}}+{\upsilon _{i}^{2}}$. However,

$$\begin{array}{*{20}l} {~}_{c}e_{\text{\textit{i,j,k}}} & = \mu_{i}+f_{i,j}+g_{\text{\textit{i,j,k}}}-\left(\mu_{i}+f_{i,j}+g_{i,j,0}\right)\\ & = g_{\text{\textit{i,j,k}}}-g_{i,j,0} \end{array} $$

((6))

Therefore, $\text {Var}\left ({~}_{c}e_{\text {\textit {i,j,k}}}\right)=2{\upsilon _{i}^{2}}$. The ratio of corrected to raw measurement standard errors is:

$$\theta_{i}\,= \sqrt{\frac{2{\upsilon_{i}^{2}}}{{\tau_{i}^{2}}+{\upsilon_{i}^{2}}}}\,. $$

Under our assumptions, this ratio can take values between 0 (all error is patient-specific, with no residue, υ=0) and $\sqrt {2}$ (all error is random, with no patient-specific component, τ=0). Both cases make sense in the current context.

The definition of the calibrated error implies (6) that it is (positively) correlated to the raw error; therefore, their difference should be (negatively) correlated to the raw error, and so should be their means.

It is equivalent to estimate τ and υ or σ and θ. The latter allows, as we shall see, to model series with and without calibrated errors in the same way.

We model the impact of calibration as variations of the measurement error’s mean and standard deviation (modeled, as before, as being normally distributed):

$$\begin{array}{*{20}l} {~}_{c}e_{\text{\textit{i,j,k}}} & \sim {\mathcal{N}}\left(\mu\text{\scriptsize c}_{i}, \sigma\text{\scriptsize c}_{i}^{2}\right) \end{array} $$

((7))

$$\begin{array}{*{20}l} \mu\text{\scriptsize c}_{i} & = \mu_{i} + \delta_{i} \end{array} $$

((8))

$$\begin{array}{*{20}l} \sigma\text{\scriptsize c}_{i} & = \sigma_{i}\,\theta_{i} \end{array} $$

((9))

We model the position parameters μ _i and δ _i of individual series as having a bivariate normal distribution; similarly, we model their (suitably transformed) spread parameters σ _i and θ _i as bivariate normally distributed:

$${} {\fontsize{9pt}{9.3pt}\begin{aligned} {\mu_{i} \choose \delta_{i}} \sim\mathcal{MVN} \left({\mu_{m} \choose \mu_{\delta}}, \left({{\sigma_{m}^{2}} \atop \rho_{p}\sigma_{m}\sigma_{\delta}} \ \ \ {\rho_{p}\sigma_{m}\sigma_{\delta} \atop \sigma_{\delta}^{2}} \right)\right) \end{aligned}} $$

((10))

$${} {\fontsize{9pt}{9.3pt}\begin{aligned} {\log\sigma_{i} \choose \log\frac{\theta_{i}}{\sqrt{2}-\theta_{i}}} \sim\,\mathcal{MVN} \left({\mu_{ls} \choose \mu_{lt}}, \left({\sigma_{lt}^{2} \atop \rho_{s}\sigma_{ls}\sigma_{lt}}\ \ \ {\rho_{s}\sigma_{ls}\sigma_{lt} \atop \sigma_{lt}^{2}} \right)\right)\end{aligned}} $$

((11))

and, as before, (7) allows us to use (4) and (5), mutatis mutandis, to compute the likelihoods from the published data.

From (10)–(11) and the properties of the multivariate normal distribution, it follows that the marginal distribution of μ _i is given by (2) and that the marginal distribution of logσ _i is given by (3); therefore, despite the appearances, (2)–(3) describe the same model as (10)–(11) when the calibrated data are unknown.

Model implementation and fitting

A Bayesian implementation of this model was fitted by MCMC methods, using the Stan [4] modeling language through the rstan [5] interface to R [6]. The model uses Eqs. (4) and (5) to compute the likelihood of the data and directly implements Eqs. (2) and (3) for series without calibrated SpHb and (8) to (11) for series with calibrated SpHb.

Using (1) and (7), we also sampled the relevant parameters of a new study and of a new observation within this study at each iteration of the MCMC sampling, thus obtaining a sample representative of the (predictive) distribution of measurement errors without being constrained by the particulars of any study. This simulation of the characteristics of the device in a new setting is the basis of our inferences on its performance.

Since our data (means and log-standard deviations of errors in the published series) were already more or less centered around 0 and scaled about 1, we followed [7, 8] and choose a Cauchy(0,3) density as a weakly informative prior distribution for the location parameters μ _m,μ _δ and the transformed spread parameters μ _ls and μ _lt, a half Cauchy(0,3) T[0,] for the standard deviations σ _m,σ _δ,σ _ls and σ _lt, and a Uniform(-1,1) distribution for the correlation coefficients ρ _p and ρ _s. This choice allows for a weakly informative prior distribution robust with respect to a few outlier values without expressing unreasonable a prori beliefs in very large values of the parameters they model.

The resulting program is available as the Additional file 1; it is also part of the the noweb source of the present paper (see the Additional file 2 for instructions).

The convergence of the MCMC chains was checked by visual assessment of the MCMC traces (see Additional file 3), the ratios of MCMC standard deviation to standard deviation for each parameter of the model (see Additional file 4) and the chain convergence indicator $\widehat {\mathrm {R}}$ (see [9]). The quality of the model was assessed by placing each observed quantity in the a posteriori distribution of the parameter it estimates (see Additional file 5).

Diagnostic impact assessment

We used the bias and standard deviation values created during model parameter estimation to assess the impact of measurement errors in terms of decision errors. We postulated that the true values tHb of hemoglobin concentration were uniformly distributed on the [4 12] g/dL range.

Let f the density of the measurement error E (whose realizations are the e _i,j,k observations whose mean and standard deviation estimates are reported), and g the density of tHb (F and G being their respective distributions). The probability of observing a measurement SpHb lower than some threshold t (a “positive” reading in our case) is:

$$\begin{array}{*{20}l} \Pr(\text{SpHb}<t) & = \Pr(\text{tHb}+E<t) \\ & = \int_{x}\Pr(x+E<t)\,g(x)\,dx \\ & = \int_{x}\Pr(E<t-x)\,g(x)\,dx \\ & = \int_{x}\left(\int_{e<t-x}\!f(e,x)\,de\right)\,g(x)\,d(x) \\ & = \int_{x}\int_{-\infty}^{t-x}\!f(e,x)\,de\,g(x)\,dx \end{array} $$

((12))

Similarly, the probability of a “true positive” is:

$$\begin{array}{*{20}l} {}\Pr(\text{SpHb}<t\wedge{}\text{tHb}<t) & = \int_{x<t}\int_{-\infty}^{t-x}\!f(e,x)\,de\,g(x)\,dx \end{array} $$

((13))

Since we modeled errors independent of “true” values tHb, these expressions simplify in:

$$\begin{array}{*{20}l} \Pr(\text{SpHb}<t) & = \int_{x} F(t-x)\,g(x)\,dx \end{array} $$

((14))

$$\begin{array}{*{20}l} \Pr(\text{SpHb}<t\wedge{}\text{tHb}<t) & = \int_{-\infty}^{t} F(t-x)\,g(x)\,dx \end{array} $$

((15))

The probability of a “positive” case being G(t) by definition, (14) and (15) are sufficient to compute the sensitivity, specificity and positive and negative predictive values.

The diagnostic impact of measurement errors depends on the distribution of the true values tHb. For reasons discussed below, we choose to assess this impact by postulating a uniform distribution of tHb on a range spanning the clinically useful range of threshold values. According to the literature, this range is about 6 to 10 g/dL [10–12]. Therefore, our impact assessment used an uniform distribution over the range from 4 to 12 g/dL.

Results

The literature review led us to select 21 papers [13–33] reporting 34 distinct estimations of the mean and standard deviation of measurement error; among these, four papers [24, 27, 28, 32] report the characteristics of measurement error after initial calibration in five series. The data extracted from the literature are listed in Table 1.

Model fit

In the text, posterior distributions are summarized as mean ± sd (credible interval) unit; the bounds of the credible intervals are the.025 and.975 quantiles. The full set of summary statistics for the MCMC sample can be found in the Additional file 4.

Analysis of raw SpHb measurement errors

The population-level results of the model fitting for raw SpHb measurement errors are depicted in Fig. 1 and summarized in Table 2; Table 3 summarizes predictive error results, i.e. bias and standard deviation in a new study (new setting), and mean error, squared error and absolute error for an new observation.

Table 2 Estimates of the population-level distribution of measurement errors of raw SpHb

Full size table

Table 3 Replication simulation results for raw SpHb

Full size table

The overall mean error (bias) of raw SpHb has mean 0.23 ± 0.12 (−0.02 0.46) g/dL; the measurement error of raw SpHb is distributed around this mean with log-standard deviation 0.23 ± 0.04 (0.15 0.30) g/dL.

The mean expected bias (systematic error expected in a new study) is 0.24 ± 0.73 (−1.23 1.59) g/dL. The mean expected error (new measurement error in a new study) is 0.27 ± 1.47 (−2.56 3.26) g/dL, whereas the mean expected absolute error is 1.18 ± 0.92 (0.05 3.36) g/dL, and the root of the mean quadratic expected error is 1.50 g/dL.

Impact of calibration

The population-level estimates of the impact of calibration are presented in Table 4 and Fig. 2 and the simulation-based estimates of the resulting measurement errors are presented in Table 5, which also reports the expected bias correction and expected ratio of raw and calibrated standard deviations (inflation/deflation factor).

Table 4 Estimates of the population-level distribution of corrections to measurement error allowed by calibration

Full size table

Table 5 Replication simulation results for calibrated SpHb

Full size table

One notes that, whereas the bias correction is almost systematically negative (−0.42 ± 0.20 (−0.83 0.02) g/dL), the impact of calibration on standard error and expected errors is modest (the mean expected absolute error is 1.05 ± 0.87 (0.04 3.21) g/dL, which is not much less than in the non-calibrated case), and has a non-negligible probability of enlarging the standard error (actually, for a new study, Pr(θ>1)≈ 0.102).

Estimation of clinical impact

The decisional impact of measurement errors of raw SpHb is summarized in Table 6 in terms of sensitivity, specificity, positive and negative predictive values (conditional probabilities) as well as accuracy and probability of a decision error (absolute probabilities); these results are illustrated in Fig. 3. Similarly, the Table 7 and the Fig. 4 summarize the diagnostic impact of measurement errors of calibrated SpHb. The resultant risks of decision errors and their credible regions are graphically compared in Fig. 5.

Table 6 Clinical impact of measurement errors of raw SpHb

Full size table

Table 7 Clinical impact of measurement errors of calibrated SpHb

Full size table