Proportional odds ratio model for comparison of diagnostic tests in meta-analysis

Background Consider a meta-analysis where a 'head-to-head' comparison of diagnostic tests for a disease of interest is intended. Assume there are two or more tests available for the disease, where each test has been studied in one or more papers. Some of the papers may have studied more than one test, hence the results are not independent. Also the collection of tests studied may change from one paper to the other, hence incomplete matched groups. Methods We propose a model, the proportional odds ratio (POR) model, which makes no assumptions about the shape of ORp, a baseline function capturing the way OR changes across papers. The POR model does not assume homogeneity of ORs, but merely specifies a relationship between the ORs of the two tests. One may expand the domain of the POR model to cover dependent studies, multiple outcomes, multiple thresholds, multi-category or continuous tests, and individual-level data. Results In the paper we demonstrate how to formulate the model for a few real examples, and how to use widely available or popular statistical software (like SAS, R or S-Plus, and Stata) to fit the models, and estimate the discrimination accuracy of tests. Furthermore, we provide code for converting ORs into other measures of test performance like predictive values, post-test probabilities, and likelihood ratios, under mild conditions. Also we provide code to convert numerical results into graphical ones, like forest plots, heterogeneous ROC curves, and post test probability difference graphs. Conclusions The flexibility of POR model, coupled with ease with which it can be estimated in familiar software, suits the daily practice of meta-analysis and improves clinical decision-making.


Background
A diagnostic test, in its simple form, tries to detect presence of a particular condition (disease) in a sample. Usually there are several studies where performance of the diagnostic test is measured by some statistic. One may want to combine such studies to get a good picture of performance of the test, a meta-analysis. Also, for a particular disease there may be several diagnostic tests invented, where each of the tests is subject of one or more studies. One may also want to combine all such studies to see how the competing tests are performing with respect to each other, and choose the best for clinical practice.
To pool several studies and estimate a summary statistic some assumptions are made. One such assumption is that differences seen between individual study results are due to chance (sampling variation). Equivalently, this means all study results are reflecting the same "true" effect [1]. However, meta-analysis of studies for some diagnostic tests show that this assumption, in some cases, is not empirically supported. In other words, there is more variation between the studies that could be explained by random chance alone, the so-called "conflicting reports". One solution is to relax the assumption that every study is pointing to the same value. In other words, one accepts explicitly that different studies may correctly give "different" values for performance of the same test.
For example, sensitivity and specificity are a pair of statistics that together measure the performance of a diagnostic test. One may want to compute an average sensitivity and an average specificity for the test across the studies, hence pooling the studies together. Instead, one may choose to extract odds ratio (OR) from each paper (as test performance measure), and then estimate the average OR across the studies. The advantage is that widely different sensitivities (and specificities) can point to the same OR. This means one is relaxing the assumption that all the studies are pointing to the same sensitivity and specificity, and accepts that different studies are reporting "truly different" sensitivity and specificity, and that the between-study variation of them is not due to random noise alone, but because of difference in choice of decision threshold (the cutoff value to dichotomize the results). Therefore the major advantage of OR, and its corresponding receiveroperating-characteristic (ROC) curve, is that it provides measures of diagnostic accuracy unconfounded by decision criteria [2]. An additional problem when pooling sensitivities and specificities separately is that it usually underestimates the test performance [ [3], p.670].
The above process may be used once more to relax the assumption that every study is pointing to the same OR, thus relaxing the "OR-homogeneity" assumption. In other words, in some cases, the remaining variation between studies, after utilizing OR as the summary performance measure, is still too much to be attributed to random noise. This suggests OR may vary from study to study. Therefore one explicitly assumes different studies are measuring different ORs, and that they are not pointing to the same OR. This difference in test performance across studies may be due to differences in study design, patient population, case difficulty, type of equipment, abilities of raters, and dependence of OR on threshold chosen [4]. Nelson [5] explains generating ROC curves that allow for the possibility of "inconstant discrimination accuracy", a heterogeneous ROC curve (HetROC). This means the ROC curve represents different ORs at different points.
This contrasts with the fact that the homogeneous-ROC is completely characterized by one single OR.
There are a few implementations of the heterogeneous ROC. One may classify them into two groups. The first group is exemplified by Tosteson and Begg [6]. They show how to use ordinal regression with two equations that correspond to location and scale. The latent scale binary logistic regression of Rutter and Gatsonis [4] belong to this group. The second group contains implementations of Kardaun and Kardaun [7], and Moses et al [8]. Moses et al explain a method to plot such heterogeneous ROC curve under some parametric assumptions, and they call it summary ROC (SROC).
When comparing two (or more) diagnostic tests, where each study reports results on more than one test, the performance statistics (in the study results) are correlated. Then standard errors computed by SROC are invalid. Toledano and Gatsonis [9] use the ordinal regression model, and account for the dependency of measurements by generalized estimating equations (GEE). However, to fit the model they suggest using a FORTRAN code.
We propose a regression model that accommodates more general heterogeneous ROC curves than SROC. The model accommodates complex missing patterns, and accounts for correlated results [10]. Furthermore, we show how to implement the model using widely available statistical software packages. The model relaxes OR-homogeneity assumption. In the model, when comparing two (or more) tests, each test has its own trend of ORs across studies, while the trends of two tests are (assumed to be) proportional to each other, the "proportional odds ratio" assumption. We alleviate dilemma of choosing weighting schemes such that do not bias the estimates [ [11], p.123], by fitting the POR model to 2-by-2 tables. The model assumes a binomial distribution that is more realistic than a Gaussian used by some implementations of HetROC. Also, it is fairly easy to fit the model to (original) patient level data (if available).
Besides accounting better for between-study variation, we show how to use the POR model to "explain why" such variation exists. This potentially gives valuable insights and may have direct clinical applications. It may help define as to when, where, how, and on what patient population to use which test, to optimize performance.
We show how to use "deviation" contrast, in parameterization of categorical variables, to relax the restriction that a summary measure may be reported only if the respective interaction terms in the model are insignificant. This is similar to using grand mean in a "factor effects" ANOVA model (compared to "cell means" ANOVA model).
We show how to use nonparametric smoothers, instead of parametric functions of true positive rate (TPR) and/or false positive rate (FPR), to generate heterogeneous ROC for a single diagnostic test across several studies. Our proposed POR model assumes the shape of the heterogeneous ROC curve is the same from one test to the other, but they differ in their locations in the ROC space. This assumption facilitates the comparison of the tests. However, one may want to relax the POR assumption, where each test is allowed to have a heterogeneous ROC curve with a different shape. One may implement such generalized comparison of the competing diagnostic tests by a mixed effects model. This may improve generalizability of meta-analysis results to all (unobserved) studies. Also, a mixed effects model may take care of remaining between-study variation better.

Average difference in performances
To compare two diagnostic tests i and j, we want to estimate the difference in their performance. However, in reality such difference may vary from one paper (study) to the other. Therefore ∆ i,j,p = PERF i,p -PERF j,p , where the difference ∆ depends on paper index p, where PERF i,p is observed performance of test i in paper p. To simplify notation, assume that a single number measures performance of each test in each paper. We relax this assumption later, allowing for the distinction between the two types of mistakes (FNR and FPR, or equivalently TPR and FPR). We decompose the differences where δ i,j is the 'average' difference between the two tests, and δ i,j,p is deviation of the observed difference within paper p from the average δ i,j . The δ i,j is an estimator for the difference between performance of the two tests. Note by using deviation parameterization (similar to an ANOVA model) [ [12], pp.51 & 45] we explicitly accept and account for the fact that the observed difference varies from one paper to the other, while estimating the 'average' difference. This is similar to a random-effects approach where a random distribution is assumed for the ∆ i,j,p and then the mean parameter for the distribution is estimated. In other words, one does not need to assume 'homogeneous' difference of the two tests across all the papers, and then estimate the 'common' difference [13].
The observed test performance, PERF, may be measured in several different scales, such as paired measures sensitivity and specificity, positive and negative predictive values, likelihood ratios, post test odds, and post test probabilities for normal and abnormal test results; as well as single measures such as accuracy, risk or rate ratio or difference, Youden's index, area under ROC curve, and odds ratio (OR). When using OR as the performance measure, the marginal logistic regression model (2) logit(Result pt ) = β 0 + β 1 *Disease pt + β 2 *PaperID pt + β 3 *Disease pt *PaperID pt + β 4 *TestID pt + β7*Disease pt *TestID pt + β 6 *TestID pt *PaperID pt + β 7 *Disease pt *TestID pt *PaperID pt implements the decomposition of the performance. Model (2) is fitted to the (repeated measures) grouped binary data, where the 2-by-2 tables of gold-standard versus test results are extracted from each published paper. In the model (2) Result is an integer-valued variable for positive test result (depending on software choice, for grouped binary data, usually Result is replaced by number of positive test results over the total sample size, for each group); Disease is an indicator for actual presence of disease, ascertained by the gold standard; PaperID is a categorical variable for papers included in the meta-analysis; and TestID is a categorical variable for tests included.
Regression coefficients β 2 to β 7 can be vector valued, meaning having several components, so the corresponding categorical variables should be represented by suitable number of indicator variables in the model. Indexes p and t signify paper p and test t. They define the repeated measures structure of the data [10]. Note model (2) fits the general case where there are two or more tests available for the disease, where each test has been studied in one or more papers. Some of the papers may have studied more than one test; hence the results are not independent. Also the collection of tests studied may change from one paper to the other, hence incomplete matched groups.
If there is an obvious and generally accepted diagnostic test that can serve as a reference category (RefCat) to which other tests can be compared, then a "simple" parameterization for tests is sufficient, However, usually it is not the case. When there is no perceived referent test to which the other tests are to be compared, a "deviation from means" coding is preferred for the tests. Using the deviation parameterization for both TestID and PaperID in the model (2), one can show that β 5 *TestID pt is the average deviation of the LOR of test t from the overall LOR (the β 1 ), where the overall LOR is the average over all tests and all papers. Therefore β 5 *TestID pt of model (2) will be equivalent to the δ i,j of the decomposition model (1), and β 7 *TestID pt *PaperID pt equivalent to δ i,j,p .

Proportional odds ratio model
Model (2) expands each study to its original sample size, and uses patients as primary analysis units. Compared to a random-effects model where papers are the primary analysis units, it has more degrees of freedom. However, in a real case, not every test is studied in every paper. Rather majority of tests are not studied in each paper. Therefore the data structure of tests-by-papers is incomplete with many unmeasured cells. The three-way interaction model (2) may become over-parameterized. One may want to drop the term β 6 *Disease pt *TestID pt *PaperID pt . Then for the reduced model where the paper and test effects are completely separate. We call this reduced model the Proportional Odds Ratio (POR) model, where the ratio of odds ratios of two tests is assumed to be constant across papers, while odds ratio of each test is allowed to vary across the papers. Note the difference with the proportional odds model where ratio of odds is assumed to be constant [14]. In the POR model where t is an index for the k diagnostic tests, and p is an index representing the m papers included in the analysis. OR p is a function capturing the way OR changes across papers. Then to compare two diagnostic tests i and j where the ratio of the two ORs depends only on the difference between the effect estimates of the two tests, and is independent of the underlying OR p across the papers. Thus the model makes no assumptions about the shape of OR p (and in particular homogeneity of ORs) but merely specifies a relationship between the ORs of the two tests.
One may want to replace the PaperID variable with a smooth function of FPR or TPR, such as natural restricted cubic splines. There are two potential advantages. This may preserve some degrees of freedom, where one can spend by adding covariates to the model to measure their potential effects on the performance of the diagnostic tests. Thus one would be able to explain why performance of the same test varies across papers. Also, this allows plotting a ROC curve where the OR is not constant across the curve, a flexible ROC (HetROC) curve.
To test the POR assumption one may use model (2) where the three-way interaction of Disease and TestID with PaperID is included. However, in majority of real datasets this would mean an over-parameterized model. Graphics can be used for a qualitative checking of the POR assumption. For instance, the y-axis can be LOR, while the x-axis is paper number. To produce such plot, it may be better to have the papers ordered in some sense. One choice is to compute an unweighted average of (observed) ORs of all the tests the paper studied, and use it as the OR of that paper. Then sort the papers based on such ORs. The OR of a test may vary from one paper to the other (with no restriction), but the POR assumption is that the ratio of ORs of two tests remains the same from one paper to another. If one shows ORs of a test across papers by a smooth curve, then one expects that the two curves of the two tests are proportional to each other. In the log-OR scale, this means the vertical distance of the two curves remains the same across the x-axis. To compute the observed LOR for a test in a paper one may need to add some value (like 1/2) to the cell counts, if some cell counts are zero. However, this could introduce some bias to the estimates.
Among the approaches for modeling repeated-measures data, we use generalized estimating equations to estimate the marginal logistic regression [15]. Software is widely available for estimation of parameters of a marginal POR model. These include SAS (genmod procedure), R (function geese), and STATA (command xtgee), with R being freely available open source software [16].
One may use a non-linear mixed effects modeling approach on the cell-count data for estimation of parameters of the POR model. The Paper effect is declared as random, and interaction of the random effect with Disease is included in the model, as indicated in model (2). However, such mixed effects non-linear models are hard to converge, especially for datasets where there are many papers studying only one or a small number of the included tests (such as the dataset presented as example in this paper One may avoid dichotomizing results of the diagnostic test by using the 'likelihood ratio' as the performance measure, and fitting a PP model to such continuous outcome. For a scenario where performance of a single test has been measured multiple times within the same study, for example with different diagnostic calibrations (multiple thresholds), the POR estimated by the GEE incorporates data dependencies. When there is a multi-layer and/ or nested clustering of repeated measures, software to fit a mixed-effects POR model may be more available than an equivalent GEE POR.
When POR is implemented by a logistic regression on 2by-2 tables, it uses a grouped binary data structure. It takes a minimal effort to fit the same logistic model to the "ungrouped" binary data, the so-called "individual level" data.
Methods of meta-analysis that allow for different outcomes (and different numbers of outcomes) to be measured per study, such as that of Gleser and Olkin [18], or DuMouchel [19], may be used to implement the POR model. This would prevent conducting parallel metaanalyses that is usually less efficient.

Deep vein thrombosis
To demonstrate how to fit the POR model, we use a recent meta-analysis of diagnostic tests for deep vein thrombosis (DVT) by Heim et al. [20]. In this meta-analysis there are 23 papers and 21 tests, comprising 483 potential performance measurements, while only 66 are actually observed, thus 86% of cells are not measured. We fitted the reduced marginal logistic regression model (3). Comparing performance of each diagnostic test to the over-all LOR Figure 1 Comparing performance of each diagnostic test to the overall LOR A forest plot may be used to present the results of the modeling in a graphical way. This may connect better with clinically oriented audience. In Figure 1 we have sorted the 21 tests based on their LOR estimate. Note in order to compute some of the performance measures, one needs to assume a prevalence and sensitivity or specificity. We assumed a disease prevalence of 40%, and a specificity of 90%, for Table 2, as the tests are mainly used for ruling out the DVT.
We suggest graphs to compare tests when using such "prevalence-dependent paired performance measures" [21]. In Figure 2 we have used a pair of measures, 'probability of disease given a normal test result' and * estimate of deviation from overall LOR ** p-value for null hypothesis of Deviation = 0 ***p-value for null hypothesis of LOR = 0 † LOR(Result pt ) = β 1 + β 3 *PaperID pt + β 5 *TestID pt 'probability of disease given an abnormal test result', the dashed red curve and the dot-and-dash blue curve respectively.
The way one may read the graph is that, given a particular population with a known prevalence of disease like 40%, we perform the diagnostic test on a person picked randomly from the population. If the test turns normal, the probability the person has disease decreases from the average 40% to about 4% (draw a vertical line from point 0.4 on x-axis to the dashed red curve, then draw a horizontal line from the curve to the y-axis). If the test turns abnormal, the probability the person is diseased increases from 40% to about 57%. The dotted green diagonal line represents a test no better than flipping a coin, an uninformative test. The farther the two curves from the diagonal line, the more informative the test is. In other words, the test performs better.
One can summarize the two curves of a test in a single curve, by computing the vertical distance between the two. The solid black curve in the figure is such "difference" curve. It seems this particular test is performing the best in populations with disease prevalence of around 75%.
One can use the difference curve to compare several tests, and study effect of prevalence on the way the tests compare to each other. In Figure 3 two tests VIDAS and D-Dimer from the DVT example are compared. From the model estimates we know that both tests perform better than average. And that VIDAS performs better than D-Dimer.
The black solid curve is comparing the two tests. For populations with low disease prevalence (around 17%), the D-Dimer is performing better than VIDAS. However, when the prevalence is higher (around 90%), VIDAS is preferred. Simultaneous confidence bands around the comparison curve would make formal inference possible.

Meta-analysis of a single test: the baseline OR p function
Sometimes one may be interested in constructing the ROC curve for the diagnostic test. A homogeneous ROC curve assumes the performance of the test (as measured by LOR) is the same across the whole range of specificity. However, this assumption may be relaxed in a HetROC. We fitted a simplified version of model (5) for test SimpliRED, logit(Result pt ) = β 0 + β 1 *Disease pt + β 2 *S(FPR pt ) + β 3 *Dis- Post-test probability difference for diagnostic test VIDAS Figure 2 Post-test probability difference for diagnostic test VIDAS Comparing post-test probability difference for VIDAS -D-Dimer Figure 3 Comparing post-test probability difference for VIDAS -D-Dimer where index t is fixed, and then used estimates of the coefficients to plot the corresponding HetROC, Figure 4.
The eleven papers that studied test SimpliRED are shown by circles where the area is proportional to the sample size of the study. The black dashed curve is ROC curve assuming homogeneous-OR. The red solid curve relaxes the assumption, hence a heterogeneous ROC curve. The amount of smoothing of the curve can be controlled by the "degree-of-freedom" DF parameter. Here we have used a DF of 2. Codes to make such plots are presented in the additional file 1.

Model checking
Checking the POR assumption, model (2) may be used to reject significance of the three-way interaction term. How-ever, the dataset gathered for the DVT meta-analysis is such that no single paper covers all the tests. Moreover, out of 21, there are 7 tests that have been studied in only one paper. For Figure 5 we chose tests that have been studied in at least 5 of the 23 papers. There are 5 such tests. Note that even for such "popular" tests, out of 10 pairwise comparisons, 3 are based on only one paper (so no way to test POR). Four comparisons are based on 4 papers, one based on 3 papers, and the remaining two comparisons are based on 2 papers.
We sorted the papers, the x-axis, based on average LOR within that paper. We fitted Lowess smooth lines to the observed LORs of each test separately. Figure 5 shows the smooth curves are relatively parallel. Note the range of LORs of a single test. The LORs vary considerably from  Heterogeneous ROC curve for diagnostic test SimpliRED Observed log-odds-ratios of each diagnostic test Figure 5 Observed log-odds-ratios of each diagnostic test one paper to the other. Indeed the homogeneity-of-ORs assumption is violated in four of the five tests.
Also, to verify how good the model fits the data, one may use an observed-versus-fitted plot. Plots or lists of standardized residuals may be helpful finding papers or tests that are not fitted well. This may provide a starting point for further investigation.

Discussion
A comparison of the relative accuracy of several diagnostic tests should ideally be based on applying all the tests to each of the patients or randomly assigning tests to patients in each primary study. Obtaining diagnostic accuracy information for different tests from different primary studies is a weak design [3]. Comparison of the accuracy of two or more tests within each primary study is more valid than comparison of the accuracy of two or more tests between primary studies [22]. Although a head-to-head comparison of diagnostic tests provides more valid results, there are real-world practical questions that metaanalysis provides an answer that is more timely and efficient than a single big study [23]. Meta-analysis can potentially provide better understanding by examining the variability in estimates, hence the validity versus generalizability (applicability). Also, there may be tests that have never been studied simultaneously in a single study, hence meta-analysis can "reconstruct" such a study of diagnostic tests.

Relaxing the assumption of OR homogeneity
In meta-analysis of two (or more) diagnostic tests, where attention is mainly on the difference between performances of two tests, having a homogeneous estimate of performance of each single test is of secondary importance, and it may be treated as nuisance. The POR model assumes differences between LORs of two tests are the same across all papers, but does not assume the OR of a test is the same in every paper. Hence no need for homogeneity of OR of a test across papers that reported it, but shifting the assumption one level higher to POR.

Common versus average effect size
The POR model uses "deviation from means" parameterization. Then one does not need to drop the interactions coefficient β 3 in the model logit(Result) = β 0 + β 1 *Disease + β 2 *PaperID + β 3 *Disease*PaperID, to interpret β 1 , the overall LOR. This means the POR model explicitly accepts that performance of the diagnostic test varies across the papers, but at the same time estimates its mean value.
McClish explains if a test for OR homogeneity shows heterogeneity, there may be no 'common' measure to report, but still there is an 'average' measure one can report. [13]

Advantages of using 2-by-2 tables
We demonstrated how to fit the POR model to the cell counts, rather than to the OR values. This, we believe, has several advantages. 1. One does not need assuming normality of some summary measure. This results in binomial distributional assumption that is more realistic.
2. Also, different study sample sizes are incorporated into the POR model without faulty bias-introducing weighting schemes, as shown by Mosteller & Chalmers [25]. And extension of the POR model to individual level patient data is much easier. 3. The effective sample size for a metaanalysis by a random model is the number of papers included, which is usually quite small. There is a great danger for overfitting. And the number of explanatory variables one could include in the model is very restricted.
Since we use the grouped binary data structure, the patients are the effective sample size, hence much bigger degrees of freedom.
The way the random-effects model is usually implemented is by extracting OR from each paper, and assuming LOR being normally distributed. Then the distinction between the two types of mistakes (FNR and FPR, or equivalently TPR and FPR) is lost, since one enters the LOR as datapoints into the model. The bivariate model by Houwelingen et al [26] tries to fix this, by entering two datapoints into the model for each test from each paper. A fourth advantage of fitting the POR model to the cell counts is that the two types of mistakes are included in the model. Consider the logistic regression logit(Result) = β 0 + β 1 *Disease + β 2 *PaperID . Then we have log(true positive/ false negative) = β 0 + β 1 + β 2 *PaperID. Substituting a value for the covariate (here PaperID) such as a modal or average value, and using the model estimates for the betas, one gets the log-odds. Then one exponentiates it to get the TP/FN, call it Q. Now it is easy to verify that sensitivity = Q/(1+Q). Likewise we have log(false positive/true negative) = β 0 + β 2 *PaperID, that we call = log(W). Then specificity = 1/(1+W). Also, one can apply separate weights to the log(true positive/false negative) and log(false positive/true negative), to balance the true positive and false positive rates for decision making in a particular clinical practice.
When collecting papers from biomedical literature for meta-analysis of a few diagnostic tests, it is hard to come up with a complete square dataset, where every paper has included all the tests of interest. Usually the dataset contains missing values, and a case-wise deletion of papers with missing tests means a lot of data is thrown away. A method of analysis that can utilize incomplete matched groups may be helpful. The POR model allows complex missing patterns in data structure. Convergence of marginal POR model seems much better than non-linear mixed model, when fitted to cell counts of incomplete matched groups. This is an advantage for using GEE to estimate POR.
The fact that one can use popular free or commercial software to fit the proposed models, facilitates incorporation of the POR modeling in the practice of meta-analysis.

Unwanted heterogeneity versus valuable variability
The POR model utilizes the variation in the observed performance of a test across papers. Explaining when and how the performance of the test changes, and finding the influential factors, is an important step in advancing science. In other words, rather than calling it 'heterogeneity', treated as 'unwanted' and unfortunate, one calls it 'variability' and utilizes the observed variability to estimate and explain when and how to use the agent or the test in order to optimize their effects.
Victor [32] emphasizes that results of a meta-analysis can only be interpreted if existing heterogeneities can be adequately explained by methodological heterogeneities. The POR model estimates effect of potential predictors on between-study variation, hence trying to 'explain' why such variation exists.
The POR model incorporates risk of events in the control group via a predictor, such as observed prevalence, hence a 'control rate regression'. [26] ROC curve Although implementing the HetROC means that one accepts the diagnostic test performs differently in different FPRs along the ROC curve, in some implementations of HetROC, such as method of summary ROC, one compares tests by a single point of their respective ROCs. This is not optimal. (The Q test of the SROC method is a single point test, where that point on the ROC may not be the point for a specific cost-benefit case.) In such method although one produces a complete SROC, but one does not use it in comparing the diagnostic tests. In the POR model, one uses LOR as the measure for diagnostic discrimination accuracy, and builds statistical test based on the LOR-ratio, hence the test corresponds to whole ROCs (of general form).
The ROC graph was designed in the context of the theory of signal detectability [27,28]. ROC can be generated in two ways, by assuming probability distribution functions (PDFs) for the two populations of 'diseased' and 'healthy', or by algebraic formulas [29]. Nelson claims the (algebraic) ROC framework is more general than the signal detection theory (and its PDF-based ROC) [5]. The location-scale regression models implement ROC via PDFs, while the method of summary-ROC uses algebraic approach. The POR model uses a hybrid approach. While POR may be implemented by logistic regression, the smoothing covariate resembles the algebraic method. Unlike location-scale regression models that use two equations, POR uses one equation, hence it is easier to fit by usual statistical packages. One may use a five-parameter logistic to implement the HetROC. However, the model cannot be linearized, then according to McCullagh [14] it won't have good statistical properties. The POR model not only relaxes assumption of Var1/Var2 = 1, where Var1 and Var2 are variances of the two underlying distributions for the two populations, but even monotonicity of ROC. Hence the model can be used to represent both asymmetric ROCs and non-regular ROCs (singular detection).
In building HetROC curve, the POR model accommodates more general heterogeneous ROCs than SROC, because it uses nonparametric smoother instead of arbitrary parametric functions used in SROC method. When in the POR model the smoother covariate is replaced by log{TPR*FPR/ [(1-TPR)*(1-FPR)]}, a HetROC similar to SROC of Moses et al is produced.
When one uses a smooth function of FPR in the POR model, it is equivalent to using a function of outcome as predictor. This resembles a 'transition model'. Ogilvie and Creelman [30] claim that for estimating parameters of a best fitting curve going through observed points in the ROC space, least squares is not good since both axes are dependent variables and subject to error. They claim maximum likelihood is a preferred method of estimation. Crouchley and Davies [31] warn that, although GEE is fairly robust, it becomes inconsistent if any of the covariates are endogenous, like a previous or related outcome or baseline outcome. They claim a mixed model is better for studying microlevel dynamics. We have observed that the smooth HetROC curve may become decreasing at right end, due to some outlier points. Using less smoothing in the splines may be a solution.
When there is only one diagnostic test, and one is mainly interested in pooling several studies of the same test, the POR model estimates effect sizes that are more generalizable. By using the smoother (instead of PaperID), one fits a sub-saturated model that allows inclusion of other covariates, hence it is possible to estimate effect of study level factors on performance and explain the heterogeneity. Also it does not assume any a priori shape of the ROC, including monotonicity. Plus, it enables graphing of the HetROC. It does not need omission of interaction terms to estimate the overall performance, and it does not need assumption of OR homogeneity. If several performance measurements of the same test is done in a single study, like evaluating the same test with different diagnostic calibrations, the POR model provides more accurate