- Open Access
No need for a gold-standard test: on the mining of diagnostic test performance indices merely based on the distribution of the test value
BMC Medical Research Methodology volume 23, Article number: 30 (2023)
Diagnostic tests are important in clinical medicine. To determine the test performance indices — test sensitivity, specificity, likelihood ratio, predictive values, etc. — the test results should be compared against a gold-standard test. Herein, a technique is presented through which the aforementioned indices can be computed merely based on the shape of the probability distribution of the test results, presuming an educated guess.
We present the application of the technique to the probability distribution of hepatitis B surface antigen measured in a group of people in Shiraz, southern Iran. We assumed that the distribution had two latent subpopulations — one for those without the disease, and another for those with the disease. We used a nonlinear curve fitting technique to figure out the parameters of these two latent populations based on which we calculated the performance indices.
The model could explain > 99% of the variance observed. The results were in good agreement with those obtained from other studies.
We concluded that if we have an appropriate educated guess about the distributions of test results in the population with and without the disease, we may harvest the test performance indices merely based on the probability distribution of the test value without need for a gold standard. The method is particularly suitable for conditions where there is no gold standard or the gold standard is not readily available.
Diagnostic tests are important means for the diagnosis of diseases. The reference range of a given marker, the test sensitivity (Se, the probability that a diseased person becomes test-positive) and specificity (Sp, the probability that a disease-free person becomes test-negative) are important test performance characteristics . Positive and negative likelihood ratios (LRs) are other test performance indices used in clinical decision making . Depending on the prior probability (prevalence, if no other information is available) of the disease (pr), positive (PPV) and negative (NPV) predictive values (the probabilities that a person with a positive and negative test results has the disease or not, respectively) are two other important test performance indices very useful for clinicians. Area under the receiver operating characteristic (ROC) curve (AUC) and number needed to misdiagnose (NNM) are other indices [3, 4].
No matter whether the test result is dichotomous (binary results, positive or negative) or continuous (where we need to use a cut-off value [also depending on the pr] to dichotomize the result) , measuring all the above-mentioned indices requires comparing our test results against the results of a gold-standard test. For certain disease conditions such as prostate cancer, we have a well-defined pathological definition of the disease and the gold-standard test is thus available. There is however, no gold-standard test for the diagnosis of some diseases, as an example, latent tuberculosis infection . Hypertension is another example — it is in fact not a well-defined disease; we just know that those with higher blood pressure carry a higher risk of mortality and morbidity and thus we redefine the definition of hypertension periodically to minimize the risk incurred . Sometimes, there is a gold standard, but it is invasive and costly or out of reach of many people, for example, pulmonary angiography for the diagnosis of pulmonary emboli . Herein, we would like to present a method that can possibly compute the above-mentioned test performance indices merely based on the shape of the test results distribution, without any need for a gold-standard test. We also present the results of application of the method to a dataset of hepatitis B surface antigen (HBs Ag) measured in a representative sample of people residing in Shiraz, southern Iran.
Suppose that we know the probability distribution of a diagnostic marker in a group of disease-free and diseased people in a representative sample of a population (Fig. 1).
Knowing the distribution of the marker in disease-free people (gray curve, Fig. 1), we can easily determine the reference range of the marker, commonly defined as the interval between the 2.5th and 97.5th percentiles of the distribution of the marker (the interval between the vertical solid lines, Fig. 1) in a healthy population .
Let set a cut-off value of d (the vertical dashed line, Fig. 1). Then, any test value ≥ d is considered a positive test result (T +), and according to the definition, the Se is :
where f2(x) is the probability density function of the marker distribution in diseased people (Fig. 1) . In a similar way, if f1(x) is the probability density function of the marker distribution in the disease-free people (Fig. 1), it can be shown that the Sp of the marker is :
There is a trade-off between the test Se and Sp. Given the test Se and Sp corresponding to each cut-off value, we can construct an ROC curve which is a graphical representation of this trade-off . Knowing the probability distributions ( f1 and f2, Fig. 1), we can also compute the likelihood ratios (LRs) for a certain value of the marker, say x = r as follows :
where D + and D – represent presence and absence of the disease. We may also calculate the LR for a range of the marker value, say for values between s and r, using the Eq. :
and for a positive and negative test results , assuming a cut-off value of d:
Using the theory of finite mixture model, we may combine the two above-said distributions of the marker in the disease-free and diseased populations with different weights to construct the distribution of the marker in the general population . For example, if we combine the two distributions with weights of 0.85 and 0.15 (corresponding to a disease pr of 15%), we would compute the probability distribution of the marker in the general population (Fig. 2, the yellow curve) using the following equation:
Reversing the process
Suppose that we have the distribution of a diagnostic marker in the general population (i.e., the yellow curve, Fig. 2). If we have a biologically plausible educated guess about the number and shape of the latent subpopulations (in our example, two components of disease-free (gray curve, Fig. 2) and diseased (red curve, Fig. 2) subpopulations), we may find the subpopulations. If we succeed, we can then compute all the test performance indices, as described above. Let us examine the method through its application to the distribution of HBs Ag in a representative sample of people residing in Shiraz, southern Iran.
Source of data
We analyzed the HBs Ag values taken from the database of a general clinical lab in Shiraz, southern Iran. The lab performs an average of 9000 tests each day on samples taken from about 850 people referred to the lab in different health states coming from various parts of Fars province. Data were those measured in samples received between March 2019 and March 2021 using electrochemiluminescence immunoassay (Elecsys HBsAg II, cobas® e 411 analyzer, Roche Diagnostics, Switzerland). The measured HBs Ag was reported as cut-off index value, equal to test signal/cut-off.
R software version 4.2.0 (R Project for Statistical Computing) was used for data analysis. To eliminate outliers, we only included the samples having HBs Ag values between 0.05 and 1.2. Using the default values of the R density function, the probability density curve for the HBs Ag values was constructed. The function uses by default a Gaussian kernel, 512 bins, and a bandwidth calculated according to the Silverman’s rule .
Examination of the probability distribution of HBs Ag obtained from our dataset (green curve, Fig. 3), revealed that we may assume that there were two latent subpopulations — one for those without the disease, and another for patients with the disease. Visual examining the distribution of HBs Ag implied that it might be a mixture of at least two normally distributed latent subpopulations. We used fviz_nbclust function from factoextra R package and clara function from cluster package to determine the optimal number of latent subpopulations (eFig 1, Supplementary Materials), which confirmed the presumed number of two subpopulations. The functions also provided the first estimates for initializing the curve fitting function. We thus assumed a Gaussian mixture model with two components with the following parametric equation :
where, µ1, σ1, µ2, and σ2 represent mean and the standard deviation (SD) of HBs Ag in the disease-free and diseased people, respectively; pr represents the prior probability (prevalence, if no other information is available) of the disease; and φ represents the probability density function of the Gaussian distribution.
A nonlinear curve fitting function (nlsLM from minpack.ml package for R) was used to compute the optimal values of parameters of a binormal equation (Eq. 7) best fit to the probability distribution. The function works based on the Levenberg-Marquardt nonlinear least-squares algorithm . Constraints were imposed on the parameters σ1, and σ2 in Eq. 7 — they could only assume non-negative values; pr, the prior probability (or the prevalence) of the disease, could only assume values between 0 and 1, inclusive.
Having the distributions’ parameters, we can then calculate all the test performance indices — the reference range, and test Se, Sp, and LRs. Assuming a binormal distribution (Eq. 7), then Eqs. 1 and 2 become:
where Φ represents the cumulative distribution function of the standard normal distribution. We can construct the ROC curve and calculate the AUC. The prior probability of the disease (pr) can be directly derived from the fitting procedure. Given the pr, we may also calculate the PPV and NPV .
The studied dataset included 14 222 records. Excluding records with HBs Ag ≤ 0.05 (considered the lower limit of detection of the assay in our lab) or > 1.2 (leading to omission of the highest 1% of the data), left 9698 records for analyses. There were 5777 (59.6%) samples taken from females and 3921 (40.4%) from males. The mean age of study participants was 36 (SD 12) years. The probability distribution of HBs Ag had a clear bimodal distribution (green curve, Fig. 3). The technique used could correctly identify the two latent Gaussian subpopulations — one with a mean of 0.38 (SD 0.10) for disease-free people (gray curve, Fig. 3), another with a mean of 0.72 (0.05) for diseased people (red curve, Fig. 3). The reference range for HBs Ag was thus between 0.18 and 0.58 (µ1 ± 1.96 σ1, assuming the Gaussian distribution of the results; the region outlined by the two vertical solid lines, Fig. 3). The cut-off value corresponding to the maximum Youden’s J index (Se + Sp – 1) was 0.59 (vertical dashed line, Fig. 3) . This cut-off value corresponds to a Se and Sp of 99.1% and 98.2%, respectively (Fig. 4).
The model could explain almost all the observed variance in the HBs Ag distribution (r2 = 0.997). The pr derived from the curve fitting on the subset of data (after omitting the outliers) was 11.6%, however, taking all the data into account, the pr corresponding to a cut-off value of 0.59 was 10.1%. The pr corresponding to a cut-off value of 0.9, the value suggested by the manufacturer of the diagnostic kit, was 1.2%. The cut-off corresponds to a Se near to zero (many false-negative results) and a Sp of almost 1 (no false-positive result).
Different types of LRs can be calculated — for a certain HBs Ag value (Fig. 5), for a given range of HBs Ag, and for a positive or negative test result. For example, according to Eq. 3, LR(HBs Ag = 0.7) is:
which means that the likelihood of observing an HBs Ag value of 0.7 is 260 times more likely to be observed in a diseased person as compared with a disease-free person (Fig. 5) .
To compute the LR for an interval of the test results, say 0.6 ≤ HBs Ag < 0.7, we need to first calculate the Se and Sp corresponding to these values (Eq. 4), which can be done easily using Eqs. 8 and 9. The Se and Sp corresponding to a cut-off value of 0.6 is 98.7% and 98.5%, respectively; the values are 64.6% and 99.9%, respectively, for a cut-off of 0.7. Then:
which means that the likelihood of observing an HBs Ag between 0.6 and 0.7 is 24 times more likely to be observed in a diseased person as compared with a disease-free person . Finally, substituting the values for Se and Sp corresponding to a cut-off value of 0.59 in Eq. 5, the positive and negative LR are approximately 55 and 0.01, respectively .
Assuming a cut-off of 0.59 (which corresponds to a pr of 10.1% in whole population), provides a PPV of 86% (the probability of presence of the disease if the test is positive), NPV of almost 100% (the probability of absence of the disease if the test is negative), and an NNM of 58.
Using the presented method, we could harvest the test performance indices for a diagnostic test (in our example, HBs Ag) merely based on the shape of the frequency distribution of a biomarker with an acceptable accuracy, provided that we have an educated guess about frequency distributions of the test values in those with and without the disease. The cut-off value of 0.59 derived from our model was less than that commonly used in practice. The manufacturer suggests that HBs Ag values < 0.9 be interpreted as non-reactive (negative test result); those > 1.0, positive; and those between 0.9 and 1.0, borderline or equivocal. The pr corresponding to a cut-off value of 0.9, the value suggested by the manufacturer of the diagnostic kit, was 1.2%, consistent with the pr of HBs Ag reported in various seroepidemiologic studies conducted in Shiraz, Fars province [14,15,16]. The cut-off corresponds to a Se near to zero (many false-negative results) and a Sp of almost 1 (no false-positive result) and it seems that the cut-off value of 0.59 derived by our model (corresponding to a Se and Sp of 99.1% and 98.2%, respectively) is more reasonable (Fig. 4).
Seroprevalence studies commonly use diagnostic tests that are not perfect — the results may be false-positive or false-negative. Therefore, the pr calculated is just the apparent prevalence, not the true prevalence . The important thing to be noted is that the pr derived through the method presented in this paper gives the true prevalence, not the apparent prevalence , which is an advantage of the metho presented.
This LR corresponding to each HBs Ag value is in fact the slope of the line tangent to the ROC curve at the point corresponding to that HBs Ag. This value could not usually be calculated readily in practice because the ROC curve is typically constructed based on a finite number of discrete values — the curve is thus not differentiable and the slope of the tangent line cannot be computed [2, 4]. Finding the parameters of the distribution components ( f1 and f2) through curve fitting enables us to directly calculate the slope and thus, the LR (Fig. 5), which is another advantage of the method proposed.
A cut-off value of 0.59, derived by the model, gave a PPV of 86%, an NPV of almost 100%, and an NNM of 58, which means that in average, one out of 58 independent tests performed would be either false-positive or false-negative . Given that the NPV is almost 100%, there would be no false-positive. Therefore, a false result is most likely false-negative.
The method has been applied to distributions of other biomarkers including the prostate-specific antigen and antibody against severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) with very good results [11, 18]. The only difference was that in previous works the variables were transformed to give a better fit result; for HBs Ag, no transformation was necessary.
The method presented heavily relies on the educated guess used in constructing the model. The shapes of the probability distributions of the latent subpopulations (not necessarily the same; they may have two completely different distributions) should be reasonable and biologically plausible. We may figure out the optimum number of latent subpopulations (as we did in our study), but the number ultimately chosen for the model should be biologically plausible too. For example, if we are going to study the distribution of hemoglobin in women, we expect to have three subpopulations — those with low (anemia), normal, and high hemoglobin concentration (polycythemia). Finally, it is important to emphasize that a model is neither correct nor wrong; it may be good or bad. A good model may be but not necessarily correct.
Based on the technique presented we could compute all test performance indices with clinically acceptable accuracy merely based on the distribution of the test value without the need for a gold-standard test. This technique could be of particular importance for disease conditions where no clear pathologic definition has been provided (e.g., hypertension). A diagnostic test is technically a binary classifier. The technique presented can have a wide range of applications in many scientific fields.
Availability of data and materials
All data generated or analyzed during this study as well as the R codes are included in this published article and its supplementary information files.
Positive predictive value
Negative predictive value
Receiver operating characteristic
Area under the curve
- HBs Ag:
Hepatitis B surface antigen
Severe acute respiratory syndrome coronavirus 2
Mallett S, Halligan S, Thompson M, Collins GS, Altman DG. Interpreting diagnostic accuracy studies for patient care. BMJ. 2012;345:e3999.
Habibzadeh F, Habibzadeh P. The likelihood ratio and its graphical representation. Biochem Med (Zagreb). 2019;29(2):020101.
Habibzadeh F, Yadollahie M. Number needed to misdiagnose: a measure of diagnostic test effectiveness. Epidemiology. 2013;24(1):170.
Habibzadeh F, Habibzadeh P, Yadollahie M. On determining the most appropriate test cut-off value: the case of tests with continuous results. Biochem Med (Zagreb). 2016;26(3):297–307.
Perez-Porcuna TM, Pereira-da-Silva HD, Ascaso C, Malheiro A, Buhrer S, Martinez-Espinosa F, Abellana R. Prevalence and diagnosis of latent tuberculosis infection in Young Children in the absence of a Gold Standard. PLoS One. 2016;11(10):e0164181.
Chobanian AV. Guidelines for the management of hypertension. Med Clin North Am. 2017;101(1):219–27.
McRae SJ, Ginsberg JS. The diagnostic evaluation of pulmonary embolism. Am Heart Hosp J. 2005;3(1):14–20.
Habibzadeh P, Yadollahie M, Habibzadeh F. What is a “Diagnostic Test Reference Range” Good for? Eur Urol. 2017;72(5):859–60.
Kyoya S, Yamanishi K. Summarizing Finite Mixture Model with Overlapping Quantification.Entropy (Basel). 2021;23(11):1503. https://doi.org/10.3390/e23111503.
Silverman BW. Density estimation and data analysis. London: Chapman & Hall/CRC; 1986.
Habibzadeh F, Habibzadeh P, Yadollahie M, Roozbehi H. On the information hidden in a classifier distribution. Sci Rep. 2021;11(1):917.
Moré JJ. The Levenberg-Marquardt algorithm: implementation and theory. In: Watson GA, ed. Lecture Notes in Mathematics 630: Numerical Analysis. Berlin: Springer-Verlag; 1978. p 105–16.
DeLong ER, DeLong DM, Clarke-Pearson DL. Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach. Biometrics. 1988;44(3):837–45.
Motamedifar M, Amini E, Talezadeh Shirazi P, Sarvari J. The prevalence of HBsAg and HBsAb among pregnant women re-ferring to Zeinabiyeh Hospital, Shiraz Iran. Shiraz E-Medical J. 2012;13(4):187–96.
Abedi F, Madani H, Asadi A, Nejatizadeh A. Significance of blood-related high-risk behaviors and horizontal transmission of hepatitis B virus in Iran. Arch Virol. 2011;156(4):629–35.
Askarian M, Mansour Ghanaie R, Karimi A, Habibzadeh F. Infectious diseases in Iran: a bird’s eye view. Clin Microbiol Infect. 2012;18(11):1081–8.
Habibzadeh F, Habibzadeh P, Yadollahie M. The apparent prevalence, the true prevalence. Biochem Med (Zagreb). 2022;32(2):020101.
Habibzadeh F, Habibzadeh P, Sajadi MY. Determining the SARS-CoV-2 serological immunoassay test performance indices based on the test results frequency distribution. Biochem Med (Zagreb). 2022;32(2):020705.
No financial support.
Ethics approval and consent to participate
The study was conducted in accordance with the Declaration of Helsinki Code of Ethics. The study protocol was approved by the Petroleum Industry Health Organization Institutional Review Board. All those who attend the lab were informed that the results of their tests might be used anonymously in non-interventional research studies. Informed consent was obtained from the study participants or their legal guardians to use their data for such purposes. The authors did not have access to identifiable data records.
Consent for publication
The authors declare that they have no competing interests.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Additional file 1:
Figure 1. Optimal number of clusters derived from fviz_nbclust. The vertical dashed line corresponds to the optimal number of clusters, here 2.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Habibzadeh, F., Roozbehi, H. No need for a gold-standard test: on the mining of diagnostic test performance indices merely based on the distribution of the test value. BMC Med Res Methodol 23, 30 (2023). https://doi.org/10.1186/s12874-023-01841-8
- Diagnostic test
- Data mining
- Statistical methods
- Classification and taxonomy