Visual inspection with acetic acid as a cervical cancer test: accuracy validated using latent class analysis
© Gaffikin et al. 2007
Received: 02 January 2007
Accepted: 31 July 2007
Published: 31 July 2007
Skip to main content
© Gaffikin et al. 2007
Received: 02 January 2007
Accepted: 31 July 2007
Published: 31 July 2007
The purpose of this study was to validate the accuracy of an alternative cervical cancer test – visual inspection with acetic acid (VIA) – by addressing possible imperfections in the gold standard through latent class analysis (LCA). The data were originally collected at peri-urban health clinics in Zimbabwe.
Conventional accuracy (sensitivity/specificity) estimates for VIA and two other screening tests using colposcopy/biopsy as the reference standard were compared to LCA estimates based on results from all four tests. For conventional analysis, negative colposcopy was accepted as a negative outcome when biopsy was not available as the reference standard. With LCA, local dependencies between tests were handled through adding direct effect parameters or additional latent classes to the model.
Two models yielded good fit to the data, a 2-class model with two adjustments and a 3-class model with one adjustment. The definition of latent disease associated with the latter was more stringent, backed by three of the four tests. Under that model, sensitivity for VIA (abnormal+) was 0.74 compared to 0.78 with conventional analyses. Specificity was 0.639 versus 0.568, respectively. By contrast, the LCA-derived sensitivity for colposcopy/biopsy was 0.63.
VIA sensitivity and specificity with the 3-class LCA model were within the range of published data and relatively consistent with conventional analyses, thus validating the original assessment of test accuracy. LCA probably yielded more likely estimates of the true accuracy than did conventional analysis with in-country colposcopy/biopsy as the reference standard. Colpscopy with biopsy can be problematic as a study reference standard and LCA offers the possibility of obtaining estimates adjusted for referent imperfections.
Cervical cancer, the second most commonly diagnosed cancer among women worldwide, can be a preventable disease. Although the Pap smear remains the most common screening test for cervical cancer, many less developed countries do not have adequate resources to implement cytology-based prevention programs. An alternative, low-cost test, visual inspection using acetic acid (VIA), has emerged for use in low-resource settings where it can be performed by auxiliary health professionals [1–3]. VIA is similar to colposcopy in that acetic acid is applied and any acetowhite lesion is visualized, although with VIA there is no magnification.
VIA accuracy studies have yielded a range of sensitivity and specificity values spanning from approximately 60 percent to over 90 percent [4–14]. While this range is narrower than observed for other tests including cytology (23% to 99% for sensitivity and 7% to 97% for specificity), it is important to investigate possible reasons for inter-study variability . Some have questioned whether the variability of results across studies is due, at least in part, to imperfections with the reference standard used. For cervical cancer, the "gold" standard for establishing a diagnosis is biopsy . The VIA studies cited above have involved a variety of reference standard measures. These include: 100 percent biopsy sampling, a combined colposcopy/biopsy reference standard for all participants, biopsy for colposcopically-suspicious lesions only, and colposcopy with histology only for women test-positive on all screening tests [i.e., visual inspection, Pap, human papilloma virus (HPV) and cervicography] [4–6, 8, 10, 12–14]. Even among studies with similar reference standard measures, another source of variability across studies could be differences in the quality of the reference standard. Subjective (human) error may have affected the quality of colposcopy or the quality of tissue collection, slide fixing and biopsy interpretation which could have led to misclassification of the reference standard [17, 18].
Most published studies on VIA involve use of conventional methods and a 2 × 2 table for assessing test accuracy (i.e., sensitivity and specificity). In recent years, several statistical methods have been used to evaluate new tests when no or an imperfect gold/reference standard is available [19–21]. LCA is a statistical technique, originally developed in the early 1950s, that allows for the accuracy of a new test to be assessed in the absence of a gold standard. It does this by using the statistical associations among various tests performed on the same individual to define unobserved (latent) disease. The likelihood of the relationship between latent disease, the new test under investigation and the other tests is then maximized to yield sensitivity and specificity estimates . Historically, LCA has been used in biomedical applications to identify disease based on observable traits [23–25]. More recently, there has been increased interest in using LCA to evaluate diagnostic or screening tests [26–30].
The objectives of this analysis were two-fold: 1) to assess test accuracy using LCA assuming no gold standard and to compare those values with conventional estimates to explore the effect of any gold standard imperfections in calculating the latter; 2) to assess whether the assumption of independence between VIA and coloposcopy as a component of the gold standard reference test were met (a prerequisite for valid test accuracy assessment). The second objective corresponds to issues that have been raised regarding the appropriateness of using colposcopy as a reference standard for VIA accuracy studies as the two tests are similar in nature, both involving visual observation of the cervix after acetic acid wash .
The dataset used in this exercise is a subset of data from a previously published, cross-sectional study . All subjects participating in that study gave informed consent and the study was approved by the Institutional Review Board at Johns Hopkins Bayview Medical Center, Baltimore, USA.
The original study involved 2203 women, aged 25 to 55 years, who attended 15 primary-care clinics in two peri-urban areas near Harare, Zimbabwe between October 1996 and August 1997. Details of the original study design, data collection and biological sample collection procedures are available elsewhere . All women enrolled in the study were offered and scheduled to receive four tests: VIA, a Pap smear, colposcopy with biospy (the latter when clinically indicated only) and a HPV DNA test [Hybrid Capture II (HC2), (Digene Corporation, Gaithersburg, USA].
Specially-trained nurse-midwives performed VIA using a 4 percent acetic acid dilution. A cytological specimen from the same woman was independently assessed by a local cytopathologist (and later reviewed by a board-certified, cytopathologist at the Johns Hopkins Bayview Medical Center, Baltimore, USA). In addition, HPV testing was independently performed for all women (at Johns Hopkins University, Bloomberg School of Public Health, Baltimore, USA) using the B probe of HC2 targeting 13 high-risk HPV types . A colposcopic examination was performed by one of two local faculty gynecologists, blinded to all other test results, shortly after VIA testing and the cytology/HPV samples had been obtained. Almost all (97 percent) women received colposcopy. Biopsy was performed only for cases suspicious on colposcopy (n = 595).
In this study, analysis were performed on the subset of women (n = 2073) for whom all four test results (VIA, Pap, HPV and colposcopy/biopsy, here considered as one combined test) were available. A combined reference standard was developed incorporating biopsy results when available; negative colposcopy was accepted as a negative outcome when no biopsy was taken. Such a combined reference standard has been considered appropriate for cervical cancer test accuracy studies given ethical and other issues involved in performing biopsy for test-negatives . The availability of colposcopy and/or biopsy results to form a combined reference standard for 97 percent of all study participants meant that conventional estimates of sensitivity/specificity could be calculated with an ignorable risk for verification bias. All tests were categorized into two levels for sensitivity/specificity estimates. Cervical intraepithelial neoplasia grade 2 or worse (CIN2+) on biopsy or high grade squamous intraepithelial lesion or worse (HGSIL+) on colposcopy was used as the cutoff point defining disease in all analyses. This threshold of disease was chosen since this is the severity level usually treated in Zimbabwe.
In the present analysis, LCA (as implemented in Latent Gold Version 3.0 software) was applied to four "manifest" variables (VIA, Pap smear, HC2, and colposcopy/biopsy) to construct a "latent" variable that could serve as a measure of the reference standard defining true disease . Since LCA estimates the conditional probability of a given latent class (e.g., presence or absence of high grade lesions or worse) for each level of the observed variables, sensitivity and specificity in relation to the latent disease variable could be calculated. Using LCA, maximum-likelihood-based estimates of the standard errors of the various probability estimates were also calculated to derive confidence intervals. We then compared LCA-estimated sensitivity and specificity estimates to those conventionally derived from 2 × 2 tables with binomial standard errors (for VIA, Pap, and HC2 against colposcopy/biopsy as the reference standard).
A fundamental assumption of LCA is that the manifest variables are locally independent. That is, within (or local to) a given latent class, the manifest variables should be statistically unrelated to or independent from each other. Bivariate residuals, which reflect any remaining association among each pairing of manifest variables after estimation of the latent classes, indicate whether this assumption was in fact met.
Vermunt and Magidson suggest that residuals above 1.0 indicate possible local dependencies which can be adjusted for in LCA models through the introduction of "direct effects", representing the excess variation between two variables. Adding a direct effect increases the log-likelihood (LL) of the model, indicating better fit of the data to the model. The bivariate residual value approximates the increase in LL observed with the addition of the direct effect variable . The software we used supports adjustment for local dependencies through the introduction of direct effects [34–36].
Model fit is also indexed by the likelihood ratio chi-square (L2) which decreases as the fit of the model improves. A significant p-value associated with the L2 indicates that the manifest variables (here, the different tests) have associations with each other that are not accounted for by the model. LCA modeling continues with the addition of latent classes and/or adjustment for local dependencies until adequate fit is reached, as indicated by a non-significant p-value.
Trichotomous coding scheme for the four tests
Normal, Inflammation, Ascus, Agus
< 1.0 RLU compared to control
>= 1.0 RLU and < 20.0 RLU compared to control
>= 20.0 RLU compared to control
Normal, Inflammation, Pure HPV
With latent class modeling, it is often possible to produce several different models that adequately fit the data. In the present exercise, two such models were generated: one with two latent classes and one with three. There is no single statistical tool that can definitively support the selection of one model over the other and, therefore, final model selection must be based on knowledge of the biological processes under study, as well as the particular characteristics of the study sample. However, information criterion statistics such as the Akaike Information Criterion (AIC), available in Latent Gold Version 3.0, have been developed to provide an idea of the relative distance of two or more models from a theoretical best model . The AIC is a function of the LL and the number of parameters (K) in the model: AIC = -2(LL) + 2K . Given two or more fitted models on the same dataset, the one with the lower AIC value is considered better. From the equation for AIC, it is apparent that as LL increases, AIC decreases (gets better). Additionally, as the number of parameters (K) in a model increases, the AIC increases (gets worse). Thus, the AIC favors a more parsimonious model (lower K), all other things being equal.
Comparative statistics for various LCA models involving all four tests
2-class with one adjustment
2-class with two adjustments
3-class with one adjustment
Bivariate residuals for various LCA models
2-class with adjustment for VIA * Colpo/Biop1
2-class with additional adjustment for Pap * Colpo/Biop
3-class with adjustment for VIA * Colpo/Biop
VIA * Pap
VIA * Colpo/Biop
VIA * HPV
Pap * Colpo/Biop
Pap * HPV
Colp/Biop * HPV
As described earlier, sensitivity and specificity under the two models were calculated by recoding each test into its standard dichotomous categories. This was done by combining the appropriate conditional probabilities from Figures 1 and 2. For example, in the 3-class model, the sensitivity of VIA was 0.74 (probability of VIA = abnormal or cancer, given class 3). Specificity was 0.57. In that model, class 1 (p = 0.682) and class 2 (p = 0.200) combine to form non-disease. Specificity was calculated as 0.323 (probability of VIA inflammation given class 2) plus 0.117 (probability of VIA normal given class 2) multiplied by the probability of class 2 (0.200), plus the analogous values for Class 1, i.e., 0.429 (probability of VIA inflammation given class 1) plus 0.276 (probability of VIA normal given class 1) multiplied by the probability of class 1 (0.682).
Comparative values: conventional versus LCA model results
Disease prevalence (± SE)
Sensitivity (± SE) (*)
Specificity (± SE)
1. Colposcopy/Biopsy LGSIL+
VIA Abnormal, CA
Pap LGSIL +
HPV >= 1.0 RLU
2. Colposcopy/Biopsy HGSIL+
VIA Abnormal, CA
Pap LGSIL +
HPV >= 1.0 RLU
3. LCA disease derived from Trichotomous (†) VIA, Pap, HPV, including colposcopy/biopsy: 2 class solution
VIA Abnormal, CA
Pap LGSIL +
HPV >= 1.0 RLU
Colposcopy/Biopsy LGSIL+ (‡)
4. LCA disease derived from Trichotomous (†) VIA, Pap, HPV, including colposcopy/biopsy: 3 class solution
VIA Abnormal, CA
HPV >= 1.0 RLU
There are various aspects of study design and implementation that can affect the validity of test accuracy studies . Regarding the accuracy of VIA as a test for (pre)cancer of the cervix, the majority of published studies to date have been conducted in low-resource settings. Differences in the threshold point defining VIA test positive, the threshold defining disease (LGSIL, HGSIL or cancer), the intensity and timing of provider training, the background experience and qualifications of the providers, sample sizes and sexually-transmitted disease risk of the participating women, among other factors, all could potentially account for differences in observed VIA accuracy estimates [40, 41].
Two problems in particular – the sometimes variable quality of gold/reference standard tests and verification bias – have the potential to substantially negatively affect test accuracy results [42–44]. This analysis focused on clarifying the potential problem of imperfections in the gold/reference standard. It also addressed another purported issue with VIA accuracy studies, that is, the potential lack of independence between VIA and coloposcopy when the latter is used as, or part of, the reference standard.
In our opinion, the latent class model with three classes represents a more realistic assessment of the true sensitivities/specificities of VIA, Pap smear, and HC2 testing than do results from the conventional model with colposcopy/biopsy as the reference standard. This model showed good fit to the data and likely yields a more accurate study reference standard. The 3-class model offers a more stringent definition of disease, backed by three of the four tests (not the Pap test) and a prevalence rate of disease (CIN2+/HGSIL+ or cancer) for class three more consistent with that calculated using colposcopy/biopsy as the reference standard (around 10 percent). Prevalence values for the three latent classes in this model were 0.682 (SE = 0.029), 0.200 (SE = 0.035) and 0.118 (SE = 0.022) for latent disease class one, two, and three, respectively (Table 4). Given the high probabilities with more severe VIA, HC2, and colposcopy/biopsy results, class three in the 3-class model can more likely be interpreted as true "disease" (CIN2+/HGSIL+). This is supported by the high sensitivity rates for all tests in this class, as well as similar probability profiles for VIA and colposcopy/plus the high HC2 RLU levels (suggestive of high viral loads).
With this subset of Zimbabwe data, the LCA-derived sensitivity and specificity for VIA were fairly consistent with conventionally-derived estimates as well as the range of published values [4–13]. For comparative purposes, the LCA-derived sensitivity of HC2 was close to 97 percent for both the 3- and 2-class models. This is considerably higher than the conventionally derived estimate (0.80) and is more consistent with the ranges cited in some industrialized country meta-analyses . However, other reviews indicate a slightly wider variation in HC2 sensitivity and reports from developing countries show a sensitivity lower than that commonly reported for developed countries [32, 46–49]. In the recent scientific literature, the sensitivity of cytology at cutoff LGSIL+ for an outcome of CIN2+ ranged from 23% to 99% and the specificity from 7 to 97% . Our LCA-derived cytology sensitivity and specificity differed slightly from the conventionally-derived values and fell within these published value ranges.
In this analysis, HC2 sensitivity showed the greatest change (gain) among the three tests comparing the conventional results to those using the latent reference standard with class three defining disease. Specificity for HC2, on the other hand, was relatively low for the 3-class model. Only the Pap test in this model continued to perform "counter-intuitively", with higher probabilities of latent disease class three for less severe Pap test results. However, this finding is consistent with the results from the initial, conventionally-derived cytology analysis that indicated a low ability for Pap smears in this setting to identify true disease .
Given the conditions of the Zimbabwe study, where the colposcopist was blinded to the results of any other test result, our LCA findings may reflect the more subjective nature of colposcopy and the consequences of a colposcopist seeing what they think is an insignificant lesion, for which they elect not to take a biopsy. The latent class model, which takes into account the additional information provided by the HC2, VIA and Pap results, classifies as true disease some lesions assessed on coloposcopy as "insignificant", rendering those colposcopy/biopsy results "false negatives". These translate into a reduced LCA-derived sensitivity value for the colposcopy/biopsy combined test.
Although the subjective nature of colposcopy has been commented on by many authors, the data from this exercise demonstrate the degree to which such subjectivity can potentially affect sensitivity or specificity estimates when a test is being evaluated with colposcopy/biopsy as a combined reference standard [50–52]. Mitchell et al (1998) summarized the sensitivity and specificity of colposcopy (compared to biopsy as the reference standard) through a meta-analysis of 9 studies . They found a range of sensitivities from 0.30 – 0.99 and a range of specificities from 0.39 – 0.93, with a weighted mean sensitivity and specificity of 0.85 and 0.69, respectively. All this suggests that colposcopy may have limitations when used as a "reference" standard, alone or in combination with biopsy, for cervical cancer test accuracy studies.
In this study, despite apparent imperfections in the reference standard, the conventionally-derived VIA results fell within the range of published data and were relatively consistent between with the 3-class LCA model (0.775 versus 0.744, respectively, for sensitivity and 0.639 and 0.568, respectively for specificity). HC2 in this study however proved to have higher sensitivity when measured using LCA. This may explain, in part, the discrepancy observed between study results from industrialized countries versus those originally from Zimbabwe . However, as noted earlier, lower HC2 sensitivity has also been observed in other developing country studies [47–49]. This could similarly be due to inadequacies in the gold standard used or to imperfections in sample collection, transport or processing or a combination.
The age range of the women (22–55) was limited in this study for two important reasons. First, because large numbers of women can be infected with HPV but not have persistent disease, we wanted to maximize chances that any identified lesions would likely represent real disease versus squamous metaplasia, inflammation or transient infection with HPV. Second, as women age, especially when they become post-menopausal, the squamo-columnar junction (which is used as an anatomical landmark for VIA assessments) recedes into the cervical canal and sometimes cannot be visualized. In such women, VIA is likely to be incomplete or unsatisfactory affecting the accuracy of the test. This study criterion limits the generalizablity of the results to the population of women of the same age range. However, for developing countries seeking more affordable cancer prevention strategies, given the natural history of disease and the intrinsic limitations of VIA among older women, it has been shown that focusing on this age range is cost effective .
To have greater confidence in results of conventional accuracy studies for existing or proposed cervical cancer screening tests, especially where there are questions about the reference standard, analyses involving LCA merit more attention. In this particular setting, LCA yielded accuracy estimates which, after adjustment, fell within the range of values observed in studies where high quality colposcopy and biopsy were used [45, 46]. Additionally, using LCA it was possible to account for any correlation between VIA and colposcopy which both rely on visual interpretation of the cervix after acetic acid application. If not adjusted for, this dependency between the test and the gold standard artificially inflates sensitivity .
LCA however also has its limitations as a "diamond" standard is ultimately required to verify the LCA truth . Under this approach, disease is not formally defined but rather latent disease (truth) is a mathematically defined entity that does not necessarily correspond with a clinically relevant status. Moreover, LCA modeling requires specification of the joint distribution of test results, conditional on disease status. The model however cannot be fully tested with the observed data. Statistical associations between tests therefore should be understood biologically otherwise the meaning of the resulting estimates may be unclear . Consequently, researchers designing screening/diagnostic test studies should designate resources for verifying true disease outcome using an improved gold standard on at least a representative subset of study subjects . Consideration can also be given to alternative approaches to evaluating diagnostic/screening test accuracy (e.g., a composite reference standard) when a gold standard does not exist [56, 57].
Additionally, given that a perfect reference standard for cervical cancer may be unattainable, even in a controlled clinical setting, efforts to determine the relative usefulness of new tests should also consider how consistent study results are with the weight of existing data rather than trying to identify a single, best "truth". In this regard, a recent cost-effectiveness analysis of cervical cancer screening strategies showed that (under model assumptions, including industrialized country accuracy values for HC2) VIA, with immediate treatment for test-positive women at first visit, was similarly effective in reducing cancer incidence over the lifetime of the simulated cohort as HC2 screening with treatment at a second followup visit (26 percent versus 27 percent cancer incidence reduction, respectively). The HC2-based approach was less cost-effective, however, than VIA with the immediate option of treatment – the main factor being the number of women in resource-limited environments who often drop out when more than one visit linking testing and treatment is involved. Cytology, followed by treatment of test-positive women at a second visit, was the least effective (19 percent cancer incidence reduction) and the least cost-effective .
In 1994, the World Health Organization recommended exploring the benefits of VIA as an alternative screening test for cervical cancer in underserved developing countries . This study confirms the accuracy of VIA in detecting lesions requiring treatment at the hands of nurse-midwives in such a low resource setting. This finding, plus the fact that VIA is simple to administer, can be performed by nurse-midwives and the results are immediately available, make it a particularly valuable option for many resource-poor settings.
Funding for the original data collection in Zimbabwe was through a United States Agency for International Development (USAID) cooperative agreement with JHPIEGO (Number CCP-3069-A-00-3020-00). Funding for this analysis and manuscript write up was provided in part by the Bill and Melinda Gates Foundation through the Alliance for Cervical Cancer Prevention (ACCP). SD Walter contributed to the conception of this analysis and the rationale supporting LCA as an analysis option in this situation with a potential imperfect gold standard. Credit also goes to Saifuddin Ahmed for his contribution to the study. Finally, the authors are grateful to all contributors to the initial studies upon which this analysis was based and the women of Zimbabwe who took time to investigate their health and generously allowed these data to be used for furthering cervical cancer prevention worldwide. MA received financial support for his time from 1) the European Commission (Directorate of SANCO, Luxembourg, Grand-Duché du Luxembourg) through the European Cancer Network; 2) the DWTC/SSTC (Federal Services for Scientific, Cultural and Technical Affairs of the Federal Government, Belgium) and 3) the Gynaecological Cancer Cochrane Review Collaboration (Bath, UK).
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.