Robustness of the Tariff method for diagnosing verbal autopsies: impact of additional site data on the relationship between symptom and cause

Background Verbal autopsy (VA) is increasingly being considered as a cost-effective method to improve cause of death information in countries with low quality vital registration. VA algorithms that use empirical data have an advantage over expert derived algorithms in that they use responses to the VA instrument as a reference instead of physician opinion. It is unclear how stable these data driven algorithms, such as the Tariff 2.0 method, are to cultural and epidemiological variations in populations where they might be employed. Methods VAs were conducted in three sites as part of the Improving Methods to Measure Comparable Mortality by Cause (IMMCMC) study: Bohol, Philippines; Chandpur and Comila Districts, Bangladesh; and Central and Eastern Highlands Provinces, Papua New Guinea. Similar diagnostic criteria and cause lists as the Population Health Metrics Research Consortium (PHMRC) study were used to identify gold standard (GS) deaths. We assessed changes in Tariffs by examining the proportion of Tariffs that changed significantly after the addition of the IMMCMC dataset to the PHMRC dataset. Results The IMMCMC study added 3512 deaths to the GS VA database (2491 adults, 320 children, and 701 neonates). Chance-corrected cause specific mortality fractions for Tariff improved with the addition of the IMMCMC dataset for adults (+ 5.0%), children (+ 5.8%), and neonates (+ 1.5%). 97.2% of Tariffs did not change significantly after the addition of the IMMCMC dataset. Conclusions Tariffs generally remained consistent after adding the IMMCMC dataset. Population level performance of the Tariff method for diagnosing VAs improved marginally for all age groups in the combined dataset. These findings suggest that cause-symptom relationships of Tariff 2.0 might well be robust across different population settings in developing countries. Increasing the total number of GS deaths improves the validity of Tariff and provides a foundation for the validation of other empirical algorithms.


Background
Reliable knowledge of the distribution of causes of death (COD) in populations is critically important for national and sub-national public health surveillance and planning [1]. However, a substantial proportion of deaths in developing countries occur outside health facilities, leading to low quality COD data [1][2][3][4]. Verbal autopsy (VA) has emerged as a cost-effective solution to determining COD in rural areas with limited contact with medical services [5,6].
VA involves interviewing the family of the deceased to collect information regarding the signs and symptoms surrounding the death. This information can be analyzed by physicians or computer algorithms to assign an individual COD which can then be aggregated to yield population level COD estimates. Currently, most computer algorithms for classifying the COD from verbal autopsy interviews (VAI) rely on a matrix describing the relationship between a set of predictors and a set of causes [7][8][9][10][11][12]. These associations can be determined either purely empirically, purely through expert opinion, or through a combination of both. Previous studies have shown that methods that rely purely on empirically derived associations, such as random forest and Tariff 2.0, outperform both physician coding and methods in which the associations are derived by expert opinion, such as InterVA [7,12,13].
Methods that use empirically derived associations require high quality COD data where the true underlying cause is known as accurately as possible ('gold standards') [5]. Automated diagnostic algorithms process data in which the true COD is known with a reasonable degree of certainty to identify predictive patterns in the responses from VAIs. Misclassifications in the underlying COD will likely result in the algorithms learning patterns that are wrong and would result in low-quality COD predictions.
The Population Health Metrics Research Consortium (PHMRC) previously collected VAs matched with COD based on medical record review with strict ex-ante diagnostic criteria from six sites in four different countries [14]. Only cases with definitive clinical diagnostic results were included in this study. This ensured that the underlying COD was known with the highest possible degree of certainty. These data were made publicly available in 2013. In this paper, we report on a new study that collected additional gold-standard VAs, adhering to the same strict diagnostic criteria and procedures as for the PHMRC study [15]. These data, collected as part of the Improving Methods to Measure Comparable Mortality by Cause (IMMCMC) study, include gold standard VAs from three sites, two of which were not included in the previous study. We report on the effect of including these data on the stability of the cause-symptom relationship that underlies the Tariff 2.0 diagnostic method.

Data
The original PHMRC study sites included Andhra Pradesh, India; Bohol, Philippines; Dar es Salaam, Tanzania; Mexico City, Mexico; Pemba Island, Tanzania; and Uttar Pradesh, India [14]. In total, 7836 adults, 2075 children, 1629 neonates, and 1002 stillbirths were collected. The study attempted to gather a similar number of cases (at least 100) for each major COD (as reflected in the Global Burden of Disease Study) in representative low and middle income country sites.
The IMMCMC study gathered gold standard VAs between 2011 and 2014 using the PHMRC long form VAI. Cases were identified using the same diagnostic criteria and cause list as the PHMRC study. The IMMCMC study was conducted at three sites: Chandpur and Comilla Districts in Bangladesh, Central and Eastern Highlands Provinces in Papua New Guinea, and Bohol Province in the Philippines. The data collection methodology is described in Additional file 1. Gold standards have been classified as having attained level 1, 2A, 2B, or 3 level of certainty, depending on the amount of information contained in the medical records [14]. Levels 1 and 2 represent cases where the certainty about the diagnosis of the underlying COD is greatest. In this report, we only analyze the 3512 cases that met gold standard level 1 or 2 criteria. This included 2491 adults, 320 children and 701 neonates. The majority of the cases were from Bohol, Philippines, which contributed 2384 cases; 1070 VAs were from Bangladesh and 58 were from Papua New Guinea.
The cause composition of the IMMCMC dataset is less balanced than the PHMRC data because the IMMCMC gathered all deaths at the study sites while the PHMRC study attempted to gather at least 100 cases for pre-determined causes. Table 1 shows the number of cases in each dataset by age module and cause. The IMMCMC database contributed a substantial number of additional deaths due to stroke, pneumonia, acute myocardial infarction, road traffic accidents, stillbirth, and preterm delivery.

Tariff Method
Tariff 2.0 is based on the determination of a set of Tariffs, based on a matrix of cause-symptom pair endorsement rates [16]. A cause-symptom pair is endorsed if the interviewee responded "yes" to a particular question, for which the true COD was known, or reports a duration greater than a pre-specified cutoff on questions that ask about length of time. Other values, including "Don't Know", are considered unendorsed and are counted in the denominator of the endorsement rate. The Tariff for any cause-symptom pair is calculated from the endorsement rate of that given cause-symptom pair and the distribution of endorsement rates for the symptom across all causes. Specifically, it is: where x i,j is the endorsement rate for symptom i across all cases in which the true cause was j, Median i is the median endorsement rate of symptom i across all causes, and IQR i is the interquartile range of endorsement rates of symptom i across all causes. For example, if 50% of respondents answered "yes" to chest pain for Acute Myocardial Infarction, and the median and interquartile range for chest pain across all causes were 18 and 20%, respectively, then the Tariff for chest pain for Acute Myocardial Infarction my would be 1.6. Tariffs are then tested for significance using a Monte Carlo experiment. The original symptom data is resampled with replacement, stratified by cause, to create 500 datasets. In each dataset, the total number of observations for each cause is constant, but the given rows, and thus the endorsements and endorsement rate, varies. Tariffs are calculated from each of these datasets and are then used to create 99% uncertainty intervals around each Tariff estimate. Tariffs where the uncertainty interval includes zeros, i.e. where we are not certain about the directionality of the association between the symptom and the cause, are removed. Lastly, Tariffs are rounded to the nearest 0.5. This prevents overfitting by treating similar values as containing the same amount of information instead of prioritizing minor, insignificant difference in predictive value.

Analysis
We assessed changes in the Tariffs by comparing the proportion of significant Tariffs before and after the addition of the IMMCMC dataset. We also measured the proportion of Tariffs that changed significantly or changed directionality (i.e. from positive to negative). We performed the out-of-sample validation procedure described in Murray et al. to assess changes in individual and population-level COD performance [17]. Individual-level performance was assessed using chance-corrected concordance (CCC). CCC is a measure of agreement between the predicted and gold standard cause assignment, adjusted for chance. Population-level performance was assessed using causespecific mortality fraction (CSMF) accuracy and chancecorrected cause-specific mortality fraction (CCCSMF) accuracy [18]. CSMF accuracy is a summary measure of performance between the predicted and gold standard cause assignment, and CCCSMF adjusts for chance.

Results
Adding the IMMCMC dataset did not change the directionality of the association between any cause-symptom pairs. In other words, the sign of all 2852 significant Tariffs was the same before and after adding the new data.
In the original study using only PHMRC data, 2852 of the 19,401 (14.7%) cause-symptom pairs across the three modules were statistically significant. After combining the PHMRC and IMMCMC datasets and recalculating Tariffs, 2563 of the original 2852 (89.9%) values remained significant. 97.2% of Tariffs did not change significantly after the addition of the IMMCMC dataset; less than 3% did. Table 2 shows the cause-specific change in performance measured by CCC, with individual causes ranked in decreasing order according to the magnitude of the difference in CCC before and after the addition of the new cases. For adults, CCC was highest for injuries and lowest for residual categories for both the PHMRC and combined datasets. For children, CCC was highest for injuries and lowest for infectious diseases and residual categories for both the PHMRC and combined datasets. For neonates, CCC was much higher for stillbirth than all other causes in both datasets. CCC for pneumonia and birth asphyxia were low for both datasets. In short, addition of the new cases did not alter the comparative performance of the Tariff method for various causes of death, as measured by CCC. Interestingly, nearly all adult and child causes experienced an increase in CCC, with the largest decrease for other injuries. Neonate causes mainly experienced an increase or little change, except for a decrease in meningitis/sepsis.
While the changes in Tariffs with the addition of new data were generally small, it is important to understand for which cause-symptom pairs the incorporation of new data had greatest effect. Table 3 shows the ten largest increases and decreases in Tariffs (all Tariffs shown in Additional file 2). Tariffs represent the strength (positive or negative) of the relationship between a particular cause and a given symptom, so large increases in Tariffs indicate increased importance of that symptom for the given cause, while large decreases signify the opposite. The largest changes in Tariffs were mainly associated with injury and maternal deaths where Tariffs would expect to be high because of the likelihood that symptoms for these conditions would be more clearly distinguishable and remembered. All symptoms associated with large increases in Tariffs had a strong association with a particular cause. The same pattern was also observed for large decreases. Although the absolute change in Tariffs for these pairs might have been large, the strength of the association was sufficiently clear that the change did not distort the predictive ability of the algorithm in selecting the correct underlying COD. Table 4 which shows the overall change in predictive performance for Tariff 2.0 before and after the IMMCMC dataset was added to the PHMRC dataset (full performance details are shown in Additional file 3). In fact, overall diagnostic predictive accuracy for adults and children increased marginally for both populations (CSMF) and individuals (CCC), and for neonatal CSMFs, but decreased slightly when assessing diagnostic accuracy for individual neonatal deaths (CCC).

Discussion
Automated diagnostic methods such as Tariff 2.0 have the potential to revolutionize national mortality surveillance system by facilitating huge improvements in the availability and quality of data on causes of death in hitherto underserved populations. But are these methods reliable and generalizable and likely to perform similarly in different populations? This study has confirmed the robustness of the Tariff 2.0 method when new goldstandard data from different populations were incorporated. The addition of the IMMCMC data to the publicly available PHMRC dataset confirmed the results of the original Tariffs derived solely from the PHMRC dataset, and led to a slight overall improvement in the diagnostic performance of the algorithm.
Adding additional deaths to the PHMRC dataset to calculate Tariffs further clarified the relationship between various symptoms and causes. No Tariffs changed direction (i.e. went from positive to negative, or the converse) and the vast majority of Tariffs that were significant using the PHMRC dataset were also significant when using the combined dataset. Some Tariffs which were statistically significant in the original PHMRC data were not, when using the combined dataset. These differences likely reflect instances where the Tariffs were over-fit to 'noise' in the raw data, and the addition of new data served to create more generalizable Tariffs.
Given that the majority of causes experienced an increase in CCC, most changes in the cause-symptom relationship as a result of adding new data led to improved predictive performance. Decreases in the CCC for some causes were likely due to spurious associations between symptom and cause that were a result of relatively few deaths present in both the PHMRC and IMMCMC datasets for certain causes. For example, only 6 neonatal pneumonia deaths (7% increase) and 32 meningitis/sepsis deaths (19% increase) were added to the PHMRC dataset from the IMMCMC study. Rather, 80-90% of neonatal deaths were attributed to stillbirth, preterm delivery, or birth asphyxia. The similar symptoms of pneumonia and meningitis/sepsis, together with the comparatively few cases, provided insufficient information for Tariff to distinguish between the causes, resulting in a decrease in CCC for neonates. A greater number of deaths attributed to more causes in the adult and child modules contributed to the increased diagnostic performance of Tariff when applied to the combined dataset.
Large changes in the Tariffs tended to be limited to certain maternal and injury causes. These changes reflect the addition of a diversity of new cases that were not in the PHMRC dataset and suggest the likelihood of cultural differences affecting responses to some questions pertinent to maternal and injury deaths. Difficulties with describing the intent of the question, or communication skills, are also likely to affect the quality of the interview process. Given the multiple requirements of interviewers when conducting a VAI, it is hardly surprising that incomplete or misleading data will be collected in some cases. This makes it much harder for the algorithm to correctly predict the most probable COD, and likely lead to unsubstantiated changes in Tariffs.
The diversity of study populations present in the combined dataset suggests that the associations generated by the algorithm are likely to be generalizable. Otherwise, the associations may reflect a cultural bias in which symptoms are noticed, communicated, remembered and reported from a limited set of study populations.
Previous studies have shown that respondents often report different information at repeat visits regarding the same death, but key symptoms are often remembered and are sufficient to properly classify the COD [19,20]. With a large enough training data set, algorithms should be able to distinguish between these key predictors and background noise. It is also necessary to include VAs which include missing data or where the pattern of responses may not seem consistent with the true COD, as long as they were collected under real survey conditions. These observations represent 'noise' in the data that are propagated when the algorithm is applied to deaths notified to vital registration systems. Properly calibrated computer algorithms such as Tariff 2.0 will be able to account for this 'noise' and adjust the predictions accordingly.
The addition of deaths from the IMMCMC study to the PHMRC GS database is an important step in the continual validation of empirical VA algorithms. There has been some criticism of empirical methods that are derived and tested on gold standard datasets, but it is important to recognize the benefits of a GS database [9,[21][22][23]. First, GS deaths provide evidence that the responses provided during a VA do not necessarily make sense in a clinical context. For example, 147 deaths in the PHMRC dataset were reported as stillbirth but described as neonatal deaths, which is impossible because stillbirth implies the birth did not occur [21]. Second, GS deaths provide a basis for assessing the validity for text items in open-ended responses. The potential of these "open narrative" responses to improve diagnostic accuracy has yet to be fully realized [16]. Third, deaths that occur in-hospital are different than deaths that occur at home because the terminal events are prolonged by therapeutic activity, but the signs and symptoms which precipitated the hospital admission are what are asked in a VA [9]. Some diseases may have different presentations at home than in the hospital: e.g. families generally have much less chance to observe a woman dying in labor or a neonate dying in a special nursery in a hospital than they would have of observing these events at home. Collecting GS data for such conditions would set standards for such data collection environments. While GS databases have limitations, they provide a valuable basis for VA validation research and implementation. They also set the foundation for adding additional cases of deaths that can provide empirical evidence about the generalizability of VA methods. We have categorized the IMMCMC dataset as gold standard, but we recognize its limitations. The sampling strategy of collecting deaths in the IMMCMC study was different from that of the PHMRC study. All deaths in study hospitals were collected for the IMMCMC dataset, while approximately 100 deaths per cause were intended for collection in the PHMRC dataset. This difference may bias the cause-symptom relationship of less frequent causes in the IMMCMC dataset towards that of the PHMRC dataset. Furthermore, while the IMMCMC dataset added 3513 death cases, some causes (e.g. AIDS and lung cancer) had less than 20 deaths. Changes in the Note: Symptoms include non-word symptoms where the change in Tariff value was at least 1.0 and the Tariff value using either the PHMRC dataset or the combined PHMRC/IMMCMC dataset is at least 1.0. Tariff values of zero represent Tariffs that are not statistically significant