A coarsened multinomial regression model for perinatal mother to child transmission of HIV

Background In trials designed to estimate rates of perinatal mother to child transmission of HIV, HIV assays are scheduled at multiple points in time. Still, infection status for some infants at some time points may be unknown, particularly when interim analyses are conducted. Methods Logistic regression models are commonly used to estimate covariate-adjusted transmission rates, but their methods for handling missing data may be inadequate. Here we propose using coarsened multinomial regression models to estimate cumulative and conditional rates of HIV transmission. Through simulation, we compare the proposed models to standard logistic models in terms of bias, mean squared error, coverage probability, and power. We consider a range of treatment effect and visit process scenarios, while including imperfect sensitivity of the assay and contamination of the endpoint due to early breastfeeding transmission. We illustrate the approach through analysis of data from a clinical trial designed to prevent perinatal transmission. Results The proposed cumulative and conditional models performed well when compared to their logistic counterparts. Performance of the proposed cumulative model was particularly strong under scenarios where treatment was assumed to increase the risk of in utero transmission but decrease the risk of intrapartum and overall perinatal transmission and under scenarios designed to represent interim analyses. Power to estimate intrapartum and perinatal transmission was consistently higher for the proposed models. Conclusion Coarsened multinomial regression models are preferred to standard logistic models for estimation of perinatal mother to child transmission of HIV, particularly when assays are missing or occur off-schedule for some infants.


Background
In trials designed to evaluate the efficacy of an intervention to prevent perinatal mother to child transmission (PMTCT) of human immunodeficiency virus (HIV), infants are usually tested within 48 hours after birth, with a second visit scheduled 4 to 8 weeks after birth, to deter-mine their HIV status. Test results from these two visit windows are used to ascertain the three main outcomes of scientific interest: A1 The probability of in utero transmission, estimated by the fraction of infants testing positive at birth; A2 The probability of perinatal transmission, estimated by the fraction of infants testing positive by 8 weeks; A3 The probability of intrapartum transmission, estimated by the fraction of infants testing positive by 8 weeks who tested negative at birth. Subsequent visits may take place (i.e., after 8 weeks) but tests at these visits only contribute information about the outcomes of interest if the infant has missed earlier scheduled visits.
In primary analyses, we are usually interested in obtaining unadjusted estimates of A1, A2, and A3. In secondary analyses, covariate-adjusted estimates are often desired. If every infant was tested in every visit window, we could use a binary endpoint approach such as logistic regression to obtain adjusted estimates for each of the three main outcomes. However, missed and off-schedule visits are not uncommon in PMTCT trials. And, even if there are no missed visits, interim analyses may occur when only a fraction of the infants are old enough for the second visit. Table 1 lists a selection of primary papers from trials aimed at reducing PMTCT of HIV and summarizes their methods for unadjusted and adjusted analysis. These methods are among the more commonly used for estimating PMTCT of HIV. In adjusted analyses, the endpoint is generally modeled as either binary or right-censored continuous using logistic or Cox proportional hazards (PH) regression, respectively. For both logistic and Cox PH models, methods currently used for handling missing data may be inadequate. For example, when the logistic model is used and a test result is missing for an infant who has not previously tested positive, the observation is dropped, although if subsequent tests are negative, the missing test result may be imputed to be negative. When the Cox PH model is used, an infant's time to HIV infection is right censored at his or her last negative test; however, approaches for addressing timing of infection when a missing visit is followed by a positive test and there have been no previous positive tests may be inadequate. Some authors use the time of the first positive test as the time of infection while others use the midpoint between the last negative and first positive tests (or birth and the first positive test). Cox PH models do not specifically address treatment effects on A1-A3 but instead estimate the average treatment effect over the observation period. While we generally think of using Cox PH models to model time to an event, in the case of PMTCT, the event of interest has already occurred (or not occurred) when an infant is first tested. Because of the imperfect sensitivity of the test, however, we may not have been able to detect it.
We propose a coarsened multinomial regression model for analyzing PMTCT of HIV that accommodates missing and off-schedule test data and allows the effect of treatment to depend upon time. The approach is motivated by the HIV Prevention Trials Network (HPTN) 024 study, a multi-site placebo-controlled trial of antiobiotics to prevent chorioamnionitis and, therefore, PMTCT of HIV. In HPTN 024, of the 2,052 liveborn infants, 1,813 (88%) had HIV tests within 48 hours of delivery and 1,696 (83%) had tests 4 to 8 weeks after delivery. Only 1,584 (77%) had tests both within 48 hours of delivery and 4 to 8 weeks after delivery. While missed visits were sometimes due to the infant's death, in some cases the mother simply forgot or was unable to bring the infant in for follow-up. Many mothers did not deliver at the study hospital and had to bring their infants in later for the birth HIV test, resulting in visits that occurred off schedule.
Bertolli et al. [1] use a coarsened data approach to estimate unadjusted rates of in utero and intrapartum transmission based on the assumption that infants with missing test data are distributed among transmission groups in the same proportions as infants with non-missing test data. Magder et al. [2] expand on this approach, using logistic regression to estimate covariate-adjusted associations between various risk factors and presumed time of transmission, while allowing for misclassification. Little and Rubin [ [3], pp. 169-70] describe an approach

Unadjusted
Adjusted Censoring Wiktor et al. [4] KM 1 -at last negative test Guay et al. [16] KM PH 2 at last negative test Dabis et al. [17] KM PH not stated Shaffer et al. [6] KM logistic regression not stated Kuhn et al. [18] KM PH and logistic regression at last follow-up Fawzi et al. [19] Chi-square tests PH not stated Dorenbaum et al. [7] Fisher's exact tests logistic regression -Moodley et al. [8] KM PH and logistic regression at last follow-up Magder et al. [2] likelihood approach logistic regression (coarsened data) -for estimating the parameters of a coarsened multinomial model using the Expectation Maximization algorithm. While it addresses the problem of incomplete data, it does so only for a single sample or independent samples. Here we allow for regression models of the probabilities for the three outcomes of interest. We begin by describing the coarsened multinomial model then lay out strategies to adjust for covariates. Next, we describe a simulation study designed to evaluate the performance of the proposed regression estimators and compare them to more commonly used approaches in PMTCT trials. We illustrate our approach with an analysis of HPTN 024 study data then follow with discussion and conclusions.

The coarsened multinomial model
In this section, we present the coarsened multinomial model. In this general presentation, we assume that there are J visit windows. Usually, when estimating PMTCT, J = 2, corresponding to birth and 4 to 8 weeks. However, depending upon the study design, J may be larger as in Wiktor et al. [4], where the main endpoint was infection status at three months, and infants were tested at birth, four weeks, and three months.
We begin by dividing the follow-up time into windows as follows: Here t j1 and t j2 indicate the times at which the jth visit window starts and ends, respectively. These intervals do not have to be and usually are not contiguous. In other words, t j2 is not necessarily equal to t j+1,1 . Unscheduled or offschedule visits result in tests that occur in the interval [t j2 , t j+1,1 ).
We define a complete response vector for the ith, i = 1,..., To illustrate how the observed vector Y relates to the unobserved but true outcome Y*, we look at two possible visit and outcome patterns. First, we examine the effect of a missed visit assuming J = 2 visit windows. We consider infant A, who was not tested until the second visit at which point he or she tested positive. We do not know if infant A would have tested positive had he or she come in for the first visit. We can say, however, that the infant would have tested positive for the first time at the first visit or at the second visit if tested at both. In other words, Y* for this infant may be (1, 0, 0)' or (0, 1, 0)' but is not (0, 0, 1)'. Therefore, by equations (1), (2), and (3), Y = (1, 1, 0)'. Next, we consider infant B who missed the first visit and tested negative at the second visit. We assume that if an infant tests negative at the end of the study he or she was negative throughout the study; therefore, Y* = (0, 0, 1)' and Y = (0, 0, 1)' for this infant. In this case, even though the infant was not tested in every visit window, we have complete information regarding his or her outcome. This illustrates one difficulty involving missed visits. If a infant is uninfected and misses all visits except the last (the Jth visit), we still have complete information about him or her (as in B); however, if the infant is infected at the last visit (as in A), we have incomplete information about him or her.

Modeling cumulative rates of transmission
We begin by considering regressions on cumulative probabilities in order to estimate endpoints A1 and A2, the in utero and perinatal transmission rates. We define the , , and otherwise (1) .
probability that the ith infant's first positive test occurs in the jth visit window as π ij and the probability that the ith infant's first positive test occurs after the last visit window as . To examine the relationship between a set of predictors, (X i1 ,..., X im ), and the probability that infant i tests positive at or before the jth visit, we define the following regression model: where g(·) is a link function that specifies the relationship between the predictors X i = (1, X i1 ,..., X im )' and the response, through the parameter vector β j = (β j0 ,..., β jm )' of length m+1. For ease of exposition, we assume that a predictor is relevant for all visit windows; however, this is not necessary as will be illustrated in the HTPN 024 analysis.
When modeling cumulative rates or probabilities, two appropriate choices for the link function are the log link, where g(p) = log(p), and the logit link, where g(p) = logit(p) = log{p/(1 -p)}. When using the logit (log) link, β jl , l = 1...,m, is interpreted as the log odds ratio (log relative risk) for testing positive at or before the jth visit window per one unit increase in X il , l = 1,..., m.
We combine and re-write equations (4) and (5) to obtain the following expressions for π ij : The log-likelihood, written in terms of the coarsened data, is given by and maximum likelihood methods are used to estimate the parameters (see Maximum likelihood estimation of parameters below).

Modeling conditional rates of transmission
We now consider regressions on conditional probabilities to estimate endpoint A3, the intrapartum transmission rate. To examine the relationship between a set of predic-tors (X i1 ,..., X im ) and , the conditional transmission rate or probability of first testing positive at the jth visit given a negative test result at the (j -1)th visit, we define the following regression model: where g(·) is a link function that specifies the relationship between the predictors X i = (1, X i1 ,..., X im )' and the response, through the parameters vectors β 1 = (β 10 ,..., β 1m )' and , each of length m + 1. When using the logit (log) link, , l = 1..., m, is interpreted as the log odds ratio (log relative risk) for testing positive at the jth visit, given a negative result at the (j -1)th visit, per one unit increase in X il , l = 1,..., m.
We calculate π ij , j = 2,..., J, for use in the log-likelihood as where Equation (6) then provides the likelihood for the coarsened data.

Maximum likelihood estimation of parameters
To obtain maximum likelihood estimates of the regression parameters, we maximize equation (6) using numerical optimization techniques. The optimization procedure requires that π i1 ,..., π iJ lie between 0 and 1 and that be less than 1 for all i. If these constraints are not met by the form of the regression models, we impose them through the optimization procedure via non-linear constraints on the coefficients. Further implications of the constraints are presented in the discussion.
For the analyses presented here, numerical optimization was carried out using a quasi-Newton algorithm. The algorithm is an efficient modification of Powell's Variable Metric Constrained WatchDog algorithm, which is available in SAS PROC NLP [5]. Additional details regarding our implementation, including SAS macros for fitting the cumulative and conditional models with the logit link for an arbitrary number of visit windows, are available from the authors upon request.

Simulation study
We performed simulations to assess the properties of the proposed regression estimators and compare them to more commonly used regression approaches. We considered the case of two visit windows corresponding to birth and 4 to 8 weeks. For each simulated data set, we randomly generated covariates that informed the infant's simulated mode of transmission (in utero, during delivery, or neither). We allowed for imperfect sensitivity of the assay shortly after transmission by simulating time of detectable infection and allowed the simulations to reflect additional positive test results at the 4 to 8 week visit due to breastfeeding. We randomly generated a set of visit times for each infant, independent of covariates but dependent upon the endpoint, with infants having simulated time of detectable infection equal to 0 days slightly more likely to attend the 4 to 8 week visit. We determined each infant's observed endpoints by comparing his or her simulated time of detectable infection to his or her simulated visit times. Additional details regarding the simulation of time of detectable infection and visit process are provided in Appendices A and B (Additional files 1 and 2), respectively.
We fit the cumulative and conditional regression models using standard logistic regression and the proposed coarsened multinomial (CM) regression models with the logit link. We considered two sets of logistic models: the first (L-CUM) modeled infection at birth and infection at 4 to 8 weeks, and the second (L-COND) modeled infection at birth and infection at 4 to 8 weeks among infants known to be HIV negative at birth. The logistic models considered all infants for whom HIV status at birth and HIV status at 4 to 8 weeks could be determined and were chosen to represent those commonly used in the analysis of PMTCT of HIV [6][7][8]. Cox PH models, although also used, do not specifically address treatment effects on A1-A3 but instead estimate the average effect of treatment over the observation period; therefore, we did not assess them in our simulations. We compared the effects of treatment as obtained from the regression models to the true effects of treatment according to which the data were generated, determining bias, mean squared error (MSE), 95 percent coverage probability (CP), and power for each estimator.
We considered several scenarios, allowing for different treatment effects (TEs) and different visit processes (VPs) that resulted in varying amounts of missing data. Results are provided for each scenario based on 1000 data sets of 1500 observations each.
In carrying out numerical optimization, we chose the convergence criteria for the proposed cumulative and conditional regression models to coincide with the convergence criteria for the logistic regression models in SAS PROC LOGISTIC [9].

HPTN 024
To illustrate our approach, we analyzed data from HPTN 024, a multi-site double-blinded placebo controlled trial of antiobiotics to prevent chorioamnionitis and, therefore, perinatal transmission of HIV. The trial enrolled pregnant, HIV positive women receiving care in hospitals and clinics in Malawi, Tanzania, and Zambia. Women were randomized to receive either treatment or placebo. Treatment consisted of two courses of antibiotics, with the first course administered at enrollment (20 to 24 weeks gestation) and the second at the onset of contractions and/or premature rupture of membranes. All women and their liveborn infants were offered single dose nevirapine per World Health Organization recommendation [10]. Women were followed during their pregnancies, and their infants were followed postnatally. Visit windows for determining in utero and delivery/early postnatal transmission in this breastfeeding population were 0 to 48 hours and 4 to 6 weeks, respectively. Because over half of the visits scheduled to occur between 4 and 6 weeks actually took place between 6 and 8 weeks, we extended the second visit window to 4 to 8 weeks for analysis purposes. We also extended the birth visit window to 0 to 7 days.
At the first interim analysis, the NIAID Vaccine and Prevention Data and Safety Monitoring Board reviewed trial progress in a scheduled interim analysis and concluded that, while statistical evidence neither established benefit nor harm, the available evidence ruled out targetted levels of benefit. They further recommended that HPTN 024 stop recruitment and administration of study drug and continue follow-up of enrolled women and infants. Administration of the study drug was halted on March 5, 2003. Additional details regarding the 024 study are provided by Taha et al. [11].
In this analysis, we examined the association between PMTCT and antibiotics, comparing outcomes for infants born to mothers randomized to antibiotics who delivered prior to March 5, 2003 to infants born to mothers randomized to placebo or to mothers randomized to antibiotics who delivered after March 5, 2003. Additional covariates of interest were log maternal viral load, maternal CD4 count, and infant gender. In the birth model, we adjusted for mother's use of nevirapine and, in the 4 to 8 week model, for mother's and infant's use of nevirapine.
To account for unmeasured differences between hospitals and clinics, we included study site in both models. Table 2 provides simulation results for the six combinations of treatment effect and visit process that we examined. These combinations illustrate the impact of treatment effect on estimator performance for a given visit process as well as the impact of visit process (i.e., varying levels of missingness) on estimator performance for a given treatment effect. Because we allowed for imperfect sensitivity and early breastfeeding transmission, we would not expect to see zero bias in the estimates from our simulations. Relative bias (not shown in table) of estimators  of treatment effect on perinatal and intrapartum transmission ranged from 0.007 (conditional CM model) to 0.711 (cumulative logistic model), corresponding to the scenario where treatment was assumed to increase the risk of in utero transmission but decrease the risk of intrapartum and overall perinatal transmission (TE4) and the visit process resulted in the most missing data (VP3).

Cumulative regression models
The proposed cumulative regression model (CM-CUM) performed comparably to, or better than, the logistic regression model (L-CUM) across all performance measures for all scenarios except where treatment was assumed to reduce the odds of in utero and intrapartum transmission by roughly equal amounts (TE1) and the visit process resulted in the least amount of missing data (VP1). For this scenario, the birth estimate was less biased in the logistic model. Across all scenarios, MSE was consistently lower (albeit only slightly in most cases) for the CM model than for its logistic counterpart. The CPs for the competing cumulative models were similar while power at 4 to 8 weeks was higher for the CM model for all scenarios where power was assessed. The higher power observed at 4 to 8 weeks compared to birth is not surprising given the smaller probability of infection at birth.

Conditional regression models
On the whole, the proposed conditional regression model (CM-COND) performed comparably to its logistic counterpart (L-COND), with the CM model having less bias at birth for all scenarios where a positive effect of treatment on intrapartum and overall perinatal transmission was assumed (TE1, TE3, and TE4). MSE was consistently lower, although only slightly, for the CM model. While power tended to be low for the conditional models, power at birth and at 4 to 8 weeks was slightly higher for the CM model for the TE4 scenarios.  Table 3. Figure 1 provides the complete testing profile for the 1,758 infants with complete covariate data, according to treatment group.
We used the proposed regression methods to analyze the outcomes infection and infection or death. We estimated the odds of perinatal transmission using the proposed cumulative model and the odds of in utero and intrapartum transmission using the proposed conditional model. Results are provided in Table 4. We found that treatment was not significantly associated with a reduction in any of the modes of PMTCT. These findings are consistent with the intent-to-treat analysis as is the trend in the estimates suggesting that treatment decreases the odds of in utero transmission while increasing the odds of perinatal transmission [11]. When we defined first positive test or death at or before a given visit as the endpoint, we observed the same trend, although slightly weaker.

Discussion
Many statistical techniques are available for estimating PMTCT of HIV while adjusting for covariates. Among the more commonly used are logistic regression models and Cox proportional hazards models. While these methods are relatively straightforward to implement, they do not easily accommodate missed or unscheduled visits while allowing for a time-varying treatment effect. Cox models can be modified to allow the effect of treatment to depend upon time but do not fully solve the problem of how to handle missed or unscheduled visits. Interval censored models, which use an infant's time to last negative test and time to first positive test to form an interval around his or her (unknown) time of infection, may better accommodate the missing data, but software is not generally available for regression with interval censored data unless we are willing to make parametric assumptions about the distribution of the event times.
Recently, Bang and Spiegelman [12] proposed a likelihood approach for a dichotomous outcome to estimate mother to child transmission when infection status is missing for some infants due to fetal loss. However, this approach does not address all three of our endpoints of interest or missing data due to incomplete follow-up. Balasubramanian and Lagakos [13] provide methods for estimating the continuous distribution of the timing of in utero and peripartum transmission that account for the imperfect sensitivity of the HIV assay. The authors developed the approach for settings in which there is no risk for infection following birth and, therefore, do not address the potential impact of breastfeeding.
We propose a coarsened multinomial approach for estimating PMTCT that accommodates missing test result data, regression on the three outcomes of interest, and time-varying treatment effects. Through simulation, we investigated the performance of the estimators obtained from the more commonly used logistic regression approaches and compared them to the proposed estimators, including imperfect sensitivity of the assay and contamination of the endpoint due to early breastfeeding transmission. We found that both the proposed cumulative model and the proposed conditional model per-formed well when compared to their logistic counterparts. Performance of the proposed cumulative model was particularly strong under scenarios where treatment was assumed to increase the risk of in utero transmission but decrease the risk of intrapartum and overall perinatal transmission and under scenarios designed to represent interim analyses. Power for the proposed models was consistently higher at 4 to 8 weeks, which is to be expected given that the logistic models used only data for infants whose endpoints were non-missing or could be imputed based on subsequent negative tests.
The coarsened multinomial regression approach is not without limitations. Both the proposed cumulative and conditional models impose non-linear constraints on the coefficients, which can complicate interpretation of the estimates if maximization of the likelihood occurs on the boundary of the parameter space. In the case of the conditional model, however, only a single constraint is imposed, which is no more than would be imposed for a general multinomial model [ [14], p. 21]. In numerous simulations (beyond those presented here), we saw no evidence of bias due to maximization on the boundary.  Our approach relies on the assumption that missingness is non-informative and, thus, may be more appropriate for some endpoints (infection/death) than for others (infection). While the model can be used in a breastfeeding population, it does not allow us to separate intrapartum transmission from early transmission due to breastfeeding.
While we have attempted to assess the impact of misclassification in our simulations, our approach uses probabilities of first testing positive to estimate transmission probabilities and, in doing so, does not formally account for misclassification due to the imperfect sensitivity of testing. A possible extension of this approach would involve introduction of a latent variable representing an infant's true infection status. One could link an infant's probability of infection at a given time point to his or her coarsened test result via his or her complete (unobserved) test result and, in doing so, incorporate information about the sensitivity of testing in the manner of Magder and Hughes [15]. Given the multivariate nature of the outcome and missingness in the test result, such an extension would introduce considerable complexity.

Conclusion
Here we have studied the problem of estimating the effect of treatment on PMTCT of HIV when outcome data are incomplete. We describe methods that give consistent and asymptotically normal estimators using maximum likelihood theory. Through simulation, we have shown that the proposed models outperform standard logistic models in terms of bias, mean squared error, coverage probability, and power under a range of treatment effect and visit process scenarios designed to reflect a PMTCT setting. Given their strong performance, coarsened multinomial regression models are to be preferred to standard logistic models for estimation of perinatal mother to child transmission of HIV, particularly when assays are missing or occur off-schedule for some infants.