K-means for shared frailty models

Background The purpose of this research was to see how the k-means algorithm can be applied to survival analysis with single events per subject for defining groups, which can then be modeled in a shared frailty model to further allow the capturing the unmeasured confounding not already explained by the covariates in the model. Methods For this purpose we developed our own k-means survival grouping algorithm to handle this approach. We compared a regular shared frailty model with a regular grouping variable and a shared frailty model with a k-means grouping variable in simulations as well as analysis on a real dataset. Results We found that in both simulations as well as real data showed that our k-means clustering is no different than the typical frailty clustering even under different situations of varied case rates and censoring. It appeared our k-means algorithm could be a trustworthy mechanism of creating groups from data when no grouping term exists for including in a frailty term in a survival model or comparing to an existing grouping variable available in the current data to use in a frailty model.


Introduction
The k-means algorithm was designed to find natural groupings amongst bivariate data, essentially creating order from disorder. The method has worked by finding the difference between initial group means and in a process, moving around these means until these distances are minimized. The idea first goes back to Hugo Steinhaus [1] in 1956. He was a Polish mathematician who first came up with the idea as written in his paper. Mac-Queen also came up with a k-means clustering algorithm by 1967 that is now also used extensively [2] and focused on setting means and finding the centroid of each partition by minimizing sums of squares to the cluster centers, but setting the initial means was never specified. Later on, borne out of signal processing in engineering to partition n observations into k clusters, the first standard algorithm was proposed by Stuart Lloyd of Bells Labs [3].
Lloyd's work was also known as the Voroni iteration. His iteration focused on finding even set of points in Euclidean partitions and would also would repeatedly find the centroid of each partition and would repeat these operations until finding the centroid that was closet. This type of idea also helped to form the basis behind the k-means algorithm.
The k-means algorithm works to minimize the squared Euclidean distances between clusters. Different variations of this idea have come to fruition over time and several of these proposed algorithms are still used. Even the Expectation-Maximization algorithm was modified in use for this purpose. Of course utilizing different methods of minimizing the distances may produce different results.
In survival analysis, frailty models have allowed incorporating unexplained heterogeneity at the individual level and grouping level [4,5]. Specifically at the grouping level, there was unmeasured heterogeneity or confounding between groups or clusters of individuals. The models for these were then called shared frailty models. Sometimes, grouping of individuals was available in the data at hand and that natural grouping can be used in the shared Govindarajulu and Bedi BMC Medical Research Methodology (2022) 22:11 frailty model. However, often such term is not available to cluster the individuals but yet, it may suffice that some kind of grouping should be imposed on these individuals in order to model the frailty between them. This idea motivated us to utilize the k-means algorithm to create natural grouping between individuals, which then could be used in a shared frailty model to allow for modeling the heterogeneity between them.

K-means algorithm for survival
In general, the k-means algorithm typically has started with defining number of clusters to partition a priori. In our k-means algorithm for survival, we have allowed making this decision at anywhere between 3 and 5 groups or clusters. Our algorithm then utilized the Euclidean distances between a given pair, x and y, which are randomly chosen existing data points which will become the initial centroids from which to start the algorithm. A typical Euclidean distance formula for the k-means algorithm has been [1][2][3] � , and we modified this formula for the k-means algorithm to be the following for survival data, where we calculate the Euclidean distances between x being a main predictor which is continuous and y being a survival time, t i , at each ith observation: We then calculated the distance from each data pair of x and survival time for each centroid. Next we sorted the distances and reassign data points to the initial centroids to which they are closest. We then calculated the mean centroid of these groups which became the new centroids for the main continuous predictor and survival time as μ x and μ t respectively. We then repeated the step of calculating the Euclidean distance between each data point pair and the current centroids which starts the loop.
We would then calculate this distance and then find the (x,t) pair that corresponded to the minimum distance. Those points would then become the new centroids and the step would be repeated over again until the distance between the current centroids and the newly selected centroids became zero. If this step was not obtained in 10 iterations then the loop was completed at 10 iterations and the last set of centroids were then used. These final set of centroids are what help define the grouping variable to be used in a shared frailty model then referred to as a k-means frailty model.

Cox shared frailty model with smoothing
Survival models, in general, are used to model time to some event. In this application we focus, in particular, on the Cox proportional hazards (PH) regression model with single events per subject. We incorporated a random effect term (frailty), incorporated into a shared frailty model, where frailty was modeled between groups. The frailty term allowed accounting for the unexplained heterogeneity between groups and this was accomplished by specifying a frailty distribution, which is typically modeled by a non-negative one, like a gamma,or log Gaussian distribution. We chose to allow the frailty distribution to have a log-normal distribution for these applications. Using a model as we previously described in a previous manuscript [6], we again incorporated a dichotomous treatment effect as a covariate and again included a continuous, prognostic variable, which then had a smoothing function on it to handle non-linearity as we have previously shown. In a Cox PH regression model this gave [6]: where the following terms are defined as: h 0 (t): baseline hazard rate, i = 1,…,c groups and j = 1,….,p i persons per group, x ij : dichotomous treatment variable, y ij : prognostic variable, which was age in our analyses, and s(y ij ): smoothed function of y ij . In addition, we again included a log-normal frailty density for the grouping effect [6], similar to Duchateau et al [7]. In Eq.(3), the ith cluster was generated as the grouping variable either via the regular method with a variable already available in the dataset to cluster in a shared frailty model or through using our modified k-means clustering previously described procedure to define the groups. These models have been estimated through a penalized regression where the frailty term is treated like an additional covariate but are then penalized by a penalty term added to the log-likelihood according to Therneau et al [8].
From our prior research, we had learned that there are certainly many options for smoothing, and we previously looked extensively at restricted cubic splines, penalized splines, and even fractional polynomials in much detail. However, we again opted to use a natural spline (NS) function as the smoothing function, s(y ij ), because it is easier to implement and did well in our simulations [9,10]. We have previously described and implemented in the ns function in R software. We described that natural splines have been essentially restricted cubic splines, and these use B-splines in the basis expansion of s(y ij ) [6,9,10], where B h (y ij ) the B-spline basis functions. These are described in further detail [9]. We again employed the ns function in R with df = 4, which happens to be the default degrees of freedom (df ) in the software. We then compared this shared frailty model from the grouping using our modified k-means algorithm to groupings generated in simulations and we also compared these separate models in a real dataset application with groups defined by a pre-existing grouping variable already available in the dataset. In the next section, we have described our simulations.

Simulation framework
In this section, we utilized the simulation framework as described in our previous manuscripts [6,9,10], where we generated the survival data using methods that similar to those in Bender et al. [11], and described in more detail [9,12] but adjusted to incorporate a grouping effect in the model. The equation below again represents the true generating model from the Cox PH model [6], As in our previous manuscripts we allowed for the following parameters which we describe in this section [6, 9,10]. We allowed the subscripts i and j represent the grouping and person respectively [6]. We again allowed the coefficient β T to be that for the dichotomous treatment effect x ij, again, using two treatment scenarios, β T = 1, and β T = − 1. We again allowed the s(y ij ) = log(y ij ) because this was the way to handle any non-linearity in the prognostic factor, y ij . We again sampled y i j randomly with replacement where we again used an age range of 30 to 66 for the subjects. The grouping effect, w i, was also again generated from a N(0,σ 2 ) distribution, with values of σ 2 = 0.25 [2].
Again through Bender et al [11] similar to before [6,9,10], we again allowed for a baseline Weibull hazard [11,13], where h 0 (t) = θν t θ -1 . The survival times end up being found from this generating distribution, when solving that equation for survival time, which as we previously described [6,9,10] is using the relationship between the hazard, the survival, and the cumulative distribution functions Also, just like previously, the way this was obtained has been well described in our previous manuscripts [6,9,10].
Similar to our previous simulations setups [6,9,10], we again included a competing risk into the simulations. We described previously how the competing risk times were generated by employing an exponential distribution to obtain t cr , the competing risk time. Once we did this then we again found the observed survival time to be the minimum of either t 0 , t cr , and a pre-specified endof-study time, τ, which we put as 20 years as previously described [6,9,10]. Also, in order to define right censoring for an observation, we said that an observations was censored if t 0 was larger than the minimum of t cr and τ. The final simulated datasets ended up having the treatment varaible, the prognostic factor which was age, the survival times, the event indicators, and then the grouping numbers. The regular shared frailty model was then compared to a k-means shared frailty model via different parameters through simulations by varying various parameters: number of groups (3, 4, or 5 and an additional 6, 7, or 8), rate of censoring (10% or 70%), case rate (0.0125 or 0.1), coefficient of treatment (− 1 or 1). These variations then led to having 24 possible scenarios and an additional 6 more (25-30) as observed in Table 1. In the comparisons, we used Akaike's information criterion (AIC) to judge model goodness-of-fit and also the root mean square error (rMSE) [6,9,10], which was calculated between the smoothed predictions of the prognostic variable and the observed values of the prognostic variable for a given model in order to assess bias in the predicted and the observed.
All above programming was done in the R language [14]. We developed our modified k-means algorithm for survival in R and we utilized and conducted all other analyses in R as well.

Simulation results
We have shown coefficients, standard errors, and p-values from the regular shared frailty model (Table 2) and from the k-means frailty model (Table 3). We have presented the coefficients and p-value for treatment (trt), the main predictor. Since a natural spline was fit on age in the model, to summarize this fit, instead of presenting the coefficients from each basis function, we have presented the average p-values from amongst the 4 basis functions.
We can see that the trt coefficients, standard errors and p-values do not differ much between Tables 2 and 3, nor do the average basis coefficient p-values. However, we do see that the p-values for the frailty variances (Table 4) differs between the models, not initially, however, with increased group size of 5 or greater the frailty variances start showing more statistical significance,, and for the k-means frailty model, which became even smaller in this instance as compared to the regular shared frailty model. Somehow there was more heterogeneity in the k-means frailty models, which is interesting since it works to minimize error.. In Table 4, we also have laid out side by side comparisons of the AIC and MSE of each scenario compared between the regular shared frailty model and the k-means shared frailty model. We have seen the goodness-of-fit and the bias appear similar between the two different models for a given scenario for the most part. Though the groupings used in the frailty term in the separate Cox models were derived from different processes, k-means vs simulated grouping creation, the results showed many similarities between the two processes. Higher case rate but more censoring seems to have led to a better model fit according to MSE which improved with increased group size while the AIC reflected better model fit for lower censoring and lower case rate. However, in general, the model fits for the regular frailty model or for the k-means frailty model compared to its true curve per scenario did not differ much as seen in Figs. 1, 2, 3 and 4. This reflects what has been seen in Table 4 with the MSE and AIC values for both models per scenario.

Real dataset example
We used a lung cancer dataset, which is freely available through the R software [14] in their available datasets in the library, survival. This dataset contained survival data in patients with advanced lung cancer from the North Central Cancer Treatment Group [15].
Performance scores rated how well the patient can perform usual daily activities. We were able to model a shared frailty model using the grouping variable, institution, and a k-means frailty model with grouping provided by our modified k-means algorithm. For both models we ran them for k = 3,4,5,6,7 or 8 groups, to match groupings used in the simulations. We can see in Tables 5  and 6, no differences in treatment coefficients between groups or between models and very minor differences in basis coefficients between groups and models. Various estimates computed for these models and groups have further reflected minor to no differences between the methods, similar to the simulations (Table 6), however there did appear to be more heterogeneity in the regular shared model than the k-means shared model as reflected through the frailty variance. This was opposite of what we had seen in the simulations but seems more consistent   with how the k-means works in minimizing variability to create groupings. In addition, the AIC reflected a somewhere better fit for the regular shared frailty models while the MSE reflected a closer fit between observed and predicted for the k-means than the regular shared frailty model. The graphs for the log(HR)'s of the prognostic factor, age, in Figs. 5 and 6 also reflected the closer fits between the observed age and the predicted HRs from the two different frailty models. All curves were overlapping for each cluster size, 3-8.

Discussion
Through our research, we attempted to bridge a gap between unsupervised learning and statistical modeling of time-to-event data by combining the k-means algorithm with survival models, namely shared frailty models. We first created our own version of the k-means algorithm by adapting it to use survival time and one main continuous predictor. We then demonstrated comparing a regular shared frailty model with a regular grouping variable and a shared frailty model with a grouping variable created through our modified k-means algorithm for survival. We did this first through simulations and then on analysis of a real dataset. We found that our modified k-means clustering appeared no different than the typical frailty clustering even under different situations of varied case rates and censoring and perhaps had created groups that had roughly the same or less amount of heterogeneity between groups.. Therefore our modified k-means algorithm could be employed as a mechanism of creating groups from data for including in a frailty term in a survival model when there is no grouping variable available, which is the case in many survival datasets, or for comparing to an existing grouping variable used in a frailty model. Some limitation may be to know how many groups or clusters to create but one can always conduct several iterations of using our k-means algorithm for such purpose and then decide using mean-squared errors and information criterion, similar to our analyses.

Conclusions
We were able to demonstrate that through our modified k-means algorithm for survival data that our k-means approach for survival data could be used to create groupings in data where there was no pre-existing grouping variable, and therefore, this grouping terms could be implemented in a shared frailty model setting to capture unexplained heterogeneity not captured by covariates. We had compared this with a regular grouping term in simulated data as well as a real dataset and found there were no significant differences between our approach and the more conventional approach. Therefore, we recommend use for our approach when an investigator would like to implement a frailty model with survival data but do not have a clear grouping term available in the data to run such a model.