Skip to main content

Table 4 Commonly used data-driven population segmentation methods

From: A systematic review of the clinical application of data-driven population segmentation analysis

Methods# No. of studies Advantages Disadvantages Notes
Unsupervised Classifications
 Latent class/profile/transition/growth analysis 96 1. Can handle missing data [75]
2. Availability of goodness-of-fit measures to assess model fit and determine the appropriate number of segments (e.g. Akaike Information Criterion, Bayesian Information Criterion, standardized entropy) [57,58,59]
3. No need to standardize variables [76]
Can be computationally intensive, especially with datasets that contain thousands of observations [76] 1. Segmenting variables need to be categorical, continuous, and categorical at multiple time points for latent class analysis, latent profile analysis, and latent transition analysis respectively [77]
2. Users need to pre-specify the desired number of segments
 K-means cluster analysis 60 1. Can deal with very large datasets [45, 78]
2. Able to handle both continuous and categorical properties [79, 80]
1. Might not guarantee reproducible solutions (may get a different solution for each set of specified seed points) [81]
2. Sensitive to outliers [82, 83]
3. Limited statistical assistance in determining the optimal number of clusters [76]
Users need to pre-specify the desired number of segments.
 Hierarchical analysis 50 1. Stopping rules are readily available (e.g. Duda’s pseudo T square statistic, and Calinski’s pseudo F statistic) to determine ideal cluster solutions [70, 84,85,86]
2. Dendogram provided offer a simple and comprehensive visual presentation of segmentation solutions [87]
3. Can handle variables of different kinds, (e.g., continuous, binary, nominal)
1. Difficult to handle large datasets (sample size is preferably under 300–400, not exceeding 1000) [88]
2. Sensitive to outliers [82, 83]
Supervised Classification
 Decision Tree Methods (CHAID/CART) 10 1. Can handle outliers and missing data [89]
2. Computationally fast [90]
Models are based on splits that depend on previous splits; an error made in a higher split will propagate down [90] Users need to pre-specify dependent (or target) variables
  1. Abbreviations: CHAID Chi-square Automatic Interaction Detector, CART Classification and Regression Tree
  2. # Some studies applied multiple methods in tandem or in combination