Skip to main content

Table 4 Commonly used data-driven population segmentation methods

From: A systematic review of the clinical application of data-driven population segmentation analysis

Methods#

No. of studies

Advantages

Disadvantages

Notes

Unsupervised Classifications

 Latent class/profile/transition/growth analysis

96

1. Can handle missing data [75]

2. Availability of goodness-of-fit measures to assess model fit and determine the appropriate number of segments (e.g. Akaike Information Criterion, Bayesian Information Criterion, standardized entropy) [57,58,59]

3. No need to standardize variables [76]

Can be computationally intensive, especially with datasets that contain thousands of observations [76]

1. Segmenting variables need to be categorical, continuous, and categorical at multiple time points for latent class analysis, latent profile analysis, and latent transition analysis respectively [77]

2. Users need to pre-specify the desired number of segments

 K-means cluster analysis

60

1. Can deal with very large datasets [45, 78]

2. Able to handle both continuous and categorical properties [79, 80]

1. Might not guarantee reproducible solutions (may get a different solution for each set of specified seed points) [81]

2. Sensitive to outliers [82, 83]

3. Limited statistical assistance in determining the optimal number of clusters [76]

Users need to pre-specify the desired number of segments.

 Hierarchical analysis

50

1. Stopping rules are readily available (e.g. Duda’s pseudo T square statistic, and Calinski’s pseudo F statistic) to determine ideal cluster solutions [70, 84,85,86]

2. Dendogram provided offer a simple and comprehensive visual presentation of segmentation solutions [87]

3. Can handle variables of different kinds, (e.g., continuous, binary, nominal)

1. Difficult to handle large datasets (sample size is preferably under 300–400, not exceeding 1000) [88]

2. Sensitive to outliers [82, 83]

 

Supervised Classification

 Decision Tree Methods (CHAID/CART)

10

1. Can handle outliers and missing data [89]

2. Computationally fast [90]

Models are based on splits that depend on previous splits; an error made in a higher split will propagate down [90]

Users need to pre-specify dependent (or target) variables

  1. Abbreviations: CHAID Chi-square Automatic Interaction Detector, CART Classification and Regression Tree
  2. # Some studies applied multiple methods in tandem or in combination