A systematic review of the clinical application of data-driven population segmentation analysis

Yan, Shi; Kwan, Yu Heng; Tan, Chuen Seng; Thumboo, Julian; Low, Lian Leng

doi:10.1186/s12874-018-0584-9

BMC Medical Research Methodology

Table 4 Commonly used data-driven population segmentation methods

From: A systematic review of the clinical application of data-driven population segmentation analysis

Methods#	No. of studies	Advantages	Disadvantages	Notes
Unsupervised Classifications
Latent class/profile/transition/growth analysis	96	1. Can handle missing data [75] 2. Availability of goodness-of-fit measures to assess model fit and determine the appropriate number of segments (e.g. Akaike Information Criterion, Bayesian Information Criterion, standardized entropy) [57,58,59] 3. No need to standardize variables [76]	Can be computationally intensive, especially with datasets that contain thousands of observations [76]	1. Segmenting variables need to be categorical, continuous, and categorical at multiple time points for latent class analysis, latent profile analysis, and latent transition analysis respectively [77] 2. Users need to pre-specify the desired number of segments
K-means cluster analysis	60	1. Can deal with very large datasets [45, 78] 2. Able to handle both continuous and categorical properties [79, 80]	1. Might not guarantee reproducible solutions (may get a different solution for each set of specified seed points) [81] 2. Sensitive to outliers [82, 83] 3. Limited statistical assistance in determining the optimal number of clusters [76]	Users need to pre-specify the desired number of segments.
Hierarchical analysis	50	1. Stopping rules are readily available (e.g. Duda’s pseudo T square statistic, and Calinski’s pseudo F statistic) to determine ideal cluster solutions [70, 84,85,86] 2. Dendogram provided offer a simple and comprehensive visual presentation of segmentation solutions [87] 3. Can handle variables of different kinds, (e.g., continuous, binary, nominal)	1. Difficult to handle large datasets (sample size is preferably under 300–400, not exceeding 1000) [88] 2. Sensitive to outliers [82, 83]
Supervised Classification
Decision Tree Methods (CHAID/CART)	10	1. Can handle outliers and missing data [89] 2. Computationally fast [90]	Models are based on splits that depend on previous splits; an error made in a higher split will propagate down [90]	Users need to pre-specify dependent (or target) variables

Abbreviations: CHAID Chi-square Automatic Interaction Detector, CART Classification and Regression Tree
# Some studies applied multiple methods in tandem or in combination

Back to article page

ISSN: 1471-2288

Contact us

General enquiries: journalsubmissions@springernature.com