Skip to main content

Table 3 Example R-Functions and their Links to The Data Quality Framework

From: Facilitating harmonized data quality assessments. A data quality framework for observational health research data collections with software implementations in R

R-function name Implementations within the function Linked with the following indicators
pro_applicability_matrix() Checks the correspondence of study data with the metadata and accessibility to files. Each study data variable is examined regarding the data type and cross-checked with the specified data type in the metadata. Unexpected data elements;
data type mismatch
com_unit_missingness() Evaluates on the level of entire observational units whether all measurements are missing. Missing measurements (Unit level)
com_segment_missingness() Evaluates whether all associated measurements at the level of study segments (e.g. single examinations or instruments) are missing for an observational unit. A pattern plot is provided as a descriptor. Missing measurements (Segment level);
com_item_missingness() Examines for each variable of the study data the amount and type of missing data according to specified missing/jump codes, including a count of data fields without any data entry like NA in R. Missing measurements (Item level); specific missingness; uncertain missingness status
con_limit_deviations() Assesses limit deviations, with regards to inadmissible and improbable values and counts deviations above/below the specified thresholds. Limits may comprise hard limits to identify inadmissible values, soft limits to identify improbable values, and detection limits which refer to a censoring based on the properties of the measurement devices used. Inadmissible numerical values; inadmissible time-date values; uncertain numerical values; uncertain time-date values
con_inadmissible_categorical() Compares the match of single data values with admissible categories, summarizes observed vs. expected data values and counts the violations. Inadmissible categorical values
con_contradictions() Compares two data values of the same observational unit by using one of 16 logical comparisons. Counts the number of contradictions. Logical contradictions; empirical contradictions
acc_distributions() Creates distributional plots (bar or histogram) for numerical measurements (float, integer). If a grouping variable is provided, stratified empirical cumulative distribution functions (ecdf) are plotted as well [20]. Indicators within the unexpected distributions domain
acc_univariate_outlier() Computes distributional characteristics of numerical measurements (e.g. mean, standard deviation, skewness) and applies four different rules to identify univariate outliers, e.g. Tukey, Hubert, and six sigma [44,45,46]. Counts the number of outliers and indicates the direction (low/high). Univariate outliers
acc_multivariate_outlier() Computes the Mahalanobis distance of at least two variables and counts the number of extreme measurements. In a heuristic approach outlier identification is based on applying simple univariate rules [44,45,46] on the Mahalanobis distance to reduce computational costs. Multivariate outliers
acc_shape_or_scale() Tests the observed distribution of measurements against predefined distributional assumption (normal, gamma, uniform). Deviations from expected distributions are visualized using the idea of rootograms [44, 47]. Unexpected shape parameter; unexpected scale parameter
acc_end_digits() Computes preferences of manually collected data, i.e. the preference of end digits. The functions assume a uniform distribution of end digits and applies a rootogram-like visualization [44, 47]. Unexpected shape
acc_margins() Compares the marginal distribution of different classes (e.g. examiners, devices) using measurements adjusted for covariates (e.g. age, sex). Adjusted linear models, logistic regression or poisson-regression are used to model marginal means of continuous measurements, binary, and count data [48]. Unexpected location; unexpected proportion
acc_varcomp() Computes the variance proportion explained by different classes (e.g. examiners, devices) in relation to the overall variance of the measurement. Depending on the data ANOVA or mixed effects models are applied [49, 50] Unexpected location
acc_loess() Computes and displays as a descriptor loess-smoothed trends of measurements across different classes over time. The raw measurements can be adjusted for covariates such as age or sex and the resulting residuals are smoothed over time using LOESS [42]. Indicators within the unexpected distributions domain, foremost unexpected location; unexpected proportion