Fast and optimal algorithm for case-control matching using registry data: application on the antibiotics use of colorectal cancer patients

Background In case-control studies most algorithms allow the controls to be sampled several times, which is not always optimal. If many controls are available and adjustment for several covariates is necessary, matching without replacement might increase statistical efficiency. Comparing similar units when having observational data is of utter importance, since confounding and selection bias is present. The aim was twofold, firstly to create a method that accommodates the option that a control is not resampled, and second, to display several scenarios that identify changes of Odds Ratios (ORs) while increasing the balance of the matched sample. Methods The algorithm was derived in an iterative way starting from the pre-processing steps to derive the data until its application in a study to investigate the risk of antibiotics on colorectal cancer in the INTEGO registry (Flanders, Belgium). Different scenarios were developed to investigate the fluctuation of ORs using the combination of exact and varying variables with or without replacement of controls. To achieve balance in the population, we introduced the Comorbidity Index (CI) variable, which is the sum of chronic diseases as a means to have comparable units for drawing valid associations. Results This algorithm is fast and optimal. We simulated data and demonstrated that the run-time of matching even with millions of patients is minimal. Optimal, since the closest controls is always captured (using the appropriate ordering and by creating some auxiliary variables), and in the scenario that a case has only one control, we assure that this control will be matched to this case, thus maximizing the cases to be used in the analysis. In total, 72 different scenarios were displayed indicating the fluctuation of ORs, and revealing patterns, especially a drop when balancing the population. Conclusions We created an optimal and computationally efficient algorithm to derive a matched case-control sample with and without replacement of controls. The code and the functions are publicly available as an open source in an R package. Finally, we emphasize the importance of displaying several scenarios and assess the difference of ORs while using an index to balance population in observational data. Supplementary Information The online version contains supplementary material available at 10.1186/s12874-021-01256-3.


Setup Load the ccoptimalmatch package functions
The ccoptimalmatch package functions have been provided to you (Additional file 1). Loading those functions in the R environment, allows the reproducibility of the package.
The main function needed for this tutorial is optimal_matching

Datasets
We have two data-sets in this package, namely the "not_processed" and the "being_processed" data-set. Those two data are provided as additional files 3 and 4. The "not_processed" data-set, as the name reveals, is a raw data-set containing the cases, controls, patient_ID and other relevant variables. After different pre-processing steps, we end up to the "being_processed" data, which is the one to use in the algorithm. To see the first 6 rows of the "being_processed" data, enter: #load ("yourpath/being_processed.csv If you wish to investigate the data-set and its attributes, then use the help function:

help("being_processed")
You can directly access a variable within this data frame as follows:

being_processed$case_control
For example, let us tabulate this variable and investigate the number of cases and controls:

Raw data
To see the first 6 rows of the "not_processed" data: Let us tabulate the "case_control" variable and investigate the number of cases and controls in the "not_processed" data-set this time: The following steps are necessary to pre-process the "not_processed" data-set in a format that can be used by our algorithm: Step 1: Exact Matching on several variables We start by defining the subsets. In order to define the subsets, we filter by the "cases", take the distinct combination of exact variables (Gender, JCG and Practice_Id), and create the new variable "subset". Finally, we select only 4 relevant variables (Gender, JCG, Practice_Id, subset): create_subset <-not_processed %>% filter(case_control =="case") %>% arrange(Practice_Id, Gender, JCG) %>% distinct(Gender, JCG, Practice_Id, .keep_all = TRUE) %>% mutate(subset = 1:n()) %>% select(Gender, JCG, Practice_Id, subset) There were created n (=383) subsets, where a subset is defined as a factorial combination of the exact variables. For example, subset 1 contains females that visited practice 1 in year 2010, subset 2 contains females that visited practice 1 in year 2011, subset 3 contains females that visited practice 1 in year 2012 up to subset n, which is the last factorial combination of the exact variables: We merge the data that contains the "subset" variable with the data that contains the cases only: case_with_subset <-not_processed %>% filter(case_control =="case") %>% full_join(create_subset, by = c("Gender", "JCG", "Practice_Id")) We merge the data that contains the "subset" variable with the data that contains the controls only: control_with_subset <-not_processed %>% filter(case_control =="control") %>% right_join(create_subset, by = c("Gender", "JCG", "Practice_Id")) Finally we bind the cases and the controls, which will have now the new variable "subset":

not_processed <-rbind(case_with_subset,control_with_subset)
Let us tabulate the "case_control" variable again: As we observe, the number of controls have decreased to 441,670 and the unique controls to 175,018. The gain from exact matching is that by shifting the analysis from one big data-set to several small sub-sets, the computational burden decreases substantially. There were 224,909-175,018 = 49,891 controls that couldn't be matched to any of the cases, thus are excluded.
Step 2: Create artificial observations and select the range of variables Firstly, we split the data-set in cases and controls and create a variable "cluster_case" to depict the cases separately. The "cluster_case" variable will have as many levels as the total number of cases, i.e. 1,718 in our example. For that purpose, the "cluster_case" will be empty in the controls data-set but have the names of the cases in the cases data-set: bdd_controls <-not_processed[not_processed$case_control=="control",] bdd_controls$cluster_case <-0 bdd_cases <-not_processed[not_processed$case_control=="case",] bdd_cases$cluster_case <-paste("case",1:nrow(bdd_cases),sep = "_") Next, we bind the cases and the controls, which will have now the new variable "cluster_case" and create the variable age: not_processed <-rbind(bdd_cases,bdd_controls) not_processed$age <-not_processed$JCG-not_processed$Birth_Year After creating the variable "cluster_case", we split again the cases and controls into two different data-sets: bdd_cases <-not_processed[not_processed$case_control=="case",] bdd_control <-not_processed[not_processed$case_control=="control",] Next, we create an empty data-frame and a unique list of the variable "cluster_case": bdd_temp <-data.frame() list_p <-unique(bdd_cases$cluster_case) Below it is the loop to generate the pseudo-observations for controls, which will be explained in details. We start by identifying in which subset each case belongs. Next, we check which controls are in the same subset and bind those controls to the case. For example, subset 1 has 2 cases and 2,217 controls. By creating pseudo-observations for controls, subset 1 will have 2 cases and 4,434 controls. Finally, we select the range for the age and follow-up. For demonstration purposes, we decided that an absolute difference of age smaller than 2 is acceptable and that the follow-up time between cases and controls is exact. Since the 2 cases are different in subset 1 in terms of age and follow-up, each case will end up with a different number of controls available to pool: The number of duplicated controls have decreased to 75,473 and the unique controls to identify are 28,458. Now, all the remaining controls are those that have at most 2 years difference from the case in the same subset, and also have the exact follow up.

bdd_temp = bdd_temp %>% group_by(Patient_Id) %>% mutate(freq_of_controls = n())
Let us have a glimpse of the data by looking at the 10 first rows: 1. For the cases, case_1 has 25 available controls to pool. Suppose that another case has less controls available (e.g. case_2) and the same control is a candidate for both cases with the same characteristcs. This control would be assigned to case_2, which has less controls. 2. For the controls, look at control "patient_5766". His/her frequency is 2, indicating that he/she is available for 2 cases and also that appears in the data-set 2 times. This is important because the controls with the lowest frequency will be matched first, thus leaving the controls with highest frequency available for the next cases. 3. Lastly, we observe that the ordering is not the most optimal yet since the controls are not as close as they should be to the cases in terms of "age-difference" and "frequency of controls", which brings us to the next step.

Step 4: Order variables
Ordering the variables in a correct order is of utter importance. For simplicity, assuming that there are three variables, namely "age-difference", "follow-up difference" and "frequency of controls". The data-set should be ordered by the variables "case", "control", "follow-up difference", "age-difference" and lastly by "frequency of controls". The variable "follow-up difference" appears before "age-difference" since the "follow-up difference" has more weight (importance) than the "age-difference".

bdd_temp<-bdd_temp[order(bdd_temp$cluster_case,bdd_temp$case_control,bdd_temp$fup_diff, bdd_temp$age_diff,bdd_temp$freq_of_controls),]
By checking the 10 first rows, we can see that the closest controls are ordered after the case, indicating that they are optimal (have the same age-difference). Also, we observe that the "frequency of controls" is ordered which allows the control with the lowest frequency to be matched first:

Analysis of the data
We have the data ready to be used for the algorithm. The "optimal_matching" function generates an optimal match between cases and controls in an iterative and computational efficient way. For demonstration purposes, we select 4 controls to match, and we perform the analysis without replacement: final_data <-optimal_matching(bdd_temp, n_con=4, cluster_case, Patient_Id, total_control_per_case, case_control, with_replacement = FALSE) Below we summarise the steps that explain how the algorithm works.
1. Start of round 1. Select one control per case per iteration. We select the first control which is the closest, thus the most optimal. 2. Split between duplicated and unique controls, and assign the duplicated controls to the case that has less available controls to pool. 3. Exclude cases that already have at least 1 control. Also exclude controls that are matched to a case. 4. Repeat steps 1-3 until all cases have at least 1 control (where applicable). 5. End of round 1. Continue to round 2 up to round n, where n is the controls that the user wants to match. If 4 controls are needed, then we have 4 rounds, if 10 controls are needed we have 10 rounds.
We can see the first 20 rows:

Extensions and Summary
In the clinical case described in the previous sections, we used 3 exact variables (Gender, JCG, Practice_Id), age and follow-up. This algorithm is very flexible in accommodating even more continuous and exact variables, and as a matter of fact, the more criteria are added, the less the computational burden is. It is very useful to operate different scenarios of matching by adjusting the age range, the follow-up time and the Comorbidity Index.
The user (epidemiologist, researcher) has available a 1:n case-control matching algorithm in an optimal, efficient, and fast way using a multi-step (iterative) procedure for traditional and nested case-control studies.