 Research
 Open Access
 Published:
Forecasting COVID19 cases using time series modeling and association rule mining
BMC Medical Research Methodology volume 22, Article number: 281 (2022)
Abstracts
Background
The aim of this study was to evaluate the most effective combination of autoregressive integrated moving average (ARIMA), a time series model, and association rule mining (ARM) techniques to identify meaningful prognostic factors and predict the number of cases for efficient COVID19 crisis management.
Methods
The 3685 COVID19 patients admitted at Thailand’s first university field hospital following the four waves of infections from March 2020 to August 2021 were analyzed using the autoregressive integrated moving average (ARIMA), its derivative to exogenous variables (ARIMAX), and association rule mining (ARM).
Results
The ARIMA (2, 2, 2) model with an optimized parameter set predicted the number of the COVID19 cases admitted at the hospital with acceptable error scores (R^{2} = 0.5695, RMSE = 29.7605, MAE = 27.5102). Key features from ARM (symptoms, age, and underlying diseases) were selected to build an ARIMAX (1, 1, 1) model, which yielded better performance in predicting the number of admitted cases (R^{2} = 0.5695, RMSE = 27.7508, MAE = 23.4642). The association analysis revealed that hospital stays of more than 14 days were related to the healthcare worker patients and the patients presented with underlying diseases. The worsening cases that required referral to the hospital ward were associated with the patients admitted with symptoms, pregnancy, metabolic syndrome, and age greater than 65 years old.
Conclusions
This study demonstrated that the ARIMAX model has the potential to predict the number of COVID19 cases by incorporating the most associated prognostic factors identified by ARM technique to the ARIMA model, which could be used for preparation and optimal management of hospital resources during pandemics.
Background
The crisis outbreak of coronavirus disease 2019 (COVID19) caused by severe acute respiratory syndrome coronavirus 2 (SARSCoV2) started in Wuhan, Hubei Province, China in December 2019 [1]. The COVID19 pandemic has required governments around the world to implement new policies under pressure from vulnerable people and communities [2]. Since the first outbreak, COVID19 has mutated into many variants including the alpha, beta and delta SARSCOV2 variants, which have been associated with new waves of infection [3]. The catastrophic effect across the entire world resulted in more than six million deaths worldwide in 2022 [4]. In addition, COVID19 has caused a rapid deterioration in the condition of the disease, and the number of patients requiring hospitalization has increased significantly, resulting in a high demand for hospital resources [1].
Data mining is an efficient analytical methodology to recognize and investigate a huge data set to acquire meaningful information [5]. In the medical field, the large numbers of medical records (including demographic information, diagnoses, clinical notes, etc.) in the healthcare information systems are ideal targets for the use of data mining in improving the analysis and prognosis prediction of various diseases [6,7,8]. Examples include using an Artificial Neural Network (ANN) and Support Vector Machine (SVM) algorithm to predict cardiovascular disease [9], using data mining classification algorithms, Decision Tree and Naive Bayes algorithms to identify liver disease [10] and predict the recovery outcome of Middle East Respiratory Syndrome Coronavirus (MERSCoV) [11]. With the unprecedented increase in COVID19 cases worldwide, there is a need for effective prediction models to identify the associated prognostic factors and forecast the number of COVID19 cases to optimally organize the hospital resources.
Time series analysis and association rule mining (ARM) models have been widely used to predict trends, structural breaks, cycles, and unobserved values, and have proven to be useful in the medical field [12,13,14]. The auto regressive integrated moving average (ARIMA), a time series analysis model, was shown to have a promising accuracy for forecasting of infectious diseases in medical fields [15, 16]. ARIMA was used to forecast the number of new COVID19 cases, deaths, and recoveries based on the daily reported data from different countries for assessment of the future outbreak [17,18,19,20]. ARM was originally presented by Agrawal et al. as an algorithm for marketing data analysis [21]. ARM has been used to extract medical health information, which is currently being applied for the development of classification and prediction models to identify and forecast the possibility of development and progression of a disease by considering the rules of the disease [22]. ARM was demonstrated to be an effective model for mining the frequent symptom pattern for COVID19 patients, which could assist clinicians in decision making [23]. Another study used ARM to analyze the patterns of different nonpharmaceutical interventions to manage the infection growth rate in the United States [24]. Even though there are many advanced datadriven time series methods used to predict the future number of COVID19 patients, a new and more accurate prediction model is important in the pandemic crisis. The associated contributing factors should be considered to improve model performance. Therefore, the combination of ARM and ARIMA models by selecting the most associated prognostic rules and integrating with ARIMA models could increase the accuracy of predicting new cases to better understand the current situation and the progression of COVID19, which can be easily used by society, organizations, or governments to assess and manage the crisis during the future outbreak.
The aim of this study was to evaluate the most effective combination of ARM techniques and ARIMA models to identify prognostic factors and predict the number of COVID19 patients. These models are expected to allow for better preparation, organizing hospital resources of further such units and more optimal use of medical personnel and equipment to enhance healthcare decisionmaking to manage COVID19 patients in this crisis situation.
Methods
Administration protocol and data collection
The study was conducted at Thailand’s first universitybased field hospital. The field hospital was transformed from the service apartment style 14story building of the university dormitory into a 494bed facility for noncritical COVID19 patients [25]. The field hospital was managed by the main university hospital and included the patients referred from the project’s five university hospitals and hospitals in the central area of Thailand. Sources of funding come mainly from the donations of university alumni, community groups and nongovernmental organizations. Upon admission, a nurse records patient data in the COVID19 screening of the field hospital information system; the patient undergoes a chest xray, blood tests for complete blood count (CBC), liver function tests (LFTs), electrolyte, balance urine nitrogen (BUN), and Creatinine (Cr). The doctor interprets the labs and chest xray, and records the results in the admission note. The patients are only admitted to the field hospital if they meet all of the following criteria: 1) asymptomatic, mild or moderate symptoms; 2) normal activities of daily living; 3) no important organ dysfunction; 4) no psychiatric history; and 5) resting pulse oxygen saturation (SpO_{2}) > 95%. To avoid unnecessary contact between patients and medical personnel, the patient reports signs and symptoms, wants and needs via an internal field hospital application. Any consultation with the attending physician is done through a notification form. If the attending physician wishes to speak to the patient, the patient’s telephone number is obtained from the respective patient’s floor. All prescriptions must be made using a prescription form which will then be processed by the attending nurse and recorded in the progress note in the field hospital information system and in the university hospital electronic medical record system. In this field hospital system, the laboratory and radiographic examination would be performed on symptomatic COVID19 patients with a history of taking Favipiravir and for severity assessment of symptomatic COVID19 patients.
For Favipiravirnaive patients: 1) A followup chest xray may be considered in patients with worsening signs and symptoms (body temperature (BT) > 38.0 °C, cough, fatigue, SpO2 < 96%, or decreased SpO2 > 3% after a stress test); and 2) if the chest xray infers pneumonia with respiratory signs and symptoms (as mentioned in 1), refer the patient to the originating hospital for continued treatment with Favipiravir.
For patients previously treated with Favipiravir: 1) Followup by chest xray, LFTs); 2) if LFTs increase, consider consulting an ID specialist to terminate/adjust medication use; and 3) if the chest xray infers a progression of the infiltration accompanied by respiratory signs and symptoms (cough, fatigue, SpO2 < 96% and SpO2 drop > 3% after a stress test), consider referring the patient to the hospital of origin.
Asymptomatic patients who have been hospitalized for at least 14 days after a positive COVID19 testing will be discharged home. The patients who received Favipiravir should fulfil all the following criteria: 1) The patients signs and symptoms have improved without progression of infiltration on chest xray; 2) BT < 37.8 °C continuously for 24–48 hours; 3) respiratory rate (RR) < 20/min; and SpO2 > 96% at rest. In the event of a patient’s condition deteriorating, they are quickly transferred to the designated higherlevel hospitals.
The criteria for transfer are 1) meeting the criterion of severe or critical, and 2) lung imaging showing a greater than 50% progression of lesions. Patients do not need Realtime Polymerase Chain Reaction (RTPCR) or Antigen/Antibody detection for COVID19 prior to discharge. One day before discharge, the attending nurse informs the attending physician of the number of potential discharges, so that the physician can prepare medical certificates and insurance documents according to the patient’s needs. Upon discharge, the attending physician updates the patient’s progress and discharge summary in the electronic medical record system of the university hospital.
A total number of 3685 patient records were retrieved from the electronic hospital information systems of the referral hospitals and the field hospital information system. In this study, we included all patients confirmed with asymptomatic and mildtomoderate COVID19 conditions from March 2020 to August 2021 (four waves of COVID19 in Thailand). Collected data included patient demographics, comorbidities, body mass index (BMI), job, place of exposure to coronavirus, symptom before field hospital admission, sign of pneumonia in chest xray, field hospital length of stay, and the field hospital discharge destination. Table 1 shows the preliminary analysis of the dataset, including attributes, values, and frequency of each attributevalue pair.
Timeseries analysis and association analysis
In this work, we present a study to combine time series analysis and association analysis to forecast the COVID19 admitted cases as well as to analyze their potential factors and characteristics. To estimate the number of new cases and to predict the prognosis for better understanding of the current situation and progression of COVID19, we exploited the autoregressive integrated moving average (ARIMA) model and its subclasses (i.e., AR, MA, ARMA) [12, 17, 26], and association rule mining (ARM) [21, 24] as tools for investigation (Fig. 1).
The autoregressive (AR) model
In the AR model, the predictive value at the time period t is modeled by the observed values at various time slots t − 1, t − 2,. . ., t − k. The impact of the value at each previous time period on the value at the current time is determined by the coefficient factor at that particular period of time. With this assumption, the model performs the regression of past time series and then calculates the present or future values in the series, commonly known as an auto regression (AR) model. It can be modeled as follows.
Here, y_{t} is the value at the current time t, and y_{t − 1}, y_{t − 2}, …, y_{t − p} are the observed values at the previous p time spots with their corresponding coefficients β_{1}, β_{2}, …, β_{p}, respectively, β_{0} is the intercept, and ε_{t} is the residual error at the time t. Therefore, y_{t} − ε_{t} is the expected value at the current time t. In this work, the value y_{t} can be modeled as the number of inpatients, incoming patients, or outgoing patients at the time period t.
The movingaverage (MA) model
Since the value of the time period t may be impacted by unexpected external factors, i.e., noises, we can alleviate such impact by means of the moving average method. Analogous to AR, the predicted value at the time period t can be modeled by the previous q lagged forecast errors ϵ_{i} as follows.
Here, y_{t} is the value at the current time t and the lagged errors ε_{t − 1}, ε_{t − 2}, …, ε_{t − q} are residual errors of the q autoregressive models at time t − 1 to t − q with ϕ_{1}, ϕ_{2}, …, ϕ_{q} as their corresponding coefficients, ϕ_{0} is the intercept, and y_{t} is the residual error at the time t. The residual error at the time points after t − 1 can be derived by the autoregressive (AR) model as follows.
Although the standard AR and MA may use the autocorrelation function (ACF), which takes into account all of the points, it is possible to apply the partial autocorrelation function (PACF), which accounts for the values of the intervals between.
The autoregressive moving average (ARMA) model
The Auto Regressive Moving Average Model (ARMA) combines the AR and MA models. In ARMA, the impact of previous lags along with the residuals is considered for forecasting the future values of the time series as follows.
Here, β_{i} represents the coefficients of the AR model, ϕ_{i} represents the coefficients of the MA model, and ε_{t} is the residual error at the time t. We assume only one significant value from the AR model and one significant value from the MA model, so the ARMA model will be obtained from the combined values of these two models, denoted as the order of ARMA (1,1).
The autoregressive integrated moving average (ARIMA) model
As a generalization of AR, MA, and ARMA, the ARIMA model introduced differencing (integration) into the ARMA model to make the series stationary exploit to forecast future values under the factor of previous lag value and residuals errors. Besides manipulating the time lag and alleviating noise by smoothing, it is also possible to decompose a series into trend, seasonal, and residual components, by assuming an additive model. With this addition, the series can be transformed to a stationary time series. To achieve the transformation, the differencing method is applied. For example, we can subtract the t − 1 value from t values of time series. After applying the first differentiation, if we are still unable to get the stationary time series, we can again apply the secondorder differentiation. The ARIMA model is an extension of the ARMA model by the fact that it includes one more factor known as integrated (i.e., differentiation) which stands for I in the ARIMA model. The ARIMA model, denoted by ARIMA (p,d,q), can be formulated as follows:
Here, p is the order of the autoregressive process, d (set to 1 in this case) is the degree of differentiation (the number of times the series was differenced), and q is the order of the moving average component. In this model, the firstorder difference (d = 1) between consecutive observations y′_{i} was computed and used, instead of the original observed value y_{i} as shown below.
Differencing removes the changes in the level of a time series, eliminating trend and seasonality and, consequently, stabilizing the mean of the time series.
In some situations, we may need to difference the series data a second time (d = 2) to obtain a stationary time series, which is referred to as second order differencing as follows:
A higherorder differentiation can be pursued analogously in the same manner.
The autoregressive integrated moving average with exogenous covariates (ARIMAX) model
When an ARIMA model includes other time series as input variables, the model is referred to as an Autoregressive Integrated Moving Average with Exogenous Covariates (ARIMAX) model. An ARIMAX model can be viewed as a multiple regression model that takes the impact of covariates on the forecasting into account, improving the comprehensiveness and accuracy of the prediction. The ARIMAX(p,d,q) extends the ARIMA(p,d,q) model by including the linear effect that one or more exogenous series has on the stationary response series y_{t}. This method is suitable for forecasting when data is stationary/nonstationary, and multivariate with any type of data pattern, i.e., level/trend/seasonality/cyclicity. The ARIMAX(p,d,q) model can be formulated as follows:
Here, d is set to 1, (X_{i})_{t} is the value at the time t of the i  th exogenous covariable (X_{1}), θ_{i} is the corresponding coefficient for the covariable X_{i}, and m is the number of exogenous covariables to be considered, while p, d, and q indicate the same parameters as in the ARIMA model.
Association rule mining
Besides the timeseries analysis, association rule mining (ARM) can be used as a multivariate analysis to help us understand the correlation among factors [24]. Given a dataset containing a collection of records or transactions, each record comprises a set of categorical attributes. An association rule can be denoted by A → B, where A (the antecedent or LHS) and B (the consequent or RHS) are sets of various attributevalue pairs (also called itemsets), and are disjoint. The rule represents the hypothesis that when variables in A occur in the dataset, the variables in B also occur. Association mining generates a large number of rules from a given dataset. In a dataset with m attributes n − 1 antecedents and one consequent, each with n values, each can generate a maximum of nm^{n − 1} − 1 rules. However, not all rules are significant. The goal of this approach is to find rules that have high practical significance. To eliminate spurious rules, we use three measures: support, confidence, and lift. In addition, we also use the chisquared test to measure the statistical significance of the association between the antecedent and the consequent. Given two disjoint sets of attributevalue pairs A and B, and an association rule A → B; support of the rule refers to the number of records where the attributevalue pairs in either set A or B appear in the dataset relative to the total number of records (transactions or instances). This denotes the prevalence of the rule in the dataset. By definition, the support value is symmetric, that is Support (A → B) = Support (B → A), and it equals the total numbers of records containing both A and B to the total number of records in the dataset. The confidence of the rule A → B measures the conditional probability of B, given A. Thus, the confidence measure for a given rule is asymmetric, that is Confidence (A → B) ≠ Confidence (B → A). The lift measure is the ratio between the observed support and the expected support between the independent variables A and B. Implicitly, lift > 1 means a greater degree of dependence, lift < 1 specifies negative dependence, and lift = 1 indicates independence between A and B. Lift is also a symmetric measure between the itemsets A and B, that is Lift (A → B) = Lift (B → A).
Here, A and B are the numbers of records that include A and B, respectively, while ∣A ⋂ B∣ is the number of records that contain both A and B. In this paper, the antecedent A can be either patient demographics (either male or female), age (< 24, 25–44, 45–64, and > 65), body mass index or BMI (< 25, 25–29, and > 29), underlying diseases (none, respiratory, hypertension, metabolic, dyslipidemia, diabetes mellitus, pregnant, or others), job (healthcare or nonhealthcare patient), inflection source (community inflection, family inflection, or hospital inflection), symptoms before field hospital admission (asymptomatic, mild, or moderate), sign of pneumonia in chest xray (no lesion or pneumonia) or length of stay in the field hospital (14 or > 14), and patient discharge (home discharge or refer to general hospital), as the contributing factors. On the other hand, for the consequent B we focus on (1) the length of stay (either 1–14 or > 14), (2) the patient discharge (either home discharge or hospital discharge), (3) the chest xray result, and (4) current incidence (wave 1, 2, 3 or 4). Since one assumption for ARM is that all the values of attributes are discrete, we translate the numerical data used in the study into discrete labels, as well as split the continuous data of infection growth curve into four phases.
Experiment settings
Data collection and parameter settings
The dataset includes 3685 records registered with the electronic hospital information systems of the field hospital during March 2020 to August 2021. It displays characteristics of the dataset, including, attributes, values, and frequency of each attributevalue pair. Each of the nine attributes contains 2–8 possible values. Most attributes have imbalanced numbers in their values, except gender (Table 1). In our time series analysis, the target of prediction is the number of patients in the field hospital for each day during the observation period, that is March 2020 to August 2021. We have explored the value of the three ARIMA parameters as p ∈ {1, 2, 3}, d ∈ {1, 2}12, q ∈ {1, 2, 3} due to our preliminary test. In addition, we applied association rule mining to find the most influential factors among the eleven factors, that is patient demographics, age, body mass index, underlying diseases, job, inflection source, symptom before field hospital admission, sign of pneumonia in chest xray, length of stay in the field hospital, patient discharge, and current incidence. As an ARIMAX model, we extend the ARIMA(p,d,q) model to include the parameters as a series that are the most influential to the prediction of the number of patients in the hospital. The parameters included are known as exogenous series that are expected to trigger the stationary response on the series that we are predicting.
Performance metrics and evaluation
Given a data set has n values, denoted by y_{1},. .., y_{n}, each associated with a predicted value f_{1},. .., f_{n}, the following three metrics can be formulated. Coefficient of determination (R^{2}) is the proportion of the variation in the dependent variable that is predictable from the independent variable(s) as follows:
Here, SS_{r} is the sum of squares of residuals, SS_{t} is the total sum of squares, proportional to the variance of the data, and \(\overline{y}\) is the mean of the observed data. Ranging from 0 to 1, it provides a measure of how well observed outcomes are replicated by the model. The higher the coefficient value is, the closer the dependent variable and independent variable are.
Root mean square error (RMSE) the standard deviation of the prediction errors [27], which are a measure of the distance of the data from the regression line, indicating the concentration of the data around the line of best fit as follows:
It expresses the dispersion of these errors.
Mean absolute error (MAE) allows measurement of the average magnitude of the errors for a set of predictions, regardless of their direction.
It represents the mean of the absolute difference in the sample between the prediction and the actual observation, taking into account that all individual differences are of equal significance. Therefore, compared to RMSE, MAE is less sensitive to outliers.
Results
Time series analysis
This section presents a time series analysis to forecast the number of patients admitted to the field hospital. Figure 2 shows the number of patients from 26 March 2020 to 22 July 2020. Three time series represent the relationships among a number of residing patients that are equal to a cumulative difference between admitted and discharged patients living in the hospital. The graph presents four waves of pandemic following the number of patients in hospital. The four waves are as follows: The first wave (Wave 1), the emergence of SARCoV2, is the smallest period (34 days) from 26 March 2020 to 16 May 2020. The second wave (Wave 2) was from 11 January 2021 to 14 March 2020 (44 days). After that, the third wave (Wave 3) and fourth wave (Wave 4) were the continuous periods from 11 April 2021 to 31 May 2021 (51 days) and 1 June 2021 to 22 July 2021 (52 days), respectively. Finally, the forecasting models are validated by a test dataset from 1 August 2021 to 30 August 2021(30 days).
In this study, the time series models were trained using six training datasets. The first training set (All Wave) covers all datasets Wave 1 to Wave 4 of 228 days; the second training set, Wave 1 of 34 days; the third training set, Wave 2 of 45 days; the fourth training set, Wave 3 of 51 days; the fifth training set, Wave 4 of 52 days; the sixth training set, Wave 3 and Wave 4 of 103 days.
In this work, we tested the estimated model using an autocorrelation function (ACF) and a partial autocorrelation function (PACF) plots to ensure that the model fits the data [17]. Figure 3 presents the steadystate prediction of timeseries models. An estimation of the model explored the coefficient (Coef.), the standard error (Std err.) and z. An estimate of the first model was the AR model which gave a coefficiency of 0.3808, standard error of 0.243 and z of 1.565. The second model was an MA model which gave coefficiency of − 0.5287, standard error of 6.841 and z of − 0.077. The sigma value or constant value was coefficiency of − 0.5287, standard error of 6.841 and z of − 0.077. Moreover, we further estimated the model with JarqueBera of 7.70, heteroskedasticity of 0.57 and skew of 0.68.
For the data set, the time series method was applied using Python (PyFlux library) for time series analysis and prediction to compare the criteria of each setting. The ARIMAX (p,d,q) + X models were parameterized with X ∈ {ϕ, x_{1}, x_{2}}, p ∈ {0, 1, 2, 3}, q ∈ {0, 1, 2, 3}, d ∈ {0, 1, 2}, where X is additional exogenous variables, with 51 combinations. Moreover, we select key features from association rule mining such as symptoms, age, and underlying diseases, etc. X = ϕ specifies no additional exogenous variable used. X = x_{1} indicates additional exogenous variables. There are 15 variables, composed of three attributes in the symptom feature, four attributes in the age feature, and eight attributes in the underlying diseases feature. X = x_{2} represents four variables of the selected attributes, that is the ‘moderate’ symptom, the ‘morethan65’ age, and the underlying diseases of ‘diabetes mellitus’ and ‘pregnant.’
The forecastingaccuracy metrics of the 51 models summarized on the six datasets and the evaluation of models with the measures of RMSE and MAE are shown in Table 2. The forecasts for the admitted patients with prediction confidential intervals (CI) between 5 and 95% are presented in Fig. 4 for ARIMA (2,2,2) and Fig. 5 for ARIMAX (1,1,1)+ x_{2}. Overall, the most accurate estimation was obtained by improving from ARIMA (2, 2, 2) to ARIMAX (1, 1, 1) + x_{2} for the training set in Wave 4, covering from 11 April 2021 to 31 May 2021. For the first setting (AllWave), the best model is ARIMA (1,2,1) with the RMSE of 22.8141 and MAE of 19.4133, which was closer to the actual data. For Wave1, ARIMAX (2,2,2) + x_{2} performs the best with the RMSE of 277.9974 and MAE of 273.4644, which was the highest to the actual data of all models. For Wave2, AR(1) + X1 model is the best with the smallest RMSE and MAE. Based on RMSE and MAE, the value of ARIMA (1,1,1) + X1 was the closest to the actual data in Wave3. The RMSE and MAE of ARIMAX (1,1,1)+ X2 appeared to be the best predictive models.
The comparisons among forecasting models are shown in Tables 3, 4 and 5. The models numbered 12–17 in Table 2 are defined to be the baseline models. The models with x_{1} are the models numbered 29–34 while the models with x_{2} are the models numbered 46–51. The compared pairs were (baseline vs x_{1}), (x_{1} vs x_{2}), and (baseline vs x_{2}). The comparison was done under the same parameter setting. The result of R,^{2} RMSE and MAE (Tables 3, 4 and 5) yielded a good result indicating that time forecasting models could improve correlation of determination when we added exogenous variables.
The predicted values, CI 5% (lower confidence interval) and CI 95% (upper confidence interval), and actual data of the models are shown in Table 6 and Fig. 4. In addition, the improved predictive values of the models by adding exogenous variables are shown in Table 7 and Fig. 5. For example, ARIMA (2, 2, 2) predicted that the number of cumulative confirmed cases for the next 30 days could be 291 to 334 cases. ARIMAX (1, 1, 1) + x_{2} predicted that the number of cumulative confirmed cases for the next 30 days could be 293–330 cases.
Association rule mining
This section explores the association analysis when association rule mining is applied. We present significant rules for the data that included four attributes’ values in the dataset. Table 1 shows preliminary analysis of dataset that was extracted for a total of 3685 patients. The patient data consist of eleven attributes and 35 attribute values. In addition, an attribute code is defined for item set name and frequency of each attribute code. We extract 595 significant rules for the data.
The association rules grouped by four attributes related to managing hospital resources are shown in Table 8. Length of stay more than 14 days is related to healthcare workers and three underlying diseases other, pregnant, and dyslipidemia that have the same value of 1.017. Length of stay less than 14 give the interesting result on symptom mode (Lift of 6.464), three underlying diseases, and age more than 65 years old.
The interesting rule of discharge had two value attributes. The result showed that referral to hospitals was strongly related to symptom of Mode (Lift of 9.127). In addition, four features in this attribute showed high Lift values; underlying diseases (5.655), metabolic syndrome (4.098), length of stay more than 14 days (3.613), and age more than 65 years old (5.515). Chest xray with no lesion presented the same level of Lift. However, two features which showed high numbers of patients were age less than 24 years old (1148) and symptom asymptomatic (2295). Moreover, chest xray with pneumonia showed all high interesting value Symptom of Mode (3.287), age more than 65 (3.271), underlying diseases diabetes mellitus (2.169), and underlying diseases Metabolic (2.062). In current incident, Wave 1 showed high interest on Length of stay more than 14days and source of infection from hospital and healthcare worker patients. Wave 2 was also related to healthcare worker, asymptomatic and source of infection from hospital, as was Wave 3. In Wave 4, underlying diseases, age more than 65 and symptom mode showed strong relationships. Association rules selected key attributes of the data set to be exogenous variables of a time series analysis.
Discussion
The first wave of SARSCoV2 occurred in early 2020, and the second, third and fourth waves rapidly spread from early to mid2021, representing an unprecedented phenomenon in medical services, society and the economy of Thailand. The number of COVID19 patients shown in this study increased from the first wave of just 55 patients to 311, 1779 and 1540 in the second, third and fourth waves, respectively, which evolved more than 30 times of the total number of patients admitted at the field hospital. Most of patients were at least 44 years old and were predominantly female. Patients included in this study were mostly asymptomatic and had no sign of pneumonia in the chest xray due to the field hospital system’s focus on patients who did not require advanced treatment. But during the third and fourth waves, the number of mild to moderate symptoms with pneumonia of COVID19 patients significantly increased because of the greater severity of the delta variant of SARSCOV2. The huge number of patients was a burden on the limited resources of Thailand’s healthcare system. Therefore, this study presented the use of time series modeling and association rule mining to forecast the COVID19 pandemic outbreak as well as to analyze its associated prognostic factors. The method presented a dataoriented approach that applies timeseries analysis and association analysis to reveal meaningful hidden patterns for efficient handling of another pandemic crisis.
ARIMA models have been successfully applied for predicting the disease outbreak. Several studies have utilized the ARIMA model to forecast the spread of COVID19 in many countries including the US, Brazil, India, Russia and Spain [28, 29]. The studies using ARIMA models to predict COVID19 cases relative to total confirmed cases presented an average RMSE of 144.81 across 6 geographic regions [28], MAE of 787 to 1506 in USA and 82 to 570 in Italy [18], and MAE of 2967 in Indonesia [20]. In this work, ARIMA (2, 2, 2) was selected as the most accurate ARIMA model for predicting the number of admitted COVID19 cases in the field hospital, which achieved a R^{2} = 0.5695, RMSE = 29.7605, MAE = 27.5102 (Fig. 4). The forecast results of admitted cases on August 15 and August 30, 2021 were 335 and 294, respectively. In comparison with the actual values reported on the same dates, the forecasted values of our selected ARIMA model were within the upper and lower bounds at 95% confidence intervals. This signified an acceptable accuracy of this model for estimating admitted cases in the field hospital.
ARM is a structured method of discovering frequent patterns in a data set and forming noticeable rules among regular patterns. In the COVID19 crisis, many nations, including Thailand, have a highest priority to save lives and protect their economies. A previous study using ARM for mining COVID19 data to analyze factors related to COVID19 situation management showed that face mask mandates combined with mobility reduction through moderate stayathome orders were most effective in reducing the number of COVID19 cases in United State [24]. In this study, the ARM technique was used to analyze and identify factors related to the length of stay and prognosis of COVID19 patients and found that the top five factors related to hospital stays longer than 14 days consisted of healthcare workers uncommon underlying diseases such as thalassemia, thyroid diseases, gout and G6PD deficiency, pregnant patients, dyslipidemia and signs of pneumonia in chest xrays. This study also identified a clinical factor rule related to the worsening condition of the inpatient. Among those who needed more advanced medical treatment, the rules included mild to moderate COVID19 symptoms, pregnant patients, metabolic syndrome, length of hospital stay more than 14 days, and patients older than 65 years old. These factors are consistent with those in a previous study, which reported similar conditions among patients who had a poor prognosis in COVID19 infections [1, 30].
In any prediction tasks, more data is needed to achieve better performance from the models. This study developed the combination of the ARM technique and the ARIMA model, as the ARIMAX model. This model worked by selecting the rules related to COVID19 prognosis from the ARM technique, including mild to moderate COVID19 symptoms, patients with metabolic syndrome and patients older than 65 years old, and integrating them to the ARIMA model. Experimental results showed that the ARIMAX model (1, 1, 1) improved the accuracy of forecasting the number of admitted COVID19 cases, which achieved a R^{2} = 0.5695, RMSE = 27.7508, MAE = 23.4642 (Fig. 5). The forecast value of this model for August 30, 2021 was estimated to be 259 to 327 cases. The actual number of cases on the same date was 291 cases. The actual value also was within the lower and upper prediction bounds for both 95% confidence intervals. To the best of our knowledge, this is the first study to combine the ARM technique with the ARIMA model for forecasting the COVID19 cases by integrating the optimal exogenous variables from the ARM rules to form a predictive model. This ARIMAX model had the potential to predict the number of COVID19 patients, which could be one of the reliable forecastingbased models for the future outbreak. These predictive models are intended to help better decisionmaking to plan an effective management system if the virus outbreak has not subsided.
Limitations
The limitation of this study is that the dataset was based on retrospective data from a single COVID19 field hospital in Thailand with a limited number of cases and clinical variables of COVID19 patients.
Future directions
In future work, the collaboration between multimedical centers for a larger number and different variables of COVID19 cases, including the medical records of clinical, laboratory and treatment data from various COVID19 centers, would upgrade the forecasting performance of this AI model to predict the COVID19 event more accurately. Additionally, geographic data related to the pandemic area could be used as a variable for alternative time series models such as spacetime ARIMA models [31], which could be more reliable in predicting future COVID19 outbreaks.
Conclusion
This study demonstrated that the ARIMAX model has the potential to increase the accuracy for predicting the number of COVID19 cases by incorporating the most associated prognostic factors identified by ARM technique to the ARIMA model. The result of this study proved to be an effective AI model to predict the number of and to identify prognostic factors of admitted COVID19 patients. This work is expected to be a novel AIbased decisionmaking model for preparation, organizing hospital resources and more optimal use of medical personnel and equipment to enhance healthcare decisionmaking, and to manage the COVID19 pandemic but as well as other epidemic crises.
Availability of data and materials
The datasets used and/or analyzed during the current study are available from the corresponding author on reasonable requests.
Abbreviations
 COVID19:

Coronavirus disease 2019
 SARSCoV2:

Severe Acute Respiratory SyndromeCoronavirus2
 MERSCoV:

Middle East Respiratory Syndrome Coronavirus
 CBC:

Complete blood count
 LFTs:

Liver function tests
 BUN:

Balance urine nitrogen
 Cr:

Creatinine
 SpO_{2} :

Pulse oxygen saturation
 BT:

Body temperature
 BMI:

Body mass index
 G6PD:

Glucose6Phosphate Dehydrogenase
 ANN:

Artificial Neural Network
 SVM:

Support Vector Machine
 ARM:

Association Rule Mining
 ARIMA:

Auto Regressive Integrated Moving Average
 ARIMAX:

Autoregressive Integrated Moving Average with Exogenous Covariates
 R^{2} :

Coefficient of determination
 RMSE:

Root mean square error
 MAE:

Mean absolute error
 CI:

Confidence intervals
References
Huang C, Wang Y, Li X, Ren L, Zhao J, Hu Y, et al. Clinical features of patients infected with 2019 novel coronavirus in Wuhan, China. Lancet. 2020;395(10223):497–506.
Wolkewitz M, Puljak L. Methodological challenges of analysing COVID19 data during the pandemic. BMC Med Res Methodol. 2020;20(1):81.
Tao K, Tzou PL, Nouhin J, Gupta RK, de Oliveira T, Kosakovsky Pond SL, et al. The biological and clinical significance of emerging SARSCoV2 variants. Nat Rev Genet. 2021;22(12):757–73.
World Health Organization: COVID19 Weekly Epidemiological Update, Edition 95. 2022.
Yoo I, Alafaireet P, Marinov M, PenaHernandez K, Gopidi R, Chang JF, et al. Data mining in healthcare and biomedicine: a survey of the literature. J Med Syst. 2012;36(4):2431–48.
Huang F, Wang S, Chan C. Predicting disease by using data mining based on healthcare information system. In: 2012 IEEE International Conference on Granular Computing: 11–13 Aug. 2012, vol. 2012; 2012. p. 191–4.
Koh HC, Tan G. Data mining applications in healthcare. J Healthc Inf Manag. 2005;19(2):64–72.
Kriston L. Predictive accuracy of a hierarchical logistic model of cumulative SARSCoV2 case growth until May 2020. BMC Med Res Methodol. 2020;20(1):278.
Ayatollahi H, Gholamhosseini L, Salehi M. Predicting coronary artery disease: a comparison between two data mining algorithms. BMC Public Health. 2019;19(1):448.
Alfisahrin SNN, Mantoro T. Data Mining Techniques for Optimization of Liver Disease Classification. In: 2013 International Conference on Advanced Computer Science Applications and Technologies: 23–24 Dec. 2013, vol. 2013; 2013. p. 379–84.
AlTuraiki I, Alshahrani M, Almutairi T. Building predictive models for MERSCoV infections using data mining techniques. J Infect Public Health. 2016;9(6):744–8.
Abonazel M, Ibrahim A. Forecasting Egyptian GDP using ARIMA models. Rep Econ Finance. 2019;5:35–47.
Cryer JD, Chan KS. Time series analysis with applications in R, 2nd 2008. Edn. New York: Springer New York; 2008.
Zaki MJ. Scalable algorithms for association mining. IEEE Trans Knowl Data Eng. 2000;12(3):372–90.
Heisterkamp SH, Dekkers AL, Heijne JC. Automated detection of infectious disease outbreaks: hierarchical time series models. Stat Med. 2006;25(24):4179–96.
Zhang GP. Time series forecasting using a hybrid ARIMA and neural network model. Neurocomputing. 2003;50:159–75.
Abonazel M, Darwish N. Forecasting confirmed and recovered Covid19 cases and deaths in Egypt after the genetic mutation of the virus: ARIMA boxJenkins approach. Commun Math Biol Neurosci. 2022;2022:17.
Gecili E, Ziady A, Szczesniak RD. Forecasting COVID19 confirmed cases, deaths and recoveries: revisiting established time series modeling through novel applications for the USA and Italy. PLoS One. 2021;16(1):e0244173.
Singh S, Parmar KS, Makkhan SJS, Kaur J, Peshoria S, Kumar J. Study of ARIMA and least square support vector machine (LSSVM) models for the prediction of SARSCoV2 confirmed cases in the most affected countries. Chaos, Solitons Fractals. 2020;139:110086.
Aditya Satrio CB, Darmawan W, Nadia BU, Hanafiah N. Time series analysis and forecasting of coronavirus disease in Indonesia using ARIMA model and PROPHET. Proc Comput Sci. 2021;179:524–32.
Agrawal R, Imieliński T, Swami A. Mining association rules between sets of items in large databases. In: Proceedings of the 1993 ACM SIGMOD international conference on management of data. Washington, D.C.: Association for Computing Machinery; 1993. p. 207–16.
K S L, G DV: Extracting association rules from medical health records using multicriteria decision analysis. Proc Comput Sci 2017, 115:290–295.
Tandan M, Acharya Y, Pokharel S, Timilsina M. Discovering symptom patterns of COVID19 patients using association rule mining. Comput Biol Med. 2021;131:104249.
Katragadda S, Gottumukkala R, Bhupatiraju RT, Kamal AM, Raghavan V, Chu H, et al. Association mining based approach to analyze COVID19 response and case growth in the United States. Sci Rep. 2021;11(1):18635.
Amasiri W, Warin K, Mairiang K, Mingmalairak C, Panichkitkosolkul W, Silanun K, et al. Analysis of characteristics and clinical outcomes for crisis management during the four waves of the COVID19 pandemic. Int J Environ Res Public Health. 2021;18(23):12633.
Time Series Models AR, MA, ARMA, ARIMA; 2020 [cited 2021 7 December] Available from: https://towardsdatascience.com/timeseriesmodelsd9266f8ac7b0.
Barnston AG. Correspondence among the correlation, RMSE, and Heidke forecast verification measures; refinement of the Heidke score. Weather Forecast. 1992;7(4):699–709.
HernandezMatamoros A, Fujita H, Hayashi T, PerezMeana H. Forecasting of COVID19 per regions using ARIMA models and polynomial functions. Appl Soft Comput. 2020;96:106610.
Darapaneni N, Reddy D, Paduri AR, Acharya P, Nithin HS. Forecasting of COVID19 in India Using ARIMA Model. In: 2020 11th IEEE Annual Ubiquitous Computing, Electronics & Mobile Communication Conference (UEMCON): 28–31 Oct. 2020, vol. 2020; 2020. p. 0894–9.
Noor FM, Islam MM. Prevalence and associated risk factors of mortality among COVID19 patients: a Metaanalysis. J Community Health. 2020;45(6):1270–82.
Awwad FA, Mohamoud MA, Abonazel MR. Estimating COVID19 cases in Makkah region of Saudi Arabia: spacetime ARIMA modeling. PLoS One. 2021;16(4):e0250149.
Acknowledgements
The authors would like to thank Supasek Sanmano from Thammasat Field Hospital and Kunch Ringrod from Thai Network for Disaster Resilience (TNDR) for data preparation. We thank Mr. Terrance J. Downey, English Editor for Thammasat University Office of Research and Innovation for English language editing.
Funding
This work was supported by the Thammasat University Research Fund (CovidTU03/2564), Center of Excellence in Intelligent Informatics, Speech and Language Technology and Service Innovation (CILS), and Intelligent Informatics and Service Innovation (IISI).
Author information
Authors and Affiliations
Contributions
Conceptualization: K.W., S.N., S.S.; Methodology: R.S., K.W., T.T., S.S.; Formal analysis and investigation: R.S., K.W., W.A., W.P., T.T., S.S.; Fund acquisition: W.A., T.T; Writing  original draft preparation: K.W., S.S.; Writing  review and editing: K.W., S.S.; Resources: W.A., C.M., K.M., K.S.; Supervision: S.S., S.N. All authors have read and agreed to the published version of the manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
The study protocol and the exempt from the need to obtain informed consent was approved by the Ethics Committee of the Thammasat University (COE 008/2564) in accordance with the 1964 Declaration of Helsinki.
Consent for publication
Not applicable.
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Somyanonthanakul, R., Warin, K., Amasiri, W. et al. Forecasting COVID19 cases using time series modeling and association rule mining. BMC Med Res Methodol 22, 281 (2022). https://doi.org/10.1186/s1287402201755x
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s1287402201755x
Keywords
 COVID 19
 Pandemic
 Data mining
 Time series analysis
 Association rule mining