Skip to main content

Prediction of acute appendicitis among patients with undifferentiated abdominal pain at emergency department



Early screening and accurately identifying Acute Appendicitis (AA) among patients with undifferentiated symptoms associated with appendicitis during their emergency visit will improve patient safety and health care quality. The aim of the study was to compare models that predict AA among patients with undifferentiated symptoms at emergency visits using both structured data and free-text data from a national survey.


We performed a secondary data analysis on the 2005-2017 United States National Hospital Ambulatory Medical Care Survey (NHAMCS) data to estimate the association between emergency department (ED) patients with the diagnosis of AA, and the demographic and clinical factors present at ED visits during a patient’s ED stay. We used binary logistic regression (LR) and random forest (RF) models incorporating natural language processing (NLP) to predict AA diagnosis among patients with undifferentiated symptoms.


Among the 40,441 ED patients with assigned International Classification of Diseases (ICD) codes of AA and appendicitis-related symptoms between 2005 and 2017, 655 adults (2.3%) and 256 children (2.2%) had AA. For the LR model identifying AA diagnosis among adult ED patients, the c-statistic was 0.72 (95% CI: 0.69–0.75) for structured variables only, 0.72 (95% CI: 0.69–0.75) for unstructured variables only, and 0.78 (95% CI: 0.76–0.80) when including both structured and unstructured variables. For the LR model identifying AA diagnosis among pediatric ED patients, the c-statistic was 0.84 (95% CI: 0.79–0.89) for including structured variables only, 0.78 (95% CI: 0.72–0.84) for unstructured variables, and 0.87 (95% CI: 0.83–0.91) when including both structured and unstructured variables. The RF method showed similar c-statistic to the corresponding LR model.


We developed predictive models that can predict the AA diagnosis for adult and pediatric ED patients, and the predictive accuracy was improved with the inclusion of NLP elements and approaches.

Peer Review reports


AA is one of the most common surgical emergencies but has a high rate of misdiagnosis in the United States [1]. It is also the second most common condition among pediatric malpractice claims and third for adult malpractice claims [2, 3]. The lifetime risk of developing appendicitis is approximately 7% and usually requires surgical treatment [4, 5]. The annual national rate of AA is up to 13/100,000 patients [6], but the diagnosis of AA is missed at a rate of 3.8-15% for children and 5.9-23.5% for adults during ED visits [7,8,9,10,11]. While the clinical diagnosis may be straightforward in patients who present with classic signs and symptoms, atypical presentations may result in diagnostic confusion and delay in treatment. The diagnosis of AA can be challenging even in the most experienced hands. Abdominal pain is the primary presenting complaint of patients with AA. Accurately identifying AA among patients with undifferentiated symptoms at emergency visits can potentially improve the patient safety and health care quality.

Technological innovations that employ NLP and machine learning (ML) techniques can be used to extract useful features from the complex structured and unstructured retrospective electronic health records (EHRs) data to potentially replicate the clinician’s thought process at ED presentation. These features can be used to accurately identify a patient’s diagnosis, which has the potential to improve ED patient safety [12]. Among ED patients, the ML and NLP techniques have proven useful in better understanding the associated factors related to ED health outcomes, such as hospitalization and medical resource utilization, and thus, they can be used to improve predictive performance for these outcomes [13,14,15,16]. However, few studies have focused on using NLP and ML to identify a patient’s diagnosis and potential misdiagnosis [17].

The aim of the study was to develop ML and NLP models as an assistive technique to predict AA among patients with undifferentiated symptoms at ED visits. We hypothesize that the prediction accuracy can be improved with the inclusion of NLP elements.


Study design and setting

We carried out the study on combined data from the ED component of the NHAMCS datasets (2005-2017). The Centers for Disease Control and Prevention (CDC) has been publishing the NHAMCS data annually since 1992, which collects data on the utilization and provision of ambulatory care services in hospital emergency and outpatient departments. The ED component of NHAMCS is a multistage, stratified probability sample of ED visits from 300 hospital-based EDs each year, which was randomly selected from about 1900 geographically defined areas across the United States, administered by the National Center for Health Statistics (NCHS) [18]. The NHAMCS is a public use dataset that does not require ethical committee or institutional review board approval.

Definition of appendicitis

AA in this study was defined by the ICD, 9th and 10th Revision, Clinical Modification (ICD-9-CM and ICD-10-CM) diagnosis codes from category 540-542 (ICD-9-CM) and K35-K37 (ICD-10-CM), which refers specifically to essential (or primary) appendicitis [19]. Along with the implementation of ICD-10-CM since 2015, an ICD-10-CM category of K35-K37 was used to define the diagnosis of primary appendicitis, which is equivalent to the ICD-9-CM category 540-542, according to the ICD-10-CM General Equivalence Mapping (GEM), a crosswalk between the two code standards maintained by the Centers for Medicare and Medicaid Services (CMS) and the CDC.

Study patients

A total of 356,333 patient visits were included in the ED component of the survey datasets from 2005 to 2017. According to the ICD-9-CM and ICD-10-CM, we selected 40,041 patients from which were assigned a ICD code of AA and showed at least one symptoms (abdominal pain, constipation, diarrhea, fever, and nausea and/or vomiting) associated with appendicitis during the ED (Tables S1 and S2). We then divided the patients into two groups by age (>=18 years old or <18 years old), respectively: the adult group (N = 28657, 71.57%) and the pediatric group (N = 11384, 28.43%).

Study variables


The primary outcome variable for this study was whether the eventual diagnosis was AA during an ED visit. The outcome variable was assigned a value of 1 if the eventual diagnosis was appendicitis, while symptoms associated with appendicitis but not assigned an ICD code of AA was assigned a value of 0.


The predictors for ML models were chosen from routinely available data at ED components using a priori knowledge [20,21,22]. This study classified predictors into two categories, structured variables and unstructured variables.

Specifically, the structured predictors included: sex, race, ethnicity, type of residence, insurance, visit year, month and day, arrival time, initial vital signs (body temperature, respiratory rate, systolic and diastolic blood pressure, pulse oximetry), 5 point triage level (immediate, emergent, urgent, semi-urgent, nonurgent), pain scale (mild, moderate, very severe), 72 hour revisit, whether the visit was related to an injury, poisoning, or adverse effect of medical treatment, whether is injury/poisoning intentional, and the diagnostic services (any laboratory tests or imaging tests) provided.

Unstructured data included up to three reasons for visiting the ED and three causes of injury recorded by the providers for each patient in the triage notes; the limit of three was by design of the NHAMCS. The reason for visit classification system derived by the NCHS is a modular framework into which the reason for visit is broadly categorized as a type of complaint (e.g., symptoms, diseases, injury) and a methodology for systematically recording these complaints within a specific organ or area of the body. The system then records the complaint in a pre-specified fashion according to an alphabetical index of complaints (for example, “eye pain” is changed to “pain, eye”) while maintaining the emphasis on the patient’s lay terminology rather than a clinician’s translation of the patient’s reason for the visit.

Missing values

Before statistical modelling, the k-nearest neighbors (k-NN) approach was used to impute missing data for most predictors. For a given patient with missing values, the k-NN method identified the k-nearest patients based on Euclidean distance. Using these patients, missing values were then replaced using a majority vote for discrete variables and weighted means for continuous features. One advantage of using this method is that missing values in all features are imputed simultaneously without the need to treat features individually [23].

Statistical analysis


NLP is a field of Artificial Intelligence (AI) that gives the machines the ability to read, understand and derive meaning from human languages; in NLP there are many techniques to vectorize human languages -- either a word, a sentence, a paragraph, or even a document [24]. Since the unstructured variables in this study were all sentence forms, we carried out Doc2Vec method in Python, an embedded encoding method, for vectorization.

We first pre-processed the unstructured data, including word segmentation and removal of stop words. Then we used TaggedDocument in the gensim package to wrap the input sentence and change it to the input sample format required by Doc2Vec [25, 26]. After that, we loaded the Doc2vec model with window size of 3 and started training, and finally we mapped the unstructured data into 128-dimensional paragraph vectors and made further predictions.

The ML methods are data-driven and therefore rely on accurate data. Although there may be some misclassification in the survey data, in the 10% quality control sample of NHAMCS, the coding error rate was less than 1% [27]. Therefore, we established two main types of ML models to compare the predictive accuracy of being diagnosed of AA or not in a population of ED patients at the time of triage, using standard binary LR and RF methods in Python.


LR is a member of the general linear model (GLM) family. It has the underlying assumption that the output follows a Bernoulli distribution with parameter p, where p is the probability of success (in our case the probability of appendicitis). This assumption is consistent with our appendicitis 0, 1 outcome. LR also uses a canonical link function in the form of: \(\mathit{\log}\ \left(\frac{p_i}{1-{p}_i}\right)={e}^{x_i\beta }\). With a transformation we get \({p}_i=\frac{1}{1+{e}^{-{x}_i\beta }}\). Since the expectation of a Bernoulli distribution is p, the output of our predicted outcome is pi for patient i.

The fitting of parameter β is done by a Maximum Likelihood Estimation (MLE); once the estimated betas are fitted, the predicted values can be calculated using the equation, \({p}_i=\frac{1}{1+{e}^{-{x}_i\beta }}\). In this study, the model building strategy for LR is direct (i.e., full, standard, or simultaneous), all predictors are entered into the equation at the same time.

In this study, we separately fitted three LR models for adults and children to determine the model’s predictive performance in identifying the eventual diagnosis: (1) models with structured variables only; (2) models with unstructured data; and (3) models with both structured and unstructured variables.


We then employed a RF classifier, which has been widely used for classification and prediction in the fields of medicine and bioinformatics, to build prediction models of appendicitis in adults and children during ED visits [28,29,30]. The RF classifier is an ensemble of decision trees, and each tree learns from a randomly selected set of the training data. The information content of the decision tree classifier is derived from each attribute in the dataset. Therefore, the decision tree classification algorithm first selects the attribute with the most abundant information for classification. Sample training data sets are selected randomly and returned to ensure that the total size of each random sample is the same. For prediction, each decision tree is applied to the test set and the error is evaluated, and the final classification decision is made by majority voting on all decision trees.

Because of this non-parametric model setting, RF can be used in non-linear separable problems. However, this property is also problematic given that it makes the model very sensitive to noise. Therefore, before we carried out the classification, we did the data cleaning on the unstructured data. Firstly, Principal Component Analysis (PCA) was used to convert the original features to orthogonal ones. Then, based on the p-value of the Welch’s approximated t-test, we chose those features with statistical significance at a level of p<0.01, selecting 24 principal components out of the original 128 features. Based on the 20 structured and 24 unstructured datasets, we applied the standard RF classification package in Scikit-learn (Sklearn) on three models, the same as in LR, using 1000 trees in the RF implementation [31, 32]. The number of jobs to run in parallel was 90. The number of features selected at random at each tree node was set to log2*(n), where n was the total number of features [33].

Model evaluation

For both LR and RF models, we used 5-fold cross-validation to evaluate our model performance. Patients were randomly divided into 5 sets, and 4 of the 5 sets were used to train the models while the remaining set was used as the testing set. In the testing set, we measured the prediction performance of each model by computing (1) C-statistic (the area under the receiver operating curve, AUC) and (2) prospective prediction results (sensitivity, specificity, threshold, and accuracy). To address the class imbalance in the outcome, we chose the threshold of prospective prediction results based on the Receiver Operating Characteristics (ROC) curve (the value with the shortest distance to the perfect model) [15]. The C statistic informs in a single numerical value about the overall diagnostic accuracy of the index test. The C statistic ranges from 0.50 to 1.00, with higher values indicating better predictive models. Values above 0.80 indicate very good models, between 0.70 and 0.80 good models, and between 0.50 and 0.70 weak models. The average ROC curve was derived by comparing the prediction values from all 5 cross-validated testing sets. The ROC curve mentioned above is a curve that shows the overall performance of a specific model. Accordingly, with threshold from 0 to 1, we calculate the corresponding False Positive Rate (FPR) (\(\frac{TP}{TP+ FN}\)) and the True Positive Rate (TPR) (\(\frac{FP}{FP+ TN}\)). We then draw the point in a rectangular coordinator with the FPR as the horizontal coordinate and the TPR as the longitudinal coordinate. The better tendency the curves have to access the up-left corner of the coordinate, the better performance of the model. The perfect model should have a ROC curve as a line linking (0,0), (0,1), (1). The meaning of AUC is the possibility that while randomly choosing one positive patient and one negative patient, the score of the positive patient will be greater than the negative patient. So, the bigger the value, the better we have classified the two classes of patients.


The recall depicts the ability of the model to search for all positive data. The calculation function is \(R=\frac{TP}{TP+ FN}\).


The precision depicts the ability of the model to search for all negative data. The calculation function is \(P=\frac{TN}{TN+ FP}\).


Among the 40,441 ED patients with appendicitis-related symptoms between 2005 and 2017, 655 of 28,657 adults (2.3%) and 256 of 11384 pediatric patients (2.2%) had appendicitis (Table 1). Male appendicitis patients (3.5% for adults and 3.1% for pediatric patients) present at a higher proportion than female patients (1.7% for adults and 1.5% for pediatric patients). The proportion of appendicitis patients was highest among Asian adults (4.4%) and highest among white pediatric patients (2.7%). The highest proportion of triage level in adults and pediatric appendicitis patients was immediate (5.6 and 10.0%). The highest proportion of the pain level in the adults and pediatric patients with appendicitis was very severe (2.7 and 5.7%). A total of 2.4% of adult patients and 3.2% of pediatric patients who were provided diagnostic services were diagnosed as AA, which is higher than those adults patients (1.3%) and pediatric (0.5%) patients who did not have diagnostic services.

Table 1 Baseline characteristics of the United States appendicitis patients presenting to the ED NHAMCS 2005–2017

The crude and adjusted odds ratio of adult and pediatric ED patients with acute appendicitis (vs. non-appendicitis) for each predictive factor using binary LR are presented in Table 2. The adjusted analysis showed that the risk of being diagnosed with AA was higher in adult males (aOR=2.327; 95% CI:1.984-2.728) and pediatric males (aOR=2.759; 95% CI:2.102-3.622) than females. Compared with patients with private insurance, adults (aOR=0.462; 95%CI: 0.370-0.578) and pediatric patients (aOR=0.691; 95% CI: 0.517-0.923) with Medicaid or Children's Health Insurance Program (CHIP) or other state-based program had a lower risk of being diagnosed with AA. Adults and pediatric patients with immediate triage levels were more likely to be diagnosed with AA. The risk of adults with moderate (aOR=2.016; 95% CI: 1.513-2.687) and very severe (aOR=2.527; 95% CI: 1.915-3.335) pain levels had greater odds than those being diagnosed with AA with mild pain. Similarly, the risk of pediatric patients with moderate (aOR=5.291; 95% CI: 3.587-7.805) and very severe (aOR=8.094; 95% CI: 5.414-12.099) pain levels had greater odds than those being diagnosed with AA with mild pain. Adults (aOR = 2.268; 95% CI: 1.445-3.560) and pediatric patients (aOR = 3.385; 95% CI: 2.106-5.441) who received diagnostic services had greater odds of AA than those who did not receive diagnostic services.

Table 2 Adjusted odds ratio (aOR) of characteristics of adult and pediatric during the emergency department visit (appendicitis vs. non-appendicitis), NHAMCS 2005–2017

In Fig. S1, before using the LR and RF approaches, we showed the contribution (weights) of each 128 Doc2Vec output to the first 24 principle components for the unstructured data.

As shown in Table 3 and Fig. 1, for the LR model identifying AA diagnosis among adult ED patients, the AUC was 0.72 (95% CI: 0.69–0.75) for structured variables only, and 0.72 (95% CI: 0.69–0.75) for unstructured variables only, and 0.78 (95% CI: 0.76–0.80) when including both structured and unstructured variables. For the LR model identifying AA diagnosis among pediatric ED patients, the AUC was 0.84 (95% CI: 0.79–0.89) for structured variables only, 0.78 (95% CI: 0.72–0.84) for unstructured variables, and 0.87 (95% CI: 0.83–0.91) when including both structured and unstructured variables.

Table 3 Predictive performance of LR and RF models with 5-fold classification in identifying diagnosed appendicitis ED patients, NHAMCS 2005-2017
Fig. 1
figure 1

ROC curves for the LR and RF models for predicting the diagnosed appendicitis (adult and pediatric)

For the RF model identifying AA diagnosis among adult ED patients, the AUC was 0.71 (95% CI: 0.65–0.77) for structured variables, 0.68 (95% CI: 0.64–0.72) for unstructured variables, and 0.75 (95% CI: 0.71–0.79) for structured and unstructured variables. For the RF model identifying AA diagnosis among pediatric ED patients, the AUC was 0.84 (95% CI: 0.83–0.85) for structured variables, 0.78 (95% CI: 0.76–0.80) for unstructured variables, and 0.86 (95% CI: 0.84–0.88) for structured and unstructured variables. The discrimination ability of different models, as represented by ROC curves, is shown in Fig. 1.

The standardized and non-standardized coefficients of structured variables were used as modeling examples (Tables S3 and S4) to determine whether to diagnose AA among adult and pediatric ED patients. The standardized coefficient can be used to compare which variable has the greater influence on the prediction of confirmed AA. The standardized coefficients of insurance and triage levels were highest among adults with ED. Among children with ED, the highest standardized coefficients were insurance and pain levels.


In this study, we used data from the 2005-2017 NHAMCS ED survey and applied statistical models to predict whether adult and pediatric patients were diagnosed with AA. A novel part of this study was a traditional statistics and ML approach (LR algorithm) and a advanced machine learning modeling techniques (RF algorithm), which can be used to diagnose and identify the clinical problem of appendicitis and to judge the predicted performance of the two machine learning modeling techniques through a series of indicators. In addition, in the aspect of preprocessing of unstructured text information, we used Doc2Vec technology in natural language processing to extract features of unstructured text and use it for modeling and prediction, so as to improve the prediction ability of the two machine learning models. In general, the performance of both models was significantly improved after NLP by using predictors that combined structured data with unstructured data.

To our knowledge, this is the first time that Doc2Vec technology of NLP has been used to conduct unstructured text analysis of the reason for patient visit and the reason for injury to predict AA diagnosis using NHAMCS ED survey data. This study also serves as a teaching case to help physicians, nurses, researchers, and others learn about NLP technologies. Combined with the structured data, LR algorithm and RF algorithm were used to establish the diagnosis and prediction model of emergency hospitalized appendicitis. Many other studies have shown that in the fields of electronic case mining and bioinformatics, the predictive performance of models can be greatly improved by incorporating textual information [34,35,36,37]. There are several potential explanations for the incremental gains in the prediction ability by the NLP. First, NLP can more effectively capture more word and context information from the unstructured text, which cannot be addressed by traditional text analysis approaches, such as word spotting and manual rules [38]. Additionally, end-to-end training and learning of representations differentiate deep learning from traditional ML methods and make it a powerful tool for NLP [39]. Moreover, Doc2Vec technology allows us to extract/infer specific features for both the word and the paragraph, which cannot be solved by word2vec technology. Our results show that the value of AUC is the highest when both structured and unstructured data are included in the prediction model.

Although many previous studies have shown that the performance of a RF algorithm is better than that of a LR algorithm [40,41,42], LR and RF algorithms were used for different patients in our study, and the results showed that the predictive performance of LR algorithm was no different from the RF algorithm for both adult and pediatric patients. This may be because LR model works well as a classifier if the relationship between the input variables (structured variables) and output variable (AA) is linear and the data is relatively balanced between classes. If the relationship between the input and the output variable is linear, RF algorithm will only approximate linear regression methods like LR in the limit case of an infinite number of trees. RF algorithm exchanges a high degree of variance between each tree for a low bias in predicting the outcome variable. A more unbiased estimate may be given if other methods are assumed not to violate the linearity, collinearity, and homogeneity of the parameters [43,44,45].

Compared with ED patients with private insurance, patients with Medicaid or CHIP or other state-based programs and self-pay patients had a significantly lower risk of being diagnosed with appendicitis. The reasons for these differences should be further explored in future studies to determine the appropriateness of including or excluding these variables in predictive models, which is important to determine whether such predictive models can be used as a more objective tool to predict whether a patient has appendicitis based on the clinical context [46]. Sex, race, ethnicity, triage level, pain level and diagnostic services provided were also found to be important predictors for identifying patients with appendicitis. As expected, patients with immediate triage level were more likely to be diagnosed with appendicitis than those with other triage levels. Patients with moderate and very severe pain levels were generally more likely to be diagnosed with AA than those with mild pain levels.

The clinical practice of adult ED is quite different from that of pediatric ED. In particular, the diagnosis of appendicitis in pediatric populations is more complex and time-consuming than that in adults because of their physiological and developmental differences [47]. Compared with patients with immediate triage level, the risk of diagnosis of urgent, semi-urgent and nonurgent appendicitis in pediatric patients is lower than adult patients. However, compared with mild patients, pediatric patients with moderate pain levels and very severe AA had a higher risk of diagnosis than adults.

Since the prediction model is based on whether patients with ED will eventually be diagnosed with AA, the prediction model can not only predict AA, but also help doctors, nurses and triage personnel to choose more helpful examination items in advance, so as to make more efficient use of medical resources. Previous studies have shown that because the ED is a critical staging area for critically ill patients, developing more efficient tools to avoid overcrowding and increase the efficiency of the use of healthcare resources in the ED and ultimately improve the quality of care and health outcomes for ED patients [48,49,50]. The prediction model developed in our study for adults and pediatric ED patients with diagnosed appendicitis is consistent with the goal of establishing a better decision system in ED [51, 52].

The prediction model of diagnosed ED patients with AA produced in this study is designed to help doctors, nurses, and triage personnel make decisions and cannot completely replace their roles. Although we developed an improved prediction model of diagnosing ED patients with AA, it still needs the actual clinical work. There is a certain risk that the model is still imperfect at present, so it may increase the possibility of misdiagnosis of AA if clinicians rely on it more than as an assistive tool.


Our study has several limitations. First, due to the large span of survey years, the questionnaire variables are inconsistent in different years, so some available variables are not included in the prediction model, such as complications, arriving by ambulance, etc., which may affect the prediction ability of the model [53]. Second, the NHAMCS data did not gather more useful clinical variables for the diagnosis of appendicitis, such as hyperbilirubinemia, white blood cells (WBCs) count and absence of inflammatory changes, etc. However, the goal of this study is not to use a large number of predictors to build predictive models, but to use a limited number of predictors to build machine learning models, which are often easier to practice. However, the results of this study still lack clinical operability and need to be further verified and improved. Third, more dimensions of the feature extraction technology of Doc2Vec were not attempted. The dimension values used in this paper were mainly based on the experience of previous literature, which may affect the prediction ability of the prediction model [38, 54]. Fourth, The dataset is a large administrative dataset that may have more limitations such as the sampling techniques used to generate the data, the decreasing number of AA as the years go by, and the lack of clinical context of the patients that only come from using more robust clinical data [55, 56]. Finally, The low incidence of AA in the study population suggests that the number of patients actually considered or AA was much smaller than the inclusion criteria suggest. Only 2-3% positive is very low as compared to other studies, which may affect the predictive performance of the model.


Based on the analysis of 40,041 patients with AA-related symptoms in the NHAMCS ED survey, we examined the information relating to the patients’ social economic, demographic and clinical factors during the patients’ ED visits, including the unstructured free-text, such as the reason for visits and the cause of the injury, and developed a prediction model to diagnose AA for adults and children. Although external prospective validation is necessary, these observations suggest an opportunity to apply advanced predictive methods to routinely available triage data -- as an assistive technique -- to enhance clinicians’ diagnostic decisions, which in turn will lead to more accurate and effective clinical identification of AA in the ED.

Availability of data and materials

The datasets and code generated during and/or analysed during the current study are available from the corresponding author on reasonable request.



Acute Appendicitis


National Hospital Ambulatory Medical Care Survey


Emergency Department


Natural Language Processing


International Classification of Diseases


Machine Learning


Electronic Health Records


Centers for Disease Control and Prevention


ICD, 9th Revision, Clinical Modification


ICD, 10th Revision, Clinical Modification


General Equivalence Mapping


National Center for Health Statistics


k-Nearest Neighbors


Artificial Intelligence


Logistic Regression


General Linear Model


Maximum Likelihood Estimation


Random Forests


Principal Component Analysis


Children’s Health Insurance Program


Area Under the Receiver Operating Curve


Receiver Operating Characteristics


False Positive Rate


True Positive Rate


Odds Ratio


Confidence Interval


White Blood Cells


  1. Mahajan P, Basu T, Pai C-W, et al. Factors associated with potentially missed diagnosis of appendicitis in the emergency department. JAMA Netw Open. 2020;3(3):e200612.

    Article  PubMed  PubMed Central  Google Scholar 

  2. Brown TW, McCarthy ML, Kelen GD, Levy F. An epidemiologic study of closed emergency department malpractice claims in a national database of physician malpractice insurers. Acad Emerg Med. 2010;17(5):553–60.

    Article  PubMed  Google Scholar 

  3. Selbst SM, Friedman MJ, Singh SB. Epidemiology and etiology of malpractice lawsuits involving children in US emergency departments and urgent care centers. Pediatr Emerg Care. 2005;21(3):165–9.

    PubMed  Google Scholar 

  4. Ahmed HO, Muhedin R, Boujan A, Aziz AH, Muhamad Abdulla A, Hardi RA, et al. A five-year longitudinal observational study in morbidity and mortality of negative appendectomy in Sulaimani teaching Hospital/Kurdistan Region/Iraq. Sci Rep. 2020;10(1):1–7.

    Google Scholar 

  5. Daldal E, Dagmura H. The correlation between complete blood count parameters and appendix diameter for the diagnosis of acute appendicitis. Healthcare. 2020;8(1):39 Multidisciplinary Digital Publishing Institute.

    Article  PubMed Central  Google Scholar 

  6. Ferris M, Quan S, Kaplan BS, et al. The global incidence of appendicitis: a systematic review of population-based studies. Ann Surg. 2017;266(2):237–41.

    Article  PubMed  Google Scholar 

  7. Galai T, Beloosesky OZ, Scolnik D, Rimon A, Glatstein M. Misdiagnosis of acute appendicitis in children attending the emergency department: the experience of a large, tertiary care pediatric hospital. Eur J Pediatr Surg. 2017;27(2):138–41.

    Article  PubMed  Google Scholar 

  8. Naiditch JA, Lautz TB, Daley S, Pierce MC, Reynolds M. The implications of missed opportunities to diagnose appendicitis in children. Acad Emerg Med. 2013;20(6):592–6.

    Article  PubMed  Google Scholar 

  9. Chang YJ, Chao HC, Kong MS, Hsia SH, Yan DC. Misdiagnosed acute appendicitis in children in the emergency department. Chang Gung Med J. 2010;33(5):551–7.

    PubMed  Google Scholar 

  10. Graff L, Russell J, Seashore J, et al. False-negative and false-positive errors in abdominal pain evaluation: failure to diagnose acute appendicitis and unnecessary surgery. Acad Emerg Med. 2000;7(11):1244–55.

    Article  CAS  PubMed  Google Scholar 

  11. Leung YK, Chan CP, Graham CA, Rainer TH. Acute appendicitis in adults: Diagnostic accuracy of emergency doctors in a university hospital in Hong Kong. Emerg Med Australas. 2017;29(1):48–55.

    Article  PubMed  Google Scholar 

  12. Levin S, Toerper M, Hamrock E, et al. Machine-learning-based electronic triage more accurately differentiates patients with respect to clinical outcomes compared with the emergency severity index. Ann Emerg Med. 2017.

  13. Claster W, Shanmuganathan S, Ghotbi N. Text mining of medical records for radiodiagnostic decision-making; 2008.

    Book  Google Scholar 

  14. Demner-Fushman D, Chapman WW, McDonald CJ. What can natural language processing do for clinical decision support? J Biomed Inform. 2009;42(5):760–72.

    Article  PubMed  PubMed Central  Google Scholar 

  15. Huhdanpaa HT, Tan WK, Rundell SD, et al. Using natural language processing of free-text radiology reports to identify type 1 modic endplate changes. J Digit Imaging. 2017.

  16. Shin B, Chokshi F, Lee T, Choi J. Classification of radiology reports using neural attention models; 2017.

    Book  Google Scholar 

  17. Goto T, Camargo CA, Faridi MK, Freishtat RJ, Hasegawa K. Machine learning–based prediction of clinical outcomes for children during emergency department triage. JAMA Netw Open. 2019;2(1):e186937.

    Article  PubMed  PubMed Central  Google Scholar 

  18. McCaig LF, Burt CW. Understanding and interpreting the National Hospital Ambulatory Medical Care Survey: key questions and answers. Ann Emerg Med. 2012;60(6):716–721.e711.

    Article  PubMed  Google Scholar 

  19. Singer DD, Thode HC Jr, Singer AJ. Effects of pain severity and CT imaging on analgesia prescription in acute appendicitis. Am J Emerg Med. 2016;34(1):36–9.

    Article  PubMed  Google Scholar 

  20. Raita Y, Goto T, Faridi MK, Brown DF, Camargo CA, Hasegawa K. Emergency department triage prediction of clinical outcomes using machine learning models. Crit Care. 2019;23(1):1–3.

    Article  Google Scholar 

  21. Griffin JL, Yersin M, Baggio S, Iglesias K, Velonaki VS, Moschetti K, et al. Characteristics and predictors of mortality among frequent users of an Emergency Department in Switzerland. Eur J Emerg Med. 2018;25(2):140–6.

    PubMed  Google Scholar 

  22. Krieg C, Hudon C, Chouinard MC, Dufour I. Individual predictors of frequent emergency department use: a scoping review. BMC Health Serv Res. 2016;16(1):1–10.

    Article  Google Scholar 

  23. Ye C, Fu T, Hao S, et al. Prediction of incident hypertension within the next year: prospective study using statewide electronic health records and machine learning. J Med Internet Res. 2018;20(1):e22.

    Article  PubMed  PubMed Central  Google Scholar 

  24. Zheng T, Gao Y, Wang F, et al. Detection of medical text semantic similarity based on convolutional neural network. BMC Med Informatics Decis Mak. 2019;19(1):156.

    Article  Google Scholar 

  25. Song M, Kang KY, Timakum T, Zhang X. Examining influential factors for acknowledgements classification using supervised learning. PLoS One. 2020;15(2):e0228928.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  26. Zeng J, Banerjee I, Henry AS, Wood DJ, Shachter RD, Gensheimer MF, et al. Natural language processing to identify cancer treatments with electronic medical records. JCO Clin Cancer Informatics. 2021;5:379–93.

    Article  Google Scholar 

  27. Panackal AA, Halpern EF, Watson AJ. Cutaneous fungal infections in the United States: analysis of the national ambulatory medical care survey (NAMCS) and national hospital ambulatory medical care survey (NHAMCS), 1995–2004. Int J Dermatol. 2009;48(7):704–12.

    Article  PubMed  Google Scholar 

  28. Boulesteix AL, Janitza S, Kruppa J, König IR. Overview of random forest methodology and practical guidance with emphasis on computational biology and bioinformatics. Wiley Interdiscip Rev Data Mining Knowl Discov. 2012;2(6):493–507.

    Article  Google Scholar 

  29. Qi Y. Random forest for bioinformatics. Ensemble machine learning: methods and applications. Berlin: Springer; 2012.

    Google Scholar 

  30. Strobl C, Boulesteix A-L, Zeileis A, Hothorn T. Bias in random forest variable importance measures: Illustrations, sources and a solution. BMC Bioinformatics. 2007;8(1):25.

    Article  PubMed  PubMed Central  Google Scholar 

  31. Rahman QA, Janmohamed T, Pirbaglou M, et al. Defining and predicting pain volatility in users of the manage my pain app: analysis using data mining and machine learning methods. J Med Internet Res. 2018;20(11):e12001.

    Article  PubMed  PubMed Central  Google Scholar 

  32. Yang F, et al. Transformers-sklearn: a toolkit for medical language understanding with transformer-based models. BMC Med Informatics Decis Mak. 2021;21(2):1–8.

    Google Scholar 

  33. Probst P, Wright MN, Boulesteix AL. Hyperparameters and tuning strategies for random forest. Wiley Interdiscip Rev Data Mining Knowl Discov. 2019;9(3):e1301.

    Article  Google Scholar 

  34. Funk B, Sadeh-Sharvit S, Fitzsimmons-Craft EE, et al. A framework for applying natural language processing in digital health interventions. J Med Internet Res. 2020;22(2):e13855.

    Article  PubMed  PubMed Central  Google Scholar 

  35. Korach ZT, Yang J, Rossetti SC, et al. Mining clinical phrases from nursing notes to discover risk factors of patient deterioration. Int J Med Inform. 2020;135:104053.

    Article  PubMed  Google Scholar 

  36. Fernandes AC, Dutta R, Velupillai S, Sanyal J, Stewart R, Chandran D. Identifying suicide ideation and suicidal attempts in a psychiatric clinical research database using natural language processing. Sci Rep. 2018;8(1):1–10.

    Article  Google Scholar 

  37. Jonnagaddala J, Liaw S-T, Ray P, Kumar M, Chang N-W, Dai H-J. Coronary artery disease risk assessment from unstructured electronic health records using text mining. J Biomed Inform. 2015;58:S203–10.

    Article  PubMed  PubMed Central  Google Scholar 

  38. Yang X, Yang S, Li Q, Wuchty S, Zhang Z. Prediction of human-virus protein-protein interactions through a sequence embedding-based machine learning method. Comput Struct Biotechnol J. 2020;18:153–61.

    Article  CAS  PubMed  Google Scholar 

  39. Li H. Deep learning for natural language processing: advantages and challenges [J]. Natl Sci Rev. 2017.

  40. Couronné R, Probst P, Boulesteix A-L. Random forest versus logistic regression: a large-scale benchmark experiment. BMC Bioinformatics. 2018;19(1):270.

    Article  PubMed  PubMed Central  Google Scholar 

  41. Buskirk TD, Kolenikov S. Finding respondents in the forest: a comparison of logistic regression and random forest models for response propensity weighting and stratification. Survey Methods: Insights from the Field. 2015:1-17.

  42. Singh V, Gupta RK, Sevakula RK, Verma NK. Comparative analysis of Gaussian mixture model, logistic regression and random forest for big data classification using map reduce. Paper presented at: 2016 11th International Conference on Industrial and Information Systems (ICIIS) 2016.

  43. Muchlinski D, Siroky D, He J, Kocher M. Comparing random forest with logistic regression for predicting class-imbalanced civil war onset data. Polit Anal. 2016;24(1):87–103.

    Article  Google Scholar 

  44. Ruiz A, Villa N. Storms prediction: logistic regression vs random forest for unbalanced data. arXiv preprint arXiv:08040650. 2008.

  45. Pranckevičius T, Marcinkevičius V. Comparison of naive bayes, random forest, decision tree, support vector machines, and logistic regression classifiers for text reviews classification. Baltic J Modern Comput. 2017;5(2):221.

    Article  Google Scholar 

  46. Payne NR, Puumala SE. Racial disparities in ordering laboratory and radiology tests for pediatric patients in the emergency department. Pediatr Emerg Care. 2013;29(5):598–606.

    Article  PubMed  Google Scholar 

  47. Fallon SC, Kim ME, Hallmark CA, et al. Correlating surgical and pathological diagnoses in pediatric appendicitis [J]. J Pediatr Surg. 2015;50(4):638–41.

    Article  PubMed  Google Scholar 

  48. Farion KJ, Michalowski W, Rubin S, Wilk S, Correll R, Gaboury I. Prospective evaluation of the MET-AP system providing triage plans for acute pediatric abdominal pain. Int J Med Inform. 2008;77(3):208–18.

    Article  PubMed  Google Scholar 

  49. Kharbanda AB, Dudley NC, Bajaj L, et al. Validation and refinement of a prediction rule to identify children at low risk for acute appendicitis. Arch Pediatr Adolesc Med. 2012;166(8):738–44.

    Article  PubMed  PubMed Central  Google Scholar 

  50. Laurell H, Hansson L-E, Gunnarsson U. Manifestations of acute appendicitis: a prospective study on acute abdominal pain. Dig Surg. 2013;30(3):198–206.

    Article  CAS  PubMed  Google Scholar 

  51. Oncel M, Degirmenci B, Demirhan N, Hakyemez B, Altuntas YE, Aydinli M. Is the use of plain abdominal radiographs (PAR) a necessity for all patients with suspected acute appendicitis in emergency services? Curr Surg. 2003;60(3):296–300.

    Article  PubMed  Google Scholar 

  52. Alshebromi MH, Alsaigh SH, Aldhubayb MA. Sensitivity and specificity of computed tomography and ultrasound for the prediction of acute appendicitis at King Fahad Specialist Hospital in Buraidah, Saudi Arabia. Saudi Med J. 2019;40(5):458.

    Article  PubMed  PubMed Central  Google Scholar 

  53. Zhang X, Bellolio MF, Medrano-Gracia P, Werys K, Yang S, Mahajan P. Use of natural language processing to improve predictive models for imaging utilization in children presenting to the emergency department. BMC Med informatics Decis Mak. 2019;19(1):287.

    Article  Google Scholar 

  54. Zheng T, Gao Y, Wang F, Fan C, Fu X, Li M, et al. Detection of medical text semantic similarity based on convolutional neural network. BMC Med informatics Decis Mak. 2019;19(1):1–11.

    Google Scholar 

  55. McNaughton CD, Self WH, Pines JM. Observational health services studies using nationwide administrative data sets: understanding strengths and limitations of the National Hospital Ambulatory Medical Care Survey: answers to the May 2013 Journal Club questions. Ann Emerg Med. 2013;62(4):425–30.

    Article  PubMed  Google Scholar 

  56. McCaig LF, Burt CW. Understanding and interpreting the National Hospital Ambulatory Medical Care Survey: key questions and answers. Ann Emerg Med. 2012;60(6):716–721.e1.

    Article  PubMed  Google Scholar 

Download references


The authors would like to thank the National Natural Science Foundation of China and the National School of Development, Peking University, University of Michigan, and other members for their support and cooperation.


This study was supported by Michigan Institute for Clinical and Health Research (MICHR No. UL1TR002240), National Natural Science Foundation of China (No. 71473096; No. 71673101; No. 71974066). This study was also supported by the Thomas E.Starzl Transplantation Institute, University of Pittsburgh Medical Center. These funders had no role in study design, data collection, analysis, decision to publish, or manuscript preparation.

Author information

Authors and Affiliations



D.S., X.Z. contributed to the conception and design of the project; D.S., T.Z., K.H., Y.C., X. Z and Q.L. contributed to the analysis and interpretation of the data; P.V., P.M. contributed to the data acquisition and provided statistical analysis support; D.S. and X.Z. drafted the article. D.S. and X.Z. are the guarantors. The corresponding authors attests that all listed authors meet authorship criteria and that no others meeting the criteria have been omitted. The author(s) read and approved the final manuscript.

Corresponding author

Correspondence to Xingyu Zhang.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

None declared.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Xingyu Zhang - Previous institutional affiliation during the course of the study: Department of Systems, Populations, and Leadership, University of Michigan School of Nursing, Ann Arbor, Michigan, United States.

Supplementary Information

Additional file 1: Table S1.

Diagnosis and Procedure Codes. Table S2. Sample size of Diagnosis and Procedure Codes between 2005 to 2017. Table S3. Parameter estimation with structured variables of the logistic regression for adult ED patients, NHAMCS 2005-2017. Table S4. Parameter estimation with structured variables of the logistic regression for pediatric ED patients, NHAMCS 2005-2017. Figure S1. The contribution (weights) of each 128 Doc2Vec output to the first 24 principle components

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit The Creative Commons Public Domain Dedication waiver ( applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and Permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Su, D., Li, Q., Zhang, T. et al. Prediction of acute appendicitis among patients with undifferentiated abdominal pain at emergency department. BMC Med Res Methodol 22, 18 (2022).

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI:


  • Acute appendicitis
  • Emergency department
  • Machine learning
  • Prediction modelling
  • Precision health