This study described which statistical methods have been referenced in the literature using the Canadian Community Health Survey (CCHS) dataset. Descriptive statistics and regression analysis methods dominate the literature in comparison to elementary statistics. The high prevalence of descriptive statistics is unsurprising, since papers using the CCHS are almost all presenting some analysis or description of the data, and descriptive statistics are usually the first table in quantitative papers.
However, the magnitude of the difference in prevalence of reported elementary statistics versus regression techniques is striking.
There are two possible reasons for the large difference in prevalence between elementary statistics and regression techniques: modelling has become the only step of analysis in many empirical papers, or, authors do not specifically mention elementary statistics when employing them. The first point, that models are the main analysis step taken by most researchers, is one that is controversial to some practitioners. Nevertheless, the epidemiological paradigm is to analyze exposures and outcomes while considering effect measure modification and confounding, thus modelling is a natural tool to employ [13]. In the social sciences, consideration of statistical effects necessitates controlling for covariates to curtail omitted variables bias, which is mathematically the same issue as confounding. Thus, a series of elementary statistics might not be necessary, or might be redundant, when the research question of interest naturally lends itself to modelling. The second point, that elementary statistics are not explicitly reported, likely explains a large share of the gap between regression techniques and elementary statistics. The elementary statistics are likely not reported themselves, but used in regression interpretation and coefficient hypothesis testing. For instance, any linear regression that reports a statistically significant coefficient is actually reporting a t-test of that coefficient being equal to zero, the authors just do not identify it as such, probably because most readers understand the hypothesis test being identified. This is important, since an individual would need to understand elementary statistical tools to be able to engage with a study using regression techniques, even though that paper does not specifically mention those elementary statistics.
Of the regression techniques utilized, this study shows that logistic regression is the most common technique employed by studies using the CCHS. In most health-condition-outcome models, an individual either has a health condition or they do not, and in that case a logistic regression is a popular choice due to its mathematical properties, such as no upper bound existing for the log of an odds. There is also a tendency to dichotomize continuous variables so that they may serve as an outcome variable in a logistic regression model, either because of medical ease of interpretation of the variable (e.g., dichotomizing body mass index into “obese” or “not obese”) or because researchers are more comfortable with logistic regressions (working with them or presenting them to an audience familiar with them). It is interesting that one regression technique, logistic regression, should be present in nearly 70% of the papers included in our literature search, and is indicative of the importance of understanding this regression technique for anyone interested in health research.
Figure 2 shows time trends of the three broad types of methods observed in the data. One notable point is that the number of papers using the CCHS increased from 5 in 2002 to over 100 in 2010, which represents an impressive feat of data dissemination by Canadian academic institutions and Statistics Canada. The statistical results used by year kept a steady pattern, with descriptive statistics being present in most papers, regression techniques closely trending alongside descriptive statistics, and elementary statistics trending much lower. The year 2010 looks like it was a peak in research using the CCHS surveys, and the number of publications starts to plateau after 2010 with the three methodological categories maintaining their relative ranks.
The statistical software preferences illustrated the popularity of SAS in the analysis of the CCHS datasets. The prevalence of SAS is unexplained by this paper; however, it might have to do with SAS being a prominent application in government environments or being available in the Statistics Canada Research Data Centres (RDC) where access to the CCHS micro data is available. Additionally, SAS provides macros for tasks such as bootstrap variance estimation which is often applied in the analysis of the CCHS. Even though this paper focused on the use of statistical methods and the associated statistical software, other software was identified during our review, such as in the analysis of geographical information (i.e. ESRI ArcGIS) [17], dietary analysis (i.e. SIDE-IML, SIDE, C-SIDE) [18] and discrete event simulation (i.e. Arena) [19].
Our study offers a new approach to the description of statistical methods presented in the literature. While the previous reviews of statistical methods used in health literature have focused on the analysis of articles indexed in specific journals or sets of journals, we investigated how a specific dataset was utilized in the reporting of statistical methods. Articles were drawn from 233 journals and spanned many specialties. This study describes the statistical techniques that a reader must be familiar with in order to understand the information within articles using a popular population-level dataset. With the growing popularity of open access journals, statistical literacy of the general public might be an issue that needs to be addressed in knowledge translation. When regression models are the norm for presenting results by researchers, the message within the papers might not be nuanced effectively or clearly understandable to media or interested individuals.
Even though this study only focused on the statistical methods reported in the literature and not the actual statistics used in the analysis, this information may be useful in identifying research gaps. For example, we could not identify any machine learning methods such as Artificial Neural Networks, Support Vector Machines or Random Forests in the analysis of CCHS data, even though these machine learning algorithms are used in the analysis of other medical areas such as mortality prediction and health services utilization.
There are several limitations to our study. First, we only reviewed references from the Ovid databases Medline and Embase, Web of Knowledge, and Scopus bibliographic databases. The CCHS dataset could have been applied to other disciplines, such as computer science, not fully indexed in the selected bibliographic databases and that literature would not be reflected in our analysis. Additionally, our study only looked at the bibliographical databases and would not include grey literature such as government reports. Second, in our retrieval from the bibliographical databases, if the CCHS were identified as used in the article, but not the abstract, it would not be included in our analysis. This is common for all literature searches that use the abstract as the basis of the search, such as a search in PubMed. In terms of the analysis, the search function only finds textual information, so if a statistical method was used in a table formatted as a picture, or referenced solely in a footnote that was a picture, the statistical method may not have been found by the software. We felt this was unlikely, as most methods sections of articles tend to report the statistical techniques and datasets used. Another limitation was in the development of the terms and phrases used for the automated identification of statistical concepts as there are many ways to phrase things in the English language and some statistical terms can be used in other contexts. For example, the word “mean” was identified as an arithmetic average, but the word “mean” can appear in the text and not be referring to an average, which the software would not be able to discriminate. Further complicating this issue, there is no standardized reporting tradition in health statistics. For instance, awkward turns of phrase to describe a mean, while obviously referring to a mean, were missed by the algorithm. An example would be a phrase like “respondents were averaged 20 years of age”, which is reporting a descriptive statistic but is not captured by the search process.