We performed an observational bibliometric analysis of animal research published in critical care medicine journals using PRISMA and STROBE guidelines [14, 15]. Journals were selected based on their inclusion on the Thomson Reuters™ Journal Citation Reports® subject category “Critical Care Medicine” [16]. A PubMed search included animal experimental studies published in 2005 and 2015. Our primary search criterion was that the article was reporting on an animal study based on an experiment. Animals were further defined as: “any of a kingdom of living things composed of many cells typically differing from plants in capacity for active movement, in rapid response to stimulation, in being unable to carry photosynthesis, and lack of cellulose cell walls” [17]. We excluded meta-analyses, case reports, historical articles, letters, review articles, and editorials. One investigator manually assessed the PubMed search results for animal experimental studies. Then, the PubMed filter “other animals” was applied to the initial search results to detect any animal experimental studies not found in the manual search. Journals that did not publish at least ten animal studies in both 2005 and 2015 were excluded from the analysis (Fig. 1). To assess consistency in the identification of manuscripts reporting on animal experimental research, a second investigator blinded to the results of the first investigator independently searched two journals that were randomly selected from the seven journals included in this study.
Next, we rated all animal studies selected. A computer-generated randomization scheme was used to randomize articles by both year and journal before the analysis (Excel, Microsoft Co., Redmond, WA). Studies were analyzed using their full-text Portable Document Format (PDF). Reporting of power analysis, randomization, and blinding was then graded using a 0–3 point scale (0-not mentioned, 1-mentioned but specified as not performed, 2- performed but no details given, 3-performed and details given) [18]. To assess inter-rater agreement for criterion ratings, we randomly selected 10 % of the total articles for re-rating by a second investigator blinded to the results of the first investigator.
Statistical analysis
To address the primary hypothesis, ordinal scale rating scores were collapsed into binary (performed/not performed) variables. Chi-square tests were used to examine overall trends in reporting of quality metrics for 2005 and 2015. Simple logistic regression with time as a continuous covariate was used to estimate the effect of time on quality metrics performed and reported in published articles. The reference group was “not performed”, and odds ratios were calculated for the entire 10-year increment in time.
To assess the relationship between year of study and degree of reporting of quality metrics (as ordinal variables), the Wilcoxon Rank Sum test was used. Proportional odds models for ordinal logistic regression was used to calculate an odds ratio for the increase in reporting of metrics in 2015 compared to 2005. The proportional odds assumptions were verified by the Score Test.
Inter-rater agreement was assessed for each of the three metrics (power, randomization, and blinding) using the Cohen’s Kappa and Gwet’s AC1 [19]. Gwet’s AC1 is an alternative inter-rater reliability coefficient to Cohen’s kappa that is more stable in the presence of high prevalence and unbalanced marginal probability [19, 20]. Inter-rater agreement for identification of animal study articles was assessed using the kappa coefficient. The level of agreement was interpreted using the scale for interpretation of Kappa [21]. The statistical analysis was done in SAS 9.4 (SAS Institute, Cary, NC). Statistical tests were performed adjusting for multiple comparisons using the Bonferroni method to maintain an overall 0.05 level of significance.
Power analysis
For the power analysis, we assumed a 12% absolute increase in reporting incidences for each of the three metrics over a 10-year interval in two independent proportions [18]. We anticipated a baseline reporting level of 5% in 2005 and a reporting level of 17% in 2015. A total of 141 studies in each year (282 total) would yield 80% power to detect an absolute difference in the proportion of metrics identified of at least 12% as significant.
For the randomization metric, we assumed a 13% absolute increase in reporting incidences for each of the three metrics over a 10-year interval in two independent proportions [18]. We anticipated a baseline reporting level of 41% in 2005 and a reporting level of 54% in 2015. A total of 307 studies in each year (614 total) would yield 80% power to detect an absolute difference in the proportion of metrics identified of at least 13% as significant.
For the blinding metric, we assumed a 21% absolute increase in reporting incidences for each of the three metrics over a 10-year interval in two independent proportions [18]. We anticipated a baseline reporting level of 26% in 2005 and a reporting level of 47% in 2015. A total of 109 studies in each year (218 total) would yield 80% power to detect an absolute difference in the proportion of metrics identified of at least 12% as significant.
All power calculations were done using G*Power, version 3.1.9.2. To maintain a 0.05 significance level across the three outcome metrics, the Bonferroni method for multiple comparisons was used to adjust the alpha to 0.017.