The aim of our study is to demonstrate how data analysis techniques can be used to address the issues of data reduction, prediction and explanation using online available public health data, in order to provide a sound basis for informing public health policy. In relation to this aim, our main methodological result is a set of procedures that involves reducing the set of public health indicators and analysing the importance of predictors by prediction and/or explanation. Our main substantive result is the identification of a small set of predictors of suicide rate which can be considered in public health policy-making.

Here, we first discuss the trade-off between predictive power and interpretability, followed by our results from a methodological perspective. We then proceed with a discussion of our substantive results in terms of suicide predictors. Next, we discuss informatics challenges of public health data. Finally, we present recommendations and future work regarding analysis of public health complex data from our findings.

### Trade-off between predictive power and interpretability

Our results demonstrate the need to make informed decisions about the approach to take in modelling. In the *prediction* approach, as predictors are added to the model, the model fit in terms of variance explained in the outcome will normally increase, but never decrease. However, statistical supervised-learning techniques such as multiple regression penalise the addition of poor predictors in two ways. First, poor predictors are by definition not statistically significant (e.g., as evaluated by the *t*-ratio for each regression parameter). Second, adding poor predictors reduces the improvement of predicting the outcome from the model against the inaccuracy of the model (as evaluated by the *F*-ratio).

Stepwise multiple regression (Tables 4 and 5) uses statistical rules to avoid the problem caused by adding poor predictors. However, this has two potentially undesirable consequences. As before, first, the models are less likely to be generalisable across samples [11]; in other words, models are more likely not to generalise between public health data sets. Second, the results may be difficult to interpret, as the analyst has no control over the entry of predictors and their order of entry into the final model. For example, when new predictors are added to improve model fit in analyses for prediction, existing predictors may suffer from reversal paradoxes such as suppression [45]. The remedy is to use substantive knowledge to assist in variable selection and specify a theoretically credible model [45]. Therefore, even in data analysis with automated procedures (e.g., automated construction of predictor variables, [35]), a domain expert needs to take part to ensure a meaningful analysis [37]. Moreover, Rudin [37] warns against the practice of attempts to explain ‘black-box models’ – that are seen as inherently ‘non-interpetable’ in their original form – through ‘explainable’ model versions because this ‘is likely to perpetuate bad practices and can potentially cause catastrophic harm to society’ (p. 1). Instead, the proposed solution is to create models that are interpretable to start with. Another consideration is that complex ‘black-box models’ do not necessarily always outperform simpler (interpretable) models [20].

In the *explanatory* approach, the analyst has full control over the entry of predictors and their order of entry in to the final model. In addition, the analyst has the responsibility to a priori specify a model to be tested or to specify different models to be tested against each other (Table 6). This specification is based on theory or pragmatic considerations (such as potential for intervention). The advantage of this approach is the promise of cumulative science, building on existing theory and results of theory-testing, to gain a continually increasing understanding of the outcome that is being studied (e.g., suicide) and, based on this, policy decision-making. Testing models against each other allows us to rule out certain explanations for behaviour and support other explanations. An advantage of analyses for explanation is that their results can be interpreted in the framework of relevant theories from which the models are instantiations. In contrast, the results from analysis for prediction are based on statistical criteria and therefore do not have this advantage; moreover, the results may not be generalisable.

In sum, predictive research aims to produce the most powerful model to predict outcome data from available predictor data. However, because this analysis is atheoretical it can produce results that are not generalisable and difficult to interpret. Explanatory research tests an a priori model or tests alternative models against each other, with the aim of theoretical understanding. Although this supports cumulative science and interpretability of results as a basis for policy decision-making, it does not necessarily maximise predictive power. Explanatory research is important to test theories and develop a coherent body of theoretical knowledge. In disciplines where theory is scarce and data are plentiful, predictive research can help develop causal theory as a basis for subsequent explanatory research [41].

### Methods

From a methods perspective, the main findings of our data analyses and associated considerations are as follows. The square-root and logarithmic transformations produced substantially improved distributions on the dependent variable (suicide rate) and some predictors. Moreover, data transformations substantially improved the distribution of residuals from all regression analyses. Multi-collinearity analysis was effective in identifying and subsequently removing redundant variables for multiple regression. In addition to reducing the predictor set, another benefit of multi-collinearity analysis is that, by doing this, reversal paradoxes such as suppression [45] are less likely to occur. PCA was effective in further reducing the suicide predictor variables to a three-dimensional solution with interpretable components. Although PCA and exploratory factor analysis are unsupervised learning techniques, confirmatory factor analysis [43] offers supervised learning to test the significance and generalisability factor structures. This could be beneficial to test the generalisability of, for example, higher-order predictors (such as relatedness dysfunction) of suicide in public health data.

After multicollinearity analysis, stepwise regression to predict suicide rate was effective in reducing the predictor set further to four statistically significant predictors. Stepwise regression using the component scores of principal-component analysis to predict suicide rate was effective at reducing the predictor set further to two statistically significant components. Stepwise linear regression analysis is advantageous in identifying the smallest set of predictors. Nonetheless, it requires assumptions [43], such as a linear model and normality of variable distributions, which may not be appropriate for all data sets. However, non-linear regression allows other function forms and bootstrapping provides a distribution-free alternative to significance-testing.

Other techniques to consider for reducing the predictor set include public health expert opinion in variable selection, grouping of variables into larger groups and automated statistical methods for linear model selection and regularization [20]. The latter include subset selection methods (e.g., best subset selection), shrinkage methods (e.g., ridge regression and the lasso) and ‘integrated’ dimension reduction methods (principal components regression and partial least squares). All these methods are integrated in the sense that, in contrast to data analysis in the current study, they do not separate (automated) data reduction and (automated) model testing.

Theory-based hierarchical regression for explanation was effective at establishing moderation (by statutory homelessness) of the effect of a predictor variable (low happiness) on suicide rate. Intervention-oriented standard regression for explanation was effective at establishing two significant predictors related to the universal human need of relatedness in social-care services. In addition to the assumptions of stepwise regression analysis, regression analysis for explanation also requires the analyst to specify one or more a priori models, based on domain knowledge. Expected pay-offs are model generalisability and cumulative science.

The methods that were presented in this research were specifically applied to data analysis with multiple regression. However, these methods may be applicable to statistical learning and machine learning more generally.

### Predictors of suicide

From a substantive perspective, our data analysis produced the following results and related consideration. The findings of regression analysis indicate that the evidence or history of self-harm could be used as an important indicator for targeted interventions to reduce suicide. This result supports a previous meta-analysis that established prior non-suicidal self-injury as a top-5 predictor of suicide attempt [13]. However, this is correlational evidence between suicide and self-harm at the unitary-authority level and stronger evidence would be provided if data at the individual level were available for analysis. Specifically, the prominence of self-harm as a predictor of suicide may be partially or fully an artefact. At the individual level (at which no data were available in the dataset that was analysed), suicide cases and self-harm cases may be quite distinct, with few or limited connections. For example, potentially those who commit suicide may not engage in self-harm and those who engage in self-harm often do not commit suicide.

In the intervention-based regression for explanation, both social care users’ social contact need fulfilment and carers’ need fulfilment were significant suicide predictors. These represent and provide further evidence for the universal human need of relatedness as a requirement for human thriving [40] in social care.

### The informatics challenges of public health data

While public health data have great potential to shape public health policy, there are several informatics challenges that should be considered which may introduce bias into the decision-making process or have practical implications for policy delivery. Two main challenges are (1) practical – data quality and (2) person-centred – public health leadership. Regarding data quality, the available data may be insufficiently detailed or impossible to disaggregate to allow policy decisions to be made. For example, if data related to age, gender or social class (or other moderating or mediating variables) are unavailable, targeting services for those most in need, or most likely to benefit, will be difficult to achieve. Furthermore, given the range of services that contribute to public health, integrating datasets can be difficult.

Alongside data-related issues, workforce issues also are a key component in the use of health informatics: developing effective policies through the use of public health informatics data is, ultimately, down to public health leadership. Given the increased downward pressure on public health budgets, it is necessary to improve understanding of how such data can be used among policymakers and commissioners of services, as well as the questions that public health data can (and cannot) answer as part of a wider move towards the implementation of information systems that can be used to support public health functions [9].

While public health informatics continues to expand in areas including surveillance and workforce issues, in other areas, such as communication and coordination, the field remains relatively under-developed. Without greater coordination between services and data, silos are likely to persist in public health information systems [27]. In response to the need for a more systemic approach, population health informatics is a growing topic among developed countries; population health informatics it takes a broader view and targets not only the total population (as publish health informatics does), but also target populations, provider organisations and healthcare systems [22].

### Recommendations and future work

Based on our findings, we present the following recommendations for future work. Effect size and its interpretation is an important consideration in regression modelling and classification [7]. Effect sizes should also routinely be interpreted in the analysis of suicide data and Fingertips data more generally. Moreover, minimum or worthwhile effect sizes have an important role to play as input into in statistical inference regarding obtained effects in techniques such as minimum-effect tests [30] and magnitude-based inference [19]. The use of worthwhile effect sizes as input to inference should also routinely be considered in the analysis of suicide data and public health data more generally.

Although regression techniques proved to be effective in the current study for the analysis of public health data, further data analysis techniques should be considered in future work. For predictive research, these include statistical learning techniques for prediction such as decision trees and random forests [21], support vector machines [5, 20], gradient boosting [35] and neural networks [17]. It is important to note that these techniques suffer from some of the same problems as stepwise regression analysis, in particular a potential lack of model generalisability [15] and potential lack of interpretability [46]. Moreover, because their loss functions are similar, the results of the support vector classifier and logistic regression can often be highly similar [20]. To consider in explanatory research are also techniques such as minimum-effect tests [30], magnitude-based inference [19] and Bayesian regression [24]. These further techniques can complement or replace regression techniques, depending the aim of data analysis.

Specifically, first, mediation analysis can be used to provide evidence for the causal process (the ‘why’) of the treatment effect [18, 26]. To gain a better understanding of public health outcomes (e.g., suicide) from a process perspective, analysts should identify potential mediators in their models and then conduct appropriate mediation analysis. For example, further analysis can be carried out, using self-harm as a mediator, to better understand the factors influencing self-harm and thereby indirectly suicide.

Moderation analysis can be used to provide evidence the conditions under which (when) a treatment effect exists [18]. To gain a better understanding of public health outcomes (e.g., suicide) from the perspective of boundary conditions, analysts should identify potential moderators in their models and then conduct appropriate moderation analysis (see, e.g., Table 6). The combination of mediation and moderation analysis (conditional process analysis; [18]) can provide further insights in the conditions (moderation) under which mechanisms (mediation) that explain (suicide or other) outcomes operate. For example, this analysis can establish whether the mediated effect of a suicide prevention intervention is moderated by baseline score (the conditions under which mediation occurs).

Second, future work could use time-series data analysis to identify local authorities that have shown (positive or negative) significant change in suicide rate in recent years. Recommendations could then be made to conduct field work to investigate the causes of this change and possible interventions.

Third, our data analysis was at the level of local authority. However, the predictors of health-public outcomes may vary across different levels of analysis (for example, general medical practice, unitary authority, and region). Therefore, future work should identify available data at different levels and analyse the data accordingly in an integrated fashion through multi-level analysis [8], allowing exceedingly complex models to be tested.

In the analysis of suicide behaviour, the currently available data set allows no meaningful analysis of indicators together with demographic indicators. This is because, first, a breakdown by demographics (gender, age) was not available for some indicators. Second, the breakdown was inconsistent among the remaining variables (e.g. different age brackets were used for different indicators). Therefore, the current analysis did not include demographics. Accordingly, a recommendation for future data collection is that data are consistently broken down by demographics and recorded in public health data sets.

Public health interventions to reduce suicide (e.g., men’s sheds; [49]) may influence outcomes (e.g., suicide rate). However, the current data sets do not include information about interventions (e.g., type of intervention, target population, duration). Future work should therefore collect data on interventions and integrate these with the data that are already collected, in a way that facilitates evidence-based analysis of theory-based interventions [29].