Socio-environmental predictors of diabetes incidence disparities in Tanzania mainland: a comparison of regression models for count data

Background Diabetes is one of the top four non-communicable diseases that cause death and illness to many people around the world. This study aims to use an efficient count data model to estimate socio-environmental factors associated with diabetes incidences in Tanzania mainland, addressing lack of evidence on the efficient count data model for estimating factors associated with disease incidences disparities. Methods This study analyzed diabetes counts in 184 Tanzania mainland councils collected in 2020. The study applied generalized Poisson, negative binomial, and Poisson count data models and evaluated their adequacy using information criteria and Pearson chi-square values. Results The data were over-dispersed, as evidenced by the mean and variance values and the positively skewed histograms. The results revealed uneven distribution of diabetes incidence across geographical locations, with northern and urban councils having more cases. Factors like population, GDP, and hospital numbers were associated with diabetes counts. The GP model performed better than NB and Poisson models. Conclusion The occurrence of diabetes can be attributed to geographical locations. To address this public health issue, environmental interventions can be implemented. Additionally, the generalized Poisson model is an effective tool for analyzing health information system count data across different population subgroups.


Background
To date, non-communicable, diseases including diabetes, are still a global health challenge affecting people of all ages; however, elderly people are at higher risk [1,2].In 2016, statistics showed that Non-Communicable Diseases (NCDs) were responsible for 80% of all deaths worldwide.The NCD death risk is notably higher in Sub-Saharan Africa, Central Asia, and Eastern Europe [3].In Tanzania, as in Sub-Saharan Africa, there is evidence of a high prevalence of NCD cases, including diabetes [4,5].
The emergence of NCDs in humans is influenced by a complex combination of various factors, which include environmental conditions, cultural beliefs, self-management, socio-demographic factors, genetics, and biology [1].These diseases are sometimes referred to as behavioural diseases because, apart from other factors, self-management, which is linked by a person's behavioural practice in running his/her daily life can increase one's likelihood of developing NCDs [1,6].Cultural norms and values can also influence human behaviour, resulting in regional and national variations in the prevalence of NCDs [7].This study aims to investigate the impact of socio-environmental factors, which are the council's zone and residence, along with other factors on the total diabetes incidences in the council.The link between environment and human behaviours is well explained in some behavioural theories, including the reciprocal deterministic concept of social cognitive theory [8].
Currently, many scenarios in public health and official statistics include count data.Count data include specific disease cases reported in a particular geographical unit, the total number of fatalities occurring within a given timeframe, etc.The Poisson model is a well-known method for modeling count data and has been applied in many situations [9][10][11][12][13][14].However, it assumes that the subject occurs randomly and at a constant rate, resulting in equality mean and variance, which is often unrealistic in real-life situations.When data exhibit over-dispersion, the negative binomial (NB) model is often used as an alternative to the Poisson model [11,15,16].Occasionally, under dispersion also occurs among count data, especially for rare events.To tackle this issue, researchers have developed new models that can model count data that exhibits over, under, or equal dispersion.These models were obtained as a result of generalization or mixing with the Poisson model.Examples of these models are the Generalized Poisson (GP) [17], the Weighted Poisson, the Conway-Maxwell-Poisson (CMP), the Hyper-Poisson (HP) [18], Extended Bi-parametric Waring (EBW) [19], and the Complex Tri-parametric Pearson (CTP) [20].
Many of the distributions mentioned above have complex functional forms, which can lead to significant computational challenges and make them difficult to use.For this reason, the GP model was selected for this study.This model has a well-defined functional form and allows easy parametric estimation [21][22][23][24].The GP model is the best option to be used in health and behavioural studies, for many reasons, including the non-uniformity of the population being studied, where individuals tend to cluster or aggregate within a particular combination with similar characteristics; dependence among observations due to environmental factors, where there are high incidences of diabetes cases in the same geographic area due to similarity in socio-cultural factors; which causes unequal dispersion which happens in the data [15,21].
Numerous studies on NCDs, including diabetes, have been conducted [4,5,[25][26][27][28][29].However, none have utilized the GP model or quantified socio-environmental factors (such as zone and council residence) in NCD occurrences in mainland Tanzania.This study aims to establish a model that can be adopted in modeling NCDs count incidences associated with socio-environmental and other risk factors.Hence, it emphasizes environmental-based approaches to eradicating NCDs in Tanzania, and the model can also be adopted in similar scenarios.Many research articles elaborate on the application of GP regression in modeling over-dispersed data [24,[30][31][32][33][34][35].However, the articles do not describe or quantify how overestimation of the standard error occurs when using standard Poisson in modeling over-dispersed data as the current does.

Design and settings
This study utilized cross-sectional reseach design.Secondary data collected by the District Health Information System (DHIS2) and the National Bureau of Statistics (NBS) in 2020 were used for analysis.The response variables represent the number of patients diagnosed with diabetes mellitus admitted to all health facilities within a council, except for regional referral and zonal hospitals.Information collected in 2020 and across 184 councils in Tanzania mainland is used in this study.

Models description
Generalized linear models (GLMs) extend linear models (LMs) when the response variable is not normally distributed, allowing for the representation of non-normal response variables.In GLMs, the distribution of the response variable can be counted, categorical, discrete, ordinal, and many others as long as it belongs to the exponential family of distributions.This family has several well-known distributions including the Poisson distribution and its generalization, the binomial distribution, Gamma distribution, and many others.GLMs can be described using the following equation: And is mainly characterized by its three components: (1) A random component, which describes the outcome variable Y i of the i th observation by its probability density function.(2) A linear component X T i β , where X T i is the vector of predictors and β is a column vector of model coefficients.(3) Differentiable link function g(µ) , which relates the mean of the response variable and the linear function of the predictor variables [9,36].
The Poisson process is often used to explain the variations in count data compared to a predicted average [9].However, this model has certain assumptions, including that the data must be equally distributed and that the g(µ) = X T i β mean must always equal the variance.Poisson regression is a well-known model for modeling the means of n non-negative count response variables y 1 , y 2 , . . ., y 2 .Let Y i i = 0,1, . . . ., be the response variable which repre- sents the number of diabetic patients admitted to specific council in 2020 and X ′ i = X 1i , . . ., X ki represents a k-dimension vector of linear predictors associated with the response variable Y .A Poisson regression of the response variable given predictors is written as: The logarithm of the likelihood of the equation above can be written as: By substituting µ i = e x i ′ β , we obtain the logarithm of the likelihood function in terms of β ′ s which can be writ- ten as: Maximum likelihood estimates of β's can be obtained by differentiating the logarithm of the likelihood equation with respect to β's and setting the results equal to zero.
Thus, the Poisson regression model of the mean parameter µ i is written as As data are collected from councils across various geographical locations, including areas with differing behavioural patterns, there is a high probability of unequal dispersion in the data.This suggests that the data may have under-or over-dispersion.If Poisson regression is used to model these data, it could lead to incorrect conclusions because the standard error may be overestimated [15,37].The negative binomial model, also known as the Poisson Gamma mixture, is considered a better alternative to Poisson regression (1) when dealing with over-dispersed count data.The model's mean and variance have a quadratic relationship, resulting in its being named NB2 [11].
The NB model was formulated as an extension of the Poisson model by considering the idea that the modeled outcomes cannot happen at a constant rate, leading to heterogeneity in the outcomes.The extended model can be formulated as follows: A negative binomial distribution is generated using a series of Bernoulli trials with a constant success probability p.Let Y be the number of attempts that failed before the k th success (k > 0) , then, Y follows a negative Binomial distribution with probability mass function (pmf ) written as follows: The mean and variance of Y are pk (1−p) and pk (1−p) 2 respectively.In the negative binomial regression model the interest in modeling the mean of the outcome variable Y with its realization y 1 , y 2 , . . ., y 2 , and X ′ i = X 1i , . . ., X ki denotes the matrix of predictors.Parametrization of Eq. ( 2) above in terms of µ and dispersion parameter α yield NB regression model as described below: Let p = α α+µ , where α = k , furthermore it is known that yi !Ŵ(α) , then the pmf of Y in Eq. ( 2) can be written as: where Ŵ represents the gamma function and α is a disper- sion index that has been modified to take positive values only.The NB can also be obtained by using the Poisson mixture gamma formula.Then, Y ∼ NB(µ, α) and the mean and variance of Y are µ i and µ i + µ i 2 α respectively.When α → ∞ , the mean and variance of Y tend to be equal, which implies that the Poisson model is a special case of the negative binomial model [9,11,15,36,38].
The likelihood of Eq. ( 3) is proportional to: It is known that: It follows that: (2) and the log-likelihood is given by: It is known that Estimates of the regression coefficients β ′ s and disper- sion index α are obtained by substituting into the above equation and differentiating it with respect to β ′ s and α and setting the result equal to zero.
Then, the negative binomial regression model can be written as: NB model cannot be used to model equal and underdispersed data.The finding reveals that the NB model faces convergence issues if inappropriately used to model count data, which does not exhibit over-dispersion [38].
Many articles use the latest count model generalizations; however, the GP model remains beneficial and user-friendly [39].This model can model stochastic processes with count data that have equal, under, or overdispersion.Moreover, estimating the parameters of this model is simple compared to other generalized models.Due to the reasons mentioned above, this study employs the model introduced by Consul and Jain [17,40].Let Y i represent diabetes incidences for inpatient recorded in a certain council for 2020.Then, Y i represents the response variable having response values y 1 , y 2 , . . . . . ., y n associ- ated with several explanatory variables.Then, Y i follows a GP distribution, and its probability mass function can be written as:
Suppose explanatory variables are represented by (K − 1) dimensional vector X ′ i = X 1i , . . ., X ki .The con- ditional distribution of Y i for a given value of x i follows a GP distribution with the mean value given by: where f (x i , β) > 0 represents a differentiable func- tion, C i represents a measure function and β is the K-dimensional vector of regression parameters.
From the mean of GPD, µ = α (1−δ) and ϑ = 1 (1−δ) the dispersion factor, then the generalized Poisson regression can be deduced as stands for the square root of the dispersion index, and m is the largest positive integer for which µ + m(ϑ − 1) > 0 when ϑ is non-negative.
When ϑ = 1 , GP distribution is condensed to stand- ard Poisson regression (proper in modeling equal dispersed data); when ϑ > 1 GPR is appropriate in modeling over-dispersed data, and when ϑ < 1 , GPR is used to fit under-dispersed data [34,40].
Similar to the standard Poisson regression model, GPR uses a log link to connect the mean of the response variable and explanatory variables, as shown below: For y > m when ϑ < 1, ( where, µ = µ(x) = α (1−δ) is the mean, x T i represents the (k − 1) dimensional vector of explanatory variables and β is the k-dimensional vector of regression parameters.
In this study, diabetes counts in the council in 2020 have been used as a response variable regressed to the following elaborated explanatory variables: The model can be written as: Then, Estimation of model coefficients β was performed through the maximum likelihood method.Additionally, the goodness of fit of the GP model over the NB and Poisson models is also evaluated using AIC, AICc, and BIC.(6)

Results
The histogram in Fig. 1 describes the dispersion property of diabetes incidence across councils.The plot indicates a significant positive skew, with more small numbers, including zero, and few large counts, suggesting overdispersion among diabetic patients incidence between two age groups, namely, age 5 to 59 and 60 or older.This is common among disease incidence datasets since sometimes disease severity is triggered by behavioural patterns among subpopulations being sampled, which vary from one society to another, leading to unequal dispersion.Since the data used reveals unequal dispersion, the GP model may give a precise estimate with meaningful inference [19,20,24,39,40].
The beeswarm plots in Fig. 2 display diabetes records per geographical location.The plots indicate a concentration of diabetes in areas with similar traits.Categories are arranged in ascending order based on the number of diabetes cases reported.The councils in the northern zone have more diabetes cases than the other zones, while the councils in the southern zone have fewer counts than other zones.Furthermore, councils inside high-count zones record fewer zeros and low counts than councils within low-count zones.Additionally, there is a significant difference in diabetes records between rural and urban councils, with rural councils record many zero and small incidents while urban councils record a substantially large number of diabetes cases.
Tables 1 and 2 summarize the association between diabetes count categories and categorical predictors in Tanzania mainland for patients with 5-59 and 60 years and above age groups respectively.The dataset consists of diabetes Fig. 1 Histogram showing diabetes counts per age group records from 184 councils, with the minimum and maximum recorded numbers being 0 and 958, respectively, across two age groups.The chi-square test for dependence was used to measure the presence of a statistically significant association between diabetes count and two categorical predictors associated with environmental factors: council residence (rural or urban) and council zone (northern, eastern, lake, southern, southern highland, western, and central zones) among datasets from two distinct age groups.For both scenarios, the p-value is less than 0.001, indicating the presence of an association among categories.Moreover, Tables 1 and 2 demonstrate larger counts of diabetes incidences recorded among councils located in urban areas than councils in rural areas in diabetes patients aged 5-59 years.The presence of large numbers of diabetes records among people aged 5-59 years indicates the high chance of premature mortality and morbidity due to diabetes contrary to sustainable development goal number 3.4.For 60 years and older age groups, 60.8% of councils located in rural areas recorded fewer than 50 diabetes patients whereas 8.7% of councils in urban areas recorded fewer than 50  patients.Moreover, in the southern zone, none of the councils recorded more than 200 diabetes cases among both age groups.Table 3 describes the log of expected diabetes counts as a function of selected predictor variables using the GP model (located at the top of the table), the negative binomial model (located in the middle of Table 3), and the standard Poisson (at the bottom of Table 3).
Based on the GP model's results, the number of diabetes cases in each council is influenced by its population.Thus, more populated councils are anticipated to have more cases of diabetes than the less populated ones.The logs of expected diabetes count in a council would be expected to increase by 0.2264 (p − value < 0.0001) when the council's population increases by one unit.The number of health facilities is significantly associated with the number of diabetes cases in the councils.This may be because the availability of health facilities accelerates disease tracking and recording.Increase in the expected number of health facilities in the councils leads to increase the log of expected diabetes counts by 0.0132 (p − value < 0.0001) .Moreover, the results sug- gest positive association between percentage of peoples living with HIV and diabetes incidences in the council.Conversely, GDP per capita shows a significantly negative association with the log of expected diabetes counts in the councils.This implies that diabetes cases happen more in councils with less GDP per capita.On the other hand, there is no significant association between the percentage of male diabetes patients and the number of diabetes cases in the councils.
Predictors representing environmental factors are significantly associated with diabetes counts in the councils.It can be demonstrated that, when other model covariates are held constant, the difference in logs of diabetes counts is predicted to be 1.172 (p − value < 2e − 16) larger for councils located in urban areas than those in rural areas.Compared to the northern zone, councils located in the central, eastern, lake, southern, and southern highlands zones have a decreased log diabetes counts of There are slight differences between the estimates and standard errors (SEs) obtained by the GP and NB models, resulting in different inferences for the western zone category.The GP model shows that the category's contribution to the logs of diabetes did not differ from that in the northern zone.In contrast, the NB model shows a significantly decreased log diabetes count by 0.6395 (p − value = 0.00899) , compared to the northern zone when other factors in the model are kept constant.Additionally, the results in Table 3 indicate that SEs in the Poisson model were underestimated because the values were visually smaller than those obtained in the GP and NB models.This occurs because the Poisson model cannot handle the over-dispersion present in the analyzed datasets.Underestimating SEs leads to incorrect inferences being drawn about some predictors and factors.
Although the GP model finds that one predictor variable (percentage of males hospitalized by diabetes) and one category (western zone) are not important, the NB model only finds the percentage of males hospitalized by diabetes to be insignificant.However, all predictors are deemed significant in the Poisson model.This shows how the GP model excels in controlling over-dispersion and producing precise estimates compared to the NB and Poisson models.
Based on the results in Table 3 from the GP regression models, we have provided prediction equations for the average diabetes count as follows:  The antilogarithm of the prediction equation above gives the expected number of diabetes cases as given below: Table 4 gives the goodness of fit results obtained using different information criteria.The results show that the GP model earns the smallest information criteria values, which means that it outperforms the NB and Poisson models in modeling the used data.Moreover, the results show a slight difference among values obtained by NB and GPD, which may indicate that these two models have slight differences when used to model over-dispersed data.However, the major difference between them is that the GP model is appropriate for modeling equal, over, and under-dispersed data, while the NB model is used for modeling over-dispersed data.
In Table 4, there are Pearson chi-square (Pearson-χ 2 ) and Pearson-χ 2 /DF values for the GP, NB, and Pois- son models.A value of Pearson-χ 2 /DF greater than one means there is over-dispersion, and if it is exactly or close to one, it means over-dispersion is well controlled.The GP model has a value closest to one compared to the other models, making it the best choice for modeling over-dispersed diabetes count data.

Discussion
This paper suggests utilizing the GP model to model socio-environmental and other risk factors associated with diabetes incidences in Tanzania mainland.The GP model's performance was compared to that of NB and Poisson, as these three models are related.The NB model was obtained through a parametrization process called Poisson mixture gamma, which can model overdispersed data that the standard Poisson model cannot.Additionally, the model can be reduced to the Poisson model when the dispersion parameter tends to infinity.Similarly, the GP model was obtained as a limit of the NB model and can model over, under, and equally dispersed count data.Similar to the NB model, the GP model can also be reduced to the Poisson model when its dispersion parameter equals zero.These models belong to the GLM category and are widely used in analyzing the relationship between a response variable that follows exponential families of distributions and one or more predictor variables.Linear models are a specific type of GLM with an identity link function [9,41,42].The link function transforms the response variable to conform to the linear model assumption, connecting the mean of the response variable to a linear combination of predictor variables.This study's findings reveal that the unequal dominance of diabetes cases is associated with the type of council residence.Both descriptive and inferential analyses show that urban areas have more diabetes cases than rural areas probably due to the lifestyles in the two areas.Urban areas showed a strong positive contribution to diabetes cases, supporting that environmental factors, including urbanization, are a significant risk factor for diabetes and other NCDs [5,43].The findings also show a significant difference in the predicted log of diabetes cases among various zones.This indicates heterogeneity of the burden across socio-environmental attributes.The northern zone, the reference category, appears to have made a significant contribution, causing the projected log of diabetes counts in other zones to be adverse.The western zone was found to have a negligible association with the log of diabetes cases compared to the northern zone according to the GP model.This finding is related to those of Stanifer et al. [44], who observed that hypertension is environmentally clustered since people living together share social-cultural norms like eating habits, crops produced, and other behavioural patterns that affect NCDs.
The study also investigated the contribution of other factors in diabetes cases, and the findings revealed that an increased log of diabetes counts is also associated with the council's population and the number of health facilities.On the contrary, GDP at market price is shown to be negatively associated with the log of diabetes counts.This indicates diabetes incidences are also more common in low-income societies.Several researchers have observed a high NCD rate in low-and middle-income countries (LMICs), which aligns with these findings [3].On the other hand, the total number of patients who attended hospitals for HIV care is not associated with diabetes cases.This result differs from those obtained by Castilho et al. [45].The percentage of male diabetes cases does not significantly relate to total diabetes cases in the councils.This factor is used to measure the contribution of sex to diabetes incidence, as other studies concluded that there is a higher prevalence of NCD cases among males than females in Africa [3].Also, there is empirical evidence of a high economic burden among poor households in Tanzania caused by NCDs [46].This study findings reveals dominance of diabetes incidences among councils with low GDP which may increase poverty contrary to Sustainable Development Goal 1.
The GP model performs better than both the NB and traditional Poisson regression models based on the loglikelihood value, AIC, BIC, AICc and Pearson-χ 2 values.The model achieves the lowest value among all information criteria, suggesting that GP is better at controlling over-dispersion among diabetes counts than its competitors.To determine the dispersion value of the data, one can also divide the Pearson chi-square value by its degree of freedom.This value should be close to or equal to 1 for equally dispersed data in the Poisson model.In this study, the value of Pearson-χ 2 /DF for the Poisson model is far greater than 1, indicating the presence of over-dispersion.The problem is well handled in the GP model.

Conclusion
Considering the variability of count data when conducting statistical modeling is crucial.Ignoring this factor can lead to false estimates of the standard error, affecting the test statistic and p-value.It is crucial to examine the dispersion nature of the data to avoid incorrect inferences during statistical modeling of count data.
Hence, this study recommends the use of the GP model in modeling risk factors associated with disease count incidences, specifically in data collected among population subgroups with varying social and environmental characteristics.The model can accommodate count data collected in population subgroups with equal and unequal dispersion.The model is advantageous because it does not involve a difficult computation burden, it does not suffer from convergence issues and gives precise results compared to the most applied NB and Poisson models.

Limitations of the study
The data in DHIS2 are recorded for very broad age groups which hinders further comparison regarding disease incidences.Additionally, the system does not include important patient information, which also limits model variables.

Fig. 2
Fig. 2 Beeswarm plots of the distribution of diabetes counts within councils by environmental location −0.8480 (p − value = 7.72e − 05) , −0.6483 (p − value = 0.00024 , −0.7265 (p − value = 1.17e − 06) , −0.8064 (p − value = 3.51e − 05) , and −0.7467 (p − value = 4.78e − 05) respectively.The reason is because councils in the northern zone contribute more diabetes cases than the other zones.Additionally, there is no significant difference in log diabetes cases in the western zone compared to councils in the northern zone.

Table 1
Distribution of diabetes counts for patients aged 5-59 years within councils and associated environmental predictors

Table 2
Distribution of diabetes counts for patients aged 60 years and above within councils and associated environmental predictors

Table 3
Model fit results from generalized poisson, negative binomial, and standard poisson models

Table 4
Information criteria