There are several statistical count data models researchers have used in modelling count data. Poisson model have been very popular and widely used as the basis in modeling count data. Count variables may display some substantial variations that have led researchers used alternatives models to accommodate the problems (Cox et al, 1983, Cohen, West and Aiken, 2003, Holford et al., 1983). This variation can result in overdispersion and underdispersion in the data. The usual way to analyze such problems in count data is the use of generalized linear models framework (Jay M. Ver Hoef & Peter L. Boveng, 2007). Several different count data models have been proposed to accommodate the problem of overdispersion and underdispersion associated with count data. Due to the extend at which the observed data is been dispersed, Negative Binomial model, Poisson-Inverse Gaussian model and Quasi-Poisson model with more parameters than Poisson model becomes the most frequently used in modelling count data.
1.1 The background of the study
Predictive performance of model is one of the major purposes of statistical analysis and also to provide a suitable measures of the uncertainty that associates with them. A look at the analysis of count data that reflect the occurrences of event in a specific given time period, as occur in ecological, environmental, climatological, epidemiological, demographic and economic perspective, there is the need to access the predictive models performances (Christesen and Waagepertersen; 2002, Gotway and Wolfinger 2003). A count variable takes positive integer values or zero as an event occur in a fixed time period. There are several examples of count variables that researchers have modeled; the number of alcoholic drinks consumed per day (Armeli et al., 2005), the number of cigarettes smoked by adolescent (Siddiqui, Mott, Anderson & Flay, 1999), Harbor seal counts from aerial survey (Jay M. Ver Hoef & Peter L. Boveng, 2007).
However, according to Erlander et al (1972), it’s important to note however that, for analysing the outcome of a random experiment, the observations had an external variables effects. The call for a relevant models that can accomodate these effects to make fair analysis. Regression model becomes appropriate since the external variables called predictors, assume to influence the mean of the random observed variables. In this study, the outcome variables Malaria cases, the effects are measured by Sex, Age, Year, Month, Educational level and Residence status, for a given period of time. This means the expected number of Malaria count cases does depend on the predictor variables. Knowing the mean, one can understand Malaria cases development in this study. And since the random dependent variable is discrete and nonnegative, its assume to be Poisson distributed as first line appropriate model.
1.2 Research Problems
It is however, important to note the limitations of Poisson model on count data. Different problems may occur when modeling count data with Poisson model. In most cases, individual counts may have substantial significant variability where the actual variance is larger than the assume variance. This happen when the observation variables distribution in count data are associated with overdispersion, excess zeros and non-independence which complicate estimation and bias inference. Poisson model as the basic count data model has one parameter which expresses both the mean and the variance. This may result in an extra Poisson variation called overdispersion, thus produce a different undesirable finding like biased standard error, significant tests and misleading conclusion (Gardner, Mulvey, ; Shaw, 1995). As a result of problems associated with count variable, Statistical researchers over the past years have used different statistical models approaches that provides appropriate statistical analysis in modeling count data, in order to have better model predictive performance assessment (Pepe; 2003, Jolliffe and Stephenson 2003, Clement 2005, Hammer ; Landau; 1981, Cox et al, 1983, Alkin ; Gallop; 2007, Cohen, West and Aiken et al, 2003).
However, Negative binomial model as the standard of generalization of Poisson model was introduce by Greenwood and Yule (1920) as a result of apparent contagion effect by unobserved heterogeneity. Negative binomial model becomes an option since assumption of the basic model for count data, mean equal to variance is been too restrictive, thus becomes the first line alternative count data model since it relaxes the assumption that the mean equal to the variance. It has two parameters mean, and variance that account the overdispersion in the count data .There may be underestimating of the variance of the estimated parameter. Negative binomial model allows for extra variation within the count data to be captured. Taking queue from research studies, e.g. In the analysis of infectious disease count data, Negative binomial model was favoured due to its flexibility and less strictness on gamma part (Hofmann et al, 2006, Held, L., Höhle et al 2005). In addition, Whitaker (1914) and Cameron and Trivedi (1998) supported the idea Negative binomial model is used to address overdispersion in a count data.
Another count data model called Quasi-Poisson model is widely used to analyze count data. This model is similar to both Poisson model and Negative binomial model, but has a linear variance function of the mean. There is no restriction of variance equal mean as in Poisson model. It has a dispersion parameter multiply by the mean to represent the variance. The following statistical literature are guidance that motivated me to select Quasi-Poisson model in my study. Ver Hoef et al (2007) prefer Quasi-Poisson model to Negative binomial model when used on overdispersed harbor seal data. Luma et al (2014) added that, Quasi-Poisson model approach was the preferred choice when modeling Accident Hazard index for urban road segment data.
Having considered Poisson model, Negative binomial model and Quasi-Poisson model, it is also reasonable to look at Poisson-Inverse Gaussian model that can work perfectly on highly dispersed count data. PIG has fixed or varying dispersion parameter. According to Vincent Moshi Ouma et al, (2016), there is a clear evidence that Poisson-Inverse Gaussian model is more flexible than the gamma distribution in Negative binomial model when dealing with overdispersion in infectious disease count data. Count dataset been characterized by too many zeros or too much dispersion cannot be efficiently analyzed by Negative binomial model (Stein et al., 1987). Poisson-Inverse Gaussian model provide better fit than Negative Binomial model in infectious disease data.
1.3 Objective of the Study
It can be seen from the above argument that, there is a problem gap to which model is best used in modeling count data.
The purpose of this study is to compare Poisson model on count data and other alternatives count data models. These models are Negative binomial model, Poisson-Inverse Gaussian model and Quasi-Poisson model.
1.1 Significance of the Study
However, the shortcomings and assumption of the basic model, and also count data having substantial variation and challenges, there is the need to introduce Negative binomial model, Poisson-Inverse Gaussian model and Quasi-Poisson model, which are also members of generalized linear model (Agresti et al, 2007, cox et al, 1983, Engel et al, 1984, Lawless et al, 1987). Considering that, We have the opportunity to use dataset from Africa, the findings of this study will throw more light on Poisson model and its alternative models performance on count data. This will show how the fitting of the count data models affects the regression coefficient and also compare the models performance on count data.
1.5 Research Structure
The study will review the specific substantial problems that may occur when Poisson model is used in modeling count data with a low means. We will observe the outcome, since Poisson model produces undesirable results from outcome variables with high counts. We will do test of overdispersion for Poisson model versus other alternatives count data models to access the adequacy of those model and compare. We will then introduce and discuss other alternatives count data models; Negative Binomial model, Poisson -Inverse Gaussian model and Quasi-Poisson model that are more appropriate in modeling overdispersed count data.
For model evaluation, the following theoretic approaches such as Akaike Information criteria (AIC, Akaike 1973), Bayesian Information Criteria (BIC, Schwarz 1978) would be look at to ascertain which of the models approaches best fit the data. These models approaches depend on the likelihood and distributional form or both. The study will make use of monthly recorded malaria cases data from records department of Mampong Government Hospital, Ghana for a period of 4 years, 2009-2012. SAS and R statistical package will be used in data analysis, examining the non-normally distributed variables as to confirm the assumptions in linear regression. This is to check whether the dataset meets the assumptions of linear regression. These are;
• Linearity- this talk about the relationship between the predictors and the response variable been linear.
• Normality- this is where error should be normally distributed.
• Homogeneity of variance- there should be constant error variance.
• Independence- error associated with every observation should be independent and not correlated.
1. Data Description
The dataset used in the analysis is monthly reported malaria cases base on the number of suspected malaria patients who visited Mampong Government Hospital over a four-year period from 2009 to 2012, was supplied by the Records Department in the Hospital. The hospital is the referral point and serves 74 communities including Mampong as the municipal capital. The data have not been used extensively in any previous studies. Table 1 show predictor variables and their classes for the Malaria counts and the exposure. There were 2 cross-classified observations. 3760 classes with zero exposure and one missing observations from the dataset were removed. The data set now included 6799 observed Malaria count cases over a period of 4 years from 2009-2012.
A summary of the statistical distribution of the data set shows skewness and kurtosis, variance to mean ratio in table 2. The information from the table 2 shows evident of overdispersion in the data set. The skewness is 2.7 and the variance to mean ratio is 13.4 which indicates high overdispersion. As recorded in table 2 and supported by Zha et al (2014), as a rule of thumb, the distribution is highly right skewed when the absolute value of skewness recorded is larger than 1.