Tackling Overdispersion with SAS

Overdispersion

In statistics, overdispersion is the presence of greater variability (statistical dispersion) in a data set than would be expected based on a given statistical model. Overdispersion is often encountered when fitting very simple parametric models, such as those based on the Poisson distribution and this will usually lead to inflated Type I error rates. The Poisson distribution has one free parameter and does not allow for the variance to be adjusted independently of the mean. The choice of a distribution from the Poisson family is often dictated by the nature of the empirical data. For example, Poisson regression analysis is commonly used to model count data. If overdispersion is a feature, an alternative model with additional free parameters may provide a better fit. In the case of count data, a Poisson mixture model like the negative binomial distribution can be proposed instead, in which the mean of the Poisson distribution can itself be thought of as a random variable drawn – in this case – from the gamma distribution thereby introducing an additional free parameter (note the resulting negative binomial distribution is completely characterized by two parameters).

Example: Extramarital Affairs

Description: Data set includes various social and marital factors along with the count of extramarital affairs for 600 people.

naffairs: Number of affairs within last year.
kids: 1=have children 0=no children.
vryunhap: Very unhappily married (1/0).
unhap: Unhappily married (1/0).
avgmarr: Average married (1/0).
hapavg: Happily married (1.0).
vryhap: Very happily married (1/0).
antirel: Anti-religious (1/0)
notrel: Not religious (1/0).
slghtrel: Slightly religious (1/0).
smerel: Somewhat religious (1/0).
vryrel: Very religious (1/0).
yrsmarr1: Year of marriage > 0.75 yrs (1/0).
yrsmarr2: Year of marriage > 1.5 yrs (1/0).
yrsmarr3: Year of marriage > 4.0 yrs (1/0).
yrsmarr4: Year of marriage > 7.0 yrs (1/0).
yrsmarr5: Year of marriage > 10.0 yrs (1/0).
yrsmarr6: Year of marriage > 15.0 yrs (1/0).

Source: https://vincentarelbundock.github.io/Rdatasets/doc/COUNT/affairs.html
Download the data from here

Task: Design a model for estimating number of affairs.

Let's try PROC GENMOD with poisson distribution option:

							
								PROC GENMOD DATA=tutorial.affairs; 

								CLASS kids--yrsmarr6;

								MODEL naffairs = kids--yrsmarr6 / DIST=POISSON;

								RUN;

Criteria For Assessing Goodness Of Fit
Criterion	DF	Value	Value/DF
Deviance	586	2303.1626	3.9303
Scaled Deviance	586	2303.1626	3.9303
Pearson Chi-Square	586	4050.5305	6.9122
Scaled Pearson X2	586	4050.5305	6.9122
Log Likelihood		-235.1784
Full Log Likelihood		-1398.5763
AIC (smaller is better)		2827.1525
AICC (smaller is better)		2827.9730
BIC (smaller is better)		2893.1314

‘Criteria For Assessing Goodness Of Fit’ section of output suggests that, because value/df for both deviance and Pearson Chi-Square statistics is much higher than 1, Poisson model is not quite adequate to describe the counts of hypotensions and crampings. It also suggests that there is overdispersion.

Negative-binomial distribution is a benchmark model to account for overdisperion on count responses. We can test for overdispersion in a Poisson model by using the DIST=NEGBIN, SCALE=0, and NOSCALE options in the MODEL statement of PROC GENMOD. When used together, these options test whether overdispersion of the form μ+kμ2 exists by testing whether the negative binomial dispersion parameter, k, is zero. When k=0, the negative binomial distribution is equivalent to the Poisson distribution. Let's test whether there is overdispersion in our data.

							
								PROC GENMOD DATA=tutorial.affairs; 

								CLASS kids--yrsmarr6;

								MODEL naffairs = kids--yrsmarr6 / DIST=NEGBIN SCALE=0 NOSCALE;

								RUN;

Lagrange Multiplier Statistics
Parameter	Chi-Square	Pr > ChiSq
Dispersion	259.7461	<.0001	*
* One-sided p-value

Our suspicions were correct; there is overdispersion. There are various distributions to implement instead of Poisson:

Negative binomial
Zero-inflated poisson
Zero-inflated negative binomial
Hurdle poisson
Hurdle negative binomial
Generalized poisson

Another option is account for overdispersion with respect to the Poisson model by introducing a dispersion parameter φ into the relationship between the variance and the mean. For example, the code below uses deviance to account for overdispersion.

							
								PROC GENMOD DATA=tutorial.affairs; 

								CLASS kids--yrsmarr6;

								MODEL naffairs = kids--yrsmarr6 / DIST=POISSON DSCALE;

								RUN;

Criteria For Assessing Goodness Of Fit
Criterion	DF	Value	Value/DF
Deviance	586	2303.1626	3.9303
Scaled Deviance	586	586.000	1.000
Pearson Chi-Square	586	4050.5305	6.9122
Scaled Pearson X2	586	1030.5876	1.7587
Log Likelihood		-59.8371
Full Log Likelihood		-1398.5763
AIC (smaller is better)		2827.1525
AICC (smaller is better)		2827.9730
BIC (smaller is better)		2893.1314

Scaled Pearson X2 is still well above 1, therefore we will try negative binomial distribution next:

							
								PROC GENMOD DATA=tutorial.affairs; 

								CLASS kids--yrsmarr6;

								MODEL naffairs = kids--yrsmarr6 / DIST=NEGBIN;

								RUN;

Criteria For Assessing Goodness Of Fit
Criterion	DF	Value	Value/DF
Deviance	586	339.8629	0.5800
Scaled Deviance	586	339.8629	0.5800
Pearson Chi-Square	586	566.8541	0.9673
Scaled Pearson X2	586	566.8541	0.9673
Log Likelihood		439.3546
Full Log Likelihood		-724.0433
AIC (smaller is better)		1480.0865
AICC (smaller is better)		1481.0180
BIC (smaller is better)		1550.4640

Parameter estimates (not shown) tells us that only unhap, antirel, notrel, slghtrel, yrsmarr1, yrsmarr2, yrsmarr3 are significant. We run our model with only these variables:

							
								PROC GENMOD DATA=tutorial.affairs; 

								CLASS kids--yrsmarr6;

								MODEL naffairs = unhap antirel notrel slghtrel yrsmarr1 yrsmarr2 yrsmarr3 / DIST=NEGBIN;

								RUN;

Criteria For Assessing Goodness Of Fit
Criterion	DF	Value	Value/DF
Deviance	593	338.5604	0.5709
Scaled Deviance	593	338.5604	0.5709
Pearson Chi-Square	593	530.2979	0.8943
Scaled Pearson X2	593	530.2979	0.8943
Log Likelihood		435.7320
Full Log Likelihood		-727.6659
AIC (smaller is better)		1473.3317
AICC (smaller is better)		1473.6363
BIC (smaller is better)		1512.9191

Algorithm converged.

Analysis Of Maximum Likelihood Parameter Estimates
Parameter		DF	Estimate	Standard Error	Wald 95% Confidence Limits		Wald Chi-Square	Pr > ChiSq
Intercept		1	0.6383	0.9363	-1.1968	2.4735	0.46	0.4954
unhap	0	1	-1.0441	0.3624	-1.7544	-0.3339	8.30	0.0040
antirel	0	1	-1.3962	0.4489	-2.2760	-0.5164	9.67	0.0019
notrel	0	1	-1.1257	0.3141	-1.7413	-0.5101	12.85	0.0003
slghtrel	0	1	-1.2156	0.3302	-1.8629	-0.5683	13.55	0.0002
yrsmarr1	0	1	1.3507	0.4769	0.4160	2.2853	8.02	0.0046
yrsmarr2	0	1	1.6580	0.3850	0.9034	2.4126	18.54	<.0001
yrsmarr3	0	1	0.8858	0.3404	0.2187	1.5529	6.77	0.0093
Dispersion		1	7.0132	0.7824	5.6358	8.7272

Let's also use SAS graphics to see some of the effects of these parameters. Prior to this we need to reformat the data set and create new columns:

							
								DATA DATA=tutorial.affairs; 

								SET tutorial.affairs;

								IF vryunhap = 1 THEN  happiness = 'Very unhappy';

								ELSE IF unhap = 1 THEN  happiness = 'Unhappy';

								ELSE IF hapavg = 1 THEN  happiness = 'Happy';

								ELSE IF vryhap = 1 THEN  happiness = 'Very happy';

								ELSE IF avgmarr = 1 THEN  happiness = 'Very unhappy';

								IF antirel = 1 THEN  religiosity = 'Anti religious';

								ELSE IF notrel = 1 THEN  religiosity = 'Not religious';

								ELSE IF slghtrel = 1 THEN  religiosity = 'Slightly religious';

								ELSE IF smerel = 1 THEN  religiosity = 'Somewhat religious';

								ELSE IF vryrel = 1 THEN  religiosity = 'Very religious';

								IF yrsmarr1 = 1 THEN  years = '0.75';

								ELSE IF yrsmarr2 = 1 THEN  years = '1.5';

								ELSE IF yrsmarr3 = 1 THEN  years = '4';

								ELSE IF yrsmarr4 = 1 THEN  years = '7';

								ELSE IF yrsmarr5 = 1 THEN  years = '10';

								ELSE IF yrsmarr6 = 1 THEN  years = '15';

								cnt=1;

								RUN;

Now let's use a few graphs to understand these variables:

							
								PROC SGPLOT DATA=tutorial.affairs; 

								VBAR religiosity / RESPONSE=cnt STATSUM;

								XAXIS VALUES =('Anti religious' 'Not religious' 'Slightly religious' 'Somewhat religious' 'Very religious');

								RUN;
								
								PROC SGPLOT DATA=tutorial.affairs; 

								VBAR happiness / RESPONSE=cnt STATSUM;

								XAXIS VALUES =('Very unhappy' 'Unhappy' 'Average' 'Happy' 'Very happy');

								RUN;
								
								PROC SGPLOT DATA=tutorial.affairs; 

								VBAR years / RESPONSE=cnt STATSUM;

								XAXIS VALUES =('0.75' '1.5' '4' '7' '10' '15');

								RUN;