Correlation with SAS PROC CORR

PROC CORR

The CORR procedure computes Pearson correlation coefficients, three nonparametric measures of association, polyserial correlation coefficients, and the probabilities associated with these statistics.

Example: Galton's data on the heights of parents and their children

Description: Galton (1886) presented these data in a table, showing a cross-tabulation of 928 adult children born to 205 fathers and mothers, by their height and their mid-parent's height. He visually smoothed the bivariate frequency distribution and showed that the contours formed concentric and similar ellipses, thus setting the stage for correlation, regression and the bivariate normal distribution.

parent: height of the mid-parent (average of father and mother)
child: height of the child.

Source: https://vincentarelbundock.github.io/Rdatasets/doc/HistData/Galton.html
Download the data from here

Task: Are the heights of the parents and child related?

To see whether two variables are related is checking their correlation. Correlation is a bivariate analysis that measures the strength of association between two variables and the direction of the relationship. In terms of the strength of relationship, the value of the correlation coefficient varies between +1 and -1. A value of ± 1 indicates a perfect degree of association between the two variables. As the correlation coefficient value goes towards 0, the relationship between the two variables will be weaker. The direction of the relationship is indicated by the sign of the coefficient; a + sign indicates a positive relationship and a – sign indicates a negative relationship.

Running correlation analysis in SAS is via PROC CORR. Let's run a basic correlation analysis with PROC CORR:

								
									PROC CORR DATA = tutorial.galton;

									VAR parent child;

									RUN;

The CORR Procedure

2 Variables:	parent child

Simple Statistics
Variable	N	Mean	Std Dev	Sum	Minimum	Maximum
parent	928	68.30819	1.78733	63390	64.00000	73.00000
child	928	68.08847	2.51794	63186	61.70000	73.70000

parent

1.00000

0.45876

<.0001

child

0.45876

<.0001

1.00000

By default PROC CORR produces Pearson r Correlation values. Pearson r Correlation is the most widely used correlation statistic to measure the degree of the relationship between linearly related variables. For the Pearson r correlation, both variables should be normally distributed. Other assumptions include linearity and homoscedasticity. Let's see if our data is normally distributed. We can check normality with PROC UNIVARIATE:

								
									PROC UNIVARIATE DATA = tutorial.galton NORMAL;

									VAR parent child;

									RUN;

Tests for Normality (parent)
Test	Statistic		p Value
Shapiro-Wilk	W	0.966104	Pr < W	<0.0001
Kolmogorov-Smirnov	D	0.130528	Pr > D	<0.0100
Cramer-von Mises	W-Sq	2.738985	Pr > W-Sq	<0.0050
Anderson-Darling	A-Sq	14.11431	Pr > A-Sq	<0.0050

Tests for Normality (child)
Test	Statistic		p Value
Shapiro-Wilk	W	0.980656	Pr < W	<0.0001
Kolmogorov-Smirnov	D	0.103744	Pr > D	<0.0100
Cramer-von Mises	W-Sq	1.276303	Pr > W-Sq	<0.0050
Anderson-Darling	A-Sq	6.917662	Pr > A-Sq	<0.0050

Based on p-values of various normality tests, our distributions are not normal therefore we need to use nonparametric correlation tests. For nonparametric measures of association, Spearman rank-order correlation uses the ranks of the data values and Kendall’s tau-b uses the number of concordances and discordances in paired observations. We can specify any of these methods to test the correlation for our data set. Here's how:

								
									PROC CORR DATA = tutorial.galton SPEARMAN KENDALL;

									VAR parent child;

									RUN;

parent

1.00000

0.42513

<.0001

child

0.42513

<.0001

1.00000

parent

1.00000

0.33158

<.0001

child

0.33158

<.0001

1.00000

Kendall’s Tau has usually smaller values than Spearman’s rho correlation which explains the result above. P values for Kendall are more accurate than Spearman with smaller sample sizes. Our sample size (N=928) is sufficiently large therefore Spearman would be OK to use. Spearman correlation coefficient (0.425) shows moderate correlation between parent and child heights.