PROC CORR

The CORR procedure computes Pearson correlation coefficients, three nonparametric measures of association, polyserial correlation coefficients, and the probabilities associated with these statistics.

Example: Galton's data on the heights of parents and their children

Description: Galton (1886) presented these data in a table, showing a cross-tabulation of 928 adult children born to 205 fathers and mothers, by their height and their mid-parent's height. He visually smoothed the bivariate frequency distribution and showed that the contours formed concentric and similar ellipses, thus setting the stage for correlation, regression and the bivariate normal distribution.

parent: height of the mid-parent (average of father and mother)
child: height of the child.

Source: https://vincentarelbundock.github.io/Rdatasets/doc/HistData/Galton.html
Download the data from here

Task: Are the heights of the parents and child related?

To see whether two variables are related is checking their correlation. Correlation is a bivariate analysis that measures the strength of association between two variables and the direction of the relationship. In terms of the strength of relationship, the value of the correlation coefficient varies between +1 and -1. A value of ± 1 indicates a perfect degree of association between the two variables. As the correlation coefficient value goes towards 0, the relationship between the two variables will be weaker. The direction of the relationship is indicated by the sign of the coefficient; a + sign indicates a positive relationship and a – sign indicates a negative relationship.

Running correlation analysis in SAS is via PROC CORR. Let's run a basic correlation analysis with PROC CORR:

PROC CORR DATA = tutorial.galton;
VAR parent child;
RUN;
The CORR Procedure
2 Variables: parent child
Simple Statistics
Variable N Mean Std Dev Sum Minimum Maximum
parent 928 68.30819 1.78733 63390 64.00000 73.00000
child 928 68.08847 2.51794 63186 61.70000 73.70000
Pearson Correlation Coefficients, N = 928
Prob > |r| under H0: Rho=0
  parent child
parent
1.00000
 
0.45876
<.0001
child
0.45876
<.0001
1.00000
 

By default PROC CORR produces Pearson r Correlation values. Pearson r Correlation is the most widely used correlation statistic to measure the degree of the relationship between linearly related variables. For the Pearson r correlation, both variables should be normally distributed. Other assumptions include linearity and homoscedasticity. Let's see if our data is normally distributed. We can check normality with PROC UNIVARIATE:

PROC UNIVARIATE DATA = tutorial.galton NORMAL;
VAR parent child;
RUN;
Tests for Normality (parent)
Test Statistic p Value
Shapiro-Wilk W 0.966104 Pr < W <0.0001
Kolmogorov-Smirnov D 0.130528 Pr > D <0.0100
Cramer-von Mises W-Sq 2.738985 Pr > W-Sq <0.0050
Anderson-Darling A-Sq 14.11431 Pr > A-Sq <0.0050
Tests for Normality (child)
Test Statistic p Value
Shapiro-Wilk W 0.980656 Pr < W <0.0001
Kolmogorov-Smirnov D 0.103744 Pr > D <0.0100
Cramer-von Mises W-Sq 1.276303 Pr > W-Sq <0.0050
Anderson-Darling A-Sq 6.917662 Pr > A-Sq <0.0050

Based on p-values of various normality tests, our distributions are not normal therefore we need to use nonparametric correlation tests. For nonparametric measures of association, Spearman rank-order correlation uses the ranks of the data values and Kendall’s tau-b uses the number of concordances and discordances in paired observations. We can specify any of these methods to test the correlation for our data set. Here's how:

PROC CORR DATA = tutorial.galton SPEARMAN KENDALL;
VAR parent child;
RUN;
Spearman Correlation Coefficients, N = 928
Prob > |r| under H0: Rho=0
  parent child
parent
1.00000
 
0.42513
<.0001
child
0.42513
<.0001
1.00000
 
Kendall Tau b Correlation Coefficients, N = 928
Prob > |tau| under H0: Tau=0
  parent child
parent
1.00000
 
0.33158
<.0001
child
0.33158
<.0001
1.00000
 

Kendall’s Tau has usually smaller values than Spearman’s rho correlation which explains the result above. P values for Kendall are more accurate than Spearman with smaller sample sizes. Our sample size (N=928) is sufficiently large therefore Spearman would be OK to use. Spearman correlation coefficient (0.425) shows moderate correlation between parent and child heights.