PROC CORR
The CORR procedure computes Pearson correlation coefficients, three nonparametric measures of association, polyserial correlation coefficients, and the probabilities associated with these statistics.
Example: Galton's data on the heights of parents and their children
Description: Galton (1886) presented these data in a table, showing a cross-tabulation of 928 adult children born to 205 fathers and mothers, by their height and their mid-parent's height. He visually smoothed the bivariate frequency distribution and showed that the contours formed concentric and similar ellipses, thus setting the stage for correlation, regression and the bivariate normal distribution.
parent: height of the mid-parent (average of father and mother)
child: height of the child.
Source: https://vincentarelbundock.github.io/Rdatasets/doc/HistData/Galton.html
Download the data from here
Task: Are the heights of the parents and child related?
To see whether two variables are related is checking their correlation. Correlation is a bivariate analysis that measures the strength of association between two variables and the direction of the relationship. In terms of the strength of relationship, the value of the correlation coefficient varies between +1 and -1. A value of ± 1 indicates a perfect degree of association between the two variables. As the correlation coefficient value goes towards 0, the relationship between the two variables will be weaker. The direction of the relationship is indicated by the sign of the coefficient; a + sign indicates a positive relationship and a – sign indicates a negative relationship.
Running correlation analysis in SAS is via PROC CORR. Let's run a basic correlation analysis with PROC CORR:
PROC CORR DATA = tutorial.galton;
VAR parent child;
RUN;
The CORR Procedure
928 |
68.30819 |
1.78733 |
63390 |
64.00000 |
73.00000 |
928 |
68.08847 |
2.51794 |
63186 |
61.70000 |
73.70000 |
By default PROC CORR produces Pearson r Correlation values. Pearson r Correlation is the most widely used correlation statistic to measure the degree of the relationship between linearly related variables. For the Pearson r correlation, both variables should be normally distributed. Other assumptions include linearity and homoscedasticity. Let's see if our data is normally distributed. We can check normality with PROC UNIVARIATE:
PROC UNIVARIATE DATA = tutorial.galton NORMAL;
VAR parent child;
RUN;
0.966104 |
<0.0001 |
0.130528 |
<0.0100 |
2.738985 |
<0.0050 |
14.11431 |
<0.0050 |
0.980656 |
<0.0001 |
0.103744 |
<0.0100 |
1.276303 |
<0.0050 |
6.917662 |
<0.0050 |
Based on p-values of various normality tests, our distributions are not normal therefore we need to use nonparametric correlation tests. For nonparametric measures of association, Spearman rank-order correlation uses the ranks of the data values and Kendall’s tau-b uses the number of concordances and discordances in paired observations. We can specify any of these methods to test the correlation for our data set. Here's how:
PROC CORR DATA = tutorial.galton SPEARMAN KENDALL;
VAR parent child;
RUN;
Kendall’s Tau has usually smaller values than Spearman’s rho correlation which explains the result above. P values for Kendall are more accurate than Spearman with smaller sample sizes. Our sample size (N=928) is sufficiently large therefore Spearman would be OK to use. Spearman correlation coefficient (0.425) shows moderate correlation between parent and child heights.