Exploratory Statistics
First step after data preparation is exploratory statistics. Major SAS procedures to do so are MEANS, UNIVARIATE and FREQ.
PROC MEANS
We will use cake sample data from SAS website. You can download the data from here. This dataset is from a cake-baking contest: each participant's last name, age, score for presentation, score for taste, cake flavor, and number of cake layers. The number of cake layers is missing for two observations. The cake flavor is missing for another observation.
LastName | Age | PresentScore | TasteScore | Flavor | Layers |
---|---|---|---|---|---|
Orlando | 27 | 93 | 80 | Vanilla | 1 |
Ramey | 32 | 84 | 72 | Rum | 2 |
Goldston | 46 | 68 | 75 | Vanilla | 1 |
Roe | 38 | 79 | 73 | Vanilla | 2 |
Larsen | 23 | 77 | 84 | Chocolate | . |
Davis | 51 | 86 | 91 | Spice | 3 |
Strickland | 19 | 82 | 79 | Chocolate | 1 |
Nguyen | 57 | 77 | 84 | Vanilla | . |
Hildenbrand | 33 | 81 | 83 | Chocolate | 1 |
Byron | 62 | 72 | 87 | Vanilla | 2 |
Sanders | 26 | 56 | 79 | Chocolate | 1 |
Jaeger | 43 | 66 | 74 | 1 | |
Davis | 28 | 69 | 75 | Chocolate | 2 |
Conrad | 69 | 85 | 94 | Vanilla | 1 |
Walters | 55 | 67 | 72 | Chocolate | 2 |
Rossburger | 28 | 78 | 81 | Spice | 2 |
Matthew | 42 | 81 | 92 | Chocolate | 2 |
Becker | 36 | 62 | 83 | Spice | 2 |
Anderson | 27 | 87 | 85 | Chocolate | 1 |
Merritt | 62 | 73 | 84 | Chocolate | 1 |
Let's get the basic statistics from this data by PROC MEANS:
By default PROC MEANS calculates N, Mean, Standard Deviation, Minimum and Maximum for all numerical variables. MAXDEC=2 option limits the number of decimals to 2. To calculate only certain variables, we can state them with a new statement:
PROC MEANS can calculate various statistics other than default ones. To do that we add keywords to the PROC MEANS statement:
We can also group variables with CLASS statement. Let's say we want to find out average scores for different flavors. Following will achieve that.
Number of observations (N Obs) is printed by default. To omit printing N Obs, we need to use the keyword NONOBS in the PROC MEANS statement:
Output from PROC MEANS can be saved with OUTPUT statement:
Here meanvalues is the name of the output dataset and n, avg and std are the column names for number of observations, mean and standard deviation, respectively.
PROC UNIVARIATE
Our second weapon in SAS arsenal for exploratory statistics is UNIVARIATE procedure. PROC UNIVARIATE provides descriptive statistics such as skewness and kurtosis, histogram, goodness-of-fit tests, probability plots etc. First, the data set. We will use cake sample data from SAS website. You can download the data from here. This dataset is from a cake-baking contest: each participant's last name, age, score for presentation, score for taste, cake flavor, and number of cake layers. The number of cake layers is missing for two observations. The cake flavor is missing for another observation.
LastName | Age | PresentScore | TasteScore | Flavor | Layers |
---|---|---|---|---|---|
Orlando | 27 | 93 | 80 | Vanilla | 1 |
Ramey | 32 | 84 | 72 | Rum | 2 |
Goldston | 46 | 68 | 75 | Vanilla | 1 |
Roe | 38 | 79 | 73 | Vanilla | 2 |
Larsen | 23 | 77 | 84 | Chocolate | . |
Davis | 51 | 86 | 91 | Spice | 3 |
Strickland | 19 | 82 | 79 | Chocolate | 1 |
Nguyen | 57 | 77 | 84 | Vanilla | . |
Hildenbrand | 33 | 81 | 83 | Chocolate | 1 |
Byron | 62 | 72 | 87 | Vanilla | 2 |
Sanders | 26 | 56 | 79 | Chocolate | 1 |
Jaeger | 43 | 66 | 74 | 1 | |
Davis | 28 | 69 | 75 | Chocolate | 2 |
Conrad | 69 | 85 | 94 | Vanilla | 1 |
Walters | 55 | 67 | 72 | Chocolate | 2 |
Rossburger | 28 | 78 | 81 | Spice | 2 |
Matthew | 42 | 81 | 92 | Chocolate | 2 |
Becker | 36 | 62 | 83 | Spice | 2 |
Anderson | 27 | 87 | 85 | Chocolate | 1 |
Merritt | 62 | 73 | 84 | Chocolate | 1 |
Let's get goodness-of-fit tests for normal distribution and a simple histogram of Age with PROC UNIVARIATE:
By default PROC UNIVARIATE calculates Kolmogorov-Smirnov, Cramer-von Mises and Anderson-Darling statistics for normality. If any of these values are below 0.05 (which we can change by specifying ALPHA=0.xx in PROC UNIVARIATE statement) we don't believe the variable follows a normal distribution. In our case each of these values are above 0.05 therefore we don't have any suspicion regarding normality of Age. In other words, we can assume that Age is normally distributed.
We can also get Shapiro-Wilk test in addition to those if NORMAL option is specified in the PROC UNIVARIATE statement. According to some statisticians, Shapiro-Wilk test is preferred over others.
Note that normality tests are somewhat sensitive to sample size; in large samples small deviations from normality may result in conclusions of non-normality. Take a look at the chart below which is generated by creating various size random samples from a pseudo-normal distribution and testing each for normality. Each result is averaged over thousands of trials to eliminate the random noise. It can be seen that all 4 tests suggest normality for samples smaller than n~90 and non-normality for larger samples. Note the particular sensitivity of Shapiro-Wilk to the sample size. Kolmogorov-Smirnov seems to be the least sensitive of normality tests to the sample size.
To get an overlay of normal curve over the histogram, we can simply add NORMAL option, which is preceded by a slash / after HISTOGRAM statement:
We can further specify options for the histogram. x-axis limits and y-axis type (count or percent) can be specified as follows:
In order to get distribution of Age with differing flavors we can declare Flavor as a classification variable. We can also specify how we would like to see the histogram. In our case, there are 4 different flavors so we can display each value within a different column by specifying number of columns and rows in HISTOGRAM options. Note that you don't have to specify this value as SAS automatically calculates number of columns.
PROC FREQ
The FREQ procedure produces one-way to n-way frequency and contingency (crosstabulation) tables. For two-way tables, PROC FREQ computes tests and measures of association. For n-way tables, PROC FREQ provides stratified analysis by computing statistics across, as well as within, strata.
Data set marbles records observations about 10 random draws of marbles from two bags filled with 3 different color marbles. Let's analyze this data sets via PROC FREQ.
Bag | Color |
---|---|
1 | blue |
1 | red |
1 | green |
1 | blue |
1 | blue |
1 | green |
1 | green |
1 | red |
1 | blue |
1 | green |
1 | green |
1 | red |
1 | green |
1 | green |
1 | red |
1 | blue |
1 | green |
1 | red |
1 | blue |
1 | blue |
1 | red |
1 | green |
1 | green |
1 | green |
1 | red |
1 | green |
1 | blue |
1 | red |
1 | green |
1 | green |
2 | blue |
2 | blue |
2 | green |
2 | blue |
2 | red |
2 | red |
2 | blue |
2 | green |
2 | blue |
2 | red |
2 | blue |
2 | blue |
2 | red |
2 | green |
2 | green |
2 | green |
2 | blue |
2 | blue |
2 | red |
2 | blue |
2 | blue |
2 | blue |
2 | red |
2 | green |
2 | blue |
2 | green |
2 | blue |
2 | blue |
2 | red |
2 | blue |
Let's get the frequencies of colors by PROC FREQ:
TABLE is major statement in PROC FREQ, it is somewhat similar to VAR statement in PROC MEANS. By requesting a table of Color, PROC FREQ calculates the frequecies of each color in the whole data set. If we instead want to see the frequency of colors in each bag we simply add Bag to our TABLE statement followed by a star * before Color argument.
If you find the output too crowded and eliminate some of the information, you can type these options in the TABLE statement. First, let's find out what information we do have. Take a look at the leftmost column in the ouput: Frequency, Percent, Row Pct, Col Pct. You can eliminate any of these with the following:
What if our data does not include individual marbles but exist as a contingency table like below?
Bag | Color | Quantity |
---|---|---|
1 | blue | 8 |
1 | green | 14 |
1 | red | 8 |
2 | blue | 16 |
2 | green | 7 |
2 | red | 7 |
In this case, we have to specify the quantity, i.e. WEIGHT:
Leave a Comment