Data Manipulation
At the center of data manipulation lies the prominent SAS DATA step. SAS DATA is one of the best, if not the best, data manipulation tool available out there.
First thing to learn about SAS DATA step is that it process the data row-wise. What I mean with that will become more apparent with each example, so without further ado, let's begin working on our examples.
Creating a column
We have a single column dataset listing ages of different people. Here's what it looks like (download it from here).
Obs | age |
---|---|
1 | 11 |
2 | 21 |
3 | 27 |
4 | 72 |
5 | 27 |
6 | 32 |
7 | 45 |
8 | 51 |
9 | 73 |
10 | 13 |
To create a column we simply specify the new column and the value:
DATA tutorial.data_age_new;
SET tutorial.data_age;
isadult = 1;
First, we specified the name of data set that will be created in the DATA statement. If we specified a the same original data set name, i.e. tutorial.data_age, it would have overwritten original data set. Also note the convention we are using for saving data: library.dataset. In our case, a new data set with the name data_age_new will be created in the tutorials library. Original data set, data_age won't be modified.
SET statement copies a temporary version of the specified data set to memory so that the original version will not change.
We create a new column by specifying a column name isadult and giving a value of 1. Here's how DATA will process this command:
- • Header Row: Create a column named isadult
- • Row 1: Read the value of age, write 1 to isadult column
- • Row 1: Read the value of age, write 1 to isadult column
- • ...
- • Row 10: Read the value of age, write 1 to isadult column
Here's our result:
Obs | age | isadult |
---|---|---|
1 | 11 | 1 |
2 | 21 | 1 |
3 | 27 | 1 |
4 | 72 | 1 |
5 | 27 | 1 |
6 | 32 | 1 |
7 | 45 | 1 |
8 | 51 | 1 |
9 | 73 | 1 |
10 | 13 | 1 |
Probably this is not exactly what we wanted. We would like the value of isadult reflect the value of age in the first column. To do this, we use an IF-ELSE statement:
DATA tutorial.data_age_new;
SET tutorial.data_age;
IF age < 18 THEN isadult = 1;
ELSE isadult = 0;
and we get finally what we wanted:
Obs | age | isadult |
---|---|---|
1 | 11 | 1 |
2 | 21 | 0 |
3 | 27 | 0 |
4 | 72 | 0 |
5 | 27 | 0 |
6 | 32 | 0 |
7 | 45 | 0 |
8 | 51 | 0 |
9 | 73 | 0 |
10 | 13 | 1 |
Appending data sets
Let's say we have two data sets like this:
Table-I |
Obs | fname | lname |
---|---|---|
1 | Alicia | Silverstone |
2 | Tom | Cruise |
3 | Uma | Thurman |
Table-II |
Obs | fname | lname |
---|---|---|
1 | Tom | Hanks |
2 | Sharon | Stone |
3 | Orlando | Bloom |
Combining these data sets into one table is as easy as this:
DATA table3;
SET table1 table2;
And here's the result:
Table-III |
Obs | fname | lname |
---|---|---|
1 | Alicia | Silverstone |
2 | Tom | Cruise |
3 | Uma | Thurman |
4 | Tom | Hanks |
5 | Sharon | Stone |
6 | Orlando | Bloom |
We learned the simplest case of combining tables. Now let's see a more complicated example:
Table-I |
Obs | fname | lname | country |
---|---|---|---|
1 | Alicia | Silverstone | US |
2 | Tom | Cruise | US |
3 | Uma | Thurman | UK |
Table-II |
Obs | fname | lname | age |
---|---|---|---|
1 | Alicia | Silverstone | 40 |
2 | Tom | Cruise | 50 |
3 | Uma | Thurman | 45 |
Let's repeat the previous example and see what happens:
DATA table3;
SET table1 table2;
Table-III |
Obs | fname | lname | country | age |
---|---|---|---|---|
1 | Alicia | Silverstone | US | . |
2 | Tom | Cruise | US | . |
3 | Uma | Thurman | UK | . |
4 | Alicia | Silverstone | 40 | |
5 | Tom | Cruise | 50 | |
6 | Uma | Thurman | 45 |
OK, this is not we want. Clearly we want to get rid of duplicates. To get what we want we need to use MERGE operator instead of SET:
DATA table3;
MERGE table1 table2;
BY fname lname;
Table-III |
Obs | fname | lname | country | age |
---|---|---|---|---|
1 | Alicia | Silverstone | US | 40 |
2 | Tom | Cruise | US | 50 |
3 | Uma | Thurman | UK | 45 |
Transposing data sets
There are times we have longitudinal data and we would like to transpose it to wide data and vice versa. SAS has a tool for this purpose: PROC TRANSPOSE. Let's see a complicated example where we can combine the power of DATA and PROC TRANSPOSE. Here's our initial table:
Obs | class | trade | sober | drinks | wage |
---|---|---|---|---|---|
1 | A | papercutter | 1 | 1 | 24 |
2 | A | cabmen | 1 | 10 | 18.416666667 |
3 | A | goldbeater | 2 | 1 | 21.5 |
4 | A | stablemen | 1 | 5 | 21.166666667 |
5 | A | millworker | 2 | 0 | 19 |
6 | A | porter | 9 | 8 | 20.5 |
7 | A | warehouse | 8 | 2 | 24.333333333 |
8 | A | maltmen | 3 | 5 | 23 |
9 | A | chimneysweep | 0 | 7 | 17.333333333 |
10 | A | carter | 12 | 23 | 22.166666667 |
and we want to covert it to this:
Obs | class | trade | wage | freq | drinks |
---|---|---|---|---|---|
1 | A | barman | 23.666666667 | 1 | yes |
2 | A | barman | 23.666666667 | 0 | no |
3 | A | billposter | 18 | 1 | yes |
4 | A | billposter | 18 | 0 | no |
5 | A | brassfounder | 21.5 | 2 | yes |
6 | A | brassfounder | 21.5 | 4 | no |
7 | A | cabmen | 18.416666667 | 1 | yes |
8 | A | cabmen | 18.416666667 | 10 | no |
9 | A | carter | 22.166666667 | 12 | yes |
10 | A | carter | 22.166666667 | 23 | no |
This conversion multiple steps of DATA procedure and PROC TRANSPOSE. First, we would like to keep columns class, trade, wage in place - no change there. Second, instead of having two columns with sober and drinks, we would like to have one column drinks (yes or no). Finally the number of drinkers and smokers will be in another column, freq. Let's built our TRANSPOSE procedure:
PROC TRANSPOSE DATA = table1;
VAR sober drinks;
BY class trade wage;
RUN;
Before running this we have one technical requirements. In order for SAS to process any BY statement, variables declared after BY have to be sorted. I am not sure why it has to be so but it is. So let's sort these variables first:
PROC SORT DATA = table1 OUT = table1_sort;
BY class trade wage;
RUN;
Now we can run our TRANSPOSE procedure:
PROC TRANSPOSE DATA = table1_sort OUT = table2;
VAR sober drinks;
BY class trade wage;
RUN;
Obs | class | trade | wage | _NAME_ | COL1 |
---|---|---|---|---|---|
1 | A | barman | 23.666666667 | sober | 1 |
2 | A | barman | 23.666666667 | drinks | 0 |
3 | A | billposter | 18 | sober | 1 |
4 | A | billposter | 18 | drinks | 0 |
5 | A | brassfounder | 21.5 | sober | 2 |
6 | A | brassfounder | 21.5 | drinks | 4 |
7 | A | cabmen | 18.416666667 | sober | 1 |
8 | A | cabmen | 18.416666667 | drinks | 10 |
9 | A | carter | 22.166666667 | sober | 12 |
10 | A | carter | 22.166666667 | drinks | 23 |
PROC TRANSPOSE created two columns with names _NAME_ and COL1. We can specify these names in PROC TRANSPOSE statement but no need for it since we will use DATA step to make the changes.
DATA table2;
SET table2 (RENAME=(col1=freq));
IF _name_ EQ "sober" THEN drinks="yes";
ELSE drinks="no";
DROP _name_;
and we get our final table:
Obs | class | trade | wage | freq | drinks |
---|---|---|---|---|---|
1 | A | barman | 23.666666667 | 1 | yes |
2 | A | barman | 23.666666667 | 0 | no |
3 | A | billposter | 18 | 1 | yes |
4 | A | billposter | 18 | 0 | no |
5 | A | brassfounder | 21.5 | 2 | yes |
6 | A | brassfounder | 21.5 | 4 | no |
7 | A | cabmen | 18.416666667 | 1 | yes |
8 | A | cabmen | 18.416666667 | 10 | no |
9 | A | carter | 22.166666667 | 12 | yes |
10 | A | carter | 22.166666667 | 23 | no |
Creating random data
We use the RAND function to create random numbers:
DATA randdata;
CALL STREAMINIT(123);
DO i=1 TO 10;
randcol = rand("UNIFORM");
OUTPUT;
END;
CALL STREAMINIT function simply specify the seed for random number generator. If we don't specify this with a number, we cannot reproduce our results.
This is our first case that we use a DO loop. DO i=1 to 10 tells SAS to repeat whatever statement follows preceding the END statement for each i value from 1 to 10. RAND function simply generates a random number between 0 and 1. Here is what happens with each loop:
- randcol = RAND("UNIFORM"): Set i=1, create a random number between 0 and 1, assign it to randcol column
- OUTPUT: Create a new row
- randcol = RAND("UNIFORM"): Set i=2, create a random number between 0 and 1, assign it to randcol column
- OUTPUT: Create a new row
- ...
- randcol = RAND("UNIFORM"): Set i=10, create a random number between 0 and 1, assign it to randcol column
- OUTPUT: Create a new row
Output is below:
Obs | i | randcall |
---|---|---|
1 | 1 | 0.58170 |
2 | 2 | 0.03562 |
3 | 3 | 0.07818 |
4 | 4 | 0.38785 |
5 | 5 | 0.32917 |
6 | 6 | 0.36159 |
7 | 7 | 0.33759 |
8 | 8 | 0.16920 |
9 | 9 | 0.05670 |
10 | 10 | 0.07983 |
Some of the other distributions available for RAND function are:
- RAND("Bernoulli",p), where p specifies the chance of success. For example, if we create 20 rows with p=0.5, about 10 of the rows will be 1 and 10 of the rows will be 0.
- RAND("Bernoulli",p,n), where p specifies the chance of success and n specifies the number of independent Bernoulli trials. For example, if we create 100 rows with p=0.5 and n=3, about pn=0.53=0.125*100=12.5 of the rows will be 0 or 3 and so on.
- RAND("Normal",θ,λ) creates a normal distribution where θ represents the mean and λ represents the standard deviation.
Extracting Characters from Data
Many times we come across a dataset where our point of interest is buried in data with extra unwanted information. For example, take a look at the table below:
Obs | agegp | alcgp | tobgp | ncases | ncontrols |
---|---|---|---|---|---|
1 | 25-34 | 0-39g/day | 0-9g/day | 0 | 40 |
2 | 25-34 | 0-39g/day | 10-19 | 0 | 10 |
3 | 25-34 | 0-39g/day | 20-29 | 0 | 6 |
4 | 25-34 | 0-39g/day | 30+ | 0 | 5 |
5 | 25-34 | 40-79 | 0-9g/day | 0 | 27 |
Second and third column in the dataset includes unwanted characters (g/day). How do we remove them easily? Believe it or not SAS has a function for this: COMPRESS. See the code below.
DATA tutorial.epoch;
SET tutorial.epoch;
alcgp = compress(alcgp,'g/day');
tobgp = compress(topgp,'g/day');
Obs | agegp | alcgp | tobgp | ncases | ncontrols |
---|---|---|---|---|---|
1 | 25-34 | 0-39 | 0-9 | 0 | 40 |
2 | 25-34 | 0-39 | 10-19 | 0 | 10 |
3 | 25-34 | 0-39 | 20-29 | 0 | 6 |
4 | 25-34 | 0-39 | 30+ | 0 | 5 |
5 | 25-34 | 40-79 | 0-9 | 0 | 27 |
Another but much more powerful option is utilizing regular expressions. I will put a lecture about this in the near future but for the time being simply know that following code will achieve the same result as above and more.
DATA tutorial.epoch;
SET tutorial.epoch;
re = PRXPARSE('/(\d{1,3}.[\d]{0,2})(?:.*)/');
match = PRXMATCH(re, alcgp);
alcgp = PRXPOSN(re,1,alcgp);
DROP re match;
DO Loops
One of the most powerful features of SAS DATA (and command line tools in general) is the power of loops for repetitive tasks. Let's say we have table which lists days of a month (i.e. 1,2,..30) and we want to add a column called 'day' which lists day of the week (i.e. Mon, Tue and so on). So we should start with day1 and
Arrays
If you are familiar with programming you already know what an array is. If you are not, an array is a data structure, which can store a fixed-size collection of elements of the same data type. Here is an example to demonstrate the power of arrays. Take a look at the table below. Say we want to convert 'N/A's to missing. How do we do it?
Obs | fname | lname | age | department | city | state | role | pay |
---|---|---|---|---|---|---|---|---|
1 | Smith | N/A | N/A | Sales | Boston | MA | Salesper | 112 |
2 | Mark | Adams | 36 | IT | Houston | TX | Manager | 105 |
3 | Jennifer | Miska | N/A | Sales | Boston | MA | N/A | 95 |
4 | Adam | Taylor | 41 | N/A | N/A | N/A | Director | N/A |
One primitive way to do it is as follows:
DATA example;
SET example;
IF fname EQ 'N/A' THEN fname = .;
IF lname EQ 'N/A' THEN lname = .;
IF age EQ 'N/A' THEN age = .;
IF department EQ 'N/A' THEN department = .;
IF city EQ 'N/A' THEN city = .;
IF state EQ 'N/A' THEN state = .;
IF role EQ 'N/A' THEN role = .;
IF pay EQ 'N/A' THEN pay = .;
A better way of doing this is with arrays. We will define an array and call it 'var' and assign this to all columns such that var[1] represents fname, var[2] represents lname and so on. Then we will use this array in our IF statements. Code below goes over each var and checks the validity of IF statement.
DATA example;
SET example;
ARRAY var name -- pay;
DO OVER var;
IF var EQ 'N/A' THEN var = .;
END;