Data Manipulation

At the center of data manipulation lies the prominent SAS DATA step. SAS DATA is one of the best, if not the best, data manipulation tool available out there.

First thing to learn about SAS DATA step is that it process the data row-wise. What I mean with that will become more apparent with each example, so without further ado, let's begin working on our examples.

Creating a column

We have a single column dataset listing ages of different people. Here's what it looks like (download it from here).

Obs age
1 11
2 21
3 27
4 72
5 27
6 32
7 45
8 51
9 73
10 13

To create a column we simply specify the new column and the value:

DATA tutorial.data_age_new;
SET tutorial.data_age;
isadult = 1;

First, we specified the name of data set that will be created in the DATA statement. If we specified a the same original data set name, i.e. tutorial.data_age, it would have overwritten original data set. Also note the convention we are using for saving data: library.dataset. In our case, a new data set with the name data_age_new will be created in the tutorials library. Original data set, data_age won't be modified.

SET statement copies a temporary version of the specified data set to memory so that the original version will not change.

We create a new column by specifying a column name isadult and giving a value of 1. Here's how DATA will process this command:

  • • Header Row: Create a column named isadult
  • • Row 1: Read the value of age, write 1 to isadult column
  • • Row 1: Read the value of age, write 1 to isadult column
  • • ...
  • • Row 10: Read the value of age, write 1 to isadult column

Here's our result:

Obs age isadult
1 11 1
2 21 1
3 27 1
4 72 1
5 27 1
6 32 1
7 45 1
8 51 1
9 73 1
10 13 1

Probably this is not exactly what we wanted. We would like the value of isadult reflect the value of age in the first column. To do this, we use an IF-ELSE statement:

DATA tutorial.data_age_new;
SET tutorial.data_age;
IF age < 18 THEN isadult = 1;
ELSE isadult = 0;

and we get finally what we wanted:

Obs age isadult
1 11 1
2 21 0
3 27 0
4 72 0
5 27 0
6 32 0
7 45 0
8 51 0
9 73 0
10 13 1
Appending data sets

Let's say we have two data sets like this:

Table-I
Obs fname lname
1 Alicia Silverstone
2 Tom Cruise
3 Uma Thurman
Table-II
Obs fname lname
1 Tom Hanks
2 Sharon Stone
3 Orlando Bloom

Combining these data sets into one table is as easy as this:

DATA table3;
SET table1 table2;

And here's the result:

Table-III
Obs fname lname
1 Alicia Silverstone
2 Tom Cruise
3 Uma Thurman
4 Tom Hanks
5 Sharon Stone
6 Orlando Bloom

We learned the simplest case of combining tables. Now let's see a more complicated example:

Table-I
Obs fname lname country
1 Alicia Silverstone US
2 Tom Cruise US
3 Uma Thurman UK

Table-II
Obs fname lname age
1 Alicia Silverstone 40
2 Tom Cruise 50
3 Uma Thurman 45

Let's repeat the previous example and see what happens:

DATA table3;
SET table1 table2;
Table-III
Obs fname lname country age
1 Alicia Silverstone US .
2 Tom Cruise US .
3 Uma Thurman UK .
4 Alicia Silverstone   40
5 Tom Cruise   50
6 Uma Thurman   45

OK, this is not we want. Clearly we want to get rid of duplicates. To get what we want we need to use MERGE operator instead of SET:

DATA table3;
MERGE table1 table2;
BY fname lname;
Table-III
Obs fname lname country age
1 Alicia Silverstone US 40
2 Tom Cruise US 50
3 Uma Thurman UK 45
Transposing data sets

There are times we have longitudinal data and we would like to transpose it to wide data and vice versa. SAS has a tool for this purpose: PROC TRANSPOSE. Let's see a complicated example where we can combine the power of DATA and PROC TRANSPOSE. Here's our initial table:

Obs class trade sober drinks wage
1 A papercutter 1 1 24
2 A cabmen 1 10 18.416666667
3 A goldbeater 2 1 21.5
4 A stablemen 1 5 21.166666667
5 A millworker 2 0 19
6 A porter 9 8 20.5
7 A warehouse 8 2 24.333333333
8 A maltmen 3 5 23
9 A chimneysweep 0 7 17.333333333
10 A carter 12 23 22.166666667

and we want to covert it to this:

Obs class trade wage freq drinks
1 A barman 23.666666667 1 yes
2 A barman 23.666666667 0 no
3 A billposter 18 1 yes
4 A billposter 18 0 no
5 A brassfounder 21.5 2 yes
6 A brassfounder 21.5 4 no
7 A cabmen 18.416666667 1 yes
8 A cabmen 18.416666667 10 no
9 A carter 22.166666667 12 yes
10 A carter 22.166666667 23 no

This conversion multiple steps of DATA procedure and PROC TRANSPOSE. First, we would like to keep columns class, trade, wage in place - no change there. Second, instead of having two columns with sober and drinks, we would like to have one column drinks (yes or no). Finally the number of drinkers and smokers will be in another column, freq. Let's built our TRANSPOSE procedure:

PROC TRANSPOSE DATA = table1;
VAR sober drinks;
BY class trade wage;
RUN;

Before running this we have one technical requirements. In order for SAS to process any BY statement, variables declared after BY have to be sorted. I am not sure why it has to be so but it is. So let's sort these variables first:

PROC SORT DATA = table1 OUT = table1_sort;
BY class trade wage;
RUN;

Now we can run our TRANSPOSE procedure:

PROC TRANSPOSE DATA = table1_sort OUT = table2;
VAR sober drinks;
BY class trade wage;
RUN;
Obs class trade wage _NAME_ COL1
1 A barman 23.666666667 sober 1
2 A barman 23.666666667 drinks 0
3 A billposter 18 sober 1
4 A billposter 18 drinks 0
5 A brassfounder 21.5 sober 2
6 A brassfounder 21.5 drinks 4
7 A cabmen 18.416666667 sober 1
8 A cabmen 18.416666667 drinks 10
9 A carter 22.166666667 sober 12
10 A carter 22.166666667 drinks 23

PROC TRANSPOSE created two columns with names _NAME_ and COL1. We can specify these names in PROC TRANSPOSE statement but no need for it since we will use DATA step to make the changes.

DATA table2;
SET table2 (RENAME=(col1=freq));
IF _name_ EQ "sober" THEN drinks="yes";
ELSE drinks="no";
DROP _name_;

and we get our final table:

Obs class trade wage freq drinks
1 A barman 23.666666667 1 yes
2 A barman 23.666666667 0 no
3 A billposter 18 1 yes
4 A billposter 18 0 no
5 A brassfounder 21.5 2 yes
6 A brassfounder 21.5 4 no
7 A cabmen 18.416666667 1 yes
8 A cabmen 18.416666667 10 no
9 A carter 22.166666667 12 yes
10 A carter 22.166666667 23 no
Creating random data

We use the RAND function to create random numbers:

DATA randdata;
CALL STREAMINIT(123);
DO i=1 TO 10;
    randcol = rand("UNIFORM");     OUTPUT;
END;

CALL STREAMINIT function simply specify the seed for random number generator. If we don't specify this with a number, we cannot reproduce our results.

This is our first case that we use a DO loop. DO i=1 to 10 tells SAS to repeat whatever statement follows preceding the END statement for each i value from 1 to 10. RAND function simply generates a random number between 0 and 1. Here is what happens with each loop:

  • randcol = RAND("UNIFORM"): Set i=1, create a random number between 0 and 1, assign it to randcol column
  • OUTPUT: Create a new row
  • randcol = RAND("UNIFORM"): Set i=2, create a random number between 0 and 1, assign it to randcol column
  • OUTPUT: Create a new row
  • ...
  • randcol = RAND("UNIFORM"): Set i=10, create a random number between 0 and 1, assign it to randcol column
  • OUTPUT: Create a new row

Output is below:

Obs i randcall
1 1 0.58170
2 2 0.03562
3 3 0.07818
4 4 0.38785
5 5 0.32917
6 6 0.36159
7 7 0.33759
8 8 0.16920
9 9 0.05670
10 10 0.07983

Some of the other distributions available for RAND function are:

  • RAND("Bernoulli",p), where p specifies the chance of success. For example, if we create 20 rows with p=0.5, about 10 of the rows will be 1 and 10 of the rows will be 0.
  • RAND("Bernoulli",p,n), where p specifies the chance of success and n specifies the number of independent Bernoulli trials. For example, if we create 100 rows with p=0.5 and n=3, about pn=0.53=0.125*100=12.5 of the rows will be 0 or 3 and so on.
  • RAND("Normal",θ,λ) creates a normal distribution where θ represents the mean and λ represents the standard deviation.
Extracting Characters from Data

Many times we come across a dataset where our point of interest is buried in data with extra unwanted information. For example, take a look at the table below:

Obs agegp alcgp tobgp ncases ncontrols
1 25-34 0-39g/day 0-9g/day 0 40
2 25-34 0-39g/day 10-19 0 10
3 25-34 0-39g/day 20-29 0 6
4 25-34 0-39g/day 30+ 0 5
5 25-34 40-79 0-9g/day 0 27

Second and third column in the dataset includes unwanted characters (g/day). How do we remove them easily? Believe it or not SAS has a function for this: COMPRESS. See the code below.

DATA tutorial.epoch;
SET tutorial.epoch;
alcgp = compress(alcgp,'g/day');
tobgp = compress(topgp,'g/day');
Obs agegp alcgp tobgp ncases ncontrols
1 25-34 0-39 0-9 0 40
2 25-34 0-39 10-19 0 10
3 25-34 0-39 20-29 0 6
4 25-34 0-39 30+ 0 5
5 25-34 40-79 0-9 0 27

Another but much more powerful option is utilizing regular expressions. I will put a lecture about this in the near future but for the time being simply know that following code will achieve the same result as above and more.

DATA tutorial.epoch;
SET tutorial.epoch;
re = PRXPARSE('/(\d{1,3}.[\d]{0,2})(?:.*)/');
match = PRXMATCH(re, alcgp);
alcgp = PRXPOSN(re,1,alcgp);
DROP re match;
DO Loops

One of the most powerful features of SAS DATA (and command line tools in general) is the power of loops for repetitive tasks. Let's say we have table which lists days of a month (i.e. 1,2,..30) and we want to add a column called 'day' which lists day of the week (i.e. Mon, Tue and so on). So we should start with day1 and

Arrays

If you are familiar with programming you already know what an array is. If you are not, an array is a data structure, which can store a fixed-size collection of elements of the same data type. Here is an example to demonstrate the power of arrays. Take a look at the table below. Say we want to convert 'N/A's to missing. How do we do it?

Obs fname lname age department city state role pay
1 Smith N/A N/A Sales Boston MA Salesper 112
2 Mark Adams 36 IT Houston TX Manager 105
3 Jennifer Miska N/A Sales Boston MA N/A 95
4 Adam Taylor 41 N/A N/A N/A Director N/A

One primitive way to do it is as follows:

DATA example;
SET example;
IF fname EQ 'N/A' THEN fname = .;
IF lname EQ 'N/A' THEN lname = .;
IF age EQ 'N/A' THEN age = .;
IF department EQ 'N/A' THEN department = .;
IF city EQ 'N/A' THEN city = .;
IF state EQ 'N/A' THEN state = .;
IF role EQ 'N/A' THEN role = .;
IF pay EQ 'N/A' THEN pay = .;

A better way of doing this is with arrays. We will define an array and call it 'var' and assign this to all columns such that var[1] represents fname, var[2] represents lname and so on. Then we will use this array in our IF statements. Code below goes over each var and checks the validity of IF statement.

DATA example;
SET example;
ARRAY var name -- pay;
DO OVER var;
   IF var EQ 'N/A' THEN var = .;
END;