Data Manipulation with SAS DATA

Data Manipulation

At the center of data manipulation lies the prominent SAS DATA step. SAS DATA is one of the best, if not the best, data manipulation tool available out there.

First thing to learn about SAS DATA step is that it process the data row-wise. What I mean with that will become more apparent with each example, so without further ado, let's begin working on our examples.

Creating a column

We have a single column dataset listing ages of different people. Here's what it looks like (download it from here).

Obs	age
1	11
2	21
3	27
4	72
5	27
6	32
7	45
8	51
9	73
10	13

To create a column we simply specify the new column and the value:

								
									DATA tutorial.data_age_new;

									SET tutorial.data_age;

									isadult = 1;

First, we specified the name of data set that will be created in the DATA statement. If we specified a the same original data set name, i.e. tutorial.data_age, it would have overwritten original data set. Also note the convention we are using for saving data: library.dataset. In our case, a new data set with the name data_age_new will be created in the tutorials library. Original data set, data_age won't be modified.

SET statement copies a temporary version of the specified data set to memory so that the original version will not change.

We create a new column by specifying a column name isadult and giving a value of 1. Here's how DATA will process this command:

• Header Row: Create a column named isadult
• Row 1: Read the value of age, write 1 to isadult column
• Row 1: Read the value of age, write 1 to isadult column
• ...
• Row 10: Read the value of age, write 1 to isadult column

Here's our result:

Obs	age	isadult
1	11	1
2	21	1
3	27	1
4	72	1
5	27	1
6	32	1
7	45	1
8	51	1
9	73	1
10	13	1

Probably this is not exactly what we wanted. We would like the value of isadult reflect the value of age in the first column. To do this, we use an IF-ELSE statement:

								
									DATA tutorial.data_age_new;

									SET tutorial.data_age;

									IF age < 18 THEN isadult = 1;

									ELSE isadult = 0;

and we get finally what we wanted:

Obs	age	isadult
1	11	1
2	21	0
3	27	0
4	72	0
5	27	0
6	32	0
7	45	0
8	51	0
9	73	0
10	13	1

Appending data sets

Let's say we have two data sets like this:

Table-I

Obs	fname	lname
1	Alicia	Silverstone
2	Tom	Cruise
3	Uma	Thurman

Table-II

Obs	fname	lname
1	Tom	Hanks
2	Sharon	Stone
3	Orlando	Bloom

Combining these data sets into one table is as easy as this:

								
									DATA table3;

									SET table1 table2;

And here's the result:

Table-III

Obs	fname	lname
1	Alicia	Silverstone
2	Tom	Cruise
3	Uma	Thurman
4	Tom	Hanks
5	Sharon	Stone
6	Orlando	Bloom

We learned the simplest case of combining tables. Now let's see a more complicated example:

Table-I

Obs	fname	lname	country
1	Alicia	Silverstone	US
2	Tom	Cruise	US
3	Uma	Thurman	UK

Table-II

Obs	fname	lname	age
1	Alicia	Silverstone	40
2	Tom	Cruise	50
3	Uma	Thurman	45

Let's repeat the previous example and see what happens:

								
									DATA table3;

									SET table1 table2;

Table-III

Obs	fname	lname	country	age
1	Alicia	Silverstone	US	.
2	Tom	Cruise	US	.
3	Uma	Thurman	UK	.
4	Alicia	Silverstone		40
5	Tom	Cruise		50
6	Uma	Thurman		45

OK, this is not we want. Clearly we want to get rid of duplicates. To get what we want we need to use MERGE operator instead of SET:

								
									DATA table3;

									MERGE table1 table2;

									BY fname lname;

Table-III

Obs	fname	lname	country	age
1	Alicia	Silverstone	US	40
2	Tom	Cruise	US	50
3	Uma	Thurman	UK	45

Transposing data sets

There are times we have longitudinal data and we would like to transpose it to wide data and vice versa. SAS has a tool for this purpose: PROC TRANSPOSE. Let's see a complicated example where we can combine the power of DATA and PROC TRANSPOSE. Here's our initial table:

Obs	class	trade	sober	drinks	wage
1	A	papercutter	1	1	24
2	A	cabmen	1	10	18.416666667
3	A	goldbeater	2	1	21.5
4	A	stablemen	1	5	21.166666667
5	A	millworker	2	0	19
6	A	porter	9	8	20.5
7	A	warehouse	8	2	24.333333333
8	A	maltmen	3	5	23
9	A	chimneysweep	0	7	17.333333333
10	A	carter	12	23	22.166666667

and we want to covert it to this:

Obs	class	trade	wage	freq	drinks
1	A	barman	23.666666667	1	yes
2	A	barman	23.666666667	0	no
3	A	billposter	18	1	yes
4	A	billposter	18	0	no
5	A	brassfounder	21.5	2	yes
6	A	brassfounder	21.5	4	no
7	A	cabmen	18.416666667	1	yes
8	A	cabmen	18.416666667	10	no
9	A	carter	22.166666667	12	yes
10	A	carter	22.166666667	23	no

This conversion multiple steps of DATA procedure and PROC TRANSPOSE. First, we would like to keep columns class, trade, wage in place - no change there. Second, instead of having two columns with sober and drinks, we would like to have one column drinks (yes or no). Finally the number of drinkers and smokers will be in another column, freq. Let's built our TRANSPOSE procedure:

								
									PROC TRANSPOSE DATA = table1;

									VAR sober drinks;

									BY class trade wage;

									RUN;

Before running this we have one technical requirements. In order for SAS to process any BY statement, variables declared after BY have to be sorted. I am not sure why it has to be so but it is. So let's sort these variables first:

								
									PROC SORT DATA = table1 OUT = table1_sort;

									BY class trade wage;

									RUN;

Now we can run our TRANSPOSE procedure:

								
									PROC TRANSPOSE DATA  =  table1_sort OUT = table2;

									VAR sober drinks;

									BY class trade wage;

									RUN;

Obs	class	trade	wage	_NAME_	COL1
1	A	barman	23.666666667	sober	1
2	A	barman	23.666666667	drinks	0
3	A	billposter	18	sober	1
4	A	billposter	18	drinks	0
5	A	brassfounder	21.5	sober	2
6	A	brassfounder	21.5	drinks	4
7	A	cabmen	18.416666667	sober	1
8	A	cabmen	18.416666667	drinks	10
9	A	carter	22.166666667	sober	12
10	A	carter	22.166666667	drinks	23

PROC TRANSPOSE created two columns with names _NAME_ and COL1. We can specify these names in PROC TRANSPOSE statement but no need for it since we will use DATA step to make the changes.

								
									DATA table2;

									SET table2 (RENAME=(col1=freq));

									IF _name_ EQ "sober" THEN drinks="yes";

									ELSE drinks="no";

									DROP _name_;

and we get our final table:

Obs	class	trade	wage	freq	drinks
1	A	barman	23.666666667	1	yes
2	A	barman	23.666666667	0	no
3	A	billposter	18	1	yes
4	A	billposter	18	0	no
5	A	brassfounder	21.5	2	yes
6	A	brassfounder	21.5	4	no
7	A	cabmen	18.416666667	1	yes
8	A	cabmen	18.416666667	10	no
9	A	carter	22.166666667	12	yes
10	A	carter	22.166666667	23	no

Creating random data

We use the RAND function to create random numbers:

								
									DATA randdata;

									CALL STREAMINIT(123);

									DO i=1 TO  10;

									    randcol = rand("UNIFORM");
									    OUTPUT;

									END;

CALL STREAMINIT function simply specify the seed for random number generator. If we don't specify this with a number, we cannot reproduce our results.

This is our first case that we use a DO loop. DO i=1 to 10 tells SAS to repeat whatever statement follows preceding the END statement for each i value from 1 to 10. RAND function simply generates a random number between 0 and 1. Here is what happens with each loop:

randcol = RAND("UNIFORM"): Set i=1, create a random number between 0 and 1, assign it to randcol column
OUTPUT: Create a new row
randcol = RAND("UNIFORM"): Set i=2, create a random number between 0 and 1, assign it to randcol column
OUTPUT: Create a new row
...
randcol = RAND("UNIFORM"): Set i=10, create a random number between 0 and 1, assign it to randcol column
OUTPUT: Create a new row

Output is below:

Obs	i	randcall
1	1	0.58170
2	2	0.03562
3	3	0.07818
4	4	0.38785
5	5	0.32917
6	6	0.36159
7	7	0.33759
8	8	0.16920
9	9	0.05670
10	10	0.07983

Some of the other distributions available for RAND function are:

RAND("Bernoulli",p), where p specifies the chance of success. For example, if we create 20 rows with p=0.5, about 10 of the rows will be 1 and 10 of the rows will be 0.
RAND("Bernoulli",p,n), where p specifies the chance of success and n specifies the number of independent Bernoulli trials. For example, if we create 100 rows with p=0.5 and n=3, about pⁿ=0.5³=0.125*100=12.5 of the rows will be 0 or 3 and so on.
RAND("Normal",θ,λ) creates a normal distribution where θ represents the mean and λ represents the standard deviation.

Extracting Characters from Data

Many times we come across a dataset where our point of interest is buried in data with extra unwanted information. For example, take a look at the table below:

Obs	agegp	alcgp	tobgp	ncontrols
1	25-34	0-39g/day	0-9g/day	40
2	25-34	0-39g/day	10-19	10
3	25-34	0-39g/day	20-29	6
4	25-34	0-39g/day	30+	5
5	25-34	40-79	0-9g/day	27

Second and third column in the dataset includes unwanted characters (g/day). How do we remove them easily? Believe it or not SAS has a function for this: COMPRESS. See the code below.

								
									DATA tutorial.epoch;

									SET tutorial.epoch;

									alcgp = compress(alcgp,'g/day');

									tobgp = compress(topgp,'g/day');

Obs	agegp	alcgp	tobgp	ncontrols
1	25-34	0-39	0-9	40
2	25-34	0-39	10-19	10
3	25-34	0-39	20-29	6
4	25-34	0-39	30+	5
5	25-34	40-79	0-9	27

Another but much more powerful option is utilizing regular expressions. I will put a lecture about this in the near future but for the time being simply know that following code will achieve the same result as above and more.

								
									DATA tutorial.epoch;

									SET tutorial.epoch;

									re = PRXPARSE('/(\d{1,3}.[\d]{0,2})(?:.*)/');

									match = PRXMATCH(re, alcgp);

									alcgp = PRXPOSN(re,1,alcgp);

									DROP re match;

DO Loops

One of the most powerful features of SAS DATA (and command line tools in general) is the power of loops for repetitive tasks. Let's say we have table which lists days of a month (i.e. 1,2,..30) and we want to add a column called 'day' which lists day of the week (i.e. Mon, Tue and so on). So we should start with day1 and

Arrays

If you are familiar with programming you already know what an array is. If you are not, an array is a data structure, which can store a fixed-size collection of elements of the same data type. Here is an example to demonstrate the power of arrays. Take a look at the table below. Say we want to convert 'N/A's to missing. How do we do it?

Obs	fname	lname	age	department	city	state	role	pay
1	Smith	N/A	N/A	Sales	Boston	MA	Salesper	112
2	Mark	Adams	36	IT	Houston	TX	Manager	105
3	Jennifer	Miska	N/A	Sales	Boston	MA	N/A	95
4	Adam	Taylor	41	N/A	N/A	N/A	Director	N/A

One primitive way to do it is as follows:

								
									DATA example;

									SET example;

									IF fname EQ 'N/A' THEN fname = .;

									IF lname EQ 'N/A' THEN lname = .;

									IF age EQ 'N/A' THEN age = .;

									IF department EQ 'N/A' THEN department = .;

									IF city EQ 'N/A' THEN city = .;

									IF state EQ 'N/A' THEN state = .;

									IF role EQ 'N/A' THEN role = .;

									IF pay EQ 'N/A' THEN pay = .;

A better way of doing this is with arrays. We will define an array and call it 'var' and assign this to all columns such that var[1] represents fname, var[2] represents lname and so on. Then we will use this array in our IF statements. Code below goes over each var and checks the validity of IF statement.

								
									DATA example;

									SET example;

									ARRAY var name -- pay;

									DO OVER var;

									   IF var EQ 'N/A' THEN var = .;

									END;