Data Sampling with SAS PROC SURVEYSELECT

Sampling

There are a variety of probability sampling strategies. Frequently used probability sampling strategies are simple, systematic, stratified, and cluster sampling. Simple random sampling may be the best known sampling strategy. Systematic random sampling uses a list of population elements. If elements can be assumed to be randomly listed, then a starting point is randomly identified, and elements are selected using a sampling interval. These intervals are calculated by dividing the desired sample size by the number of elements in the sampling frame. For example, to select a sample size of 50 from a sampling frame with 100 elements, pick a random starting point and select every second element until 50 elements are selected. Stratified sampling uses groups to achieve representativeness, or to ensure that a certain number of elements from each group are selected. In a stratified sample, the sampling frame is divided into nonoverlapping groups or strata (e.g., age groups, gender). Then a random sample is taken from each stratum. This technique, for example, can be used to study a small subgroup of a population that could be excluded in a simple random sample. Cluster sampling enables random sampling from either a very large population or one that is geographically diverse. An important objective of cluster sampling is to reduce costs by increasing sampling efficiency. A problem with cluster sampling is that, although every cluster has the same chance of being selected, elements within large clusters have a greatly reduced chance of being selected in the final sample. Using the probability proportionate to size (PPS) technique corrects this error. PPS takes into account the difference in cluster size and adjusts the chance that clusters will be selected. That is, PPS increases the odds that elements in larger clusters will be selected [1]. All of these methods are available with the SURVEYSELECT procedure.

The SURVEYSELECT procedure provides a variety of methods for selecting probability-based random samples. The procedure can select a simple random sample or can sample according to a complex multistage sample design that includes stratification, clustering, and unequal probabilities of selection.

Let's use an imaginary dataset to demonstrate each sampling method. For example let's assume we have dataset that lists number of accidents per each day of a week and we would like to get a sample out of it. A few observations from the dataset looks like this (you can download the full dataset from here.)

Obs	day	dayofweek	n
1	1	Monday	49
2	2	Tuesday	30
3	3	Wednesday	29
4	4	Thursday	53
5	5	Friday	73
6	6	Saturday	103
7	7	Sunday	66
8	8	Monday	52
9	9	Tuesday	30
10	10	Wednesday	28

Here's a summary of the original dataset:

Analysis Variable : n
dayofweek	N	Mean	Std Dev	Minimum	Maximum
Friday	3000	69.54	18.37	11.00	134.00
Monday	3000	59.55	9.88	20.00	91.00
Saturday	3000	79.61	17.73	4.00	142.00
Sunday	3000	59.88	13.11	19.00	103.00
Thursday	3000	54.34	8.02	28.00	83.00
Tuesday	3000	39.61	7.13	9.00	69.00
Wednesday	3000	34.32	5.99	15.00	56.00

Simple Random Sampling

The method of simple random sampling without replacement (METHOD=SRS) selects units with equal probability and without replacement and also the default sampling method in SAS. Each possible sample of different units out of has the same probability of being selected. After a unit is selected it cannot be selected again. Here's how we do it in SAS:

							
								PROC SURVEYSELECT DATA=tutorial.sampling SAMPSIZE=100 METHOD=SRS SEED=1234 OUT=sample;

								RUN;

Here's a summary of our sample:

Analysis Variable : n
dayofweek	N	Mean	Std Dev	Minimum	Maximum
Friday	12	64.25	16.03	39.00	87.00
Monday	10	54.00	8.97	44.00	75.00
Saturday	18	82.11	21.92	33.00	113.00
Sunday	18	63.44	14.20	44.00	93.00
Thursday	15	54.87	7.37	43.00	65.00
Tuesday	16	42.13	7.64	26.00	53.00
Wednesday	11	33.09	6.20	24.00	42.00

The problem with our sample is the distribution of days; clearly Saturday and Sunday are overrepresented whereas Monday and Wednesday are underrepresented. To fix this we have to specify a strata preceded by sorting the data by specified strata (this is one of the things that I still don't get - why this sorting cannot be done by the process itself but has to be separately conducted):

							
								PROC SORT DATA=tutorial.sampling;

								BY dayofweek;

								RUN;
								
								PROC SURVEYSELECT DATA=tutorial.sampling SAMPSIZE=14 METHOD=SRS SEED=1234 OUT=sample;

								STRATA dayofweek;

								RUN;

Analysis Variable : n
dayofweek	N	Mean	Std Dev	Minimum	Maximum
Friday	14	60.71	20.41	11.00	86.00
Monday	14	61,43	9.43	47.00	81.00
Saturday	14	79.07	21.83	46.00	126.00
Sunday	14	65.57	13.79	38.00	83.00
Thursday	14	52.07	6.79	37.00	65.00
Tuesday	14	37.64	7.63	27.00	49.00
Wednesday	14	33.93	5.59	26.00	42.00

Note we specified the sample size as 14 for each strata so that we have a sample size of 14x7=98.

Strata option worked well in our example because there are equal number of observations for each day of the week but what if we have a sample with unequal distribution of days of week and we represent this proportion in our sample. Assume for example that our original data looked like this instead:

Analysis Variable : n
dayofweek	N	Mean	Std Dev	Minimum	Maximum
Friday	211	74.41	14.77	51.00	134.00
Monday	190	63.56	7.61	51.00	86.00
Saturday	232	81.60	16.70	52.00	131.00
Sunday	162	65.74	10.06	51.00	97.00
Thursday	183	58.89	5.56	51.00	73.00
Tuesday	21	53.81	2.82	51.00	59.00
Wednesday	1	52.00	-	52.00	52.00

To get an accurate sample representing the original distribution of days of the week, we can use METHOD=SEQ, sequential random sampling which would take the original distribution into account without a need to specify strata at all. Here's how we do it:

							
								PROC SURVEYSELECT DATA=tutorial.sampling SAMPSIZE=100 METHOD=SEQ SEED=1234 OUT=sample;

								RUN;

Analysis Variable : n
dayofweek	N	Mean	Std Dev	Minimum	Maximum
Friday	21	73.24	12.07	52.00	95.00
Monday	19	62.47	6.10	51.00	73.00
Saturday	23	80.00	15.37	56.00	110.00
Sunday	16	63.94	11.11	51.00	97.00
Thursday	19	58.42	5.68	52.00	69.00
Tuesday	2	55.00	4.24	52.00	58.00

If we instead want to create a sample with replacement instead, we should use METHOD=URS (unrestricted random sampling). With replacement in place of course, there is always a chance of selecting the same observation more than once. By default SAS removes duplicate observations from the final dataset. Let's demonstrate this with an example:

							
								PROC SURVEYSELECT DATA=tutorial.sampling SAMPSIZE=14 METHOD=URS SEED=4345 OUT=sample;

								STRATA dayofweek;

								RUN;

Analysis Variable : n
dayofweek	N	Mean	Std Dev	Minimum	Maximum
Friday	14	65.50	18.10	26.00	91.00
Monday	14	62.79	9.01	44.00	77.00
Saturday	14	86.71	23.31	45.00	132.00
Sunday	14	59.36	12.24	40.00	79.00
Thursday	13	55.15	5.15	46.00	63.00
Tuesday	14	40.07	5.31	32.00	47.00
Wednesday	14	32.50	5.84	19.00	42.00

As can be seen, we have 13 observations for Thursday instead of 14 we specified. In order to keep the duplicate instead we add OUTHITS option to our code.

							
								PROC SURVEYSELECT DATA=tutorial.sampling SAMPSIZE=14 METHOD=URS OUTHITS SEED=4345 OUT=sample;

								STRATA dayofweek;

								RUN;

Analysis Variable : n
dayofweek	N	Mean	Std Dev	Minimum	Maximum
Friday	14	65.50	18.10	26.00	91.00
Monday	14	62.79	9.01	44.00	77.00
Saturday	14	86.71	23.31	45.00	132.00
Sunday	14	59.36	12.24	40.00	79.00
Thursday	14	55.21	4.95	46.00	63.00
Tuesday	14	40.07	5.31	32.00	47.00
Wednesday	14	32.50	5.84	19.00	42.00

Example: Titanic Passengers

Description: TThe sinking of the RMS Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This data table describes the survival status of 1309 of the 1324 individual passengers on the Titanic. Information on the 899 crew members is not included

name: Passenger name.
survived: Yes or No.
passenger_class: 1, 2, or 3 corresponding to 1st, 2nd, or 3rd class.
sex: Passenger sex.
age: Passenger age.
siblings_and_spouses: The number of siblings and spouses aboard.
parents_and_children: The number of parents and children aboard.
fare: The passenger fare.
port: Port of embarkment (C = Cherbourg; Q = Queenstown; S = Southampton).
home_dest: The home or final intended destination of the passenger.

Source: Building Better Models with JMP Pro
Download the data from here

Task: Create a data set with %70 train and 30% test, derive a logistic regression model to find factors effecting survival chance with train data and test it with test data.

Let's first create our test and train dataset with SURVEYSELECT procedure:

							
								PROC SURVEYSELECT DATA=tutorial.titanic RATE=0.3 OUTALL METHOD=SRS SEED=1234 OUT=sample;

								RUN;

RATE defines the percent of data to be selected - 30% in our case.
OUTALL tells SAS to output not just the selected but also unselected data in the output. To separate the selected from unselected, SAS will create a column called "selected" with value 1 for selected and 0 for unselected.
METHOD=SRS specifies the selection method = sampling with replacement in this case. Other option is URS.
SEED=1234 is simply a seed number for generating random selection. This can be any number. If you use the same number you will get the same result - it's good for reproducibility.

Now we have a new dataset with a "selected" column and we would like to create two datasets with names titanic_train and titanic_test. To do this we use DATA procedure:

							
								DATA titanic_train titanic_test;

								SET sample;

								IF selected=0 THEN OUTPUT titanic_train;

								ELSE OUTPUT titanic_test;

Next we can run our first logistic regression model with train data:

							
								PROC LOGISTIC DATA= titanic_train;

								CLASS survived passenger_class sex port;

								MODEL survived = passenger_class sex port age siblings_and_spouses parents_and_children fare;

								RUN;

Model Fit Statistics
Criterion	Intercept Only	Intercept and Covariates
AIC	977.400	682.571
SC	981.984	728.405
-2 Log L	975.400	662.571

Testing Global Null Hypothesis: BETA=0
Test	Chi-Square	DF	Pr > ChiSq
Likelihood Ratio	312.8295	9	<.0001
Score	272.0024	9	<.0001
Wald	187.3937	9	<.0001

Type 3 Analysis of Effects
Effect	DF	Wald Chi-Square	Pr > ChiSq
Passenger_Class	2	39.9014	<.0001
Sex	1	129.7333	<.0001
Port	2	22.1606	<.0001
Age	1	24.2463	<.0001
Siblings_and_Spouses	1	7.5274	0.0061
Parents_and_Children	1	0.0452	0.8316
Fare	1	0.0014	0.9697

Analysis of Maximum Likelihood Estimates
Parameter		DF	Estimate	Standard Error	Wald Chi-Square	Pr > ChiSq
Intercept		1	1.0919	0.3459	9.9635	0.0016
Passenger_Class	1	1	0.9339	0.2050	20.7613	<.0001
Passenger_Class	2	1	0.0812	0.1562	0.2699	0.6034
Sex	female	1	1.2423	0.1091	129.7333	<.0001
Port	C	1	1.1380	0.2474	21.1600	<.0001
Port	Q	1	-1.2993	0.3786	11.7762	0.0006
Age		1	-0.0394	0.00800	24.2463	<.0001
Siblings_and_Spouses		1	-0.3617	0.1318	7.5274	0.0061
Parents_and_Children		1	0.0260	0.1223	0.0452	0.8316
Fare		1	0.000099	0.00260	0.0014	0.9697

Odds Ratio Estimates
Effect	Point Estimate	95% Wald Confidence Limits
Passenger_Class 1 vs 3	7.022	3.648	13.516
Passenger_Class 2 vs 3	2.993	1.861	4.814
Sex female vs male	11.997	7.823	18.397
Port C vs S	2.655	1.600	4.406
Port Q vs S	0.232	0.077	0.695
Age	0.961	0.946	0.977
Siblings_and_Spouses	0.696	0.538	0.902
Parents_and_Children	1.026	0.808	1.304
Fare	1.000	0.995	1.005

Association of Predicted Probabilities and Observed Responses
Percent Concordant	84.8	Somers' D	0.697
Percent Discordant	15.1	Gamma	0.697
Percent Tied	0.0	Tau-a	0.336
Pairs	125852	c	0.849

Variables parents_and_children and fare are relevant so we will remove them from our model. Also we will add a new option to PROC LOGISTIC statement: OUTMODEL=titanic. This will save the model (formula) for making estimations on the test data set.

							
								PROC LOGISTIC DATA=titanic_train OUTMODEL=titanic;

								CLASS survived passenger_class sex port;

								MODEL survived = passenger_class sex port age siblings_and_spouses;

								RUN;

Model Fit Statistics
Criterion	Intercept Only	Intercept and Covariates
AIC	978.434	678.689
SC	983.019	715.367
-2 Log L	976.434	662.689

Testing Global Null Hypothesis: BETA=0
Test	Chi-Square	DF	Pr > ChiSq
Likelihood Ratio	313.7450	7	<.0001
Score	272.7190	7	<.0001
Wald	187.6911	7	<.0001

Type 3 Analysis of Effects
Effect	DF	Wald Chi-Square	Pr > ChiSq
Passenger_Class	2	52.2370	<.0001
Sex	1	137.4998	<.0001
Port	2	22.5883	<.0001
Age	1	24.6579	<.0001
Siblings_and_Spouses	1	8.0404	0.0046

Analysis of Maximum Likelihood Estimates
Parameter		DF	Estimate	Standard Error	Wald Chi-Square	Pr > ChiSq
Intercept		1	1.1068	0.3281	11.3792	0.0007
Passenger_Class	1	1	0.9380	0.1743	28.9742	<.0001
Passenger_Class	2	1	0.0795	0.1516	0.2746	0.6002
Sex	female	1	1.2482	0.1064	137.4998	<.0001
Port	C	1	1.1421	0.2463	21.5095	<.0001
Port	Q	1	-1.3050	0.3781	11.9122	0.0006
Age		1	-0.0395	0.00796	24.6579	<.0001
Siblings_and_Spouses		1	-0.3523	0.1242	8.0404	0.0046

Odds Ratio Estimates
Effect	Point Estimate	95% Wald Confidence Limits
Passenger_Class 1 vs 3	7.067	4.062	12.296
Passenger_Class 2 vs 3	2.995	1.870	4.796
Sex female vs male	12.138	7.997	18.424
Port C vs S	2.663	1.611	4.399
Port Q vs S	0.230	0.077	0.689
Age	0.961	0.946	0.976
Siblings_and_Spouses	0.703	0.551	0.897

Association of Predicted Probabilities and Observed Responses
Percent Concordant	84.9	Somers' D	0.698
Percent Discordant	15.0	Gamma	0.699
Percent Tied	0.1	Tau-a	0.337
Pairs	126144	c	0.849

Let's now see how good our model is with our test data set.

							
								PROC LOGISTIC INMODEL=titanic;

								SCORE DATA=titanic_test OUT=titanic_score FITSTAT ;

								RUN;

Fit Statistics for SCORE Data
Data Set	Total Frequency	Log Likelihood	Error Rate	AIC	AICC	BIC	SC	R-Square	Max-Rescaled R-Square	AUC	Brier Score
WORK.TITANIC_TEST	320	-150.1	0.2000	316.1415	316.6046	346.2881	346.2881	0.342782	0.461507	0.846468	0.148635

Our error rate is 20%, not bad. Another measure I typically look into is Brier Score. Brier Score is the mean squared difference between the predicted probability and the actual outcome. The lower the Brier score is, the better our model is. A Brier score of zero means perfect fit and one means worst fit possible. For our case it's 0.15, not so bad - not amazing either. If we had randomly estimated survival, our Brier score would have been around 0.25, so 0.15 is not too bad.

When saving scores, SAS puts two columns in the output: f_survived and i_survived for each observation. f_survived is the value of survived in the titanic_test dataset and i_survived is the estimated value of the survival based on the logistic model. Therefore the more agreeement between these columns, the better our model is. We can utilize PROC FREQ to see how good our estimates are:

							
								PROC FREQ DATA=titanic_score;

								TABLE f_survived*i_survived / AGREE;

								RUN;

And here is our output:

McNemar's Test
Statistic (S)	1.5625
DF	1
Pr > S	0.2113

p-value for McNemar test is 0.2113. Higher p-values mean better agreement. We can use this measure to compare different models. Agree plot shows our fit visually. Note that this measure would only be useful for relatively balanced designs.

Preventing Bias in Sampling

It is important to create samples that represent original data accurately otherwise we would have a biased sample. For example, take a look at the carddefault data which shows the credit card defaults with other socio-economic variables. A predictive model would try to estimate the defaulting. Let's take a look at the default and student ratios in the dataset via PROC FREQ:

							
								PROC FREQ DATA=tutorial.carddefault;

								TABLES default student;

								RUN;

default	Frequency	Percent	Cumulative Frequency	Cumulative Percent
No	9667	96.67	9667	96.67
Yes	333	3.33	10000	100.00

student	Frequency	Percent	Cumulative Frequency	Cumulative Percent
No	7056	70.56	7056	70.56
Yes	2944	29.44	10000	100.00

Now let's create a training sample from our dataset via PROC SURVEYSELECT and see the distribution via PROC FREQ:

							
								PROC SURVEYSELECT DATA=tutorial.carddefault OUT=train_carddefault RATE=0.7 SEED=1234;

								RUN;

								PROC FREQ DATA=train_carddefault;

								TABLES default student;

								RUN;

default	Frequency	Percent	Cumulative Frequency	Cumulative Percent
No	6742	96.31	6742	96.31
Yes	258	3.69	7000	100.00

student	Frequency	Percent	Cumulative Frequency	Cumulative Percent
No	4969	70.99	4969	70.99
Yes	2031	29.01	7000	100.00

As can be seen from above default=Yes is oversampled and student=Yes is undersampled. To remedy this we have to use STRATA statement with PROC SURVEYSELECT. Note that sorting the data is required before invoking STRATA.

							
								PROC SORT DATA=tutorial.carddefault;

								BY default student;

								RUN;

								PROC SURVEYSELECT DATA=tutorial.carddefault OUT=train_carddefault RATE=0.7 SEED=1234;

								STRATA default student;

								RUN;

								PROC FREQ DATA=train_carddefault;

								TABLES default student;

								RUN;

default	Frequency	Percent	Cumulative Frequency	Cumulative Percent
No	6767	96.66	6767	96.66
Yes	234	3.34	7001	100.00

student	Frequency	Percent	Cumulative Frequency	Cumulative Percent
No	4940	70.56	4940	70.56
Yes	2061	29.44	7001	100.00

Determining Sample Size: Balancing Power, Precision, and Practicality, P. Dattalo

Sampling

Simple Random Sampling

Example: Titanic Passengers

Preventing Bias in Sampling

ARIMA Models

Generalized Linear Models

Leave a Comment

Lrrr

Popular Posts

Space The Final Frontier

The Amazing Hubble

Astronomy Or Astrology

Asteroids telescope

Post Categories

Sampling

Simple Random Sampling

Example: Titanic Passengers

Preventing Bias in Sampling

ARIMA Models

Generalized Linear Models

Leave a Comment

Lrrr

Popular Posts

Space The Final Frontier

The Amazing Hubble

Astronomy Or Astrology

Asteroids telescope

Post Categories

Newsletter

Tag Clouds