• Graham Whitehead

Using hash tables to split datasets in SAS

A common task in data preparation is splitting a dataset based on the values of a column. Whilst there are a variety of ways of accomplishing this task in SAS, these often require at least two passes through the data: first to identify the unique values, and then to use a where clause with a datastep or proc sql procedure. Here we present an alternative solution using the hash table object in just one pass. This approach should be simpler, and more efficient both in speed and memory requirements.

We present this solution by showing how we could split the dataset Birthwgt (one of the Sashelp example data sets) by the variable ‘race’.

First, we define the dataset and specify the variable we want to split on, as below:

%Let Dset = SASHELP.BIRTHWGT;

%Let SplitVar = race;

Next, we specify the columns that we want to keep in each of the final split output datasets. This can either be done directly like this:

%Let Vars = %Str("LowBirthWgt", "Married", "AgeGroup", "Race", "Drinking", "Death")

or, more dynamically, by collecting all the columns as the macro variable ‘vars’ as in the example below:

proc contents data = &DSet noprint out=conts; run;

proc sort data = &conts; by varnum; run;

proc sql noprint:

select quote(trim(name)) into: vars separated by ',' from conts;

quit;

Before running the hash table split datastep, the dataset must be sorted in ascending value of the variable to be split on as below. (There are more complex ways to do achieve this within the same datastep but instead we choose here to present a simplified version for easier comprehension of the process.)

proc sort data = &Dset out = SplitDset; by &SplitVar; run;

Finally, we set up the hash table with the following piece of code: data _null_;

declare hash split_hash;

split_hash = _new_hash(ordered: 'a');

split_hash.definekey('key');

split_hash.defineData(&vars);

split_hash.definedone();

do until (last.&SplitVar);

set SplitDset;

by &SplitVar notsorted;

key + 1;

split_hash.add();

end;

split_hash.output(dataset: cats("ForAnalysis_",trim(&SplitVar)));

rc = split_hash.delete();

run; The above procedure can be summarised as below:

  1. Declare a new hash object into called ‘split_hash’

  2. Create the hash object and assign to it the variable ‘split_hash’ specifying that the dataset is ordered by ascending value

  3. Define a unique value for each of the values to be split on in the hash object as ‘key’

  4. Pass in the variables that we want to be on each of the output datasets. These are the values that we previously assigned to the variable ‘vars’

  5. Loop through each value of the dataset adding each row to the hash object until the last value of the current &SplitVar is reached

  6. Split_hash.output then creates the dataset required for the current value of &SplitVar

  7. Finally, delete the contents of the hash object to free up the memory taken up


Author: Graham Whitehead, Data Scientist

67 views

ABOUT US

We solve the data problems that seem unsolvable.

GET SOCIAL

  • Jump Data Twitter Account
  • Jump Data Linkedin Account

Jumpdata Limited. Company No. 08241266.

JumpData Logo
Jump Data Linkedin Account
Jump Data Twitter Account