Splitting a Data File Using Batch Logic

Using only CSPro there is no simple way to split a data file into several parts. Someone asked me: "How would I split a file with 300 cases into six files, each with 50 cases?" It is possible to do this by writing a recursive batch program. This is not a particularly efficient way to split a file into parts, but it works fine for data files that are not so large. This code is probably not worth using if your data file contains more than a million cases.

What I do here is use the skip case statement to selectively write out cases. The first run of the program, I do nothing but create a PFF that calls the program again with the starting position. Then that program runs, writing out certain cases and skipping others, and then calls the program again, with a new starting position. This continues until the whole file has been processed. In the above example, the program would be run seven times, once to initialize the PFF, and then six times for each block of 50 cases. See the code below:

PROC GLOBAL

numeric numCasesPerFile = 50;

numeric currentCase,currentIteration,desiredStartCase,desiredEndCase;

file pffFile;

function writeOutPffAndStop(nextStartIteration)

setfile(pffFile,maketext("%s%d%d_%d.pff",pathname(temp),sysdate("YYYYMMDD"),systime(),nextStartIteration));

filewrite(pffFile,"[Run Information]");
filewrite(pffFile,"Version=CSPro 4.1");
filewrite(pffFile,"AppType=Batch");

filewrite(pffFile,"[Files]");
filewrite(pffFile,"Application=%ssplitFile.bch",pathname(application));
filewrite(pffFile,"InputData=%s",filename(CEN2000));
filewrite(pffFile,"OutputData=%s_%d",filename(CEN2000),nextStartIteration);
filewrite(pffFile,"Listing=%s.lst",filename(pffFile));

filewrite(pffFile,"[Parameters]");
filewrite(pffFile,"ViewListing=Never");
filewrite(pffFile,"ViewResults=Yes");
filewrite(pffFile,"Parameter=%d",nextStartIteration);

close(pffFile);

execpff(filename(pffFile));
stop();

end;

PROC DICTIONARY_FF

preproc

if sysparm() = "" then // we're on the first run
writeOutPffAndStop(1);

else
currentIteration = tonumber(sysparm());
desiredStartCase = 1 + ( currentIteration – 1 ) * numCasesPerFile;
desiredEndCase = desiredStartCase + numCasesPerFile – 1;

endif;

PROC QUEST

preproc

inc(currentCase);

if currentCase > desiredEndCase then
writeOutPffAndStop(currentIteration + 1);

elseif currentCase < desiredStartCase then
skip case;

endif;

You can use this code almost exactly as is, with the following modifications:

Modify the numeric numCasesPerFile from 50 to your liking.
Replace "CEN2000" with the name of your dictionary. (There are two places where this appears.)
Replace "DICTIONARY_FF" with the name of your top-level batch PROC. (It will end with _FF.)
Replace "QUEST" with the name of your dictionary's first level.

See here for an example of this application using the Popstan dictionary.