Creating Subsets of Data

The following is a list of commands and options that are used when creating subsets of data files:

~execute do_subset~input/~output sampling=#n~input/~output sampling=.nnn~input/~output try_for_sampling=#n~input/~output try_for_sampling=.nnn~input/~output select=~input/~output select=casewritten/not(casewritten)~input/~output num_sample_cases=~output file_name #n>random_seed=

1) ~execute do_subset

The ~execute command do_subset is what launches the subsetting run based on the ~input/~output commands and options you have chosen. It is very much like issuing a “write_now” command for every ~output file. The simplest version of subset is shown in subs1.spx

2) ~input/~output select=casewritten/not(casewritten)

A special variable “casewritten” may be used as part of the select= condition. Most often it is used with not(), as in not(casewritten). It takes effect when one is using multiple ~output files in a subset run. See subs2.spx

3) ~input/~output sampling=#n

The pound sign version of sampling= gives you a random sample of “n” records from the data. See subs3.spx

4) ~input/~output sampling=.n

When using sampling=.n, the number of records written will be .n times the number of records available to be written (i.e. num_sample_cases). The exact number of cases written will be the result of this calculation rounded to the nearest whole number. See subs4.spx

The “try_for” sampling options work the same way that the sampling= options do, except that it is not an error when the number of records requested is not available in the sample.

5) >random_seed=

Sampling= picks a random sample from the data. Normally, each time a subsetted output file is created a somewhat different collection of records will be output. By using >random_seed= one can force the starting point of the randomizing process, and thus make it possible to repeatedly create the same “random” sample of records. See subs5.spx

6) ~input/~output select=

Select= may now be used on either the ~input, ~output, or both. If the select appears on both the ~input and ~output, the one on the ~input is executed first. No record that does not pass the ~input selection criteria will have the opportunity to be written to the output file. Both select= and sampling= may be used at the same time. See subs6.spx

7) ~input/~output num_sample_cases=

In order to pull a random sample from an existing sample, the number of cases in the existing sample needs to be known. For example, if you wanted five cases out of 10,000 you would want the random cases to be pulled from random locations throughout the file, not just the beginning, middle, or end of the original file.

Sometimes it’s very easy to determine the number of cases in a file. For example CfMC system files contain the number of cases in their header, and MPE ascii files contain the number of records in their file label. To determine the number of records in a variable length ascii file on Windows or Unix, a pass must be made through the data to count the records. On relatively small input files (<10,000) this counting pass is nearly imperceptible in terms of run time, but on very large samples it may increase run times noticeably. If the input/output files contain select/sample options, this further complicates determining the number of cases available from which to sample and may show a corresponding increase in run times.

Setting num_sample_cases= will cause the subsetting process to use this setting as the number of cases from which to draw the sample, and cause the program to not execute passes through the data to determine the number of cases available to be sampled from. This should decrease run times for very large samples, however, if the number provided via num_sample_cases is not correct an error will be generated.

See subs7.spx for an example of subsetting a data file into thirds using this feature.

subset2.zip