3: Reformatting Your Data – Enghouse Insights (Powered by Survox) Documentation

INTRODUCTION

This chapter describes the most commonly used data manipulation statements and shows how to use arithmetic calculations to generate numeric data for use in tables.

3.1 WHY REFORMAT DATA?

There are several different ways you can change how data is organized in a data file. For example, you may need data in a study that is the combination of two variables (e.g., one variable filtered by the other).

If you want to change the the data file type (such as changing a CfMC data file to an ASCII data file), use the CfMC utility COPYFILE or MAKECASE. If you want to recover a corrupted data file, use the RAWCOPY utility. See the Utilities manual for a complete description of COPYFILE, MAKECASE and RAWCOPY.

THE OVERALL STRUCTURE

You can use data manipulation statements in either the ~CLEANER or ~DEFINE block of the Mentor program. In the ~CLEANER block, data modification takes place immediately and only on the case you have in hand. With ~CLEANER, you can either use each command by itself or execute a procedure. In the ~DEFINE block, you must create procedures which are then executed on all data cases when called in the ~EXECUTE block. We will be concentrating primarily on commands that you use in the ~DEFINE block. For an example of using a procedure in the ~CLEANER block, see 3.1.8 Data Manipulation in the ~CLEANER Block.

There are five basic commands that you can use for data manipulation:

COPY
MAKE_DATA
PRINT_TO_DATA
TRANSFER
MODIFY

The data variables used by these commands can be defined as punches, strings (literals), or numeric. They can be variables from a DB file that was created in EZWriter/PREPARE or they can be created with the PREPARE= instruction (see section Creating Variables under 3.1.5 Data Manipulation for Predefined Variables).

Two other useful commands are BLANK and SAY. BLANK allows you to blank the location or data variable referenced (see 3.1.2 Blanking Data), and SAY allows you to display the data associated with any location or data variable (see 3.1.3 Printing Text and Data Fields).

3.1.1 Rules For Manipulating Data

Rules Applying to Punch Type Variables

Defining a punch type variable will allow that data location to receive any of the valid punches (1-9,0,X,Y). Because no valid punches are defined with the variable you can add, subtract, and change punches as you wish.

[19.2$P] where 19.2 is the actual data location and each column will allow all punches (1-9,0,X,Y).

[city$P] where CITY can already exist, and enclosing it in square brackets with the $P redefines the

data location associated with city as a punch type variable.

Rules Applying to String Type Variables

Defining a string type variable will allow any valid ASCII string to be put into that data location. The string will be left-justified and if too long, it will be truncated.

[21.2$] where 21.2 is the actual data location and each column accepts only a single ASCII character.

[name$] where NAME can already exist, and enclosing it in square brackets with the $ redefines that

data location as a string type variable.

Rules Applying to Numeric Type Variables

Defining a numeric type variable means the only valid data that will be recognized in that location is a number. If some other data exists in that location, the program will return the keyword MISSING.

[23.2] where 23.2 is the actual data location and contains only valid numeric values.

[gender] where GENDER can already exist and enclosing the variable in square brackets redefines it

for the program as a numeric variable.

Any variable that exists in a DB file which originated in or PREPARE variables uses the same syntax. The program inherently knows the type of variable (CAT, FLD, NUM, VAR, TEX) from its definition and expects certain parameters to be upheld.

CAT types. There are certain punch codes that have been defined and are acceptable for the data location referenced. Any attempt to add/remove a code that does not match the definition will result in an error.

FLD types. These have certain acceptable ASCII codes, just like CAT types. Data modification must be done through redefining the FLD variable as a string to put in the new code. If the new code generated does not match any of the acceptable codes in the original variable then that code will not appear when displaying the data location in its original format.

NUM types. The numerical range and exception codes have been defined and any number or code not matching the definition will not be a recognized number in that location.

VAR types. The length of the response is defined and any generation into the referenced data location that exceeds the length will result in an error. Anything shorter than the default will be left-justified.

TEX types. You must specify the location of text data when you open the data file, e.g., ~INPUT myfile, TEXT_LOCATION= 3/10.

3.1.2 Blanking Data

The BLANK command will blank specific data locations or will blank data referenced by a pre-defined variable. There may be occasions in data manipulation where it will be desirable to blank out existing data from a specific location before beginning any new manipulations.

The command BLANK [6.4] blanks out all data in columns 6-9. If a pre-defined variable called TIMES exists in columns 4-5, then BLANK [TIMES] would blank columns 4-5. The command BLANK [6-80] blanks all data from column 6 through 80. If CITY was a pre-defined CAT type variable with responses of 1, 2, 3, or 4, BLANK CITY would clear responses associated with the CITY variable. A punch 9 in the data location of the CITY variable would not be cleared with BLANK CITY. To clear the data location associated with CITY, use BLANK [CITY].

3.1.3 Printing Text and Data Fields

The SAY command prints any text enclosed in quotes that follows the command, or prints the value of a specified variable or both. The SAY command is helpful in checking data manipulations to see the result of the manipulations. Output from the SAY command goes to the list file.

It is important to recognize the difference in results that will be printed depending on how the location is specified (string, numeric or punch).

The location specified as [4.2] is printed as a numeric field and leading zeros will be dropped if they exist. Any value in the location that is not a good number will print as MISSING.
The location specified as [4.2$] is printed as a string so if leading zeros or blanks exist, they will be printed.
The location specified as [4.2$P] will display all the punches in each column separated by backslashes.

Contents of Columns: SAY Command: Results Printed:

columns 4-5=04 SAY [4.2] 4

columns 4-5=04 SAY [4.2$] 04

columns 4-5=04, columns 6-7 are blank SAY [4.4] MISSING=”04 ”

columns 4-5=04, columns 6-7 are blank SAY [4.4$] 04

columns 4-5 are blank, columns 6-7=14 SAY [4.4] 14

columns 4-5 are blank, columns 6-7=14 SAY [4.4$] 14

columns 4-7=ABCD SAY [4.4] MISSING=”ABCD”

columns 4-7=ABCD SAY [4.4$] ABCD

column 4=punches 1234, column 5=4 SAY [4.2$P] 1234\4

To display some descriptive text followed by data, enter the text enclosed in quotes.

EX: SAY “ID for this case:” [1.4$]

This will print:

ID for this case: 0001

If you wanted to look at the result of a manipulation for a specific case, you would enter:

IF [1.3#020] THEN SAY [4.2$] ENDIF

If ID# 020 contained the number 26 in columns 4-5, you would see the result “26”. If you wanted to display exactly what was in each column, including the case ID and any multiple punches if they existed, you would enter:

EX: IF [1.3#020] THEN SAY + [4.2$] ENDIF

The ”+” sign before the location allows you to see this additional information:

ID: 020 <– this is the case ID number

Card 01: <– this is the record number

00 <– this is a column template (columns 04 and 05)

26 <– ASCII display of column contents

2 <– binary display of column contents

If the case has multiple punches in a column range in Card 2, the command:

EX: IF [1.3#021] THEN SAY + [81.2$] ENDIF

would show the following information:

ID: 021 <– this is the case ID number

Card 02: <– this is the record number

00 <– this is a column template (columns 81 and 82 or

columns 01 and 02 of card 2)

** <– ASCII display of column contents

55 <– binary display of column contents

If you have pre-defined variables, the SAY command will display additional information for each CAT and FLD type question. No additional information is displayed for NUM, VAR, and TEX type questions. For a pre-defined CAT variable named GENDER:

EX: SAY CASE_ID GENDER

would show the following for case 001:

001 gender(2:1=1=MALE)

The “2” tells how many possible responses there are to the GENDER question. The “1=1” says that Case 001 answered the GENDER question with response 1 and that response 1 was the first item in the response list. “MALE” indicates that the text associated with response 1 is MALE.

For a pre-defined FLD variable named DAY:

EX: SAY CASE_ID DAY

would show the following for case 001:

001 day(7:MON=1=Monday)

3.1.4 Data Manipulation for Punch, String, and Numeric Variables

DIRECT DATA MOVES

The COPY command is used to copy data from one location directly to another location, whether that data is in numeric, string, or punch format. The syntax for a COPY command is:

COPY to_datavar = from_datavar

datavar is any location ([col.wid]) or any user-defined variable. An example of a COPY command is:

EX: COPY [24.2] = [6.2]

If the contents of columns 6-7 was 04, then 04 is copied from columns 6-7 to columns 24-25. The original contents of columns 24-25 have been replaced with what was in columns 6-7, and columns 6-7 still have their original data.

The COPY command is the easiest way to copy the contents of a multiple punched column to another column.

The TRANSFER command is used to alter the data in a location. The syntax for a TRANSFER command is:

TRANSFER to_datavar op= from_expression

op refers to the different operators to add or remove data, i.e.:

(none) Replaces data; blanks location first (only blanks valid pre-defined responses

for the to_datavar)

+ Adds data; does not blank location first

– Removes data

The MODIFY command is used to convert data from one type to another, while the TRANSFER command is used to change the contents of a variable. For example, TRANSFER would be used to add two number variables together to get a third number variable. TRANSFER, unlike MODIFY checks that the data is of the same type on both sides of the operator.

The recommended way to add or remove punches from a data location is the MAKE_DATA command (see section Adding/Removing Punches under 3.1.4 Data Manipulation for Punch, String, and Numeric Variables). Some examples of the TRANSFER command follow:

EX: TRANSFER [26.2 = 4

This transfers the number 4 to columns 26-27. Since the location 26.2 is defined as numeric (no $ or $P in the brackets), the 4 is right-justified so column 26 is blank and column 27 = 4.

EX: TRANSFER [32.2$] = “4”

This transfers the literal 4 to columns 32-33. Since the location 32.2 is defined as a string ($ in the bracket), the 4 is left-justified so column 32 = 4 and column 33 is blank.

EX: TRANSFER [14$P] = “1,2,3”

This TRANSFER command says to treat column 14 like a punch type variable ($P in the bracket) and move the punches 1, 2 and 3 to column 14.

EX: TRANSFER [15.2$P] = “1,2\3,4”

This TRANSFER command says to treat both columns 15 and 16 like punch type variables ($P in the brackets) and move punches 1 and 2 to column 15 and punches 3 and 4 to column 16. The backslash (\) says to put what’s before it into the first column and what’s after it into the second column. If you were interested in seeing what columns 15 and 16 looked like after this transfer you should enter:

DISPLAY_BINARY [15.2$P]

and Mentor would display:

ID: 001:[ 15.2 ] =12\34

column ASCII Binary

15 * 1,2

16 * 3,4

In the following example:

EX: TRANSFER S[36.4$] = “TEST”

TRANSFER S = “R”

a new string type variable is created called S that is located in columns 36-39. The new name is specified immediately in front of the open square bracket. The letters TEST are transferred to columns 36-39. If we assume that the contents of S was TEST from our first TRANSFER command, then the second transfer command changes the contents of the string variable S to the letter R. The letter R is left- justified so column 36 = R and columns 37-39 are blank.

An example of an operator that can be used with the TRANSFER command is the plus sign (+). When you are dealing with a string type variable, the + can be used to replace characters and not blank the receiving location first.

EX: TRANSFER S[36.4$]= “TEST”

TRANSFER S+= “R”

This first TRANSFER command will move TEST into columns 36-39. The second transfer command will move the R to column 36 and not clear the remaining columns. The result is columns 36-39 = REST.

Using the plus (+) operator with a numeric type variable results in the addition of two values. (See section “Arithmetic Calculations” under 3.1.4 Data Manipulation for Punch, String, and Numeric Variables for more examples.)

EX: TRANSFER N[40.4] = 1234

TRANSFER N+= 5

The first TRANSFER command will first blank columns 40-43, then move ‘1234’ to the numeric variable called N that is located in columns 40-43. The second command will not clear the contents of columns 40-43 but will add the number 5 to it. The result is columns 40-43 = 1239. You can also use the TRANSFER command to copy the data from more than one location to more than one location.

EX: TRANSFER [44.2,46.2,48.2] = [22.2,20.2,18.2]

This command will first blank the receiving locations, then copy the contents of columns 22-23 to columns 44-45, and will copy the contents of columns 20-21 to columns 46-47 and will copy the contents of columns 18-19 to columns 48-49.

Since the receiving locations are defined as numeric, if the sending locations are not numeric, the receiving locations will be blanked and the non-numeric data will not be transferred.

You can also use the TRANSFER command to change the data in more than one location at a time.

EX: TRANSFER [44.2,46.2,48.2] = VALUES(1,2,3)

This command would transfer a 1 to column 45, a 2 to column 47 and a 3 to column 49 (with columns 44, 46, and 48 being blanked).

EX: TRANSFER [50.2,52.2,54.2] = VALUES(4,4,4)

This command would transfer a 4 to columns 51 and 53 and 55.

One unique use of the MODIFY command is to spread multi-punched data into multiple single-punched locations. This might be used to reformat a multi-punched question into a series of mentioned/not mentioned questions or when providing a data file for use in a program that doesn’t accept multiple punches. This process would be accomplished with the following command:

EX: MODIFY [21,…,26] = [20^1//6]

This MODIFY command spreads the multi-punched data in column 20 into columns 21 to 26 with a series of ones and blanks. For instance, a 2 punch in column 20 would cause a 1 to be placed into column 22, a 6 punch in column 20 would cause a 1 to be placed into column 26 and so on.

NOTE: The number of single-punched columns needed equals the number of possible punches in the multi-punched column.

If zeros are preferred over blanks for representing not mentioned, the following syntax will recode blanks into zeros (0) and put it back into the same field:

EX: TRANSFER [21,…,26] = [!21,…,26]

In this example the exclamation point (!) means return a zero when the location is blank.

Adding/Removing Punches

The MAKE_DATA command is one way to add or remove data in punch type variables. The syntax for the MAKE_DATA command is:

MAKE_DATA op[datavar]

datavar can be any punch type variable

Here are some examples for using the MAKE_DATA command to add punches:

EX: MAKE_DATA [8^1.5]

This blanks column 8 and then adds to column 8 the punches 1,2,3,4 and 5.

EX: MAKE_DATA [9^1,5]

This blanks column 9 and then adds to column 9 the punches 1 and 5.

EX: MAKE_DATA +[10^X,Y]

This doesn’t blank column 10 but adds the punches X and Y.

Here are some examples for using the MAKE_DATA command to remove punches from punch variables:

EX: MAKE_DATA -[7^1,0]

This removes only the punches 1 and 0 from column 7. Any other punches in column 7 remain unchanged.

EX: MAKE_DATA -[11^1.5]

This removes only the punches 1,2,3,4 and 5 from column 11. Any other punches in column 11 remain unchanged.

Arithmetic Calculations

In arithmetic calculations, the following arithmetic operators are available:

+ addition

– subtraction

* multiplication

/ division

++ when adding, substitute zero for missing elements (if both/all are missing, then the

result will be missing)

For the purpose of the following example we will use the TRANSFER command to copy the number 6 to columns 4-5 and the number 4 to each pair of columns starting with 6-7 and ending with 20-21. This TRANSFER command looks like:

EX: TRANSFER[4.2,…,20.2]=VALUES(6,4,4,4,4,4,4,4,4)

With these values assigned, the following TRANSFER and SAY commands can be used:

Commands: Results:

TRANSFER [6.2] += 2 SAY [6.2] 6

TRANSFER [8.2] -= 2 SAY [8.2] 2

TRANSFER [10.2] /= 2 SAY [10.2] 2

TRANSFER [12.2] *= 2 SAY [12.2] 8

TRANSFER [14.2] += [4.2] SAY [14.2] 10

TRANSFER [16.2] -= [4.2] SAY [16.2] -2

TRANSFER [18.2] /= [4.2] SAY [18.2] 1

TRANSFER [20.2] *= [4.2] SAY [20.2] 24

Since we said nothing about decimal significance, the result of 4 / 6 (the TRANSFER [18.2] /= [4.2] command above) was rounded up to 1.

Continuing with the above example, if we know that columns 22-28 are blank then:

EX: TRANSFER [26.2] = [22.2] ++ [4.2] SAY [26.2] 6

TRANSFER [28.2] = [22.2] ++ [24.2] SAY [28.2] MISSING

If the result will not fit in the receiving location, that location will be filled with asterisks (*)

Using SAY for a numeric field displays without leading blanks or zeroes so you’re seeing the numeric value displayed left justified, regardless of the field width. If a view of the complete location width is desired, use SAY [loc$], where the dollar sign ($) forces the number to be displayed as a string with all leading blanks and zeroes.

We could use the arithmetic operator “++” to sum the number of household members in different age groups to make a total of household members:

EX: TRANSFER [40.6] = [20.2] ++ [22.2] ++ [24.2]

This example would clear the receiving location (40.6), and then place into it the sum of the three locations 20.2, 22.2, 24.2. The “++” would cause any missing value to be treated as zero. If all values were missing the result would be missing. In addition to the above arithmetic operators, the following arithmetic functions are also available:

ABSOLUTE_VALUE AVERAGE EXPONENT LOGARITHM SQUARE_ROOT SUM X

The following examples illustrate some uses of these arithmetic functions (in all examples below, the PRINT_TO_DATA command blanks the receiving location first). (See 3.1.7 Formatting Data Elements for more information on the PRINT_TO_DATA command.)

EX: PRINT_TO_DATA [30.8] “\Z1_8.2S” ABSOLUTE_VALUE ([20.8])

This example of the ABSOLUTE_VALUE function will put the absolute value of the number in columns 20-27 into columns 30-37. The “Z1” will zero-fill, the underscore separates the “Z1” from the location, the “8.2” specifies an 8 column field with 2 decimal places and the “S” says to print a string of characters.

EX: PRINT_TO_DATA [30.6] “\Z1_6.3S” AVERAGE ([20.2,22.2,24.2])

This example of the AVERAGE function will put the average of the 3 numbers in columns 20-21, 22-23, and 24-25 into columns 30-35. The “Z1” will zero-fill, the underscore separates the “Z1” from the location, the “6.3” specifies a 6 column field with 3 decimal places and the “S” says to print a string of characters.

EX: PRINT_TO_DATA [32.6] “\Z1_6.3S” EXPONENT(3)

The example above will return the product of Euler’s Constant (e=2.71828) raised to the power of 3 (the number in the vector). Therefore this command will place 20.086 into columns 32-37, formatted as in the prior example.

EX: PRINT_TO_DATA [52.6] “\Z1_6.3S” LOGARITHM(2.71828)

This example of the LOGARITHM function will return the natural log (e sub n) of 2.71828 (the number in the vector). In this instance, 01.000 will be placed into columns 52-57.

EX: PRINT_TO_DATA [40.6] “\Z1_6.3S” SQUARE_ROOT(26)

The example of the SQUARE_ROOT function above will return the square root of 26 (the number in the vector). In this instance, 05.099 will be placed into columns 40-45.

EX: PRINT_TO_DATA [40.6] “\Z1_6.3S” SUM([20.2,…,24.2])

This example of the SUM function will return the sum of the numbers in columns 20.2, 22.2 and 24.2 (the locations in the vector). The result of the SUM will be placed into columns 40-45 and will be right justified, zero-filled and will have three decimal places of significance. Any missing value will be treated as zero. If all values are missing, the result will be missing.

EX: PRINT_TO_DATA [40.6] “\Z1_6.3S” X([20.2]) + X([22.2]) + X([24.2])

This example, using the X function, shows how to force a blank location or a location with something other than a valid number to be returned as a zero in an equation or numeric variable. If any (or all) of the locations were blank (or MISSING) or had alpha characters, adding the locations would normally return MISSING. The X function causes the problem location to be treated as a zero.

Netting Punches

The goal of the following setup is to create a new multi-punched variable (in columns 68-70) from data collected in 11 fields of 2 columns each (columns 4-25). The answers collected in the 11 fields are codes 01-26. The ~INPUT command opens a data file called DATAGENS.TR in UPDATE mode. The BLANK command will blank columns 68-70. The TRANSFER command will net the answers of 01-12 from any of the 11 fields into column 68 as punches 1-12. The net of the answers 13-24 will be in column 69 as punches 1-12 and the net of the answers 25 and 26 will be in column 70 as punches 1 and 2. The += means adds punches and do not blank the receiving location first.

EX:

~INPUT DATAGENS, ALLOW_UPDATE

~DEFINE PROCEDURE={GENS: BLANK [68.3]

>REPEAT $COL=04,06,…,24

TRANSFER [68.3^01//26] += [$COL.2#01//26]

>END_REPEAT

SAY CASE_ID [4.22$] [68.3$P]

}

~EXECUTE PROCEDURE=GENS

~END

The SAY command will show the case ID, the contents of the original 11 fields and the resulting netted punches in columns 68-70. An example of the output from this SAY command follows:

EX:

001 210102081718 128\569\

002 021721 2\59\

003 19152023 \378X\

004 1522 \30\

005 1807 7\6\

Storing Weights in the Data

By storing weights in the data we can speed up subsequent runs since the program will not have to recalculate the weights. To get the weights into the data, define a procedure that uses either the MODIFY, TRANSFER or PRINT_TO_DATA command to insert the weights into an unused location.

Use the MODIFY or TRANSFER command if the weights are comprised of integers or if you plan to store the weight as an integer but later reference the weight with decimal significance.

EX:

~DEFINE PROCEDURE={GENWTMOD:

MODIFY [12.4] = SELECT([26^1//3],VALUES(86,66,139))

}

or:

~DEFINE PROCEDURE={GENWTTRAN:

TRANSFER [12.4] = SELECT([26^1//3],VALUES(86,66,139))

}

The two procedures above would store 86, 66 and 139 into columns 12-15 based respectively on the punches 1, 2 and 3 in column 26. The weights stored in columns 12-15 would be right justified and blank filled.

To reference later with decimal significance (2 decimal places) use:

WEIGHT=: [12.4*F2]

This would change the way the program references the values inserted into the data by the two procedures above to .86, .66, and 1.39.

Use the PRINT_TO_DATA command to insert data with decimal significance directly into the location. The PRINT_TO_DATA command blanks the receiving location first.

EX: PRINT_TO_DATA [12.4] “\Z1_4.2S” &

SELECT([26^1//3],VALUES(.86,.66,1.39))

This command would put the weights 0.86, 0.66 and 1.39 into columns 12-15 based respectively on the punches 1, 2 and 3 in column 26.
Randomly Selecting Respondents

Defining a procedure that randomly selects respondents is another use of the TRANSFER command.

The following run would create a list of six randomly selected respondents. (See section on 9.3.1 System Constants under 9.3.1 System Constants for further information on the RANDOM_VALUE constant.)

EX:

>DEFINE @HOWMANY 6

>PRINT_FILE RANDOM

~INPUT DATACLN

~OUTPUT RAND LENGTH=80

~DEFINE PROCEDURE={ASSIGN:

BLANK [11-80]

TRANSFER RANDVAR[11.5] = RANDOM_VALUE * 10000

WRITE_CASE

}

~EXECUTE PROCEDURE=ASSIGN

~INPUT RAND

~OUTPUT SORT

~SORT RANDVAR

~INPUT SORT,STOP_AFTER=@HOWMANY

~DEFINE PROCEDURE={WHO:

SAY CASE_ID

PRINT_LINES “\S” CASE_ID

}

~EXECUTE PROCEDURE=WHO

~INPUT

~END

This run inserts a random number into an arbitrary location ([11.5]) of an intermediate data file copy (leaving the original data file unchanged), sorts the data on that random number into a second intermediate data file, prints the ID numbers of the first 6 (specified by the >DEFINE @HOWMANY) respondents into a print file called RANDOM.PRT, then deletes the disposable intermediate data files.

3.1.5 Data Manipulation for Predefined Variables

For the following sections on data manipulation using pre-defined variables, the list of variables below will be used as examples.

image.chk

Data area ends at 160, Text area starts at 161

col.wid label QQ# qtype sub other use

4.2 TIMES 3.00 [NUM ] Z

6.2 OFTEN 4.00 [NUM ] Z

GAP of 47 columns

55.1 DISEASE 12.00 [CAT ]

56.1 ASTHMA 13.00 [CAT ]

GAP of 9 columns

66.1 CITY 19.00 [CAT ]

67.1 GENDER 20.00 [CAT ]

GAP of 1 columns

69.3 DAY 21.00 [FLD ]

72.9 NAME 22.00 [VAR ]

81.1 VACATION 23.00 [TEXT] B

Text starts at 161

Direct Data Moves

The COPY command is used to copy data from one location directly to another location, whether that data is in numeric, string, or punch format. The syntax for a COPY command is:

COPY to_datavar = from_datavar

where datavar is any variable which references some part of the data. It can be a label or a question number (QQ#) defined using PREPARE, or it can be any location ([col.wid]).

For this example we will evaluate five pre-defined variables using the SAY command:

Commands Results

SAY CITY CITY(4:2=2=DENVER)

SAY DAY DAY(7:MON=1=MONDAY)

SAY TIMES 21

SAY OFTEN 1

SAY ASTHMA ASTHMA(3:1=1=ASTHMA,2=2=EMPHYSEMA

)

The following examples of the COPY command use the above pre-defined variables:

Commands Results

COPY [8]= CITY column 8 = 2

COPY [9.3]=DAY columns 9-11 = MON

COPY [12.2]=TIMES columns 12-13 = 21

COPY OFTEN=TIMES OFTEN =21

COPY [18]=ASTHMA column 18 = punches 1 and 2

Receiving locations can be defined as any variable type, however the sending variable must be one that was defined in EZWriter/PREPARE or a PREPARE= variable. COPY does a direct copy of the sending location or variable (right-hand side of operation) to the receiving location or variable (left-hand side). If the length of the variable in the sending location is larger than the receiving location, an error message will print. If the length of the variable in the sending location is shorter than the receiving location, the copy will occur but the results will be left justified in the receiving location. This would not be desirable in copying NUM type variables.

COPY is the easiest way to move a multiple response question from one location to another.

The TRANSFER command is used to alter the data in a location. The syntax for a TRANSFER command is:

TRANSFER to_datavar op= from_datavar

Operators (op) that can be used with the TRANSFER command are:

(none) Replaces data; blanks location first (only blanks valid pre-defined responses for the to_datavar)

+ Adds data; does not blank location first

– Removes data

The to_datavar can be any of the previously described variable types or an unspecified location. The from_datavar is the actual data to be transferred to the receiving location. The from_datavar be a number, location, expression, string, or pre-defined variable.

The MODIFY command is used to convert data from one type to another, while the TRANSFER command is used to change the contents of a variable. For example, TRANSFER would be used to add two number variables together to get a third number variable. TRANSFER, unlike, MODIFY checks that the data is of the same type on both sides of the operator.

The recommended way to add or remove punches from a data location is the MAKE_DATA command (see section on Adding/Removing Responses under 3.1.5 Data Manipulation for Predefined Variables).

An example of a TRANSFER command with the to_datavar as an unspecified location and the from_datavar as a number would be:

EX: TRANSFER [14] = 3

This would blank column 14 first, then put the number 3 in column 14.

An example of a TRANSFER command with the to_datavar as a pre-defined NUM type variable and the from_datavar as an unspecified location would be:

EX: TRANSFER TIMES = [14]

This would put the contents of column 14 in the TIMES variable.

An example of a TRANSFER command with the to_datavar as a pre-defined NUM type variable and the from_datavar as an expression would be:

EX: TRANSFER TIMES = [14]+10

This would put the contents of column 14, plus 10, in the TIMES variable.

When using TRANSFER to put a string into a FLD type variable, we must enclose the variable in brackets and add a dollar sign ($) to keep both sides of the equation equal. An example of a TRANSFER command with the to_datavar as a pre-defined FLD type variable and the from_datavar as a string would be:

EX: TRANSFER [DAY$]=”WED”

This would put the string “WED” into the DAY variable.

Our pre-defined variable DAY is a FLD type question and was constructed with the following answers: MON, TUE, WED, THU, FRI, SAT, and SUN. These are the only valid answers to this question. If the data contained the answer “WEE”, this would not be recognized as a valid answer and would produce an error if we cleaned the DAY variable using the EDIT command. (See “2.4 CLEANING WITH SURVENT VARIABLES”, “Generating A List Of Error Messages” for examples of cleaning pre-defined variables.) We can transfer an answer to the DAY variable that is not a valid answer, but it would not be recognized as an acceptable answer. If we were in a ~CLEANER block evaluating a respondent’s answers, and the case we were looking at had MON as the answer to the DAY variable and we entered:

SAY DAY

Mentor displays:

DAY(7:MON=1=Monday)

If we then were to enter the following TRANSFER and SAY commands:

TRANSFER [DAY$]=”WEE”

SAY DAY

Mentor displays:

DAY(7:)

because WEE is not a recognized answer to the DAY question.

The SAY command only shows valid answers to a question. If we wanted to see this respondent’s answer to the DAY question, even if it was not a valid answer, we could treat the DAY question as a string type variable and enter:

SAY [DAY$]

Mentor displays:

WEE

Here are some examples using the TRANSFER command and pre-defined variables:

EX: TRANSFER [DAY$] = “TUE”

where DAY becomes TUE and if that is an acceptable code for the original DAY question it will be recognized when the DAY question is referenced.

EX: TRANSFER TIMES = 1

where TIMES becomes 1 if it is within the range that TIMES was originally defined.

EX: TRANSFER NAME = “MARTIN”

where NAME will contain the string MARTIN as long as the string will fit NAME’s original definition length. Note that because the variable NAME is a VAR type variable, we can use the variable name without surrounding brackets and following dollar sign ($).

EX: TRANSFER VACATION = “EATING”

where EATING will replace any text that currently exists in VACATION’s data location.

A new name can be given to any pre-defined variable. The new name is specified immediately in front of the open square bracket (i.e., NEWNAME[INCOME$P]). The data in the location of INCOME can be referenced as a punch type variable which is called NEWNAME. INCOME remains intact in its original definition.

Some additional TRANSFER examples are:

EX: TRANSFER NAME2[8.9$] = NAME

where NAME2 is defined to be columns 8 through 16 and is a string type variable. NAME2 can now be used in place of [8.9$] when it is necessary to refer to that location in that format. NAME’s data will be put into NAME2.

EX: TRANSFER NAME2 = “SAM”

where the columns 8-10 will now contain the ASCII characters SAM and the remaining columns 11-16 are blank.

EX: TRANSFER [OFTEN$] = “RF”

where OFTEN was originally a NUM type question with a width of 2. Here the location OFTEN is redefined as a string variable for this transfer, and the ASCII characters RF are put into the location. Since the characters RF were not part of the pre-defined variable OFTEN, SAY OFTEN will show MISSING. SAY [OFTEN$] will show RF.

EX: TRANSFER [GENDER] = 3

where GENDER is redefined as numeric for this transfer and the data location GENDER references receives a 3. This would be one way to transfer a response to the data location GENDER that is not one of the pre-defined responses. Since the number 3 was not a pre-defined response for the GENDER variable, SAY GENDER will show ‘gender(2:)’. SAY [GENDER$] will show 3.

EX: TRANSFER GENDER = CATS(1)

will add the pre-defined response 1 to the GENDER variable and will not clear out the existing 3 since the 3 is not defined as a valid GENDER response.

A subsequent command:

TRANSFER GENDER = CATS(2)

would clear out the existing valid GENDER variable response 1 and insert the pre-defined response 2 to the GENDER variable but again the existing 3 would not be cleared out since it is not defined as a valid GENDER response.

Adding/Removing Responses

The best way to add or remove punches from CAT type questions is to specify which categories are to be added or removed. See 3.1.4 Data Manipulation for Punch, String, and Numeric Variables, Adding/Removing Punches for details on doing this to data locations, not pre-defined variables. The syntax for using the MAKE_DATA command with pre-defined variables is:

MAKE_DATA op CATtypevar(response code(s))

(For a discussion of operators, see 3.1.5 DATA MANIPULATION FOR PRE-DEFINED VARIABLES, Direct Data Moves.)

EX: MAKE_DATA GENDER(2)

where GENDER is a single response CAT type and the data location will now hold a response 2.

EX: MAKE_DATA ASTHMA(1,2)

where ASTHMA is a multi-response CAT type and the data location now holds responses 1 AND 2.

All of the above examples can be changed so that the receiving location is NOT blanked prior to the data move. To do this, simply use the plus (+) sign.

EX: MAKE_DATA + ASTHMA(1)

where ASTHMA is a multi-response CAT type and the data location now holds response 1 in addition to whatever responses were previously in that location.

Under certain circumstances, you may want to be able to remove a response from a location, without affecting the other responses in that location. To do this use the minus (-) sign as in the following example:

EX: MAKE_DATA – ASTHMA(1)

where ASTHMA is a multi-response CAT type and if a response 1 exists in that data location it will be removed.

NOTE: The only types of variables where you can remove punches are in the examples above of CAT (single and multi-response) and punch. It does not make sense to try to remove punches from other types of variables.

Arithmetic Calculations

In arithmetic calculations, the following arithmetic operators are available:

+ addition

– subtraction

* multiplication

/ division

The TRANSFER command can be used in combination with these operators. The equal sign (=) is part of the syntax:

TRANSFER datavar op= numexpr

datavar is any variable which references the data. It can be a label, a question number (QQ#) from PREPARE, or a data location ([col.wid]).

op is an optional arithmetic operator (+,-,/,*)

= is required syntax even if no op is specified

numexpr is a constant, an arithmetic operation, a function, or some combination of these that returns a number

In the example:

TRANSFER TIMES = 3

TIMES becomes 3. If this instruction is followed by:

TRANSFER TIMES +=5

then TIMES becomes 8.

Some additional examples of syntax are:

EX:

TRANSFER [22.2] = times + often

TRANSFER [22.2] = times – often

TRANSFER [22.2] = times / 2

TRANSFER [22.2] = times * 2

Creating Variables

One use of the TRANSFER command is to put data into a new location and either blank or not blank the receiving location first. The data you may want to put into the new location can be an expression that has to be evaluated, an ASCII string, a punch or punches, or a number.

Suppose you have two variables, ASTHMA and GENDER, and you want to create a new variable, ASTHMA2, that will cross those two variables. You want to create a location in the data (we will use column 68) that contains the result of the expression, which can then be defined using the PREPARE= format to assign labeling for use as a column or row variable in a cross tabulation later on.

The setup using the PREPARE= variable would look like this:

EX:

~DEFINE PREPARE=&

{ASTHMA2: 68

ASTHMA BY GENDER

!CAT,,1

1 ASTHMA MALE

2 EMPHYSEMA MALE

3 NEITHER MALE

4 ASTHMA FEMALE

5 EMPHYSEMA FEMALE

6 NEITHER FEMALE

}

The procedure you would create to define this new variable would look like this:

EX:

~DEFINE PROCEDURE={DOIT:

TRANSFER ASTHMA2 = ASTHMA BY GENDER

WRITE_CASE

}

If we were interested in seeing the variables ASTHMA, GENDER and the new variable ASTHMA2 we could add the following line to our procedure:

EX: SAY CASE_ID ASTHMA GENDER ASTHMA2

Our whole setup might look like:

>USE_DB IMAGE,READ_WRITE,DUPLICATE=WARN

~INPUT DATACLN

~OUTPUT DATAGENS

~DEFINE PREPARE=&

{ASTHMA2: 68

ASTHMA BY GENDER

!CAT,,1

1 ASTHMA MALE

2 EMPHYSEMA MALE

3 NEITHER MALE

4 ASTHMA FEMALE

5 EMPHYSEMA FEMALE

6 NEITHER FEMALE

}

PROCEDURE={DOIT:

TRANSFER ASTHMA2 = ASTHMA BY GENDER

SAY CASE_ID ASTHMA GENDER ASTHMA2

WRITE_CASE

}

~EXECUTE PROCEDURE=DOIT

~END

The following are a few examples we might see from the SAY command:

018 ASTHMA (3:3=3=NEITHER) GENDER (2:1=1=MALE) ASTHMA2

(6:3=3=NEITHER MALE)

019 ASTHMA (3:2=2=EMPHYSEMA) GENDER (2:1=1=MALE) ASTHMA2

(6:2=2=EMPHYSEMA MALE)

020 ASTHMA (3:3=3=NEITHER) GENDER (2:2=2=FEMALE) ASTHMA2

(6:6=6=NEITHER female)

3.1.6 Relational Operators

Relational operators (sometimes called “relops”) allow you to compare variables to one another or to a constant. They can be used to set up conditions, to limit your base for certain functions or anything else that you want done to some but not all of your cases. There are six relational operators; each can be expressed with symbols or letters (and some with words):

TYPE SYMBOL LETTERS WORD

Equal to = EQ MATCHES

Not equal to <> NE ––

Greater than > GT ––

Less than < LT ––

Greater than

or equal to >= GE CONTAINS

Less than or

equal to <= LE ––

All six can be used for arithmetic comparisons.

EX:

IF OFTEN > TIMES THEN

SAY CASE_ID “OFTEN>TIMES”

ENDIF

IF OFTEN EQ 3 THEN

SAY CASE_ID “OFTEN EQ 3”

ENDIF

IF OFTEN + TIMES >= 42 THEN

SAY CASE_ID “OFTEN+TIMES>=42”

ENDIF

IF OFTEN < 50 THEN

SAY CASE_ID “OFTEN<50”

ENDIF

IF OFTEN NE [6.2] THEN

SAY CASE_ID “OFTEN NE 6.2”

ENDIF

IF OFTEN – TIMES LE OFTEN/TIMES THEN

SAY CASE_ID “OFTEN-TIMES LE OFTEN/TIMES”

ENDIF

IF NAME <> “MARK ” THEN

SAY CASE_ID NAME

ENDIF

A cleaning instruction might be:

IF [4.2] > 20 THEN

SAY “CASE ID: ” CASE_ID “COLUMNS 4.2 = ” [4.2]

ENDIF

Some examples of comparisons follow:

If you compare two numeric fields and they are both blank, they will be equal to one another. If, for example, columns 4 and 5 are blank, and if you used the instruction:

IF [4] = [5] THEN

SAY “EQUAL”

ENDIF

you would see the word EQUAL. If you compared the two columns as strings ($), as in the following example:

IF [4$] = [5$] THEN

SAY “EQUAL”

ENDIF

you would see the word EQUAL here also.

When evaluating numeric fields, preceding blanks are acceptable but trailing blanks are not recognized as part of a number. If columns 2-3 are blank and columns 4-7= 1234, and you used the instruction:

IF [2.6#1234] THEN

SAY “TRUE”

ENDIF

you would see the word TRUE.

If columns 4-7= 1234 and columns 8-9 are blank, and you used the instruction:

IF [4.6#1234] THEN

SAY “TRUE”

ENDIF

you would not see the word TRUE.

When evaluating alphabetic or string fields, preceding blanks are not acceptable while trailing blanks are ignored. If columns 2-3 are blank and columns 4-7= TEST, and you used the instruction:

IF [2.6#TEST] THEN

SAY “TRUE”

ENDIF

you would not see the word TRUE.

If columns 4-7= TEST and columns 8-9 are blank, and you used the instruction:

IF [4.6#TEST] THEN

SAY “TRUE”

ENDIF

you would see the word TRUE.

When evaluating alphabetic or string fields and you want to check for the presence of blanks, you would use quotes (””) to surround the blanks. If columns 2-3 are blank and columns 4-7= TEST, and you used the instruction:

IF [2.6$]=” TEST” THEN

SAY “TRUE”

ENDIF

you would see the word TRUE.

If columns 4-7= TEST and columns 8-9 are blank, and you used the instruction:

IF [4.6$]=”TEST ” THEN

SAY “TRUE”

ENDIF

you would see the word TRUE.

All six relational operators can also be used for alphabetical comparisons, but in most cases it does not make sense to compare words other than using equal to (EQ, =) or not equal to (NE,<>). In determining if one word is greater than another, the program does have a ranking scheme for all ASCII characters based on the standard ASCII character set:

!”#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ [\]^_’abcdefghijklmnopqrstuvwxyz{|}~

Therefore:

“+” LT “/”

“1 “LT “9”

“>” LT “E”

“a “LT “v”

NOTE: This ranking is case sensitive so A is less than a; this is why we do not recommend using greater than or less than relational operators with alpha characters unless you’re absolutely sure of the case and it is not longer than one character. You should also enclose the characters in quotes so the program knows you’re doing an alphabetic comparison.

Changing Case

Variable Types U, D, and N cause ASCII strings to be treated as upshifted, downshifted, or not shifted, respectively. When displaying a data location, $ is the same as $n.

Contents SAY Results

of Columns: Command: Printed:

THIS IS mixed case. [1.25$] THIS IS mixed case.

THIS IS mixed case. [1.25$U] THIS IS MIXED CASE.

THIS IS mixed case. [1.25$D] this is mixed case.

THIS IS mixed case. [1.25$N] THIS IS mixed case.

By default, these variables are only effective in SAY and PRINT commands. To turn case sensitivity on for ~SORT, ~FREQ or pound sign variables, use the command ~SET CASE_SENSITIVE (see Mentor Volume II). When CASE_SENSITIVE is set, these variable types can be used to force a string to be upshifted or downshifted._

3.1.7 Formatting Data Elements

ZERO-FILLING DATA

The PRINT_TO_DATA command prints data into a data file. This command would be useful if you wanted to zero-fill a numeric type variable or if you wanted to insert decimal points in the data. The PRINT_TO_DATA command blanks the receiving location first.

The PRINT_TO_DATA command is similar to the PRINT command. (See 9.1 GENERATING SPECIALIZED REPORTS for additional information on printing options.) The syntax for the PRINT_TO_DATA command is:

PRINT_TO_DATA datavar “format items” variables

Let’s suppose we start with a two column field, with column 4 blank and column 5 = 8. Here are some examples of zero-filling numeric data:

EX: PRINT_TO_DATA [8.2] “\Z1_2S” [4.2]

This says to print the contents of columns 4-5 with a width of two and no decimal places into columns 8-9. The “Z1” says to zero-fill the two receiving columns. The result would be columns 8-9 = 08.

EX: PRINT_TO_DATA [10.2] “\Z1_2S” [4.2] + 1

This says to print the results of adding columns 4-5 and the number 1 into columns 10-11 with a width of 2, zero-filled, with no decimal places. The result would be columns 10-11 = 09.

EX: PRINT_TO_DATA [12.2] “\Z1_2S” [4.2] – [8.2]

This says to print the results of subtracting columns 8-9 from columns 4-5. Assuming we used the preceding PRINT_TO_DATA commands, the results will be columns 12-13 = 00.

Another option for zero-filling data locations is to use the *Z modifier. The syntax for this modifier is as follows:

EX: TRANSFER [21.5*Z] = [43] + [44] + [45]

If you wanted to zero-fill all data modifications you could use the ~SET ZERO_FILL command (See Appendix B: TILDE COMMANDS for more information on the ZERO_FILL command).

DECIMAL POINTS IN DATA

Here are some examples of inserting decimals in the data:

EX: TRANSFER [5.5] = 123

This TRANSFER command moves the number 123 to columns 5-9. The number is right-justified so columns 5-6 are blank and columns 7-9 = 123.

EX: PRINT_TO_DATA [10.5] “\Z1_5.1S” [5.5]

In the above example we are saying print the contents of columns 5-9 to columns 10-14 with a length of 5 and one decimal point of significance. The “Z1” preceding the 5.1 says to zero-fill the receiving columns (the underscore is required to separate the “Z1” from the location). The “S” following the 5.1 says to print a string of characters into the location 5.1. The result is columns 10-14 = 123.0. If we wanted to print the contents of 5-9 to another location in the data file, let’s say columns 15-19, and keep the length at 5 and continue to show one decimal precision but move the decimal point one place to the left we would say:

EX: PRINT_TO_DATA [15.5] “\Z1_5.1S” [5.5*F1]

The result of this command would be columns 15-19 = ‘012.3’.

If we now wanted to print the contents of columns 15-19 into a new location and remove the preceding zeros we would change \Z1_5.1S to \5.1S.

EX: PRINT_TO_DATA [25.5] “\5.1S” [15.5]

This would make columns 25-29 = ‘ 12.3’.

It is also possible to use the PRINT_TO_DATA command to format the output of arithmetic operations as in:

EX: PRINT_TO_DATA [30.5] “\5.1S” 2 * 5.5

This would make columns 30-34 = ‘ 11.0’.

SPREADING MULTI-PUNCHED DATA

We can use the TRANSFER command to spread out multi-punched data into multiple single-punched fields. This is useful if making an ASCII data file for use by another program.

Syntax for this command would be:

EX: TRANSFER [6.8$] = [6$P]

By default, Mentor spreads multi-punched data into single punches separated by commas. The SUBSTITUTE command causes the commas to be ignored during output.

NOTE: The receiving location needs to be equal in width to the maximum number of punches possible in the multi-punched location. If the default commas are desired in addition to the single punches themselves, the receiving location needs to be equal in width to twice the maximum number of punches possible in the multi-punched location minus one.

Transforming String Digits into Numbers

Using the PRINT_TO_DATA command, we can transform digits left-justified or randomly located in a field into right-justified NUM type data. The following two step procedure will right-justify all the data in the location, zero-fill the location, and give all data two decimal places (i.e., for currency format):

EX:

~INPUT DATACLN

~OUTPUT DATAGENS

~DEFINE PROCEDURE={GENS:

PRINT_TO_DATA [20.6] “\>6S” [20.6$]

PRINT_TO_DATA [20.6] “\Z1_6.2S” [20.6]

WRITE_CASE

}

~EXECUTE PROCEDURE=GENS

~END

The first PRINT_TO_DATA command right-justifies any data in columns 20.6. The 6 following the \ in this first PRINT_TO_DATA command is what says this is going into a six column location and the > preceding the 6 says to right justify the data. From this point on the data is in NUM format (right-justified) and we can do modifications on the data that only apply to NUM type data. The second PRINT_TO_DATA command does two of these number modifications. The “\Z1_6.2S” control item does the following:

Z1 zero-fill

6 the receiving location is 6 columns wide

.2 use two decimal places

S print a string of characters into the location

So if our original data had three cases with the following data in columns 20.6:

” 78 ”

” 2.3 ”

” 9.98″

then after running the above procedure GENS on the data it would look like:

“078.00”

“002.30”

“009.98”

TRANSFORMING NUMBERSINTO STRINGS

The function STRING_FROM_NUMBER converts a numeric argument into a string. This function is useful in combination with the PUTID command to assign case IDs.
RECODING 10-POINT SCALES

It is common to have 10-point scales entered into a single column location with the 10 value entered as a zero. Some statistical software packages however, cannot convert the 0 to a 10 internally. If you will be exporting CfMC data files with single column10-point scales to a third party statistical software package, the following PRINT_TO_DATA command will reformat your 10-point scales:

EX:

PRINT_TO_DATA [120.2] “\Z1_2S”

SELECT([26^1//0],VALUES(1,…,10))

This command will format the data into columns 120-121 as zero-filled (“Z1”) right-justified numbers.

RECODING TO EXCLUDE SELECTED RESPONSES

It is sometimes necessary to reformat a data location into another location and exclude some response, usually “Don’t know/No response”, i.e., for future input into a statistics software package.

The following example would reformat column 42 to contain only punches 1-4. The “=” sign after the location clears the location first, therefore any punch in column 42 other than 1-4 would not be present after the TRANSFER.

EX: TRANSFER [42]=SELECT([42^1/2/3/4],VALUES(1,2,3,4))

It would be preferable to INPUT your original data file and OUTPUT a second data file, so as to retain your original data file in an unaltered state, as opposed to using your original data file with ALLOW_UPDATE set.

RECODING TO REVERSE A SCALE QUESTION

Often it is necessary to reverse the values of a scale question, i.e., when it is more desirable to have the higher value correlate to the more positive response. We can use the TRANSFER command to accomplish this:

EX: TRANSFER [43^1/2/3/4]=[43^4/3/2/1]

By reversing the figures in the sending and receiving locations for only those punches we want to change, we accomplish the reversal of the scale and leave any other punches, i.e., a “DON’T KNOW/NO RESPONSE” punch, intact.

3.1.8 Data Manipulation in the ~CLEANER Block

In the following example we have defined two procedures in the ~DEFINE block and we will use the ~CLEANER block to execute the procedures on a case by case basis. (See 2.3.2 Correcting Errors, Using Cleaning Screens for a more detailed explanation of the features used in this example). The command line used to run this example would be:

Mentor &STATE.SPX CON (DOS)

Mentor “&STATE.SPX” “CON” (UNIX)

RUN Mentor.CGO.CFMC;INFO=”STATESPX CON” (MPE)

In the spec file called STATE.SPX, the procedure we have defined as SAYSTATE will show us the contents of columns 69-70 (where the state information is located). The procedure defined as FIXSTATE will change the state to CA for those cases where we want the state changed.

~INPUT DATAGENS, ALLOW_UPDATE

~SET PRODUCTION_MODE

~DEFINE

PROCEDURE={SAYSTATE: SAY CASE_ID [69.2$]}

PROCEDURE={FIXSTATE: TRANSFER [69.2$]=”CA”

}

~CLEANER

The above spec file will put us in the ~CLEANER block. The screen will look like this:

CLeaNer–>

At this prompt if we type !SAYSTATE, we will see what state is coded in columns 69-70 for the first case in our data file. If we decide we want to change this case, we type !FIXSTATE at the next prompt and the contents of columns 69-70 will now be CA. We can again type !SAYSTATE at the prompt if we want to see that columns 69-70 is now CA. To move to the next case we would type NEXT at the prompt. We can execute either one or both of our procedures or move to the next case.

NOTE: An exclamation (!) must be entered before the procedure name.) We can also issue any other ~CLEANER command. When we are done making modifications we will type ~END to exit Mentor.

3.2 Creating Subsets of Data Files

HOLD_OUTPUT_UNTIL_SUBSET

This command ensures that the ~input file is read only twice for a group of ~OUTPUTs rather then potentially twice for each ~OUTPUT. This considerably speeds up the subsetting of a large input file into many output files. The default for this option is “on”. When this option is “on” the “case_written” status of any particular record cannot be assertained during the select scanning process. This means that a case slated to be written may later no longer pass the select being used because the case was written by an earlier output.

You can set -HOLD_OUTPUT_UNTIL_SUBSET to make the written/not written state of each case be determined in a linear fashion. Generally speaking using more than one not (casewritten) in a subset block is likely to not generate the expected sample, and often will lead to an error.

Examples of subsets of data files

Here is a list of commands and options that are used when creating subsets of data files:

~EXECUTE do_subset
~INPUT/~OUTPUT sampling=#n
~INPUT/~OUTPUT sampling=.nnn
~INPUT/~OUTPUT try_for_sampling=#n
~INPUT/~OUTPUT try_for_sampling=.nnn
~INPUT/~OUTPUT select=
~INPUT/~OUTPUT select=casewritten/not(casewritten)
~INPUT/~OUTPUT num_sample_cases=
~OUTPUT file_name #n
>random_seed=

~EXECUTE DO_SUBSET The ~EXECUTE command DO_SUBSET is what launches the subsetting run based on the ~INPUT/~OUTPUT commands and options you have chosen. It is similar to issuing a “write_now” command for every ~OUTPUT file.

SAMPLING=#N AND SAMPLING=.N The pound sign version of sampling = gives you a random sample of “n” records from the data. The .n form gives you a random fraction of the sample (i.e. num_sample_cases * .n).

EX:

~INPUT file1 SAMPLING=#10

~OUTPUT file2

~EXC DO_SUBSET

~END

In the example above, file2 will be created from a random sample of ten records from file1. The pound sign version of sampling = tells Mentor how many records you want. If instead of 10 records you wanted one tenth of the sample, then sampling=.1 is what you would use. Also note that sampling = can be on either the ~INPUT or OUTPUT statement or both. (See notes on num_sample_cases and sub1.spx for examples of using sampling = .)

When using sampling = .n, the number of records written will be .n times the number of records available to be written (i.e. num_sample_cases). The exact number of cases written will be the result of this calculation rounded to the nearest whole number.

TRY_FOR_SAMPLING=#N AND TRY_FOR_SAMPLING=.N The “try_for” sampling options work the same way that the sampling = options do, except that it is not an error when the number of records requested is not available in the sample. For example you might select on the males in your sample and use tryfor=#100 to get 100 males if possible, but if there are less than that keep the output anyway. When less than the number of records tried for are available a warning is generated indicating the number that were found.

SELECT= Select = may now be used on either the ~INPUT, ~OUTPUT, or both. If the select appears on both the ~INPUT and ~OUTPUT, the one on the ~INPUT is executed first. No record that does not pass the ~INPUT selection criteria will have the opportunity to be written to the output file. Both SELECT = and SAMPLING = may be used at the same time.

CASEWRITTEN A special variable “casewritten” may be used as part of the select = condition. Most often it is used with NOT(), as in NOT(CASEWRITTEN). It takes effect when one is using multiple ~OUTPUT files in a subset run.

EX:

~INPUT file1

~OUTPUT file2a #1 SELECT=[1#1-5]

~OUTPUT file2b #2 SELECT=NOT(CASEWRITTEN)

~EXC DO_SUBSET

~END

COMBINING OPTIONS If SAMPLING = /SELECT appear on both the INPUT and OUTPUT, only those records which make it past the input phase will contribute to the sampling/selection that occurs in the output phase.

For example, note the difference between the following two subset runs:

” This run gets all of the males from the input, and then writes

” a random sample of half of them.

~INPUT file1 SELECT=Q1(M)

~OUTPUT file2 SAMPLING=.5

~EXC DO_SUBSET

~END

” This run gets half of the respondents from the input, and then

” writes out the males.

~INPUT file1 SAMPLING=.5

~OUTPUT file2 SELECT=Q1(M)

~EXC DO_SUBSET

~END

In this example, the resulting output files are likely to be similar, but they won’t be the same. (Sub3.spx contains examples of using both SAMPLING = and SELECT = in the same subset run.) When SELECT = and SAMPLING = occur on the same statement (e.g. both on the ~INPUT) the SELECT = is honored before SAMPLING = is done.

Some situations to watch out for:

Suppose you want two output files each containing a random half of your original sample. You might be inclined to do:

~INPUT file1

~OUTPUT file2a #1 SAMPLING=.5

~OUTPUT file2b #2 SAMPLING=.5

~EXC DO_SUBSET

~END

The two output files in this example will contain many (about half) of the same respondents. What you probably had in mind is:

~INPUT file1

~OUTPUT file2a #1 SAMPLING=.5

~OUTPUT file2b #2 SELECT=NOT(CASEWRITTEN)

~EXC DO_SUBSET

~END

Having divided your sample into two random halves using the specs above, you might think that dividing the sample into equal thirds would be done with:

~INPUT file1

~OUTPUT file2a #1 SAMPLING=.33333

~OUTPUT file2b #2 SAMPLING=.33333 SELECT=NOT(CASEWRITTEN)

~OUTPUT file2c #3 SELECT=NOT(CASEWRITTEN)

~EXC DO_SUBSET

~END

However the above specs won’t give you equal sized samples. For example, if file1 contains 100 records, then file2a will contain 33 records (OK so far), but file2b will contain .33333 of the 66 records not written so far (i.e. 22 records), and file2c will contain the rest of the sample (i.e. 45 records).

What you actually need is:

~INPUT file1

~OUTPUT file2a #1 SAMPLING=.33333

~OUTPUT file2b #2 NUM_SAMPLE_CASES=67 SAMPLING=.5 SELECT=NOT(CASEWRITTEN)

~OUTPUT file2c #3 SELECT=NOT(CASEWRITTEN)

~EXC DO_SUBSET do_subset

~END

NOTE: In the above example file2a will contain 33 (.33333 * 100) records, file2b will contain 34 (67 unwritten records, times .5, and rounded to the nearest whole number), and file2c will contain the 33 as yet unwritten records. If “num_sample_cases” does not appear on the second ~OUTPUT statement an error will result.

~SET HOLD_OUTPUT_UNTIL_SUBSET

~INPUT sub4.asc ASCII=80 ;

~OUTPUT sub4a.asc ASCII #1 SAMPLING=.33333 -TRIMBLANKS;

~OUTPUT sub4b.asc ASCII #2 NUM_SAMPLE_CASES=67 -TRIMBLANKS SAMPLING=.5 SELECT=NOT(CASEWRITTEN) ;

~OUTPUT sub4c.asc ASCII #3 -TRIMBLANKS SELECT=NOT(CASEWRITTEN) ;

~EXC do_subset

~END

NUM_SAMPLE_CASES= In order to pull a random sample from an existing sample, the number of cases in the existing sample needs to be known. For example, if you wanted five cases out of 10,000 you would want the random cases to be pulled from random locations throughout the file, not just the beginning, middle, or end of the original file.

Sometimes it’s very easy to determine the number of cases in a file. For example, CfMC system files contain the number of cases in their header, and MPE ascii files contain the number of records in their file label. To determine the number of records in a variable length ascii file on Windows or Unix, a pass must be made through the data to count the records. On relatively small input files (<10,000) this counting pass is nearly imperceptible in terms of run time, but on very large samples it may increase run times noticeably. If the input/output files contain select/sample options, this further complicates determining the number of cases available from which to sample and may show a corresponding increase in run times. Setting NUM_SAMPLE_CASES = will cause the subsetting process to use this setting as the number of cases from which to draw the sample, and cause the program to not execute passes through the data to determine the number of cases available to be sampled from. This should decrease run times for very large samples, however, if the number provided via NUM_SAMPLE_CASES is not correct an error will be generated.

REPEATABLE SUBSET RESULTS Sampling = picks a random sample from the data. Normally, each time a subsetted output file is created a somewhat different collection of records will be output. By using >random_seed = one can force the starting point of the randomizing process, and thus make it possible to repeatedly create the same “random” sample of records. Keep in mind that adding or subtracting a step in your spec file that calls a random number will change the results of subsequent random calls even if a random seed is set at the beginning of the run.

3.3 MENTOR EQUIVALENTS TO SPL

Mentor is currently written in the C programming language. The earlier version of Mentor was written in the SPL programming language. The commands and syntax are different between the two versions. If you are already familiar with SPL commands and syntax, here are some of those commands and the current Mentor equivalents. Mentor equivalents to SPL’s ADD, ZAP, MOVE, CLEAR, ONTO, INTO, ZPUT are as follows:

ADD statement adds punches specified in a mask to a variable.

SPL syntax: ADD [201] \XY\

Mentor syntax: MAKE_DATA +[201^X,Y]

ZAP statement removes punches specified in a mask from a variable.

SPL syntax: ZAP [201] \X\

Mentor syntax: MAKE_DATA -[201^X]

MOVE statement copies data from one location into a specified receiving location after blanking the receiving location.

SPL syntax: MOVE [201.4]=”ABCD”

Mentor syntax: TRANSFER [201.4$]=”ABCD”

SPL syntax: MOVE [205.4]=”1234″

Mentor syntax: TRANSFER [205.4]=1234

SPL syntax: MOVE [209.2,211.2] FROM [207.2,205.2]

Mentor syntax: TRANSFER [209.2,211.2] =[207.2,205.2]

CLEAR statement blanks the column(s) specified in a location.

SPL syntax: CLEAR [201.12]

Mentor syntax: BLANK [201.12]

ONTO statement moves a value from a sending location to a receiving location, always blanking the receiving location first.

SPL syntax: ONTO [201] PUNCH \1.4\ [43.2#11/12/21/22]

Mentor syntax: TRANSFER [201^1//4] = [43.2#11/12/21/22]

SPL syntax: ONTO [201] PUNCH \1.5\ [88,…,92*F^1//5]

Mentor syntax: >REPEAT $COL=88,…,92

TRANSFER [201^1//5] += [$COL^1//5]

>END_REPEAT

SPL syntax: ONTO [201.2] ZPUT [57.2]

Mentor syntax: TRANSFER [201.2*Z] = [57.2]

SPL syntax: ONTO [203.2] PUT [43] + [44] + [45]

Mentor syntax: TRANSFER [203.2] = [43] + [44] + [45]

SPL syntax: ONTO [205.2] PUT SUM([68] WITH [69])

Mentor syntax: TRANSFER [205.2] = [68] ++ [69]

(See 3.1.4 Data Manipulation for Punch, String, and Numeric Variables, Arithmetic Calculations for additional information on the ”++” syntax).

SPL syntax: ONTO [10.3] ZPUT [43] + [44] + [45]

Mentor syntax: TRANSFER [10.3*Z] = [43] + [44] + [45]

SPL syntax: ONTO [10.6] \1\ DPUT [44] / [45]

Mentor syntax: TRANSFER [10.6*D1] = [44] / [45]

SPL syntax: ONTO [10.6] \1\ DZPUT [44] / [45]

Mentor syntax: TRANSFER [10.6*ZD1] = [44] / [45]

INTO statement with the keyword SPREAD spreads out multi-punched data into multiple single-punched fields.

SPL syntax: INTO [6.8] SPREAD [6]

Mentor syntax: TRANSFER [6.8$] = [6$P]