This chapter provides information on how to prepare and clean a data file. It is important to become familiar with your data in order to write procedures that tell Mentor what the data should look like. Mentor compares the existing data format to your revision of the data format, identifies errors, and provides several ways to correct these errors. You can check and correct the data interactively, one case at a time, or you can have Mentor automatically check and modify all the cases at once. This chapter also describes how to correct data by modifying it, transferring it to a new location, or combining data elements in the file.


If you don’t have any cleaning experience, and you just need to make a few changes to a data file, use the CLEANIT utility, which is described in the Utilities manual. If you already have a list of data corrections to make and you have some experience with the Mentor ~CLEANER block, you can skip to 2.5.1 Quick Reference: Cleaning Commands And Examples.



[ws_table id=”1″]


Once you have collected your data, you should check it for errors. This section summarizes the steps necessary to clean data which are described in detail in the rest of this chapter. If you are using Survent to collect your data, Survent can also automatically generate cleaning specifications for you. See section 2.4 CLEANING WITH SURVENT VARIABLES.

1 Make a backup copy of data file. Always have a copy of the data in its original form before you start cleaning.

2 Translate your raw data file into a file that Mentor can read. Mentor data files are referred to as System files, and have an extension of “tr”. There are two menu- driven CfMC utility programs you can use, MAKECASE and COPYFILE, to translate raw data files. They are described in the Utilities manual. The MAKECASE utility converts 80 column ASCII or binary card image files into Mentor System files. The COPYFILE utility converts files with records longer than 80 columns and other file types. You can also use these utilities to sort the file and check for duplicate case IDs.

3 Generate reports that provide an overview of the data (these reports are commonly called “marginals”). Use the HOLE utility to generate a holecount, a count of the punches in the data for each column. This can help you spot obvious problems with the data. Use the FREQ utility to generate frequency distributions, or a list of which ASCII characters are in specified locations in the data. This is useful to look at short, multiple column data, such as zip codes.

4 Write and test a procedure(usually referred to as “cleaning specs”) that will check the data. To test the procedure, you can tell Mentor to only look at a certain number of cases, and see what errors it produces. You may need to modify your cleaning specs based on these results. Running the procedure on all of the cases in the data file will then generate a list of errors, or locations in the data that need to be checked.

5 Clean the data, based on the errors generated by the cleaning specifications. When data looks incorrect, you can choose to delete it, change it by referring back to the original survey, or change it based on guidelines that you have established.

6 Repeat steps three through five until you consider the data clean. (You can also use a final holecount to check the tables you generate from your data file.)

2.2.1 Other Types of Data Files

While we recommend that you do so, you do not have to convert your data file to a CfMC System file. You can write a procedure to have Mentor read and generate a data error report on a different type of data file. Mentor can read ASCII, binary, swapped binary or DTA files directly. The disadvantages to working with a data file that has not been converted are:

  • Processing will be slower because Mentor must convert each case before reading it.
  • Cases cannot be flagged for errors or deletion.
  • You cannot modify the data interactively using Mentor.

If, however, you have a small ASCII file, it may be easier to write a Mentor procedure to read and check the data file and then edit it in a word processing program. Your procedure can also include some data checking and modification. The basic specifications would look something like this:



data checking or modification commands


~INPUT file, ASCII=length (or other file type option)

~OUTPUT newfile, ASCII=length (or other file type option)




See 2.3 CLEANING SPECIFICATIONS, and 2.3.2 Correcting Errors for examples of procedures that read and modify raw data. CORRECTING ERRORS, Auto-Fixing The Data contains two example procedures that modify data in batch mode. You can also refer to Appendix B: TILDE COMMANDS under the specific command for information.

2.2.2 Why Clean the Data?

The cleaning process is one of the most important steps in data tabulation. Human error occurs during data collection, especially in a paper and pencil study that has been manually keypunched. There is considerable advantage in using a CRT-type interviewing package such as Survent since it makes and checks its own data file. No matter what measures you take to prevent errors, some will occur. Serious problems can jeopardize the validity of a study. If the logic of the questionnaire is flawed (e.g., causing multiple responses to single-response question), you may have to do replacement interviews or do the entire study over. Let’s assume that your questionnaire has good logic and your data collection methods are conservative.

You must assume any data file you have not already looked at is dirty. A dirty data file is one that may have errors in it. These errors may be isolated to one column or question (more than one answer in a single answer question, number answers outside of the defined range, etc.) or across several columns or questions (skip patterns not followed, respondent asked about brands that they have never heard of, open-ends coded incorrectly, keypunch error, etc.). Data received from the field must be put through a cleaning process to increase its accuracy.

To adequately clean a data file, you must be able to change the information in the file. You need the ability to add or remove characters from columns, move data from one location to another, or blank columns altogether. You may choose to manually change one case at a time (interactive cleaning), or have the computer modify all of the cases automatically (“auto-fixing”). Interactive cleaning is the safest way to clean data, because you refer back to the original survey to make corrections. Auto-fixing is faster, but you must be careful that your modifications do not compound your errors. A combination of interactive cleaning and auto- fixing can give you both accuracy and speed. You can choose to have Mentor auto- fix simple problems and correct complex errors manually.

When you are through with the cleaning process, your data is it is not 100% accurate, it merely conforms to the cleaning procedures you have written. There is no such thing as data that is completely accurate. The cleaning process is merely an attempt to minimize errors, and increase the accuracy of your data.

2.2.3 Understanding the Questionnaire

Before you do any cleaning, take the time to read through your study to make sure you understand its framework and specifics. Once you have written your cleaning specifications, run them on part of the data file to list out all errors before the data is altered. This will expose any questionnaire execution problems or specification errors. If the same error occurs frequently, make sure that it is actually an error in the data (by examining a few cases) and not a specification error. If you confirm the error, your decision will be whether to discard these questions or questionnaires, or to salvage them. Generating the initial error listing is a key step that can save you time. Below is a sample questionnaire. Make sure you understand the questionnaire’s logic by asking yourself two questions.

  1. Who should answer each question?
  2. How should each question be answered?

The questions below are meant to appear as they would on a self-administered questionnaire, collected with paper and pencil.





(5-14)  __________



(15-17) MON








(18) 1 YES

2 NO  (SKIP TO Q5)



(19-20)  _____












WHO should answer each question?

The respondents who should answer a question are sometimes called the base of the question. During the course of a questionnaire, respondents may be excluded from answering a question and skipped to another section of the questionnaire. The remaining respondents make up the base of the question. In our sample questionnaire, respondents who do not have siblings should not answer the question about the number of siblings.

HOW should each question be answered?

Responses for questions have limits. Word responses must be recoded into a finite number of numeric or letter codes either by Mentor or manually. Numeric responses usually have some sorts of boundaries. Answers outside of these boundaries are either considered errors or the bounds of the questions are expanded to allow these answers. In the sample questionnaire above, question Q4. has valid answers of 1-10. The cleaning procedure will report errors on any case containing numbers or characters outside of these bounds. The same is true for question Q 5.; valid punches are 1-8.

NOTE: No more than 7 answers may be recorded and 8 (none of the above) cannot be given with any other answer. All of these parameters must be checked in cleaning for consistency throughout the data file.


Before beginning the cleaning process, make a copy of your CfMC data file! Always have a copy of the data in its original form, i.e., before you started cleaning. In fact, if you find you are making changes after each subsequent run of your cleaning specifications, it is good practice to make interim copies of your data file. Then if one set of changes must be undone you may only have to go back one or two copies to get to a place before the last set of data modifications. You can use your list file and ~SET LOGGING to keep track of the changes you have made.


Cleaning specifications are a set of instructions (called a procedure in Mentor) that checks the data for valid responses based on your description of the data. These data descriptions are called variables. If you do not know how to define variables refer to 2.3.1 Cleaning Examples. In the cleaning procedure you will write statements that tell Mentor which columns to check and to print an error message if the data does not match your description. For example, you can include statements to indicate where a skip pattern should have been followed, what the number of responses should be or how they should be ranked. You will use the list of data errors to either clean interactively on a case-by-case basis or to modify your procedure to correct some errors in batch mode.

Here is a cleaning specification file based on the sample questionnaire in section 2.2.1. Refer to 4.2 TABLE BUILDING BASICS for an explanation of other useful spec file commands and 4.5 META COMMANDS for an introduction to meta (>) commands. Explanations in this section will be limited to those commands pertaining specifically to cleaning.

NOTE: Two single apostrophes (”) indicate either a comment, or a command that is optional or not always be needed. Mentor does not process anything preceded by these marks, which is not a quote (“), rather two single apostrophes.


>CREATE_DB procs


”>USE_DB procs




CHECK [15.3#MON/TUE/WED/THU/FRI/SAT/SUN] “Not a valid code”

CHECK [18^1/2] “Should be single 1,2”

IF [18^1] THEN

CHECK [19.2*Z#1//10] “Should be a number 1-10”


CHECK [19.2^B] “If 18 not 1,19.2, Should be blank”


CHECK [21*P^1//7/(-)8] “Should be multi 1-7, or single 8″




~INPUT sampl <options>





NOTES: 1. Mentor will print its own error message if you do not provide one. See under the heading Program-Generated Error Messages later in this section for examples.

2. You could also say [19.2#” ”] instead of [19.2^B] to mean columns 19 and 20 are an empty string (blank). You must be sure that the number of spaces inside the quote marks is the same width as the field. This syntax is especially useful when saying an ASCII variable can also be blank., e.g., [19.2*Z#1//10/” “].

This specification file will only report where the data does not match the description. No data will be changed. Columns that do not need to be checked are specified first.

It then checks that the day of the week in columns 15-17 is one of the seven codes listed. Question Q 2. in column 18 must be a code 1 or 2 and not both, designated by the / separating the punch values. If the respondent has siblings, column 18 = code 1, then question Q 3. in columns 19-20 must be in the range of 1- 10. Values outside of the range are not acceptable. If the respondent did not have any siblings, column 18=2, then we expect columns 19-20 to be blank. We expect question Q 5. to have any number of responses between 1 and 7, but if it has an 8 (none) then no other answers should appear. The designation of punches 1-7 means multiple codes are acceptable, the / designation separating the 1-7 from the 8 means that the 8 is an exclusive code.

Finally, we check all the remaining columns in the case (22-80 in our example) to make sure they are blank. See Appendix B: TILDE COMMANDS, ~CLEANER for more examples of the CHECK command.

Here is an explanation of the commands used in the example specifications above.

>CREATE_DB procs Tells Mentor to create a DB file called PROCS to store the cleaning procedure. This is optional, but useful during the actual cleaning process if we need to re-execute the procedure on each case. The procedure can be accessed in this file by the name assigned to it.

>USE_DB procs Opens the DB file containing the cleaning procedure. Again this is optional, but useful if you need to use the procedure again either to actually clean cases or to recheck the data after cleaning. By storing your procedure in a DB file it is not necessary to re-define it each time. The entire ~DEFINE section could be eliminated in future runs of these specifications.

If you use the meta command >CREATE_DB, you do not need the meta command >USE_DB in the same run. Once you have created a DB, get items from it in another run with >USE_DB.

~DEFINE PROCEDURE= The ~DEFINE keyword that tells Mentor that you will be specifying a group of commands to be executed at some later point either on a single case or on the entire data set. There are many commands that could be specified inside the procedure structure. Those used in this example are some that you will use frequently. Other useful commands are listed at the end of this chapter.

OK_COLUMNS Tells Mentor that any data in these columns is valid. CHECK_ COLUMNS will not look at these columns. These are usually the case ID columns or non-coded open-ended responses such as name and address that will be listed out separately.

CHECK Compares the data to the description of the data given here and prints an error message when the data does not match the description. Mentor will print default error messages or you can provide your own inside of “quotes” as shown in the example.

IF-THEN-ELSE-ENDIF Specifies some condition that must be met for the following commands to be executed. See 2.3.1 CLEANING EXAMPLES.

CHECK_COLUMNS Tells Mentor to check every column not specifically examined with a cleaning statement or marked as OK by OK_COLUMNS to make sure it is blank. It checks that columns are blank when conditionals are not met. This command is especially useful when you have many skip patterns in your questionnaire. By using this command it is not necessary to write an additional statement to check for blank columns for each conditional not met, e.g., ELSE CHECK [19.2^B] in our example. You cannot define your own error message and the command will not display the actual data, but it will list the columns you should look at for extraneous data. Refer to the sample error listings later in this section for an example.

CHECK_COLUMNS should usually be the last command in your procedure.

~INPUT sampl Opens the data file. There are many options available with this command. STOP_AFTER= is used to read only the number of cases specified. It is very useful for testing your procedure for syntax errors or errors in data descriptions such as referencing the wrong columns. This option will cause Mentor to stop reading the data file after the number of cases specified. If you have a very large sample or a complex set of cleaning conditions it will save time to test your procedure on a small number of cases first. For instance, you could run your specifications on a 50 case sample and then review the error summary. If a given data error appeared more than a few times, you would want to double check your cleaning specifications for errors perhaps in logic or an incorrect data location. The option ASCII= opens an ASCII data file. After the equal sign you must specify the length (number of columns) of the longest case in your file. Refer to Appendix B: TILDE COMMANDS under ~INPUT for other file types and options allowed on this command.

~SET CLEAN_ALLOW_BLANKS Tells Mentor that data fields specified in either CHECK or CLEAN statements can be blank if they do not otherwise fit the description. This is especially useful for self-administered questionnaires where it is unlikely that every question is answered. If a particular field cannot be blank the cleaning procedure can include a check for that field so that a blank still produces an error. In our example question Q 4. in columns 19 and 20 should be blank (skipped) if the respondent answered No to question Q 3.

~EXECUTE PROCEDURE=showerr Tells Mentor to execute the procedure on the cases in the specified ~INPUT file.

~END Exits Mentor, closing all files opened during the run.

Here are two commands that will help you debug your procedures:

>QUIT ERRORS=# Stops executing after this number (#) of syntax errors is reached. This means all syntax errors reported by Mentor, not just those found in your procedure.

~SET PROCEDURE_DUMP Echoes Mentor’s internal process as it executes a procedure, to help determine the source of the error message.

Run your specification file (i.e., CLEAN.SPX) through the Mentor program by typing the following statement from the command line of your operating system:

Mentor CLEAN.SPX CLEAN.LFL                      (DOS/UNIX)


The results of the run will go to the list file (CLEAN.LFL or CLEANLFL). This is your error listing. Errors will be listed out case by case. An error summary prints at the end. The error summary is a count of the occurrence of each error message across all cases.


For the purpose of explaining error messages, we will refer to the data listing printed below. See 9.1 GENERATING SPECIALIZED REPORTS for the specifications that produced this type of formatted listing.


ID Q1   Q2 Q3 Q4 Q5

— ——   — — — —

0001 MICKEY   MON 1 12 1

0002 OLIVER   TUE 0 2

0003 PURDY   WED 3 1 1, 2, 3

0004   THU 2 ? 8

0005 POLLY 1 1 4, 5

0006 FLYNN   FRI ? 4, 5

0007 SAMANTHA  SAT 1 ? 8

0008 PERCY   SUN 1 7 B

0009 WOLFIE   MON 1 3 2, 8

0010 SWEETPEA  WEE 1 2 6


A blank indicates blank punch data (no punch present) and ? indicates missing numeric data. These are defaults for a formatted listing from Mentor.

The list file CLEAN.LFL will contain an error listing similar to the one below.


ID: 0001

error 3: [19.2#] Should be a number 1-10: a valid answer is required [19.2#]=”12″

error 14: [22] field should be blank [22]=”1″

ID: 0002:

error 2: [18^] Should be a single 1,2: a valid answer is required [18^]=” ”

error 4: [19.2#] If 18 not 1,19.2, Should be blank:

extra punches[19.2#]=”0″

ID: 0003:

error 2: [18^] Should be a single 1,2: a valid answer

is required [18^]=”3″

error 4: [19.2#] If 18 not 1,19.2, Should be blank:

extra punches[19.2#]=”1″

ID: 0005:

error 1: [15.3#] not a valid code: a valid answer is required [15.3#]=” ”

ID: 0006:

error 2: [18^] Should be a single 1,2: a valid answer is required [18^]=” ”

ID: 0007:

error 3: [19.2#] Should be a number 1-10: a valid answer is required [19.2#]=” ”

ID: 0008:

error 5: [21^] Should be multi 1-7, or single 8: a valid answer is required [21^]=” ”

ID: 0010:

error 1: [15.3#] not a valid code: a valid answer is required [15.3#]=”WEE”

error 14: [22] field should be blank [22]=”1″


12 errors in 10 cases


error 1: 2       [15.3#] Not a valid code

error 2: 3 [18^]   Should be a single 1,2

error 3: 2 [19.2#] Should be a number 1-10

error 4: 2 [20]    If 18 not 1, 19.2 should be blank

error 5: 1 [22^]   Should be multi 1-7, or single 8

error 14: 2 $check_columns


The case ID is followed by a line that has an error number, the data location, your error message, the program-generated error message, and the data location with the actual data in quotes.

In this example, there are case ID 0001 has an answer of 12 in columns 19.2, for question Q4, regarding the number of siblings, and only answers from 1 to 10 are valid. Case ID 0002 has an error for columns 19.2 because column 18 is not a 1. In this case, you might want to go to original survey to see if question Q4 was actually blank, or if the answer was just not entered into the data. Case ID 0003 has an invalid answer (3) in column 18, and this causes a second error for columns 19.2.

If the cleaning specifications did not include the lines



CHECK [19.2^B] “If not 1, 19.2 Should be blank”


the error listing would be different because CHECK_COLUMNS would print an error for each column that should be blank, such as cases 0002 and 0003.


ID: 0001

error 3: [19.2#] Should be a number 1-10: a valid answer is required [19.2#]=”12″

error 14: [22^] field should be blank [22^]=”1″

ID: 0002:

error 2: [18^] Should be a single 1,2: a valid answer is required [18^]=” ”

error 14: [19.1] field should be blank: [19.1]=”0″

ID: 0003:

error 2: [18^] Should be a single 1,2: a valid answer is required [18^]=”3″

error 14: [20.1] field should be blank: [20.1]=”1″

ID: 0005:

error 1: [15.3#] not a valid code: a valid answer is required [15.3#]=” ”

ID: 0006:

error 2: [18^] Should be a single 1,2: a valid answer is required [18^]=” ”

ID: 0007:

error 3: [19.2#] Should be a number 1-10: a valid answer is required [19.2#]=” ”

ID: 0008:

error 4: [21^] Should be multi 1-7, or single 8: a valid answer is required [21^]=” ”

ID: 0010:

error 1: [15.3#] not a valid code: a valid answer is required [15.3#]=”WEE”

error 14: [22] field should be blank [22]=”1″


12 errors in 10 cases


error 1: 2 [15.3#] Not a valid code

error  2: 3 [18^] Should be a single 1,2

error 3: 2 [19.2#] Should be a number 1-10

error 4: 1 [22^] Should be multi 1-7, or single 8

error 14: 4 $check_columns



Here are a list of the standard types of cleaning errors and the error messages Mentor generates:


Error                                           Error Message

A question is blank when an answer should a valid answer is required

be present.


Single response question with more than         too many answers

one response.


Punch question with invalid punches         extra punches

in the column(s).


Single response question with an invalid a valid answer is required

punch code or number out of range.


Question with an invalid punch.                 too many answers


Multi-response question where there is         exclusive code violation

an exclusive response with another punch(es)

or code(s).


Multi-response question with duplicate         duplicate codes

ASCII codes.


Multi-response question with invalid ASCII invalid code or blank fields

codes, leading or embedding blanks.


2.3.1 Cleaning Examples

This section provides examples for cleaning simple types of questions. Many common cleaning situations are covered here. In later sections you will find examples of more complex cleaning situations.

The terms answer, mention, and response are used interchangeably throughout the rest of this chapter. The terms field and data location each refer to the column or set of columns that should contain a valid response.

In order for the Mentor program to check the validity of your data, each question must be defined as a variable. Mentor then compares it to the actual data for errors. A data variable is something defined within square brackets ([ ]). Within those brackets you provide the data’s location, a data modifier, the data type, and finally the data categories. For more information on defining variables, especially as they relate to data tabulation, see sections 4.6 DEFINING DATA , 5.1 Expressions and Joiners and 5.2 Axis Commands/Cross-Case Operations. 4.6.1 Summary of Rules for Defining Data provides example sets that summarize the rules for defining ASCII, punch, and numeric data.

EX:  [2/5.2*F#1//17]


Data locations can be defined in one of two formats: either record number/column (2/5), where each record is 80 columns, or absolute column(85). You may then specify a width using a period and the width (85.2), or specify a range of columns using a dash (85-86).

Variable modifiers used in cleaning are:

*D# the field contains the number of decimal places specified.

*F checks unique mentions (i.e., no duplicates allowed) across multiple locations. Leading or embedded blanks in the field are an error. *F# this numeric field has <#> decimal places.

*L checks for all possible mentions across multiple locations. Leading or embedded blanks in the field are an error. *P=# specifies the maximum number of answers allowed for a punch variable. The default is *P=1 meaning only one answer is allowed. Just *P means that there can be as many answers as there are categories defined. # may be any number 1-255. *S the field is multi-punched, but only one of the punches will be retained.

*Z this field contains leading zeros (0).

Data type is referenced by a caret (^) for punch categories, and a pound sign (#) for either numeric or ASCII categories. Categories are made up of either the string or number in the data, or the punch or punch number. Each category is separated by a single slash (/). A double slash (//) means a set of categories from first to last category (e.g., 1//5 to mean 1/2/3/4/5). A category can be marked exclusive of all others with a minus sign (-) enclosed in parentheses. This means that only this answer should be present in the data, and if it appears with any other it is an error.

You can combine modifiers, but you only need one asterisk (*). Spaces are optional between variable items.


Single Column/Single Punch


2B. How often do you use your BANK CARD–

at least once a month, at least every 3 months, a

few times each year, less than that, or never?








This is an example of a single column, single response question. The cleaning statement will check that only one of the allowed punches appears in the data. Because this is the default we do not need to include the modifier *P=1 in the variable definition.

CHECK [10^1//5] “Should be single 1-5”

This means that in column 10 there can be one of five possible punches 1, 2, 3, 4, or 5. If more than one of the allowed punches is found in column 10, the program will print the error message defined inside the double quotes. It is also an error if the response is something other than one of these punches or is blank.

This is how the statement would be rewritten to allow a blank response. A single slash is used to separate two single categories.

CHECK [10^1//5/B] “Should be single 1-5 or blank”

Now this statement says that column 10 may be blank (no punches present) or may contain one of the other allowed punches one through five. You can also use N to mean “not these punches”, or use N with B to mean “not blank”

You can use the command ~SET CLEAN_ALLOW_BLANKS to globally allow blanks on any question that you verify with a CHECK statement.

Single Column/Multiple Punches


2A. Which of the following types of credit cards do you, or your household members have?









This question allows up to six responses. All of the punches will be stored in a single column. Punch data stores up to 12 punches per column. They can be referred to by their punch position from the starting column (1-12) or by the actual punch (1-9, 0, X, Y), where 0, X, and Y mean punch positions 10, 11, and 12 respectively.

This statement will check for any of the allowed punches in column eight. The *P modifier by itself means that column eight may contain any or all of the punches one through six.

CHECK [8*P^1//6] “Should be multi 1-6”

Again blank will be an error unless it is specifically allowed.

CHECK [8*P^1//6/B] “Should be multi 1-6 or blank”

Multiple Columns/Single Punch


Q 1.Where did you acquire this product?

(15) Received as a gift…………..1

Appliance store……………..2

Department store…………….3

Furniture store……………..4

Catalog showroom…………….5

Discount store………………6

Hardware store………………7

Kitchen specialty store………8

Building supply store………..9


Mail order………………….X

Kitchen remodeler……………Y

(16) Plumber…………………….1



In this example we need to check for one punch, but over multiple columns (e.g., a long list of choices where you are looking for the first mention). This question has 14 possible responses. Since punch data stores up to 12 punches per column, this question requires two columns.

The cleaning statement checks columns 15 and 16 for a punch. If more than one punch is found the error message is printed.

CHECK [15.2^1//14] “Should be single 1-14”

15.2 tells the program that the field starts in column 15 and is two columns wide, or columns 15 and 16. The punches are referenced by their position to the starting column. Hence a two punch in column 16 is in punch position 14, starting from column 15.

You can also check a set of questions in a single CHECK statement, for instance a series of rating scales.

CHECK [51,…,55^1//5]

This statement will check each column in the set: 51; 52; 53; 54; and 55, for any valid response one through five. The ellipsis (…) is used after the first column number to abbreviate this list of consecutive data locations. The columns do not need to be consecutive, but you will need to specify each one, e.g., [50,51,2/64,5/26^1//5].

Multiple Columns/Multiple Punches


Q 8. From your own experience and knowledge, what do you especially like about this restaurant?

(Check all that apply)


(15) 1 Good service/prompt service

2 Dependable/continuous service

3 The courteous employees they have/helpful

4 I like the food/good food

5 Food selection

6 Good prices

7 Computerized/accurate billing

8 Helpful in explaining billing questions

9 Good entertainment

0       Variety of entertainment

X Nice family atmosphere

Y       Established place/has been around for awhile

(16) 1 Accessible/Available/They’re everywhere

2 Like everything/good place

8 Other

0       Don’t know/No answer

X Nothing


This question allows more than one response, but if either of the last two responses is chosen then no others should be present.

The statement below checks for any of the allowed punches in columns 15 and 16. It also checks that if either Don’t Know or Nothing was chosen then no other response is allowed.

CHECK [15.2*P^1//14/20/(-)22/(-)23] “Should be multi 1-14,20 or single 22 or 23”

Multiple Columns (non-contiguous)

This cleaning situation would usually occur where you have a question with an Other/specify and no columns or not enough columns were available in the data file to code the ‘Other’ responses in columns consecutive to the original question. You still need to write separate CHECK statements that check for inconsistencies such as an exclusive punch violation code or more than one response when only one is allowed, in addition to valid punches or blank.

Complex Single Punch

EX:   IF NUMBER_OF_ITEMS([1/10^1//10] WITH [2/10^1//5]) <> 1

ERROR “Should only be one answer in cols 1/10 or 2/10 combined”


This example uses a more complex cleaning statement than you have seen in previous examples. There are two data locations and a possible 15 categories, but only one can be present. We need a way to treat the two data locations and all of their categories as one unit in order to count the total number of categories present. The keyword WITH is a vector joiner that connects the categories in record one column 10 with those in record two column 10. This forms a single expression including all the categories. NUMBER_OF_ITEMS is a Mentor function (meaning it acts on the expression given inside the parentheses) that counts the number of categories that are true for the current case. If the result is not equal (<>) to one, then we instruct the program to flag the case and print the error message specified. To allow No Answer you would say > 1.

Multiple Punches With An Exclusive Response

EX:   IF ([1/10^1//10] OR [2/10^5] AND [2/10^5]

ERROR “Can be multi in 1/10,2/10 or 2/10^5”


This example is similar to the one described above, but more than one response can be present in either location. In addition, one response is considered exclusive, meaning if it is present then no other response is allowed.

This example uses two other joiners, OR and AND. This IF statement says that if you have any of the allowed categories turned on anywhere in either record one column 10 OR record two column 10, AND record two column 10 is a five punch (and exclusive), then this is an error.

Remember, neither of these examples checks the validity of the actual data in these columns. They only count categories present in the expression formed by the joiner(s). Prior to one of these IF blocks, you would check the data with individual CHECK statements for each column.

Refer to Appendix B: TILDE COMMANDS under ~DEFINE VARIABLE= for information on other joiners and functions. Joiners are also used in data tabulation to form complex banners and to base the table (see 5.1 Expressions and Joiners).

Cleaning ASCII Data

ASCII data can be collected either as a number or a string. To specify ASCII data, use a pound sign(#). Categories of responses are specified in the same way as for punch data with single (/) or double slash (//). ASCII data can also be checked for zero-filled columns. Numeric data can be checked for literals like DK indicating a response such as Don’t Know or Refused.

Numeric Data

Q 3. How many years have you lived at this address?

(Enter RF for refused)

This question collects a numeric response and includes a literal if the respondent does not provide an answer.

CHECK [7.2#1//20/RF] “Should be a number 1-20 or RF”

The cleaning statement will check for a range of numbers or the literal RF. An error will occur if what is found in the data does not match your description (number out of range, blank field, different literal, etc.).

For this example we have assumed that the range can be 1-20 years. That requires a two column field. If the range were 1-5 years we would still be checking a two column field since the literal RF will be coded into two columns.

Zero-Filled Numeric Data

You may be post-processing the data file in a software package that requires a number in every column. It is easy to modify the previous cleaning statement to check for leading zeros rather than blanks. In addition, it is often easier to examine your data if all of the columns in a numeric field contain a number. Here is the same CHECK statement modified to verify that where the response is less than 10, the leading columns are zero-filled.

CHECK [7.2*Z#1//20/RF] “Should be 01-20 or RF”

*Z says that leading columns for numeric responses must contain a zero or else it is an error. The default is to allow blanks.

Single Response String Data

Responses coded as literals are referred to as string data. Strings can be alphabetic or numeric and unlike punch data there is only one code per column. Typical examples would be state or zip codes.

Q 11. Please enter the appropriate 4-character code for your department.

In this question the respondent must fill in a code designating their department within a company. To clean this question you would need to check the field for each legitimate code (presumably from a list of all the possible department codes):

CHECK [5/5.4#A111/P222/M333/S444/T555] “Not a valid dept. code”

Notice how the columns are referenced in the statement above. Record/column is the program default for all column references. Absolute column references are also allowed. We could have written the data column location as 325.4. If we wanted to allow a blank for this field then the CHECK statement would look like this:

CHECK [5/5.4#A111/P222/M333/S444/T555/” “]

The set of double quotes acts like the B used in punch data examples. We have now defined a variable to say 5/5.4 may be any one of these responses or blank.

Multiple Response String Data

Coded Open Ends

You may have an open-end on a survey coded to allow for multiple answers of two or more digits, as in the following example:





(19-20)  ______________________________________

(21-22)  ______________________________________

(23-24)  ______________________________________


Imagine that valid answers are coded from 01 through 17. Code 95 stands for Other and code 99 stands for an exclusive code, such as Don’t Know. No code should appear more than once, there can be no leading or embedded blanks, and the exclusive codes cannot appear with any other code. The CHECK statement to clean this question for all of the parameters given would look like this:

EX: CHECK [19.2,21,23*ZF #1//17/95/(-)99] “Can be 1-17,95 or 99”

All three fields are specified in the same variable. We could have used the ellipsis, e.g., [19.2,…,23, to say the same thing. Also note that fields do not have to be contiguous:

EX: [1/23.2,1/25,1/41,2/43*ZF #1//17/95/(-)99].

*Z says that the first column for each of the codes one through nine must contain a zero (0), e.g., 02, and if not it is an error.

*F will check all three locations for the responses defined in the variable. An error is generated if an invalid code is found; if any code appears more than once; if the exclusive code appears with another code; or if leading blanks or embedded blanks are found (due possibly to keypunch error).

Allowing Duplicate Codes

You might have a situation where duplicate codes in the data are not an error. In this case you would need to use the *L modifier.

EX: CHECK [19.2,21,23 *ZL #1//17/95/(-)99]

*L will still print an error for an invalid code, an exclusive code violation, or if leading or embedded blanks are found, but duplicate valid codes will not be an error.

Multiple Punches in a Single Response Field

You can clean a location if you have the situation of having multiple punches in what should have been a single-punch field. Use the *S modifier to determine which punch to keep by the order of categories in your statement.

EX: MODIFY [10^6/5/4/3/2/1] = [9*S^6/5/4/3/2/1]

In this example, if the field contains the punches 6, 4 and 2, only the 6 will be retained. If the field contains punches 4 and 2, only the 4 will be retained.

Checking Multiple Locations For A Valid Response

To simply check multiple locations for a valid response do not use either the *F or the *L modifier. There is also no need to specify any exclusive responses since only one valid response can appear in each set of columns. There is no relationship amongst the responses in these columns. The only errors we care about are either an invalid code or a single-digit code without a leading zero (0):

EX: CHECK [19.2,21,23 *Z #1//17/95//99]

This statement is the same as writing three separate CHECK statements. It checks each of 19.2, 21.2, and 23.2 for any response 01-17 or 95-99.

Cleaning for Skips or Bases

A common situation you will clean for is whether skip patterns were executed correctly in the questionnaire. This is accomplished by writing an IF-THEN-ELSE-ENDIF block. In our sample questionnaire question 4 should have a valid response only if the response to question 3 was Yes (respondent has siblings).


EX:   IF [18^1] THEN

CHECK [19.2*Z#1//10] “Should be a number 1-10” ELSE

CHECK [19.2^B] “Should be blank”



This example says if column 18 contains a 1 punch (answered Yes to question 3) then check columns 19 and 20 (how many siblings) for any valid response 1 through 10, otherwise (ELSE) those columns should be blank (meaning the respondent answered No to question 3). The word THEN is not required. If you use the CHECK_COLUMNS command in your cleaning procedure, the ELSE clause in this IF statement is not needed unless you want to specify your own error message. CHECK_COLUMNS will report any columns that are not blank. Commands are indented and on separate lines for readability only.

You can have unlimited levels of IF blocks within IF blocks. You can use the GOTO command to branch into or out of an IF block.

Branching in the Cleaning Procedure

The GOTO command allows you to branch anywhere (forward only) in your cleaning procedure including into or out of IF blocks. GOTO is especially useful to skip large blocks of questions where a conditional statement would contain many statements, or to exit a complex IF block. GOTO’s are often used at the beginning of a section such as check only if male or female. Most of the time you will control spec file execution with IF blocks.

NOTE: Test carefully if you skip into an IF block. You could get unexpected results. An ELSE coming after the GOTO label will be ignored. Below is an example to illustrate this point.




IF [5^1] THEN

PRINT “after the if/before XX”

XX: PRINT “at XX/before else”    ”Only this line is executed


PRINT “after XX”




In our example paper and pencil questionnaire let’s say that a No response to question 3, “Do you have any siblings?,” skips to question 11. The IF statement checks column 18 for a 2 punch (respondent does not have siblings), the GOTO command then branches to the label Q11 skipping all of the cleaning statements for questions five through ten.

EX:   IF [18^2] THEN



cleaning statements for questions 5-10

Q11: cleaning statement for question 11

Without the GOTO in the previous example, you would have to include the cleaning statements for questions five through ten in one or more IF blocks to control when they would execute. If your procedure includes a CHECK_COLUMNS command then the program will print an error if any of the columns used by questions five through ten are not blank. See the example cleaning procedure in section 2.3 for the sample paper and pencil questionnaire.

A GOTO statement can have multiple conditions and labels. If the question you are checking includes more than one category, such as male/female respondent, you can write a GOTO with a label for each one.

EX:   CHECK [18^1/2] “Should be single 1,2”

GOTO(*,Q11) [18^1/2]

cleaning statements for questions 5-10

Q11: cleaning statement for question 11

The CHECK statement checks column 18 for a valid response. The GOTO says if the response is a one continue processing (*), and if the response is a two branch to the label Q11. If response one also branched to another point in the cleaning procedure, then we could have provided a label for it, e.g., GOTO (Q20,Q11) [18^1/2]. The comma delimiter is optional, but use at least one space between labels if you do not use a comma.

Complex Cleaning

Mentor provides many features to help you check the validity of your data. You have seen just a few of the most commonly used commands in our example cleaning procedure. In addition to other cleaning commands there are also functions that allow you to count responses, add fields to check a sum, modifiers to treat the data found in a field differently than the program default, or keywords called system constants that allow you to quickly get information about your data. We will cover a few of the more common examples here, but refer to sections “9.3.1 System Constants”and “9.3.2 Functions” for information on other useful features.

Counting Number of Responses

In a previous example under Multiple Columns (non-contiguous) we used the NUMBER_OF_ITEMS function to count categories present across a range of columns. The same function can also count categories in a single column.

You may need to verify that a specific number of responses is present in a particular field. If we define the range of valid responses as separate categories, then NUMBER_OF_ITEMS can count the categories that have a response. The result can be compared to the number of responses that must be present.

Here is an example where respondents are asked to circle their three most common sources of news information on a precoded list. The cleaning statements follow.











CHECK [5*P^1//8] “Can be multiple 1-8”


ERROR “Has more than 3 punches” [5.1$P]



The first statement checks the field for valid punches, any response one through eight. The NUMBER_OF_ITEMS function counts the number of categories (5^1, 5^2, 5^3, 5^4, 5^5, 5^6, 5^7, and 5^8) present in the current case. If the result is greater than three then the case is flagged with an error and the message is printed.

Use the ~CLEANER command ERROR whenever you are not using a CHECK command, most often you will use ERROR inside an IF condition structure as in the example above. ERROR prints the ID of the current case and any text enclosed in “quotes” following the ERROR command. It also flags the case where an error has been found. You can locate flagged cases in the open data file with the FIND_FLAGGED command. Optionally, you can print the contents of a field as part of the ERROR statement: as a string or literal ($); as punches ($P); or as a number [location]. In our example the contents of column five will be printed as a punch. Refer to Chapter 3:Reformatting Your Dataand Appendix B: TILDE COMMANDS under ~CLEANER PRINT_LINES for other examples.

There are two other command used to generate text: SAY and PRINT_LINES. These commands are best used for trouble-shooting because they are not included as “hits” in the cleaning summary. SAY does not print the case ID, does not set the error flag on the case, and prints only to the list file, not to the print file.

Here is the previous ERROR statement rewritten with the SAY command. The System constant CASE_ID causes the case id to print before the “text”:

SAY CASE_ID “Has more than 3 punches” [5.1$P]

If you need to send this message to the print file, use the PRINT_LINES command:

PRINT_LINES “/s has more than 3 punches /s” CASE_ID [5.1$P]

Checking For A Constant Sum

An example of a constant sum would be where you have asked respondents what percent of the time they spend on three activities. Your cleaning statement might need to check that the sum of the responses adds up to 100%.

The SUM function returns the sum of values of the numbers that exist in the data. Missing values are ignored, and if all values are missing then the result of SUM is missing.

EX:  IF SUM([5.2,7,9]) < 100

ERROR “Responses do not add up to 100%”


In this example, the three data locations have the same length, so it can be expressed as one data reference with multiple locations, and the length only needs to be specified once. A length of one is the default; specifying it is optional. When the lengths are different each location and length must be specified separately, separated by commas:

EX: SUM([5.2],[7.3],[2/10.5]) < 100

You may specify any of the following inside the parentheses ( ):

  1. a data location
  2. an absolute number
  3. a number returning function such as SQUARE_ROOT or NUMBER_OF_ITEMS, refer to “9.3.2 Functions” for other functions that return numbers
  4. a math statement such as 2 * 4
  5. the name of a previously defined numeric variable

EX:   SUM([5.2],123.5,NUMBER_OF_ITEMS([5^1//8]),8*3,AGE)

Checking an Aided/Unaided Awareness Grid

Next is an example of an aided/unaided awareness grid that asks about brand recognition and usage. In the cleaning procedure (in addition to checking for valid responses) we want to be sure that the first mention is unique and that the respondent did not mention advertising for brands not mentioned previously. Then we want to check that an aided mention was not previously mentioned (unaided).

Next we check that the aided advertising question was asked, and if so that those mentions appear somewhere previously. Finally the usage question is checked making sure that those mentions also appear in one of the previous mention questions.


Q.1a First Unaided Mention            columns 21-22

Q.1b All Other Unaided Mentions       columns 23-24

Q.1c Advertising Unaided Mentions     columns 25-26

Q.2a Aided Mentions (key brands)      columns 27-28

Q.2b Aided Advertising Mentions       columns 29-30

Q.2c Aided Usage                      columns 31-32




Q.1a Q.1b Q.1c Q.2a Q.2b    Q.2c

COKE …………21-1 23-1 25-1 27-1 29-1 31-1

DR. PEPPER ………2   2   2

GATORADE……….. 3   3   3       3       3       3

MELLO YELLOW …….4      4       4       4       4       4

MOUNTAIN DEW …….5      5       5       5       5       5

MUG ROOT BEER ……6      6       6

MUG CREAM ……….7      7       7

PEPSI-COLA……….8      8       8       8       8       8

RC COLA …………9      9       9       9       9       9

7-UP ……………0      0       0

SLICE …………..X      X       X       X       X       X

SPRITE ………….Y      Y       Y

SUNDROP ………22-1 24-1 26-1 28-1 30-1 32-1

OTHER (SPECIFY:)    8      8       8

NONE/NO MORE …….X              X       X       X       X

DON’T KNOW ………Y              Y       Y       Y       Y


Here is the cleaning procedure for this grid. The numbers in the left margin correspond to the explanation that follows.




”Unaided First Mention(Q.1a)


1 CHECK [21.2^1//13/20/(-)23(-)24]

2 IF [21.2*P^1//13/20] THEN


”All other Unaided Mentions(Q1.b)

2a   CHECK [23.2*P^1//13/20/23]

2b   IF [21.2^1//13] INTERSECT [23.2^1//13] THEN

2c     ERROR “23.2:” [23.2$P] “HAS SOMETHING ALSO IN 21.2:” [21.2$P]



”Unaided Advertising Mentions (Q1.c)

3   CHECK [25.2*P^1//13/20/23/24]

3a   IF [21.2,23.2*F^1//13/20] >= [25.2^1//13/20] ELSE

3b     ERROR “25.2:” [25.2$P] “HAS SOMETHING NOT IN 21.2:” [21.2$P] “OR 23.2″ [23.2$P]




”All Aided Mentions (key brands only)(Q.2a)

4 IF NUMITEMS([21.2,23*F^1/3//5/8/9/11/13]) < 8 THEN

4a   CHECK [27.2*P^1/3//5/8/9/11/13/23/24]

4b   IF [27.2^1/3//5/8/9/11/13] INTERSECT [21.2,23.2*F^1/3//5/8/9/11/13] THEN

4c     ERROR “27.2:” [27.2$P] “HAS SOMETHING ALSO IN 21.2:” [21.2$P] “OR 23.2″ [23.2$P]




”Aided Advertising Mentions (key brands)(Q.2b)


5 IF NUMITEMS([21.2,23,27*F^1/3//5/8/9/11/13]) >= NUMITEMS([25.2^1/3//5/8/9/11/13]) THEN

5a   CHECK [29.2*P^1/3//5/8/9/11/13/23/24]

5b   IF [21.2,23,27*F^1/3//5/8/9/11/13] >= [29.2^1/3//5/8/9/11/13] ELSE

5c     ERROR “29.2:” [29.2$P] “HAS SOMETHING NOT IN 21.2:” [21.2$P] “23.2:” [23.2$P] &

“OR 27.2:” [27.2$P]


5d IF [25.2^1/3//5/8/9/11/13] INTERSECT [29.2^1/3//5/8/9/11/13] THEN

5e   ERROR “29.2:” [29.2$P] “HAS SOMETHING ALSO IN 25.2:” [25.2$P]




”Aided Usage Mentions(Q.2c)

6 IF NUMITEMS([21.2,23,27*F^1/3//5/8/9/11/13]) > 0 THEN

6a   CHECK [31.2*P^1/3//5/8/9/11/13/23/24]

6b   IF [21.2,23,27*F^1/3//5/8/9/11/13] >= [31.2^1/3//5/8/9/11/13] ELSE

6c     ERROR “31.2:” [31.2$P] “HAS SOMETHING NOT IN 21.2:” [21.2$P] “23.2:” [23.2$P] &

“OR 27.2:” [27.2$P]





Indenting embedded IF statements is the recommended style for specifications. This allows you to keep track of which IF’s are still in effect.

1 CHECK first mention for a valid single response. Responses 23 and 24 are exclusive.

2 IF a brand was mentioned

2a THEN CHECK other mentions for a valid response: can be one or more of (1-13,OTHER) or DON’T KNOW. NONE/NO MORE would be an invalid second mention, so code 24 is not included here.

2b IF the first mention appears with other mentions, it must be a duplicate mention.

NOTE: OTHER is the exception.

2c THEN print an error and the contents of the columns as punches ($P).

3 CHECK unaided advertising for a valid response, can be multi (1-13,OTHER) or single NONE, DON’T KNOW.

3a IF the response is in unaided advertising (25.2) then it must also appear in either first mention (21.2) or other mentions (23.2)

3b Otherwise (ELSE) it is an error. Print “error message” and print the contents of the columns as punches (e.g., [25.2$P]).

4 IF all the “key brands” (i.e., the eight brands on the aided list) were mentioned then there are no brands to ask about. This statement counts key brands mentions unaided. If it is less than eight THEN it will check for aided mentions.

4a CHECK unaided mentions for valid codes.

4b IF a key brand appears in either first mention or other mentions, and also in aided mentions

4c THEN print an error and the contents of the columns as punches ($P).

5 IF the number of key brands aware of through advertising is greater than or equal to the number of total key brands aware of, then there were some brands to ask about in aided advertising mentions.

5a THEN check aided advertising mentions for valid responses. Those can be any of the key brands (multi-punched) or single punched NONE, DON’T KNOW.

5b IF there is a mention in aided advertising (29.2) then it must be mentioned previously in aided or unaided (21.2, 23.2, or 27.2)

5c Otherwise (ELSE) print an error and the contents of the columns as punches ($P).

5d IF aided and unaided advertising mentions share any responses

5e THEN print an error and the contents of the columns as punches ($P).

6 IF the count of unaided and aided key brands mentions is greater than 0

6a THEN CHECK Aided Usage Mentions for valid responses.

6b IF there is a mention in usage (31.2) then it must be mentioned previously in unaided or aided mentions (21.2, 23.2, or 27.2)

6c Otherwise (ELSE) print an error and the contents of the columns as punches ($P).

7 Report all columns that should be blank.

Checking Ranked Responses

There are two Mentor functions to help you clean for ranked responses. The one you use depends on what you expect the data to look like: either that all of the ranks present are consecutive starting with the first rank (less restrictive), or that every one of the ranks required is present (more restrictive)






MCDONALD’S —– (10)

WENDY’S                 —– (11)


TACO BELL —– (13)

PIZZA HUT —– (14)

JACK IN THE BOX         —– (15)

CARL’S JR. —– (16)



CAESAR’S PIZZA          —– (19)


First you would check for valid response across the columns (including no response). Next you would write an IF statement with either the CASCADE or the COMPLETE function depending on how you want to check the ranking of those responses.

EX: CHECK [10,…,19^1//5/B]

Using the CASCADE Function

The CASCADE function checks that ranking starts with the first value and without breaks through the end of the list.

EX: IF CASCADE([10,…,19*F^1//5]) ELSE



CASCADE requires that the starting point for ranking matches the variable definition. In this example, the respondent must have ranked one of the restaurants as one since the first category is one. Not all five rankings need to be present in the data, but they must be consecutive. For example, if a restaurant is ranked as fourth, then there must be restaurants ranked as first, second and third. However, just a one ranking or no ranking at all would not be considered an error.

In earlier examples we used the *F modifier to verify data using a CHECK statement. *F actually nets together occurrences of the same mention across the columns specified. In a CHECK statement if responses can be netted then duplicates exist and this is treated as an error. In checking that ranked responses cascade, we are only concerned that the rankings follow the categories defined in the variable.

In the variable defined above, *F creates 5 categories: 10,…,19^1; 10,…,19^2; 10,…,19^3;…; etc.

Duplicate rankings are netted together. For instance, if columns 10 through 19 contain two number three rankings, they are treated as one number three ranking. CASCADE then checks that the final set responses starts with the one response and continues or cascades through the end of the list or response five.

Using The COMPLETE Function

Unlike CASCADE, the COMPLETE function requires that all the categories defined be present in the data. In our example, COMPLETE says that only the set one through five is correct and anything else, including no ranking, is an error. Remember that COMPLETE could return many errors for a self-administered questionnaire since it is unlikely that all the required answers will be present.




Ranking must be consecutive, and must contain one through five (duplicate rankings are allowed):


IF COMPLETE ( [10,…,19*F^1//5] ) ELSE




If you need to check for duplicate rankings, you will need to add this condition along with either CASCADE or COMPLETE:

IF NUMBER_OF_ITEMS([10,…,19^1//5]) > NUMBER_OF_ITEMS([10,…,19*F^1//5])



The function NUMBER_OF_ITEMS counts all occurrences of the rankings including duplicates and then compares this to the net count. If all occurrences is greater than the net count, you have duplicates. The error message prints the data in all 10 columns so that you can see the actual duplicates.

2.3.2 Correcting Errors

Once you have your error listing, you should examine it to decide if any errors can be cleaned in batch mode. That will mean writing a procedure using commands that will cause the program to alter the data. Under Auto-fixing The Data you will find two examples of batch mode cleaning. This section will explain how to clean your data from the listing on a case by case basis using the procedure that generated the error listing. By using the procedure to clean your data, you will also find any new errors that could be introduced as a result of altering a field.


Start the cleaning section of Mentor by entering:


This utility starts Mentor, starts the cleaner block, and opens a log file called clean.log. CLEANIT records your commands in the log file so you can have a record of the commands you issued. If a file called clean.log already exists, CLEANIT will append to the existing file. If you do not want to append to an existing log file, you will have to rename or remove clean.log before starting CLEANIT again.

Section 2.5.1 Quick Reference: Cleaning Commands And Examples has a list of commonly used ~CLEANER commands and a brief explanation of each one. For a complete list of ~CLEANER commands, see Mentor, Volume 2, Appendix B. For a description of the CLEANIT utility, see the Utilities manual.

Now, open the DB file for access to SHOWERR (the cleaning procedure defined in 2.3 CLEANING SPECIFICATIONS):

CleaNer–> >USE_DB procs

Load the data file with FILE command:

CleaNer–> FILE sampl

Start the procedure SHOWERR with the HUNT command:

CleaNer–> HUNT showerr

HUNT starts the beginning of the data file, stops on the first case with an error, and displays all the errors in that case.

For example:

ID0001, error 3: [19.2#] Should be a number 1-10: a valid answer is required [19.2#]=”12″

Now you can modify the data with the MODIFY_ASCII command:

CleaNer–> MA 19.2

Mentor will display a the two columns and their contents (vertically), and give you the MODIFY_ASCII (MA) prompt:


ID: 0001, #1 1/19.2







This is where you will enter the correct data:

MA ->08

After modifying the data, you can check to see if the data is corrected with the DISPLAY_ASCII command:

CleaNer–> DA 19.2

Mentor will now display:


ID: 0001, #1 1/19.2






The rest of interactive cleaning is just repeating these steps for each case with an error in it. You can use HUNT (start at the beginning of the file) or FIND (move forward from the current case) to find the next case with an error. HUNT is a good way to re-check the cases you have already corrected. If you choose NOT to correct a case, and want to move on the next case with an error, you can move forward one case with the NEXT command, and then enter “FIND showerr.”

Interactive Cleaning Tips

• You can execute a procedure on only the current case by using an exclamation point, for example:


• You can specify more than one command on a line by separating them with semicolons, for example:

NEXT; DA 19.2

• You can use the REDO command to repeat the previous command.

• You can define a data location that you will be modifying in several cases with a name that you can reference with another command, for example:

DEFINE spot[19.2] MA spot

• You can automate repetitive commands by writing a small procedure and then executing it with an exclamation point, for example:

DEFINE PROCEDURE= {chkit: IF [3^1]; DA 19.2; MA 08; ENDIF}


• If you have a long or complicated command that you want to edit, you can use the >EDIT_PREVIOUS command. This will bring up the previous command in the on-line editor. Correct the command, and when you press ESC, you will exit the editor, return to the CLEANER block, and the command will be re-executed on the current case.

• You can combine commands into a file and instruct Mentor to read the file, for example:

1 Create the file CLNIT.SPX with the following lines:


>USE_DB procs


FILE sampl

FIND showerr

2 Execute Mentor, using an ampersand to tell Mentor to expect input from the keyboard once the spec file has been read, and “echo” to send all program messages to the screen:

MENTOR &clnit.spx  clnit.lfl,echo (DOS/UNIX)



Below is a sample of a clean.log file. In this example, the user opens the Roadrunner data file and changes the first four columns of the first case from “0001” to “abcd”. What the user typed is included in the log file; those lines start with “con:” for console.


con: file rrunr

con: ma 1.4

ID: 0001 (study code=RRUN, int_id=intv):1.4


Display 1/1.4:






con: abcd

con: da 1.4

ID: 0001 (study code=RRUN, int_id=intv):1.4


Display 1/1.4:






con: ~end


Enter (Y)es or (N)o please–>


con: y


Using Cleaning Screens

Mentor can also display a cleaning screen when it finds an error, and you can modify data from this screen. You can set up a controlled cleaning run for someone else to execute. Creating screens for others to use has several advantages:

  • they will only be able to modify the data columns you specify.
  • they don’t need to know data modification commands.
  • error checking is built in.

Referring to our example cleaning specifications in section 2.3, you would substitute the EDIT command for CHECK. Mentor will first list an error, and then present a cleaning screens.

EX: EDIT [18^1/2] “Should be single 1,2”

You can add explanatory text to the data description to display on the cleaning screen.

EX: EDIT [$T=”Has Siblings”18^Yes:1/No:2] “Should be single 1,2”

Mentor displays a screen similar to this:

Has Siblings 001 Yes 002 No –> enter the new values or RES/BLK/TERM ([ 1/18.1^B ]) </code>

Mentor automatically generates a three digit zero (0) filled number for each category defined in the variable when it is displayed in a question structure (similar to Survent) as above.

Refer to 2.4.1 Correcting Errors for an explanation of this screen and a sample specification file.

There are limitations to using this method. You could write a valid CHECK statement that could not then be used with the EDIT command. The variable you define must represent a single question in order for Mentor to present it with a cleaning screen. For instance the statement CHECK [10,…,15^1//5] will check columns 10, 11, 12, 13, 14, and 15 separately for a one through five punch. In effect this is six separate questions. In this case you could not substitute EDIT for CHECK. You would need to write separate EDIT statements for each of the six columns.


Auto-fixing is essentially a set of rules for cleaning the data without examining each case for why the error occurred. An example might be an unverified data file where the data is off by a column due to a keypunch error. Cleaning with this method will mean that your data set is not as reliable, but the cleaning process will be less labor intensive, especially as the number of cases increases.

We recommend that you always save the auto-fixed data to a new file, either using a copy of your original System file or by writing the modified cases to a new file. Make periodic copies of the data file especially if you find yourself modifying the procedure in subsequent cleaning runs. In this way, you would only have to go back one or two copies to undo the last set of changes made to the data file.

Another option is to create a new file for the “clean” data. ~OUTPUT creates the new data file and WRITE_CASE writes the case to the new file. Use WRITE_CASE as the last command in your cleaning procedure.

What follows is an example of auto-fixing using the sample paper and pencil questionnaire from section 2.2.1. This cleaning procedure uses a new ~CLEANER command, CLEAN. USE THIS COMMAND WITH CAUTION; IT BLANKS THE DATA LOCATION WHEN IT DOES NOT MATCH YOUR DATA DESCRIPTION. Like the CHECK command, you can define your own error message in addition to the one Mentor prints for each case it changes. This message is especially useful in the error summary by telling you how many times a variable was changed to a blank.

An explanation of how each command in the procedure will affect the data follows this example.




CLEAN [15.3#MON/TUE/WED/THU/FRI/SAT/SUN] “Not a valid code”

CLEAN [18^1/2] “Should be single 1,2″

IF [19.2#10//99] THEN

TRANSFER [19.2] = 10



CLEAN [19.2*Z#1//10/” “] “Should be a number 1-10 or blank”

IF [19.2^B] THEN


MAKE_DATA [18^1]


IF [21^8] AND [21*P^1//7] THEN

MAKE_DATA -[21^8]



CLEAN [21*P^1//7/(-)8] “Should be multi 1-7 or single 8”

BLANK [22-80]











CLEAN [15.3#MON/TUE/WED/THU/FRI/SAT/SUN] “Not a valid code”

Blanks the data associated with the day of the week question when it does not match one of these responses.

CLEAN [18^1/2] “Should be single response 1,2”

Blanks the data in the columns of the has siblings question if the response is not a single response of one or two.

IF [19.2#10//99] THEN TRANSFER [19.2] = 10 ENDIF

Says if the answer to the number of siblings question is more than 10, recode it to be 10.

The TRANSFER command can modify numeric, string$, or punch$P data (though punch data is more often modified with the MAKE_DATA command). In this example we have a very simple modification. You can also modify data based on the result of an expression such as multiplying one location times another.

TRANSFER also verifies the data on both sides of the equal sign (=) and will return an error if the data types do not match. A dollar sign ($) is required to modify string data such as the codes for day of the week in our example procedure. Refer to the related command, MODIFY, under ~CLEANER in Appendix B: TILDE COMMANDS for more information and examples.

CLEAN [19.2*Z#1//10/” “] “Should be a number 1-10 or blank”

Blanks the data for the number of siblings question if it is not a valid response 1-10 or blank. Given the skip pattern in this questionnaire we know it is possible for this field to be blank. The empty quotes ” ” response in the data definition [19.2#1//10/” “] instructs Mentor that blank is also a valid response.


Says if the number of siblings question is blank, generate a response of two (No) for the has siblings question, and if it is not blank generate a response of one (Yes).

The MAKE_DATA command replaces punch type data; it blanks the location first. It is the recommended way to simply add or remove punches from a data location. The columns to be modified are defined like any other punch variable by specifying the location and the new punch value inside square brackets ([ ]).

IF [21^8] AND [21*P^1//7] THEN MAKE_DATA -[21^8] ENDIF

Says if the other family members question has a none response (code eight) with any of the other valid responses (codes one through seven), then remove the eight response.

In this example, we have used one of two modifiers available on the MAKE_DATA command. These are useful when you do not want to blank the data location. Minus (-) removes a punch. In this procedure we only want to remove the eight code, but leave any others in tact. A plus sign (+) would add a punch to the location.

CLEAN [21*P^1//7/(-)8] “Should be multi response 1-7 or single 8”

Blanks the other family members question if it is not a valid code of 1-7 (any combination) or exclusive code 8.

BLANK [22-80]
Blanks the remaining columns in the case. The BLANK command unconditionally blanks the field specified.

Writes the case to the output file SAMPL2.

Auto-fixing Case By Case

You could correct cleaning errors by writing an instruction to modify the data on a case by case basis correcting those columns that are in error. In the following example, we have written a procedure to correct only the cases with reported errors. The case to correct is identified by its ID number located in columns one through four.




IF [1.4#0001] THEN

TRANSFER [19.2] = 1



IF [1.4#0002] THEN

MAKE_DATA [18^2]

BLANK [19.2]



IF [1.4#0003] THEN

MAKE_DATA [18^1]


IF [1.4#0005] THEN

TRANSFER [15.3$] = “THU”



IF [1.4#0006] THEN

MAKE_DATA [18^2]


IF [1.4#0007] THEN

TRANSFER [19.2] = 5



IF [1.4#0008] THEN

MAKE_DATA [21^1/2]


IF [1.4#0009] THEN

MAKE_DATA [21^8]


IF [1.4#0010] THEN

TRANSFER [15.3$] = “WED”








2.3.3 Subsequent Cleaning Runs

Once the errors have been corrected in the data, the cleaning procedure that produced the original error listing should be rerun to verify that no errors remain. In the following example, we will use the clean data file, SAMPL2, as the input file. If additional errors are found, SAMPL2 can be edited either interactively on a case-by-case basis, or by modifying the auto-fixing procedure and rerunning it.







CHECK [15.3#MON/TUE/WED/THU/FRI/SAT/SUN] “Not a valid code”

CHECK [18^1/2] “Should be single 1,2”

IF [18^1] THEN

CHECK [19.2*Z#1//10] “Should be a number 1-10” ELSE

CHECK [19.2^B] “Should be blank”


CHECK [21*P^1//7/(-)8] “Should be multi 1-7, or single 8”







If you saved your procedure in a DB file then it is not necessary to redefine it in subsequent runs. Open the DB file with the >USE_DB command. Delete the lines that define the procedure or comment them out by inserting the command ~COMMENT right after ~DEFINE. Specifications falling between a ~COMMENT and the next tilde command will not be processed. See 2.3 CLEANING SPECIFICATIONS, the >CREATE_DB command.


The Survent interviewing program has been designed with features that makes data errors highly unlikely, but:

  • The questionnaire could be changed after the study begins. Data collected prior to the change could be invalid.
  • Errors could be made when open-ends are recoded.
  • Someone reviewing the data in the CLEANER utility or in ~CLEANER could make changes to the wrong columns.
  • Interviewer changes (done when a respondent changes a previous response using the ALTER keyword in Survent VIEW mode) could affect skip patterns in the questionnaire.

This section will explain how to use one of Survent’s automatic spec generation options to create basic cleaning specifications. Survent specifications compiled with the ~PREPARE COMPILE CLEANING_SPECS option produce a file of cleaning statements for each CAT, FLD, NUM, VAR, and TEX question. The resulting CLN file can be incorporated into a cleaning procedure to produce an error listing. The same procedure can then be used to clean the data case-by-case.

The Survent program also includes an option to view and/or alter data on a case by basis. Refer to your Survent manual under 4.1.4 VIEWING A PREVIOUS INTERVIEW for more information.

If you know PREPARE syntax you could reproduce your paper and pencil questionnaire, and then generate data cleaning and/or tabulation specs with one of the compile options. See 4.7 USING PREPARE TO GENERATE Mentor SPECIFICATION FILES for details on table building spec files.

You can produce a CLN file (and any other auxiliary file) at any time from a compiled questionnaire file (QFF).

EX:   ~QFF_FILE <studyname>


See Appendix B: TILDE COMMANDS, ~PREPARE for more information. Each cleaning statement checks data validity against the PREPARE question specification (i.e., single/multi punched, exclusive response, range of responses, skip patterns, etc.). When these statements are executed interactively, a cleaning screen similar to what the interviewer saw is presented each time an error is found. You do not need to know any specific data modification commands to clean the data. New data is automatically checked before the procedure continues to the next error. It is possible to suppress the cleaning screens, but then you will need to know basic data modification commands.


When you compile Survent specifications in PREPARE, each question can be saved as a Mentor variable. These variables are stored in a CfMC DB file under the Survent label name. Just as your data file contains punches or characters in a certain column order, the DB file variables contain information that describe the contents of these columns. Each one includes the question description (text, question label, question type, column location, who should have answered this question) and answer descriptions (valid punches, ASCII responses, numbers, exception codes). The program-generated CLN file references these variables to check your data for errors.

Here is our sample paper and pencil questionnaire rewritten as PREPARE specifications. We have added a TEX type question to the original example in section 2.2.1.



{NAME: 5.10


!VAR,,10,1 }


{DAY: 15.3














2 NO }


{NUMSIBS: 19.2



!NUM,,,1-10 }


{OTHERS: 21.1












{DEMOG: 22.1

INTERVIEWER: Please gather the following information from the respondent and type the answers here.


!TEX }



The cleaning specifications below were produced by compiling the sample questionnaire above with ~PREPARE COMPILE CLEANING_SPECS. The resulting file is called SAMPL.CLN. We will reference it in our procedure along with the DB file also created when the sample questionnaire was compiled. SAMPL.DB contains all the information about the variables NAME, DAY, SIBLINGS, NUMSIBS, OTHERS, and DEMOG necessary for cleaning.

Responses to TEX type questions such as DEMOG are usually listed out with the CfMC LIST utility. You would then work from the LIST report to edit TEX responses. Refer to 2.4.1 Correcting Errors, Modifying TEX Question Responses for commands to display and edit this data outside of a cleaning procedure.

NOTE: We are referencing file names using the DOS/UNIX naming convention studyname.extension. In MPE, the convention is studynameextension (e.g., SAMPLCLN). In addition, MPE has an eight character file name limit. To be consistent across platforms we are using the study code of SAMPL.


















The CLN file consists of a cleaning statement for each CAT, FLD, NUM, TEX, and VAR in our sample questionnaire. The EDIT command verifies that the data matches the question definition, generates an error message when it does not, and then presents a Survent-like cleaning screen. The EDIT command performs the same function as the CHECK command (see the sample cleaning procedure in 2.3 CLEANING SPECIFICATIONS). The CHECK command does not display a cleaning screen when errors are found. This is the main difference between the two commands.

EDIT only presents a cleaning screen when an error is found. The ALTER command presents the cleaning screen unconditionally. You might want to use this command for TEX questions.


You have the option to produce a CLN file with CHECK commands and data variables when your PREPARE specifications are compiled. Use the command ~SPEC_RULES CLN_CHECK. Specify this command before ~PREPARE COMPILE CLEANING_SPECS. For example, instead of EDIT OTHERS, the cleaning statement would be CHECK [1/21*P=7^1//7/(-)8]. This would be useful if you want to edit the CLN file later on to add cleaning statements, and you want the references to be to the data locations, independent of the Survent variables which may themselves contain certain logic errors.


Include your own cleaning specifications in your questionnaire specification file with the compiler command {!Mentor_CLN}. When Survent specifications are compiled with the ~PREPARE COMPILE CLEANING_SPECS option, statements specified inside this command will be passed to the CLN file. There is no syntax checking by PREPARE. This option allows an experienced spec writer to include other cleaning commands or complex cleaning instructions in the questionnaire specifications. For example, you might need to check that the responses to three questions add up to 100%. Here is what the cleaning spec might look like.


EX:   { !Mentor_CLN

IF SUM([15.2,17,19]) <> 100

ERROR “Responses do not add up to 100%”




Cleaning statements are generated for every condition and branching statement in the questionnaire (!IF, SKIPTO, or !GOTO). The labels you see before each EDIT statement (e.g., NAME:) are generated by Mentor for every question. These serve as possible markers for branching caused by a SKIPTO or a !GOTO in the Survent questionnaire. They have no effect on a cleaning run unless there is a preceding GOTO statement that branches to that marker.

A PREPARE !IF statement is converted in the CLN file to a Mentor IF-THEN- ELSE-ENDIF block. The structure closely resembles the PREPARE syntax except that IF statements in Mentor must be closed with an ENDIF statement. For example the NUMSIBS question was only asked if the response to SIBLINGS was a one (Yes). The cleaning instruction below says to check NUMSIBS for valid responses only if the data for SIBLINGS contains a one response. (THEN is implied but never needs to be specifically stated.) The outside parentheses on Mentor IF statements are added by Mentor for clarity. They are not required for a single condition.




Branching in a Survent questionnaire is accomplished with either a SKIPTO or a !GOTO statement. Each one generates a GOTO statement in the CLN file.

Here is the original PREPARE specification rewritten with a SKIPTO instruction on the NO response to SIBLINGS instead of the !IF statement on NUMSIBS.


EX:   {SIBLINGS: 18.1






SKIPTO OTHERS generates a GOTO cleaning command. GOTO tells Mentor to move forward to the place marked in the cleaning procedure and execute the command(s) specified there. A GOTO can conditionally move to different points in the procedure depending on the responses found in the data.










First SIBLINGS is checked for a valid response with the EDIT command. Then based on the response found in the data, the GOTO will execute in one of two ways. A Yes response means Mentor will continue processing, indicated by the asterisk (*). A No response will cause Mentor to go to the label OTHERS and continue processing from there.

In the next example, the same branching is accomplished with a !GOTO statement. While it is unlikely that you would write this with a !GOTO, it illustrates what the cleaning specification would look like.







2 NO }





The cleaning procedure will branch to the label OTHERS when the response to SIBLINGS is two.














There are three optional modifiers that allow you to control how the data is checked by the EDIT command. These modifiers are also available to the CHECK and CLEAN commands. You can edit the program-generated CLN file to add any of these before the variable name.

modifier is optional, but if you include it:

  • a minus sign (-) allows a blank in addition to other valid answers
  • asterisk (*) can be used to indicate that the field must be blank.
  • a plus sign (+) means anything in addition to the valid punches is also okay.

description outlines what the command will be looking for.


Variable Type     Modifier     Description

VAR               (none)       Valid ASCII at least minimum length

–           Valid answer or blank field

*           Must be blank


FLD               (none)       Location(s) has one or more of the codes on the list (checking

for an exclusive code and maximum responses).

–           Valid answer or blank field

*           Must be blank


NUM               (none)       Number within the range, or the exception number, or one of the exception


–           Valid answer or blank field

*           Must be blank


CAT               (none)       Has valid codes (checking for exclusive item and maximum responses) and no

other punches in the location

–           Valid answer or blank field

*           Must be blank

+           Anything in addition is okay.


TEX               (none)       Internal text pointers are okay

*           Must be blank (i.e., not asked)

–           Valid answer (good text pointer) or blank


As an example we have rewritten the statements that check the NUMSIBS question and the DEMOG question.











For NUMSIBS the IF condition was changed to say if the response to the SIBLINGS question was No (2) then the NUMSIBS question must be blank. The asterisk (*) preceding the question label NUMSIBS means the data for this question must be blank. For DEMOG we added the minus sign (-) modifier to allow this question to be blank in addition to a valid response.


Now that we have a CLN file we need to incorporate it into a cleaning procedure and then into a larger specification file to generate an error listing.

The commands used in the example below were covered in some detail in section 2.3 CLEANING SPECIFICATIONS. We will only cover commands or concepts not previously discussed. Please refer back to that section for more information.










The caret (^) in &SAMPL^CLN allows any CfMC-supported operating system to read this file, i.e., as SAMPL.CLN (by DOS or UNIX) or SAMPLCLN (by MPE). Notice that &filename has been indented along with the procedure commands. Normal CfMC processing requires that the ampersand (&) be in column one of the spec file.

The cleaning procedure looks very similar to the example in section 2.3. But instead of writing a CHECK statement and data description for each variable, we read in the program-generated CLN file. The ampersand (&) preceding the file name tells Mentor to read the file referenced from the current directory or group unless otherwise specified. This procedure also writes out each case to a new data file ensuring that the original data file remains intact.

Here is the above cleaning procedure incorporated into the complete specification file. The DB file must be opened first since the variables referenced in the CLN file are stored here. ~SET ERROR_REVIEW suppresses the cleaning screens displayed by default when an error is found. See 2.3.2 CORRECTING ERRORS, Auto-Fixing The Data for an explanation of ~OUTPUT and WRITE_CASE.



















If the above file is named CLEAN.SPX you would run it by entering:



The list of errors would go to the output file, CLEAN.LFL. Below is an abbreviated error listing that could be generated by this specification file. See 2.3 CLEANING SPECIFICATIONS, Generating a List of Error Messages for more information.


ID: 0001

error 3:  NUMSIBS [19.2#] has error: number in range is required



ID: 0002

error 2:  SIBLINGS [18^] has error: a valid answer is required

SIBLINGS[18^]=” ”


ID: 0003

error 2:  SIBLINGS [18^] has error: extra punches


ID 0004:

error 1:  NAME [5.10$] has error: string too short

NAME[5.10$]=”  ”

ID 0005:

error 5:  DAY [15.3#] has error: a valid answer is required

DAY[15.3#]=” “.


5 errors in 5 cases


error 1:   1 NAME [5.10$] has error

error 2:   2 DAY [18^] has error

error 3:   1 NUMSIBS [19.2#] has error

error 5:   1 DAY [15.3#] has error



Here are the automatic error messages generated by Mentor for different types of errors found.


Question Type Errors                    Error Message


CAT or FLD question is blank when an A valid answer is required

answer should be present.

Single response CAT question has an too many answers

invalid punch with a valid punch.

Single response CAT question has more extra punches

than one punch per column.

Single reponse FLD question has an      a valid answer is required

invalid response.

Multi-response FLD question has an invalid codes or blank field

invalid response, or leading or

embedded blanks.


Multi-response FLD contains duplicate duplicate codes


Multi-response CAT or FLD question has exclusive code violation

an exclusive response with another

punch or response.


VAR question does not contain the string too short

minimum number of typed characters.

NUM question is blank when an answer an answer is required

should be present.

NUM question has an invalid answer a valid answer is required

(out of range or invalid exception



TEX question has a bad internal  an answer required

pointer or is blank when an answer is



TEX question is not blank (skipped)     must be blank

and it should be.


You can specify your own error message inside “quotes” after the EDIT command.

EX:   EDIT SIBLINGS “Should be single punched 1 or 2”

2.4.1 Correcting Errors

You have several choices for cleaning your data once you have an error listing.

• Clean each case with one of the data modification commands described in 2.5.1 Quick Reference: Cleaning Commands And Examples.

• Clean using the procedure we wrote to list out the errors. There are three advantages to this approach: you do not need to know data modification commands; EDIT only allows access to columns with errors; changes to the data are checked for errors before you move to the next case.

• Clean the data with an auto-fixing procedure.


The EDIT command in the program-generated CLN file presents a Survent-like screen for data cleaning. It is similar to the screen the interviewer would see during actual interviewing with the C-Survent software. The top of the screen displays the text of the question including any recode table. The bottom of the screen displays the question name, the data location and the current data. If you enter an invalid response an appropriate error message displays, just as it would for an interviewer.

Here is a sample spec file called SCRNEDIT.SPX. This file contains all the commands needed to clean our sample data file using the Survent screens.

















The ~SET ERROR_REVIEW statement was omitted. For interactive cleaning we do not want to suppress the Survent cleaning screens. The data file we wrote out with the error listing procedure is opened in ALLOW_UPDATE mode to save all of our changes.

The CHECK_COLUMNS command does not present a cleaning screen so it has been omitted from this procedure. The columns reported by CHECK_COLUMNS in the error listing should be examined interactively, and either modified or blanked using the BLANK command.

~EXECUTE PROCEDURE=CLEANIT executes the procedure starting at the beginning of the data file. The error lists to the screen and then the cleaning screen for that question is presented. New responses are checked against the question’s parameters just as they would be during an actual interview. A Survent error message displays when an invalid response is entered. The error summary is displayed when Mentor reaches the end of the data file (refer to the sample error list and summary earlier in this section).

There is no ~END command in this file, but you could put one in, otherwise you will provide it from the keyboard when you want to exit Mentor.

The command line to run this spec file would look like this:




In this sample, the first case has an error in the NUMSIBS. The procedure will stop there and display this error message:


ID: 0001


error 3: NUMSIBS [19.2#] has error: number in range is required



ID: 0001

…and now you can modify NUMSIBS[19.2#]=”12″

Press <return> to continue


This shows that case ID 0001 has an answer (12) in columns 19 and 20 which is outside of the allowed numeric range of 1-10 for this question. When you press <Return> (or Enter) a screen similar to the one below will display.



enter the new values or RES/BLK/TERM (NUMSIBS[ 1/19.2=12 ])

RES/BLK/TERM are commands you can enter at the prompt instead of new data.

RES Restores the original data and prompts the user to continue or to begin entering commands from the console. Original refers to the state of the current case when the current command accessed it. If you are using a procedure as we are here then RES will cause the procedure to go on to the next error.

BLK Blanks the field and continues to the next error.

TERM Terminates the cleaning procedure leaving you at the CLeaNer–> prompt. Type some other command or ~END to exit Mentor. We will correct the data by entering a new value of 10 at the arrow prompt. The new value is automatically checked against the question’s structure. If it invalid then an error message will print. You will not be allowed off the screen until either the correct value is entered or one of the allowed commands.

When Mentor reaches the end of the data file you will see a message similar to this:

found the EOF_DATA without finding CLEANIT


Enter ~END to exit or another command.


As we explained earlier, you can use the EDIT cleaning command to check TEX questions, but a Survent screen is only presented when an error is found. It is more likely that you will be correcting typographical errors from the report of verbatims generated by the LIST utility. In that case, you will want to clean TEX responses separately from your regular cleaning run.

You must indicate the start location of the text data in the data file. By default, it is the first column of the next record after the last data column used, or the column specified on the TEXT_START= ~PREPARE header statement option. There are two ways to do this:

  1. Open the QFF file (~QFF_FILE name) before opening the data file. Mentor will determine the text location from the compiled questionnaire file.
  2. Use the TEXT_LOCATION= option on either the ~INPUT or the ~CLEANER FILE statement.

Here are the TEX question modifying command options.

MODIFY_TEXT <location> or <varname> Displays the current response and lets you enter a new one. Specify either the location of the TEX question or the variable name (you must open the DB file first with >USE_DB).

ERASE_TEXT <location> or <varname> Erases the response and the internal text pointers for this question. Use this command instead of BLANK for TEX questions.

ALTER varname Presents the question in a Survent cleaning screen unconditionally. Unlike EDIT, the screen is presented whether or not an error is found.


Refer to 2.3.2 Correcting Errors, Auto-Fixing The Data for an explanation of auto-fixing and the commands used in this procedure. The following is an example of cleaning the data in batch mode.
































The CLEAN command tells Mentor to examine each case using the data descriptions generated from the PREPARE questionnaire specifications, and to blank those data columns whenever they deviate from that description. Where needed, specific conditions are given for changing the data.

The CLEAN command can have an error message associated with it. This message is helpful in the error summary. It provides a tally of how many times this error was found causing the data to be blanked.

You could correct cleaning errors by writing an instruction to modify the data on a case by case basis. This procedure corrects the NUMSIBS and DAY variables using the TRANSFER command, and the SIBLINGS and OTHERS variables using the MAKE_DATA command. If a column should be blank and it is not, it is corrected with the BLANK command.







IF [1.4#0001] THEN




IF [1.4#0002] THEN




IF [1.4#0003] THEN



IF [1.4#0005] THEN



IF [1.4#0006] THEN



IF [1.4#0007] THEN



IF [1.4#0008] THEN



IF [1.4#0009] THEN



IF [1.4#0010] THEN









2.4.2 Subsequent Cleaning Runs

Once the errors have been corrected in the data, the cleaning procedure should be rerun to ensure that no errors remain. In the following example, we will use the corrected file, SAMPL2, as the input file. If additional errors are found, SAMPL2 can be corrected using one of the methods described in the previous section.



















2.5.1 Quick Reference: Cleaning Commands And Examples

This section is a list of the more commonly used cleaning commands and a short explanation of each of them. With these commands, you can access specific cases in your data file and correct them.

NOTE: Before starting the cleaning process, make a backup copy of your data file. You always want to have a copy of your “untouched” data.

Start Mentor by entering:


This utility starts Mentor, starts the cleaner block, and opens a log file called clean.log. CLEANIT records your commands in a log file named clean.log. If you do not want to append to an existing log file, you will have to rename clean.log before starting CLEANIT again.

Or, you can start Mentor by entering:

Mentor CON CON                        (DOS/UNIX)


CON is short for console, this tells Mentor to expect input from the keyboard, and to send messages out to the screen. You could specify a file name to have Mentor keep a record of the messages to a List file, for example:

Mentor CON SAVEIT.LFL,ECHO                    (DOS/UNIX)


ECHO tells Mentor to send messages to the screen as well as the List file. See Utilities, Appendix D: CfMC Conventions for more information on List files.

After Mentor starts, you will prompted to enter a command. Type ~CLEANER to start the cleaner block. The prompt will change to “CLeaNer–>” and you can then start cleaning your data. To enter the Mentor CLEANER block, type ~CLEANER:

(Enter command)–>~cleaner


A list of common CLEANER commands and short description of them follows. For a complete list of ~CLEANER commands and their syntax, see Mentor, Appendix B: Tilde Commands.


Action          Command                    Description (command abbreviation)


Help            >HELP                      Display on-line help for ~CLEANER.


Open file       FILE <filename> options   Opens a CfMC System file, for modification, with other options

if needed. Changes are written to the file when you move to the

next case. Related Commands: ~INPUT; ~OUTPUT, ~CLEANER



Get Variables   USE_DB <studyname>         Opens the DB file containing variables from a Survent



Get a case      NEXT “<caseid>”/FIRST/LAST  Gets case with ID of <caseid>. Keywords FIRST and LAST will

get first case or last case,  respectively. (N)


Move            NEXT #,+#                  Goes to record number’#’, or in file forward

(+)  the number of cases specified. (N)

Maximum -# is 9 cases.


Set command     ~SET SEQUENTIAL READ       Tells Mentor to deal with cases in case order.


Find case       BACKUP #                   Goes backwards the number of cases specified.


FIND <variable description>  Goes to the next case or procedure name that satisfies either

the variable description or procedure. The FIND command by

itself re-executes the last description specified. (F)


FIND_FLAGGED                Finds the next case with the error flag set on dirty cases

that were written (with a cleaning procedure) to a ~OUTPUT

file, to the ~INPUT file opened in ALLOW_UPDATE mode, or

opened with the FILE command. (FF)


HUNT <variable description>  Like FIND, but starts at the top of the data file. (H)



Use FIND or HUNT to locate the case that contains data defined in the variable.


EX:  DEFINE x[1.4#0023]


This would find the case that had 0023 in columns one through four.


Action               Command                         Description (command abbreviation)


Display             DA <col.wid>,<variable> or *    Displays data as ASCII text by specifying the data

columns, the variable name, or * for the entire case.



DB/DC <col.wid> or <variable>   Displays data in binary or column (punch) mode as

described for DA command above, but does not do *.


DT <col>, <text variable> or *  Displays text question data (collected in Survent

with a TEX type question) by referencing the text

pointer column, the Survent question label, or * for

all text questions.


SHOW <varname>, *, *B           Displays the entire question including text and

response list. * displays the entire case in card

image. *B displays the entire case in column binary

(punch) format.


Refer to your UTILITIES manual under //APPENDIX C: GLOSSARY OF CFMC TERMS// for definitions of ASCII,

binary, text variables, or other unfamiliar terms.



Action               Command                         Description (command abbreviation)


Modify               BLANK [col.wid] or <variable>   Blanks the data columns. (B)

ERASE_TEXT <col.wid>            Erases the text area and the internal text

or <variable>                 pointers for TEX type questions collected in

Survent. (ERASETEX)

MA <col.wid> or <variable>      Modifies data in ASCII format.

MB/MC <col.wid> or <variable>   Modifies data in binary or column (punch) format.

MT <col> or <text variable>     Modifies data from a text question (collected in

Survent with a TEX type question) with the program’s

editor. You must specify the option TEXT_LOCATION= on

either the FILE or ~INPUT command.


MA, MB, MC, MT are interactive data modification commands only.


MAKE_DATA +/-[loc^punch]        Modifies punch data. (MKDATA)

or variable                  EX: MKDATA +[18^1]

RESTORE                         Restores the data in this case (i.e., drops current

changes), but only if you have NOT moved to another


TRANSFER                        Modifies data. (T)




numeric data:          T [19.2] = 5

string/ASCII data:     T [15.3$] = “THU”

punch data:            T [11.2$P] = “1,2,3”

sum the values:        T [20.3] = SUM([5,6,7])

add numeric data:      T [10.2] = [5] + [20]



Action               Command                         Description (command abbreviation)


Delete a case        ASSIGN_DELETE_FLAG             Flags this case for deletion. (DELETE)

UNDELETE                        Removes the delete case flag.

Redo                 >EDIT_PREVIOUS                  Displays last line typed at the program prompt

in the program’s editor for modification. Press

ESC to exit and re-execute the command on the

current case. (EP)

FIND_FLAG_REDO                  Finds the next case with an error flag and

re-executes the last command line. (FFR)

FIND_REDO                       Finds the next case that satisfies the variable

description and re-executes the last command

line. (FR)

NEXT_REDO                       Goes to the next case and re-executes the last

command line. (NR)


Flagged cases are not read (i.e., they are ignored) by a procedure that modifies data, WRITE_CASE,

~COPY, ~INPUT, etc. unless you instruct Mentor to use deleted cases (USE_DELETED and the System

REDO Re-executes the last command line. (R)

You can specify commands on the same line if you separate them with semicolons. Pressing will
execute all of the commands specified on that line.

>EDIT_PREVIOUS will display the entire line in the editor for modification. Any of the redo commands
will re-execute all of the commands.
EX: DA 5.3;DA 10.3;DA 20.4
Action Command Description (command abbreviation)

Define DEFINE name[data description] Defines a data variable and assigns a name for future
reference with display, find, or modify commands.

You can attach a name directly to any DISPLAY, FIND, or MODIFY command and then refer to the name the next
time you use the command.

EX: DA x[15.3] DA x

Log SET LOGGING Records interaction with Mentor either from the
keyboard or a &file, and Mentor’s response to each
action. The log goes to the list file providing a
record of all data changes. If the ECHO option is
specified after the list file name,
then program messages also display to the screen.

Get info >STATUS INPUT Displays information on the data file and the current

NOTE: Refer to your Utilities manual Appendix A: META COMMANDS for other >STATUS options.

Exit ~END Quits program and saves data changes.

Move to new ~command Moves to another ~tilde block.
tilde block

CHECK [11*P=1^1//5/Y] (*P=1 is the default and does not need to be specified)
CHECK [ 11^1//5/Y]
CHECK [12*P^1//5/(-)Y] or CHECK [12^1-5/Y]
CHECK [12*P=3^1//5/(-)Y] (*P=3 says this may have up to three valid responses)
CHECK [13.2^1//20/24] (*P=1 is implied)
CHECK [15.2*P^1//20/(-)24] or CHECK [15.2^1-20/24]
CHECK [15.2*P=10^1//20/(-)24]
CHECK [17.2*Z#1//10/99] (*Z means the field must be zero-filled)

CHECK [5.10*P=5$] *P=n specifies the ending column to check (starting from the right-most column in the field for a non-blank character. This can be a number 1-127. From a Survent VAR question this is the minimum number of typed (i.e, non-blank, including spaces) characters requirement for that question.

In this example, Mentor checks the field starting in column 14 through column 5 for a non-blank character. Columns five through nine could be blank. If you needed to check for a character anywhere in the field then you would specify *P=1.

Duplicates or skipped blank fields are an error. CHECK [19.2,…,25*ZF#1//45/(-)99]
Duplicates are allowed, but skipped blank fields are an error. CHECK [19.2,…,25*ZL#1//45/(-)99]
Multiple Location Variables Without *F or *L SINGLE PUNCH/SINGLE COLUMN
CHECK [41,…,47,51,52^1//5/Y] (*P=1 is implied)
The example above simplifies this syntax:

>REPEAT $A=41,…,47,51,52
CHECK [$A^1//5/Y]
CHECK [53,56*P^1//5/(-)Y] or CHECK [53,56^1-5/Y]
CHECK [53,56*P=5^1//5/(-)Y]
The examples above simplify this syntax:

>REPEAT $A=53,56
CHECK [$A^1-5/Y]
CHECK [61.2,63.2*P^1//20/(-)24] or CHECK [61.2,63.2^1-20/24]
CHECK [61.2,63.2*P=20^1//20/(-)24]
The examples above simplify this syntax:

>REPEAT $A=61,63
CHECK [$A.2^1-20/24]
CHECK [65.2,67,69*Z#1//10/99] or CHECK [65.2,67,69*Z#1-10,99]
The examples above simplify this syntax:

>REPEAT $A=65,67,69
CHECK [$A.2*Z#1//10/99]
CHECK [71.3,74#1//100/DK/NA/RF] or
CHECK [71.3,74#1-100,DK,NA,RF] or

The examples above simplify this syntax:

>REPEAT $A=71,74
CHECK [$A.3#1-100,DK,NA,RF]
Assume four consecutive two column fields with codes 01-17,98, and 99, where 98 is No More and 99 is Refused. First clean with a typical CHECK statement.

EX: CHECK [1.2,…,7.2*F#1//17/98/(-)99]
There are three possible errors that could still exist with the 98 (No More) code.

1 98 is in the first position. The specification to clean would look like this:

IF [1.2#98] THEN
ERROR “1.2 is 1st position and has no more code”
2 98 is in the list, but is not the last code in the field.

This specification returns an error if the number of total answers (NUMITEMS) does not equal the category position (SUBSCRIPT) of the 98 code:

IF NUMITEMS([1.2,…,7.2^NB]) <> SUBSCRIPT([1.2,…,7.2#98]) THEN
ERROR “98 is not last code in the field 1-8:” [1.8$]
3 The code list terminates with a code other than 98 or 99. This is only an error if 98 is always coded as the last mention (except for someone who has the maximum number of responses).

This specification returns an error if there is a blank and there is not a 98 or 99 code:

IF [1.2,…,7.2^B] AND NOT([1.2,…,7.2#98,99]) THEN
ERROR “1-8 has a blank and not a 98 code:” [1.8$]

2.5.2 Sending Error Messages To A Print File

You can create a separate report of the errors in your system file by opening a print file. This is an ASCII file and it will exclude most of the processing messages generated by Mentor when a specification file is run.

>PRINT_FILE says to create a file called DIRTY (the default extension is PRT). The ~SET option ERRORS_TO_PRINT_FILE says to send the error messages and the error summary to the print file. This listing will look similar to the error list shown in 2.4 CLEANING WITH SURVENT VARIABLES, Generating A List Of Error Messages.

NOTE: You can specify your own extension for the print file, e.g., >PRINT_FILE DIRTY^LFL,USER. Refer to Appendix A: META COMMANDS in your UTILITIES manual for more information.

If the above file is named CLEAN.SPX, you would run it by entering:

The list of errors would go to the output file, CLEAN.LFL. Below is an abbreviated error listing that could be generated by this specification file. See 2.3 CLEANING SPECIFICATIONS, Generating a List of Error Messages for more information.

ID: 0001

error 3:NUMSIBS [19.2#] has error: number in range is required

ID: 0002
error 2: SIBLINGS [18^] has error: a valid answer is required
SIBLINGS[18^]=” ”

ID: 0003
error 2: SIBLINGS [18^] has error: extra punches

ID 0004:
error 1: NAME [5.10$] has error: string too short
NAME[5.10$]=” ”

ID 0005:
error 5: DAY [15.3#] has error: a valid answer is required
DAY[15.3#]=” ”

5 errors in 5 cases

error 1: 1 NAME [5.10$] has error
error 2: 2 DAY [18^] has error
error 3: 1 NUMSIBS [19.2#] has error
error 5: 1 DAY [15.3#] has error

2.5.3 Specifying More Than One Command Per Line

When you write cleaning procedures it is often more convenient to specify more than one command on a line, especially conditionals. Preceding cleaning commands with a dollar sign ($) helps Mentor to interpret a specification line when it can be ambiguous whether or not the command is followed by a variable name or another command. You do not need a $ on the first command on the line.

EX: IF [5^1] WRITE_CASE $ELSE SAY “Case not written” $ENDIF
In the above example the commands ELSE and ENDIF could be interpreted as variable names. Remember, preceding commands with a dollar sign ($) ensures that Mentor will always be able to distinguish between the two.

2.5.4 Additional Commands

The commands listed below are related to data cleaning and generation operations. Refer to Appendix B: TILDE COMMANDS under the tilde commands listed below for information.

~CLEANER Commands:

ASSIGN_DELETE_FLAG Flags the case for deletion.
ASSIGN_ERROR_FLAG Turns on the error flag for this case.
CLEAR_ERROR_FLAG Clears the error flag.
COPY Copies data from one field to another.

FIXUP Blanks the data location when it does not fit the data description given.
Mentor does not generate any messages when a data error is found.
MODIFY Modifies the data. See also COPY and TRANSFER.

TERMINAL_SAY Prints a message to the screen only (use in a procedure).
UNDELETE Removes the delete flag.
UPSHIFT Changes all alpha characters in the specified location to upper case.
~SET Options:

CLEANER_DEFINITION= Prepends this string to every ~CLEANER command.
ERROR_LIMIT= Maximum errors allowed in a cleaning run.
ERROR_STOP= Stops compiling a procedure when this number of syntax errors is reached.

MAXIMUM_PAST_CASES Sets the maximum number of cases that Mentor can backup to with
PROCEDURE_DUMP Echoes Mentor’s internal processing messages as it compiles and executes a
PRODUCTION_MODE Updates a case without confirmation. This is the default for the
~CLEANER FILE command.
TESTING_MODE Does not allow any changes to the data file unless it is opened with
Overrides PRODUCTION_MODE on the FILE command.
TRAINING_MODE Does not allow changes to the data. Use this option for training purposes.

Refer to 9.3.1 System Constants and 9.3.2 Functions for other commands that you can use.