How to drop duplicates from a file

The spec DEDUPER.SPXshows how to have Mentor drop duplicate records. This run is setup to keep the first record if there are duplicates and to print a message when a duplicate record is found. You could modify the procedure to do different things when you have a duplicate. DEDUPER.DAT is the input file used with this example.

This setup sorts the file on the field and uses the Break_Variable to store information about whether this record is a duplicate or not. See below (Break Field Information) for what the Break_Variable specifically does.

The first duplicate is used by default. You can use the define UseLastDupe to have it save the last duplicate.

You will either need a 4-column blank field or extend the length of the file (as in this example) to store the 4-column break variable.

See the list file to see a list and count of records that were dropped because of being duplicates.

Output file will be sorted in the duplicate field order, unless the define originalsort is used.

>Define @Input Deduper.Dat                     ''Name of the input Data file>Define @InputOpt ASCII=20 Length=200         ''Input Options - Default is TR file>Define @Output Deduper.OK                    ''Name of the output Data file>Define @OutputOpt ASCII Length=20 Trimblanks  ''Output Optons - Default is TR file>Define @DupeField 1.10                       ''Field you wish to dedupe off of>Define @BreakField 101.4                      ''Field where Break info is stored. MUST BE BLANK!Use this define if you want to get the file back in the original sort order. This causes a lot of extra processing, so only use it if need be. Also you must supply either a current blank field or extend the case to hold the case number. This field must be big enough to hold the number of cases in the file (6 digits gives you 999,999). This field should either be an existing blank field or added space on the record.>Define @OriginalSort 111.6Use this define if you have put the Break field in an existing blank set of columns and you do NOT want this information saved in the new data set>Define @BlankBreakUse this define if you want to use the LAST duplicate rather than the first '>Define @UseLastDupeUse this define if you wish to save the intermediate data set that is made'>Define @SaveTemp*******************************************************************************Break Field information The 1st 2 columns of the break variable will contain: 00 on the first record in the file 01 if this record does NOT duplicate the previous record 02 if this record DOES duplicate the previous record. The last 2 columns of the break variable will contain: 00 on the last record in the file 01 if this record does NOT duplicate the next record 02 if this record DOES duplicate the next record.>Purgesame>FillDefinesInQuotes~In @Input  @InputOpt>IfDefine @OriginalSort~Out DUPESORD~Def Proc=WriteOrder:If [@BreakField^NB] Error [@BreakField$] Halt "BreakField ""@BreakField"" is not blank. Must STOP!"EndifIf [@OriginalSort^NB] Error [@OriginalSort$] Halt "Original Order field ""@OriginalSort"" is not blank. Must STOP!"EndifModify [@OriginalSort] = CaseNumberWrite }~Exc Proc=WriteOrder~In DUPESORD>Endif~Out DUPESSRT~Sort [@DupeField $] Break_Variable=[@BreakField]~In DUPESSRT>IfDefine @OriginalSort~Out DUPESTMP>Else~Out @Output  @OutputOpt>Endif~DefBreak: [@BreakField]Proc= Finddups:>IfDefine @UseLastDupeIf [(Break) 3.2#0,1] Then>ElseIf [(Break) 1.2#0,1] Then>Endif>IfDefine @BlankBreak Blank [@BreakField $]>Endif WritecaseElse Error [@DupeField $] [@BreakField$] " Dropped Because this is a duplicate"Endif }~Exc Proc= Finddups>IfDefine @OriginalSort~In DupesTmp~Out @Output  @OutputOpt~Sort [@OriginalSort$]>Endif~In>IfDefine @SaveTemp>Else>Delete Dupessrt.tr>IfDefine @OriginalSort>Delete Dupesord.tr>Delete Dupestmp.tr>Endif>Endif~End

  • deduper.spx
  • deduper.dat