Home > Commands > Database > Screen Data > Find Duplicates

Find Duplicates

The Find Duplicates command enables you to determine whether each record in the current database can be uniquely identified by values in the "key" field(s) you specify. Additional screening of data, as well as processing in many other areas of ProVal, is contingent on the presence of a unique identifier in the database(s) used. (Most often a single field, such as SSN or EEID, may serve this purpose, but the need for multiple keys exists when, for example, EEIDs are unique only within a particular DIVISION.)

If you do not have a database open when you select this command, ProVal will prompt you to select one. Under the Key field(s) parameter, select one or more fields to check for duplicate values.

If the key you select is (or includes) a character field, the Match case in key(s) option will be accessible, so that you can indicate whether you want ProVal to regard upper and lower case letters as distinct. The default treatment is to ignore alphabetic case differences when matching key values (so that, for example, "Joe Smith" matches "JOE SMITH"). Check this box to treat upper-case and lower-case letters in character key fields as distinct values (so that "Joe Smith" and "JOE SMITH" will be treated as different values).

The check for duplicate keys can be restricted to a subset of the records in the database by entering a Selection Expression. Click the image/selection_expression_button.gif button to access the Retrieve Selection Expression dialog box. Pick a selection expression to retrieve, by clicking its name from the list of Selection Expressions in the current Project. You will return to the preceding dialog box and the Selection Expression will appear in the Selection Expression box.

Two database Result fields are created during the Find Duplicates process to help you evaluate the results of the process. The “DupFlag” field contains a “1” for records whose key field(s) do not have unique values, and it contains a “0” otherwise. The “DupID” field contains a “1” for the first record encountered with a given key (whether it is a unique or non-unique occurrence), “2” for the second (in RecID order), “3” for the third, and so forth. You can click on the row for either result field to edit the name, if you wish.

The Log duplicates as errors option refers to an action ProVal can take at the end of the Find Duplicates process. If you check this box, then when you save the results of the process, ProVal will append information to the database's error log for records with duplicate key values. You can subsequently review the errors using the View Error Log command.

When you click Run, ProVal will scan the currently open database to identify records with duplicate key values. At the conclusion of the scan, ProVal will display a message with the results (along with a summary of the parameters used for the process). The results reported include the number of records for which the DupFlag field value is "0" (indicating records with unique key values) and "1" (indicating duplicates), as well as the number of records for which DupID is "1", "2", and so on. (The highest DupID value will indicate the number of records in the largest duplicate set identified by the checking process.) In addition, if DupFlag and DupID values already exist in the database from an earlier check for duplicate key values, ProVal's processing message will indicate the number of values that have changed, which can help you determine how your corrective efforts are progressing.

You can use the Print button at the bottom of the results message box to print the message contents. Other buttons at the bottom of the message box enable you to Save the results to the database (and the database's error log, if you chose the Log duplicates as errors option for processing), or Cancel to conclude the process without saving its results to the database.

If duplicate key values are found when the Find Duplicates command is run within Spreadsheet Edit, after you save the results of the process, ProVal will offer to display the duplicate records within the spreadsheet format. The view that is offered has the following characteristics:

You can resolve duplicate key errors by correcting the record(s) for which the wrong value is present. If the reason for the duplicate keys is that your data includes multiple records for the same person, you will need to consolidate those records in order to perform further data screening. You can either resolve them by hand or use the Resolve Duplicates database command (on the Screen tab in Spreadsheet Edit).