Home > Commands > Database > Resolve Duplicates

Resolve Duplicates

The purpose of the Resolve Duplicates command is to allow you to process a database with multiple records per person into a database containing one record for each person. The data within the duplicate records is condensed based on settings you specify. The parameters of the Resolve Duplicates dialog box are as follows.

Input Database is the “source” database (or input file) for the process of resolving duplicate records. Select a database from the drop-down list box.

Output Database is a separate database (i.e., distinct from the Input Database) that is created by the process of resolving the duplicate records. ProVal will automatically assign a “.SF” extension to the database file name entered. If the named output database already exists, ProVal will ask you to confirm before processing begins that you want to replace it.

Although the Input Database and Output Database are distinct, they are related by two important characteristics that enhance data tracking capabilities:

  1. RecIDs in the Output Database are the same as those for the corresponding records in the Input Database and

  2. the Change History for the Output Database carries forward the Change History from the Input Database.

Key Field(s) allows you to specify the field(s) to be used to identify duplicate records in the Input Database. The values in the key field(s) will always be written to the Output Database.

Select one or more key fields from the list of all fields in the Input Database. If you choose a character field, the Match case in key(s) check box becomes accessible. The default treatment is to ignore alphabetic case differences when matching key values (so that, for example, "Joe Smith" matches "JOE SMITH"). Check this box to treat upper-case and lower-case letters in character key fields as distinct values (so that "Joe Smith" and "JOE SMITH" will be treated as different values).

Specify one or more Rules to handle the processing of duplicate records. There can be as few, or as many, rules as needed. At a minimum, one rule must be specified. The parameters that define the rule(s) are contained in the Rule to Resolve Duplicates dialog box and discussed in a separate section of this article (below).

Click the New button to create a rule, or click the name of an existing rule to edit it. For ease of reference, ProVal assigns a sequence number to each rule you specify. You may click the Reorder button to change the sequence in which rules are applied in the processing of duplicate records (and ProVal will revise the assigned references accordingly).

Click Run to initiate processing, which occurs in the following manner:

First, all sets of duplicate records in the Input Database are identified.

Next, the rules are applied in order, one at a time.

All sets of duplicates that meet the conditions of the first rule applied (rule number one), as detailed in its two Selection Expressions (see below), are processed according to that rule's specifications, after which the processing of those duplicate sets is considered complete.

The remaining duplicate sets are then compared against the second rule applied (rule number two), and those sets that meet rule number two's conditions are processed according to its specifications, after which their processing is considered complete.

The same steps are repeated until all rules have been applied or all duplicate sets have been processed.

The Create DupRule field option is useful for reviewing the results of the process of resolving duplicate records. When the Create DupRule field box is checked, ProVal will include, for each record in the Output Database, a field that contains the value:

 

Rule to Resolve Duplicates dialog box parameters

Description may be any descriptive phrase, including spaces, under which to save this Rule.

You may choose one of three types of Action for dealing with sets of duplicate records:

  1. Condense records into one, to condense the records in a duplicate set into one record;

  2. Keep one record and delete the rest, to keep only one record from a set; or

  3. Keep all records (and exclude from further rules); duplicates are legitimate, to keep each, and every, record from a duplicate set.

Details for each Rule are dependent upon the Action selected:

I. Condense records into one:

A condensing Method is specified for each non-key database Field, thus determining how the field values from duplicate records will be combined before the value(s) are written to the Output Database. Depending on field type, up to seven methods are available:

  1. “<skip>“ – do not include this field in the Output Database. If multiple rules are used for the condense process, the use of the “<skip>” method for a field in any rule will take precedence over the method used in any other rule for that same field, because a skipped field will not be written to the output database file.

  2. “keep all values” – preserve the field values from each record in the duplicate set, so that the value from the first record in the duplicate set will be stored in a field with the original field name and the extra values will be stored in additional "target" fields that ProVal generates dynamically. For example, consider a field named Salary that contains values for each of three records (that have an identical key value) in a set of duplicates. In this case, the Salary value for the first record in the set will be saved in the field named Salary and the values for the second and third records in the set will be saved in two newly created fields, which ProVal names Salary_2 and Salary_3. (See also the description of Target Fields, below.)

  3. “first non-missing value” – instead of preserving all values, select just one value. Use the value from the first record in the duplicate set and ignore the values on the other duplicate records. If a field value is missing from the first record, then check the second and subsequent records until a non-missing value is found. Thus, for example, if three records with an identical key field value each contain a field named Salary, with a blank field value on the first record, a value of $50,000 on the second record and a value of $60,000 on the third record, ProVal will preserve the Salary field value from the second record and thus $50,000 will be contained in the Salary field on the Output Database.

  4. “sum” – total the values for all duplicate records; this method is available only for numeric fields.

  5. “average” – total the values for all duplicate records and divide by the number of duplicate records with non-missing values; this method is available only for numeric fields.

  6. “minimum value” – take the minimum value of the duplicate records; this method can be used for both numeric and date fields.

  7. “maximum value” – take the maximum value of the duplicate records; this method can be used for both numeric and date fields.

The “<skip>”, “keep all values” and “first non-missing value” condense methods are the only methods that can be selected for coded, character and Soc. Sec. # (SSN) fields.

In general, the initial setting of an Input Database field's condense method depends on the field type. For numeric fields, the condense method initially is set to "sum" and for all other field types the initial setting is "first non-missing value". An exception to this general rule is made for the two numeric fields that may have been added to the Input Database by the Find Duplicates process. Those fields, DupFlag and DupID (or their substitutes, if you chose to edit the field names when you ran the check for duplicates), are numeric but their initial setting is “<skip>”, rather than "sum". (See Data Dictionary for details about formats and characteristics of each of the five types of ProVal database fields.)

The Target Fields button, which becomes accessible when you select the "keep all values" method for any field, accesses a dialog box by means of which you can customize field names. Instead of using the standard naming convention, such as Salary_2 and Salary_3 (in our example above), you can substitute other names. Under the parameter When condensing a field with “keep all values”, write the results to, there is a row for each field for which this method has been selected and there is a column for each potential duplicate record. You can specify the names to be associated with each field value kept (referring to the field value coming from each record in the duplicate set), or you may accept ProVal’s suggested names. Customize names for duplicate sets of up to: n records controls the number of duplicate records for which you may specify customized field names – for example, if you enter “3”, ProVal will provide two columns in the spreadsheet, labeled “Record 2” and “Record 3” (besides a column labeled “Record 1”, whose rows you cannot alter, because the field name in any row in this column is always the field name on Record 1 in the Input Database). If ProVal finds, during its processing, that there are more values to keep than the number entered for customizing names, it will assign field names for the additional records.

The Starting record parameter allows flexibility in choosing the record that will be used as the basis for the final (i.e., condensed) record, and it is this starting record whose Input Database RecID will be assigned to the condensed record in the Output Database. The choices are “First to meet selection expression #1”, “First on file” and “First in sort order”. If “First in sort order” is selected, the Sort Criteria button (discussed below, after discussion of the three Actions) becomes accessible, for input of the necessary sorting information. The "First to meet selection expression #1" and "First on file" options do not rely on sort parameters, because they are based on ProVal's default sequencing of records, which is RecID order.

II. Keep one record and delete the rest:

One record from a set of duplicates is written to the Output Database and the others are ignored. The Record to keep parameter determines the one record to include in the Output Database. The choices are “First to meet selection expression #1”, “First on file” and “First in sort order”. If “First in sort order” is selected, the Sort Criteria button (discussed below, after discussion of the three Actions) becomes accessible, for input of the necessary sorting information. The "First to meet selection expression #1" and "First on file" options do not rely on sort parameters, because they are based on ProVal's default sequencing of records, which is RecID order.

III. Keep all records (and exclude from further rules); duplicates are legitimate:

All records from a set of duplicates are written to the Output Database. Note that, although the description of this method refers to legitimate duplicates, you might also choose this action for problematic duplicate sets whose characteristics are such that you want to address them by hand in Spreadsheet Edit.

 

Sort criteria

If the “First in sort order” option has been selected, the Sort Criteria button lets you specify the sequence of records within each duplicate set, and thus control which record is first. This matters when the duplicate rule uses the Action:

In the Sort Criteria dialog box, the name of each Field present on the records in the Input Database is shown and the Sort Order column (to the right of the field name) lets you indicate the sort hierarchy. If no sort criteria are specified, records within each duplicate set will be sorted by the RecID on the record in the Input Database. Note that all fields are displayed regardless of whether they are Key Field(s) selected on the (preceding) Resolve Duplicates dialog box, although the values in a particular key field would be the same for all the records within a duplicate set.

To sort by a single field in ascending order, enter a “1” in the row for the sort field. To sort in descending order, enter “-1”. To sort by two fields, enter “1” (or “-1”) in the row for the primary sort key and enter “2” (or “-2”) in the row for the secondary sort key. The records will be sorted according to the primary key field and where there are duplicate values in the primary key field, the records will be sorted by the secondary key field. You may sort by any number of fields, and you may freely mix positive and negative numbers to indicate ascending and descending sort orders. For character and coded fields, a negative sort indicates that reverse alphabetic sorting is used for the field values. Note that for coded fields the values sorted are the field labels, rather than the numeric codes associated with the labels. There is no need to specify a sort order for Key Field(s) because, by definition, records within a duplicate set have the same values for the key field(s).

It is important to keep in mind that the Sort Criteria affect only the sequence of records within each duplicate set, not the sequence of final records in the Output Database. As stated above, RecIDs in the Output Database are the same as those for the corresponding records in the Input Database, and it is those RecIDs that then determine the (default) sequence of records in the Output Database. For example, if there are three records in a duplicate set, with RecIDs 23, 45 and 67, and the sort criteria indicate that RecID 45 is the starting record or record to keep (depending on the rule’s Action type), then the RecID of the condensed record, or record kept, in the Output Database will be 45. Note that RecIDs 23 and 67 will not be present in the Output Database.

 

Selection Expressions applied

For any Action type, under the Apply to duplicate record sets where parameter, one or two Selection Expressions can be used to determine the conditions under which a rule applies to a set of duplicate records. The conditions of selection expression #1 can also be used to determine the starting record (if “First to meet selection expression #1” was selected for the Starting record), under the Action Condense records into one, and to determine the record to keep (if “First to meet selection expression #1” was selected for the Record to keep), under the Action Keep one record and delete the rest.

As elsewhere in ProVal, the absence of coding in a selection expression indicates an absence of restrictions; thus if both selection expressions are blank for a rule, all duplicate sets tested will meet the conditions of that rule. Requirements specified by the parameter One record in the set meets selection expression #1 must be met by one record in the duplicate set (although the rule still will be applied if more than one record meets the requirements of selection expression #1). Additional requirements to apply the rule can be specified for one or more other records in the duplicate set (i.e., for a different record from the record meeting selection expression #1), by coding the parameter … in the set must meet selection expression #2, which becomes accessible when you check the box to its left. Checking that box also allows you to choose either At least one other record or All other records from the drop-down list box, to indicate how the additional requirements are to be applied in determining whether a duplicate record set meets the rule's requirements (i.e., whether it is sufficient for as few as one other record to meet selection expression #2 or whether all other records must meet selection expression #2).

For example, suppose neither selection expression is blank and the At least one other record option is selected for selection expression #2. If there are three duplicate records (in a set), with RecIDs 1, 2 and 3, and RecIDs 1 and 2 meet selection expression #1, then the rule’s conditions are met if any record meets selection expression #2 – because meeting selection expressions #1 and #2 can be “paired” as RecID 1 meeting selection expression #1 with RecID 2 meeting selection expression #2, or RecID 2 meeting selection expression #1 with RecID 1 meeting selection expression #2, or RecID 1 meeting selection expression #1 with RecID 3 meeting selection expression #2, and so forth. If, in our example, the All other records option is selected instead, then the rule’s conditions are met if RecIDs 2 and 3 meet selection expression #2 or if RecIDs 1 and 3 meet selection expression #2 – because both RecIDs 1 and 2 meet selection expression #1.

For either selection expression, clicking the image/selection_expression_button.gif button accesses the Retrieve Selection Expression dialog box. Pick a selection expression to retrieve by clicking its name in the list of Selection Expressions in the current Project. You will return to the preceding dialog box and the Selection Expression will appear in the box for the first or second selection expression, according to whether you were retrieving for selection expression #1 or selection expression #2.