The first step in duplicate elimination process is to review the duplicate candidates with the appropriate subject matter expert. The subject matter expert will go through the duplicate candidates, confirm which records need to be consolidated and pick the survivor record\information (the record\information that should be brought forward to the target system). During this review, the subject matter expert may also be able to confirm\determine consistent rules to follow that identifies true duplicate information. The actual method and process will be different based on the type and size of data as well as the number of duplicates that need to be reviewed\resolved. The important part of this process is that the true duplicates are identified.
Once the true duplicates are identified, there are two paths to take with the data. The first is that the data is corrected\harmonized within the disparate legacy systems. This option is normally not a viable option or is only partially viable. The reason is because it is impossible or resource intensive to update all of the appropriate data in the legacy systems. However, if duplicate data can be addressed within the legacy systems, the data migration process becomes simpler for these records. The process will just need to monitor for the duplicate situations and report the situations to the team so they can continue to cleanse the data until it is ready for go-live.
More than likely, the duplicates will not be able to be addressed within the legacy systems. They will need to be addressed programmatically at the time of cut-over. When addressing duplicate data during cut-over, there are several important requirements to consider during the process.
- Build a Cross Reference that Maps Legacy to Target - If the duplicate resolution process is manual where a report is marked up by the business users, the marked up report will need to be stored in a place where the transformation programs can access it. If there are programmatic rules or a hybrid set of rules that has both a programmatic and manual component, not only does the procedure need to be documented, but the cross reference is key to being able to easily show where each record ended up for the reconciliation. This cross reference will also be extremely beneficial to the business users as a reference resource. The cross reference should show the key legacy information, the key information that was passed forward to the target system, and any additional fields that is relevant to the business. This cross reference will also assist the auditors and the data migration reconciliation. Even on data sets where duplicates are not an issue, it is good practice to show where the legacy data ended up and have that data readily acceptable.
- Determine the Survivorship Rules - There are two levels of survivorship rules, record level and field level. It is important that there are clear rules for determining the survivor on both the record and field level. Sometimes these rules could be a programmatic, e.g. take the higher credit limit value between the duplicate candidates, or the winning value can be manually specified on the returned duplicate candidate report. The important part is that there is the ability to apply both a general survivor rule and a field level survivor rule. It is also frequently helpful to explicitly report when\what survivorship rules were applied at time of transformation. Reporting this information makes it easier to validate the transformation and to track down any potential questions around the transformations.
- Map Subordinate Information to the Survivor Record - Make sure that there are rules for handling all subordinate pieces of information tied to the record that is being eliminated\harmonized\etc. When handling the subordinate information, there are usually two routes; map it to the survivor record or just ignore it at cut-over. For example, when eliminating duplicate items, rules for handling the eliminated item number on all sales orders, purchase orders, the on-hand quantity, approved supplier lists, etc. will need to be addressed. Depending on the type of data that is being de-duplicated, there could be many or only a few pieces of additional data that might be affected. Also, some pieces of information might need to be mapped to the survivor and others might need to be eliminated. For example, when merging duplicate customers, the rule for handling contacts is frequently map new contacts to the survivor record and ignore any contacts that already exist under the survivor.
Once the survivorship rules for the data migration are in place, there is one more important piece to the data de-duplication process. This last piece is to be able to continually monitor the data for and report new duplicates, duplicates that drop off, and changes in the survivor data. Setting up this type of report from scratch can be a little tricky to set up as there are several situations that need to be validated and reported in a clear manner. I will discuss the requirements around ongoing duplicate maintenance reports in the next post.