If the duplicate data is not resolved before or during the migration, the duplicate data will either silently load into the new ERP application and cause problems when the application is put to use or will error out during the load and cause problems because the downstream transaction conversions won’t be able to load.
The basic concept for identifying a duplicate candidate is straight forward. Standardize fields that identify a unique entity (e.g. a single customer, vendor, item, etc.) and compare. In practice, depending on the type\cleanliness of the data, the standardization process can be complex.
There are several techniques that are commonly used to identify duplicates within master data.
- Noise Word Removal – The process of removing words that don’t add significance to the data or are often incorrect. Common examples of noise words are “the”, “of”, and “inc”.
- Word Substitution - The process of replacing an existing word or phrase with another word or phrase. It is common to substitute names and abbreviations when identifying duplicates. For example, Tim would be replaced with Timothy, OZ would be replaced with Ounce, a single quote might be replaced with foot, and a double space might be replaced with a single space.
- Case Standardization – The process of making everything the same case.
- Punctuation Removal – The process of removing all alpha or numeric values that don’t add any significance to the field value.
- Phonetic Encoding – The process of encoding words based on how they sound. For example, “donut” and “doughnut” would be phonetically encoded to the same value. There are several types of phonetic encoding methods that are commonly used.
- Address Standardization – The process of standardizing all of the components of an address prior to comparing values. Usually the process that checks for duplicates utilizes address validation techniques\software\services to make sure the address is valid and to make ensure that all of the pieces of the address are formatted uniformly throughout the data.
If you have an upcoming data project that you have concerns about or are involved on one that is currently going sideways, call me at 773.789.9324 and I’ll do everything I can to help your project succeed.