Data cleansing is a critical component of any data migration project and should be one of the first activities the project team starts after the initial data assessment is put together. There are three methods that can be used to clean the data; often all three will need to be used during the course of a project. The first method of cleansing is correcting the bad data in the legacy system. The second method is by using logic in the transformation rules to clean the data as it’s being migrated. The third method is a hybrid approach where the business works out overrides that can be applied to the legacy data as it’s being converted.
The first method, cleansing the legacy data in the legacy system, is the most traditional approach to data cleansing. There are analysis and profiling reports that identify the bad data scenarios, and the data is cleaned up in the legacy system either by hand or programmatically. Typically, this approach to data cleansing leads to the cleanest results; the data is cleansed at the source and gets scrutinized by the business at a closer level than happens with the other two methods, because the changes will have a direct impact in the current production system. However, this method is also usually the most difficult, because legacy systems often lack a method for facilitating this type of cleansing or the changes would introduce too much risk into the current production environment. Frequently, updating legacy systems is a manual effort requiring someone going into screens for hundreds or thousands of records. This could be a time-consuming and expensive undertaking—which brings us to the second method of cleansing on a data migration project, automated cleansing.
Automated cleansing on a data migration project is usually the most cost-effective and efficient way to cleanse data. It involves coming up with a logical set of rules to handle the bad data scenarios. For example, in the fixed asset data, the asset acquired date might be after the depreciation start date. This situation should not happen. The cleansing rule in this case might be to set the asset acquired date equal to the depreciation start date. There are an infinite number of scenarios where the data can be cleansed through some sort of logical statement. The drawback of this type of cleansing is that the business focuses on what needs to happen to get the data into the target system and does not always think through the full ramifications of the automated rule. That being said, the benefits far outweigh the negative of slightly less focus from the business. The largest benefits are that logical rules are usually easy to implement and, once they’re set up, the team does not need to worry about them.
The last way data cleansing can be facilitated on a data migration project is a hybrid approach. This approach involves additional input from the business and some additional programming from the data conversion team. It should be used when it’s not feasible to clean the data in the legacy system, but there is not a logical rule that can be formulated to correct the issue. There are many examples of when this type of data cleansing will be needed on a project, but using it to handle duplicated data is a common scenario. There might be duplicate customers across or within the legacy systems and no clear situation for determining which customer to bring forward. In this case, the data migration team would provide the business with a listing of all the duplicate customer candidates and the business would mark up the report indicating which customer records to bring forward into the target system. The conversion processes will then utilize this information to only bring the proper customer records forward and map the appropriate subordinate information under the “winning” customer.
On occasion, there will be a request to let the bad data get loaded into the target system and to just clean it up after the fact. This is usually a bad idea, as “after the fact” will never happen. This suggestion should only be given merit if not going forward puts the project at risk and the target business/data owners are comfortable with the decision.