I recently came across a situation where we needed to perform incremental on some data, but the problem we cam across is that there was no date column in the source feed which could help us find the incremental data.
When you have an insert date and update in the source you could simply filer the data of source using the date column. We could surely find out the the new inserts by comparing the business keys but to find out the updates one need to compare each column value. This would make the performance very bad if there many columns in the table.
Scenario:
Suppose we have an employee table in the both source and target table with some 20 columns. Now comparing each column of the table in both source and target db for each row could not be a solution. We decided to generate the checksum for each row in the data and compare the checksum values for each row.
What is Checksum: Checksum is an error detecting algorithm and it generates a unique number for a string provided. There are multiple algorithms to generate checksum and we used CRC32.
I am demonstrating this in Talend, but one could use almost any ETL tool to replicate the same.
For the purpose of demonstration , let us take 3 columns:
Table Employee source has these 3 columns. The table Employee Target would have same structure with one extra column name checksum. Consider the snapshot below to understand the data.
Below is how the talend job looks like.
It reads the data from source employee table and the java code in the component "tJavaRow" adds another column named checksum to the flow. Below is how the java code looks like.
output_row.EmpID = input_row.EmpID;
output_row.Name = input_row.Name;
output_row.Salary = input_row.Salary;
String rowConcat= input_row.EmpID+input_row.Name+input_row.Salary;
CRC32 crc = new CRC32();
crc.update(rowConcat.getBytes());
output_row.Checksum=Long.toString(crc.getValue());
Here we have concatenated all the 3 columns and calculated the checksum on it. Please note that the sequence of column is important.
After that tMap is comparing the checksum value generated in the process and checksum value already available in the target table.
As seen in the out put below, one row went to the insert branch and 1 row went to the update branch.The log window below shows the checksum values generated.
When you have an insert date and update in the source you could simply filer the data of source using the date column. We could surely find out the the new inserts by comparing the business keys but to find out the updates one need to compare each column value. This would make the performance very bad if there many columns in the table.
Scenario:
Suppose we have an employee table in the both source and target table with some 20 columns. Now comparing each column of the table in both source and target db for each row could not be a solution. We decided to generate the checksum for each row in the data and compare the checksum values for each row.
What is Checksum: Checksum is an error detecting algorithm and it generates a unique number for a string provided. There are multiple algorithms to generate checksum and we used CRC32.
I am demonstrating this in Talend, but one could use almost any ETL tool to replicate the same.
For the purpose of demonstration , let us take 3 columns:
- EmpId(PK)
- Name(Changing Attr)
- Salary(Changing Attr)
Table Employee source has these 3 columns. The table Employee Target would have same structure with one extra column name checksum. Consider the snapshot below to understand the data.
Below is how the talend job looks like.
It reads the data from source employee table and the java code in the component "tJavaRow" adds another column named checksum to the flow. Below is how the java code looks like.
output_row.EmpID = input_row.EmpID;
output_row.Name = input_row.Name;
output_row.Salary = input_row.Salary;
String rowConcat= input_row.EmpID+input_row.Name+input_row.Salary;
CRC32 crc = new CRC32();
crc.update(rowConcat.getBytes());
output_row.Checksum=Long.toString(crc.getValue());
Here we have concatenated all the 3 columns and calculated the checksum on it. Please note that the sequence of column is important.
After that tMap is comparing the checksum value generated in the process and checksum value already available in the target table.
As seen in the out put below, one row went to the insert branch and 1 row went to the update branch.The log window below shows the checksum values generated.
great
ReplyDelete