Thursday, June 28, 2018

ETL or ESB : What is better fit for your design

I have often seen people getting confused about, what would be a better fit for movement of data in their system. Their is a lot of content out their about these two technologies. Today i am trying to elaborate these in a very basic example.

Domino's(The pizza giant) logistics is a very good example to understand.
So how domino's works:
They prepare their dough, veggies, spices etc. in their factories than big giant trucks supply these materials to their outlets once or twice a day.

This is what ETL is. Scheduled movement of large quantity data from source to destination. Obviously in real scenario we do some data transformations in between.

The outlets usually have their delivery bikes to deliver the order in smaller chunks as they receive the order. This is ESB. Delivery of small chunks of data to the needy system in near real-time manner.

Would it make sense to use a large truck to deliver the pizzas only once or twice day or vice versa. This will not work at all. This is why it is important plugin in the right component in the system.

I hope i explained it well. Let me know if you have any questions?

Monday, June 18, 2018

Checksum - How to find incremental data when there are no dates column in the source, using Talend DI

I recently came across a situation where we needed to perform incremental on some data, but the problem we cam across is that there was no date column in the source feed which could help us find the incremental data.
When you have an insert date and update in the source you could simply filer the data of source using the date column. We could surely find out the the new inserts by comparing the business keys but to find out the updates one need to compare each column value. This would make the performance very bad if there many columns in the table.

Scenario:
Suppose we have an employee table in the both source and target table with some 20 columns. Now comparing each column of the table in both source and target db for each row could not be a solution. We decided to generate the checksum for each row in the data and compare the checksum values for each row.
What is Checksum: Checksum is an error detecting algorithm and it generates a unique number for a string provided. There are multiple algorithms to generate checksum and we used CRC32.

I am demonstrating this in Talend, but one could use almost any ETL tool to replicate the same.
For the purpose of demonstration , let us take 3 columns:

  • EmpId(PK)
  • Name(Changing Attr)
  • Salary(Changing Attr)


Table Employee source has these 3 columns. The table Employee Target would have same structure with one extra column name checksum. Consider the snapshot below to understand the data.


Below is how the talend job looks like.
Talend job for checksum

It reads the data from source employee table and the java code in the component "tJavaRow" adds another column named checksum to the flow. Below is how the java code looks like.
output_row.EmpID = input_row.EmpID;
output_row.Name = input_row.Name;
output_row.Salary = input_row.Salary;
String rowConcat= input_row.EmpID+input_row.Name+input_row.Salary;

CRC32 crc = new CRC32();
crc.update(rowConcat.getBytes());
output_row.Checksum=Long.toString(crc.getValue());

Here we have concatenated all the 3 columns and calculated the checksum on it. Please note that the sequence of column is important.

After that tMap is comparing the checksum value generated in the process and checksum value already available in the target table.

As seen in the out put below, one row went to the insert branch and 1 row went to the update branch.The log window below shows the checksum values generated.

Monday, June 4, 2018

Talend Introductions for MSBI developers

In this post i would cover the intorduction to talend about product details for the people who already have experience in MSBI and are new to talend.
Talend offers multiple products including an ETL tool call Data Integration. It is a great tool and is built on java so if you have no idea about java at all you might face some challenges during debugging or enhancements.
Apart from this Talend offers other tools for Big Data, Cloud integration and Master Data Management tool.

The open source confusion:
Most of people would tell you Talend is an open source tool but that is not entirely true. Talend offer 2 ETL options:

  1. Open Studio for Data integration: It is a free tool but with limited functionality. It does not offer all the connectors, no Management and Monitoring tool and No Data quality or cloud support. So in short it is just a tool to learn Talend but not business.
  2.  Data Management Platform: It is a licenced product but one could surely download the one month trial.


I will cover more about Talend in my further posts.
If you liked the post, please share it.
Subscribe to your email and receive new articles on your email