Learning and Sharing: Business Analytics

Showing posts with label Business Analytics. Show all posts

Saturday, September 8, 2018

Shrunken Dimension - Very useful for higher level summary

Shrunken Dimension is a very special type of dimension when there is a requirement of higher level of summary and lower grain data is already available in the warehouse.

Before we dive in if you do not understand what is the grain of the dimension, it is important to understand.

Month could be a shrunken dimension of date, which could be connected to a fact which has grain of month like monthly sales. But this is a very classic example, let us deep dive a little with another example.
Considering the Online Movie Ticketing business scenario, we would have 2 tables, a fact for Ticket sales and a related dimension of the Venue for which sold the ticket.
If customer need to view the total consolidated Sales at city level rather than Venue level, we could create a separate dimension for City and a separate fact table for Consolidated Sales. The image below demonstrates the same:

In the example above we have created a base dimension and Shrunken dimension separately. If needed we could also create the Shrunken dimension from the base dimension, once it is loaded. Again it totally depends on the business requirements and the type of source system you are dealing with.

Sunday, August 19, 2018

What is Inferred dimension and Why it is important

Inferred Dimension, also referred to as Early coming Fact or Late coming dimension. These all are different names of same scenario.

If the source systems of your Data warehouse are very well designed, you should never face this scenario. But, sometimes it would happen that the fact data is available in the source but related dimension data does not yet exists.

Consider that Online Movie Ticketing business scenario. Suppose a sales person sold some bulk movie tickets to an organisation at some discount. He provided a discount with a new promotion code, which he is supposed to punch in the system but forgot. When the ETL would process the data it will find that the promo code is not available.

There are 2 ways to handle this scenario:

1st one is to mark the FKey as unknown. This is the simplest option but these sales records could not be used for the reporting purpose unless the dimension record arises or the fact is updated accordingly. Which is an ETL overhead altogether.
The other option is to use Inferred Dimension. For any such record we could create an entry in the dimension table and mark it inferred as shown in the image below.

Using this method, our fact record could be used for reporting and a better picture would be generated. At a later load when the promo code details would be available, the dimension record could be updated automatically, depending on what type of Slowly Changing Dimension it is.

Sunday, August 12, 2018

OLTP vs OLAP explained

Both OLAP and OLTP are databases are designed for different purposes.

OLTP stands for Online Transaction Processing and is optimized for insert. update and delete operations. This is the reason these are used behind the application for regular business transactions.

OLAP stands for Online Analysis Processing and this databases are optimized for read operation and are widely used for analyzing, mining and reporting on large datasets.

It is important to note here that the database in which we store all the Facts and Dimensions is also OLTP. Post this we move the data to create a cube, which is stored in a OLAP database.

The image below explains the major differences of a OLTP and OLAP systems.

Now that we know the basic difference between OLTP and OLAP, lets now dig a little deeper.

The core of any OLAP system is a Cube, which is just a three-dimensional spreadsheet, where as in OLTP data is stored in 2D tables.
A cube consists of Numeric Facts called measures and are categorized by dimension data.
The metadata of any cube is star schema or Snowflake Schema or Fact Constellation in relational(OLTP) database. Measures are derived from the records in the fact table and dimensions are derived from dimension tables.
While we can perform CRUD operation in OLTP, there are a few more operations which we could perform on a cube. These include Slicing, Dicing. Pivoting etc.

Wednesday, August 1, 2018

Snowflake schema

Snowflake Schema: When one fact is surrounded by a chain of more than one dimension, it is called snowflake. It is also called an extension of star schema by adding more dimensions.
In snowflake schema the table are normalized.

If you do not understand the basic concepts of data warehousing, please visit my previous article.

If we consider the data warehousing scenario detailed here, below is how a snowflake schema would look like.

Click here to compare the same design with star schema.

Characteristics of Snowflake schema:

Less storage requirement because of normalized dimensions
Less time to load as compared to star schema.
Better slicing and dicing
Join complexity increase
Time to Query may increase a little depending on the scenario.

Click here to learn more about different types of schema.

Sunday, July 22, 2018

Bridge Tables in datawarehouse

As the name suggests, bridge table is used to link tables when direct linking does not make sense. Usually used in a little complecated senarios when we have many to many relations between 2 tables.

Example: Consider we have multiple customer linked to a single bank account balance. Putting multiple customer keys in that fact would create obvious problems in the fact table.

Bridge table is also used sometimes to manage variable hierarchies. For e.g. to maintain one primary customer and one or more secondary customers.

Bridge table come to rescue in such situation, but just like outrigger dimension don't use it too much. Remember, every query has to pay toll to navigate the bridge.

Although, it is not a dimension but I personally call it Bridge dimension because it has many properties of a dimension.

Click here to read about the other dimensions in datawarehousing.

Tuesday, July 17, 2018

What is Outrigger Dimension and its purpose

Definition: As per Kimball, when in dimension in the DWH is linked with another dimension table, the secondary table is call outrigger dimension. For example a customer dimension referring to the Date dimension for the date of account creation.

Now, it might seem like snowflaked(not complete), but it is not. There are a few differences as detailed below:

Snowflake make the structure highly normalized where as outrigger makes it partially normalized.
Snowflake makes the ETL complicated and reduces the query performance.

While, Now outrigger dimensions are similar to snowflake but are vary limited in nature. They are just one or more levels removed fro the fact table. This is mostly used when there is a business requirement to slice and dice of the particular column.

Consider the following criteria before creating an outrigger dimension:

User Analytics Needed: If the user wants to slice and dice on the column
Data Granularity vs Cardinality : A column with high low granularity and high cardinality is a candidate.
Update Frequency : If a column is updated very frequently, not more than once a month
It is divided in the source itself.
Customer and it location would be a good candidate for outrigger dimension

For list of dimensions in data-warehouse, click here

Wednesday, July 11, 2018

What is grain of a table in data warehousing

In simple terms grain is the lowest level till which you could identify the data in a table.
For example: If you have table which holds the stock inventory and table holds:

One record for each day of the stock : Than the grain would be one day
One record for each hour : than the grain of the table would be hour
One record for each week : than the grain of the week would be week.

How does grain impacts the data warehouse:
It is important that while designeing the databse we design till the lowest grain for which we business needs to analyse or report.
With example above, If your Stock inventory table has grain of day you can easily calculate weekly stock but viceversa is not true.

Similarly, grain of related tables(Facts and Dims) should complement each other. For eg if Fact_Stock_Inventory holds data for each hour and Dim_DateTime has a grain of a day, it will generate distorted results.

Tuesday, July 10, 2018

What is Conformed Dimension and its advantages

People usually finds it difficult to understand the Conformed dimension, I am trying to explain it with an example. To start with, please do not confuse it confirmed / static dimension, because of the name.

Definition: A conformed is defined as a dimension, which has the same meaning to all the facts in a data warehouse, even if the facts belong to different data-marts.
It could one single table or multiple copies of same table across different marts. There is no hard rule of the copies having the same table name. One copy could even be a sub set of another copy.

Let us now try to understand it with a example. Consider online movie tickets data warehouse scenario.
Lets assume we have divided our EDW into 2 data marts

Marketing : Marketing team is more concerned about the no of tickets sold.
Finance : Finance team is more concerned about the profit.

Both teams would require to know the stats for each cinema. Now, both teams might also have different terminologies. Marketing team might call it Cinems whereas finance team might call it Screen.
So in this scenario we could create 2 copies of same table with different names as shown below.

But why should we use it. Why don't we create 2 separate tables if we need. Below are a few disadvantages of not creating Conformed dimensions:

There will more ETL work to generate the logic of different tables.
More ETL will surely mean some extra maintenance effort.
and possible difficulties in joining multiple data marts.

Visit my post on Types of Dimensions to understand more about different types of dimensions.

Monday, July 9, 2018

List of dimensions in a Data Warehouse

Below is the list of dimensions types we should consider while designing any data warehouse. Click on the link below for a detailed definition and example.

Basic dimension
Junk dimension
Conformed dimension
Outrigger Dimension
Bridge table
Snowflaked dimension (Details article in progress)
Role Playing dimension (Details article in progress)
Degenerate Dimension (Details article in progress)
Slowly Changing dimension (Details article in progress)
Inferred dimension
Shrunken Dimensions
Static Dimensions

This list will keep on improving with time, so you can bookmark this page and feel free to subscribe to the blog by entering your email

Tuesday, May 22, 2018

Business Analysis - The Big Picture

There is lots of buzz in the market about analytics and that's the reason there are brurred lines for new comers to understand the overall scenario.
Today i am trying to draw a picture to cover the complete picture of the complete analytics business and related technologies. But before we dig deeper, let us define what is data analytics.

Definition of Data Analytics
The purpose of data analytics is to study at the data and find out the patterns in data. This could be further drilled down to the size of data we are looking at and type of patterns we need a output.
Usually the output of Data Analytics is a reports, because that is what business people understand.

Based on the findings we need we could divide the analytics majorly in the 4 categories. Based on the requirement of the business we could fit the requirement in any of the 4 categories.
But it is important to understand that there are not clear boundaries. Below picture explains complexity of the types and the value it brings in.

The first 2 type, Descriptive and Diagnostics are done after the event has happend. So we deal with the actual data of the business and tells the real picture. These 2 are mostly used together.

Descriptive analysis: What is happening
It is most common type of analytics and tell us what is happening. For e.g. it could tell us the monthly profit of the organisation for different demographic areas.
Effective visualizations are created to ensure right picture is visible to viewer.

Diagnostic Analysis: Why it is happening
After we know what is happening, it is important to understand the reasons. Use the diagnostic tools to drill down and isolate the root cause.

Comes next the more complex and valuable types of analytics Predictive and Prescriptive. This is forecasting of events based on the previous history and are mostly used together.
Predictive Analysis : What is likely to happen
Based of the previous data and predictive models, we forecast what is going to happen on a particular point of time. For eg. Every December the sales of retails companies increases.
As business world is full of uncertainties, it is always good to have certain information before hand. Which makes predictive analysis very powerful and valuable.

Prescriptive analysis : What to do
This is most complex and valuable type of analysis. Here we tell the business that, based on th predictive analysis this is going to happen in this time frame and if we do this the opportunity could be maximized.
A day to day example would be Maps in our phone, which suggests us to take a different route when there is traffic on one route.

Will talk about different tools and models in a seperate post.