Learning and Sharing: August 2018

Sunday, August 19, 2018

What is Inferred dimension and Why it is important

Inferred Dimension, also referred to as Early coming Fact or Late coming dimension. These all are different names of same scenario.

If the source systems of your Data warehouse are very well designed, you should never face this scenario. But, sometimes it would happen that the fact data is available in the source but related dimension data does not yet exists.

Consider that Online Movie Ticketing business scenario. Suppose a sales person sold some bulk movie tickets to an organisation at some discount. He provided a discount with a new promotion code, which he is supposed to punch in the system but forgot. When the ETL would process the data it will find that the promo code is not available.

There are 2 ways to handle this scenario:

1st one is to mark the FKey as unknown. This is the simplest option but these sales records could not be used for the reporting purpose unless the dimension record arises or the fact is updated accordingly. Which is an ETL overhead altogether.
The other option is to use Inferred Dimension. For any such record we could create an entry in the dimension table and mark it inferred as shown in the image below.

Using this method, our fact record could be used for reporting and a better picture would be generated. At a later load when the promo code details would be available, the dimension record could be updated automatically, depending on what type of Slowly Changing Dimension it is.

Sunday, August 12, 2018

OLTP vs OLAP explained

Both OLAP and OLTP are databases are designed for different purposes.

OLTP stands for Online Transaction Processing and is optimized for insert. update and delete operations. This is the reason these are used behind the application for regular business transactions.

OLAP stands for Online Analysis Processing and this databases are optimized for read operation and are widely used for analyzing, mining and reporting on large datasets.

It is important to note here that the database in which we store all the Facts and Dimensions is also OLTP. Post this we move the data to create a cube, which is stored in a OLAP database.

The image below explains the major differences of a OLTP and OLAP systems.

Now that we know the basic difference between OLTP and OLAP, lets now dig a little deeper.

The core of any OLAP system is a Cube, which is just a three-dimensional spreadsheet, where as in OLTP data is stored in 2D tables.
A cube consists of Numeric Facts called measures and are categorized by dimension data.
The metadata of any cube is star schema or Snowflake Schema or Fact Constellation in relational(OLTP) database. Measures are derived from the records in the fact table and dimensions are derived from dimension tables.
While we can perform CRUD operation in OLTP, there are a few more operations which we could perform on a cube. These include Slicing, Dicing. Pivoting etc.

What is fact table in data warehouse

In data warehousing, a fact table consists of measurable entities of any business. It is surrounded by multiple dimensions in a star or snowflake schema.
In ideal scenario it has 2 types of columns:

Foreign Keys to the related dimension
Measurable columns like price etc.

It usually provides the calculative values(count, sum etc) which could be analysed across different dimensions. It is important to define the grain of each fact to the lowest level of reporting needs.

If we consider the Data-warehouse scenario of Online Ticket Booking, The first fact that comes to mind is for tickets.

So, if at any point of time business needs to find out how many tickets were sold for each cinema, they could simply join Fact_Tickets with Dim_Cinemas table as detailed in below image.

This suggests that 2 tickets were sold in cinema C1 and 1 ticket was sold for cinems C2

Different Operations on a OLAP cube in datawarehouse

The common operations on a cube are Slice, Dice, Drill-down, Roll-Up, pivot. These could be used individually or together to generate the desired picture of data.

Slice is an act of picking a subset of the cube by choosing a single value for one of its dimensions. This create a new cube with fewer dimension.
The picture below shows a slicing operation: The sales figures of all categories and all regions is sliced for just year 2004.

Dicing allow the analysts to create and visualize a sub cube by selecting multiple values from multiple dimensions. The picture below shows dicing operation:

Drill down / up allows the analyst to navigate through the levels of data from most summarized to most detailed. The picture shows the drill down operation from summary category "Outdoor-Schutzausrüstung" to see the sales figures for the individual products.

Roll Up means summarizing the data along a dimension. The summarization could be calculating, count, total across a dimension hierarchy or more complex formulas like profit from sales etc. The picture below shows the summation of sale for each categories.

Pivot allows the analysts to rotate the cube in space to the desired direction, which gives the data and visualizations a totally different perspective. The image below show that.

Sunday, August 5, 2018

Fact Constellation schema in Data warehousing

Fact Constellation: Fact Constellation allows you to connect multiple facts to dimensions. It is combination of multiple stars connected with each other using dimension, hence called Galaxy or Fact Constellation schema.

As compared to Star and snowflake it is used for more sophisticated application. The dimension tables could be normalized or not as per the requirement.

If you do not understand the basic concepts of data warehousing, please visit my previous article.

If we consider the data warehousing scenario detailed here, below is how is galaxy or fact constellation schema would look like.

Characteristics of Fact Constellation schema:

Complex design
Different variants of aggregations
Suitable for more sophisticate application

Wednesday, August 1, 2018

Snowflake schema

Snowflake Schema: When one fact is surrounded by a chain of more than one dimension, it is called snowflake. It is also called an extension of star schema by adding more dimensions.
In snowflake schema the table are normalized.

If you do not understand the basic concepts of data warehousing, please visit my previous article.

If we consider the data warehousing scenario detailed here, below is how a snowflake schema would look like.

Click here to compare the same design with star schema.

Characteristics of Snowflake schema:

Less storage requirement because of normalized dimensions
Less time to load as compared to star schema.
Better slicing and dicing
Join complexity increase
Time to Query may increase a little depending on the scenario.

Click here to learn more about different types of schema.

Star Schema - The simplest of all

Star Schema : It is simplest data warehouse schema. It is called star because it resembles the star with points radiating out of the center. Fact tables is the center of the star and dimensions sit at the points.
Usually the fact table in the star schema is in 3rd normal form and dimensions are usually de-normalized. It is most simplest and most commonly used schema.

If you do not understand the basic concepts of data warehousing, please visit my previous article.

If we consider the data warehousing scenario detailed here, below is how a star schema would look like.

Although, star schema require just one fact table at the center, but you could surely have multiple stars in your design.

Characteristics of star schema

Simplest design
Better query performance.
A little extra time to load the data into dimension because of de-normalization.
More space because of redundancy in dimensions.

Click here to learn more about different types of schema.