Learning and Sharing: Dimension Modelling

Showing posts with label Dimension Modelling. Show all posts

Saturday, September 8, 2018

Shrunken Dimension - Very useful for higher level summary

Shrunken Dimension is a very special type of dimension when there is a requirement of higher level of summary and lower grain data is already available in the warehouse.

Before we dive in if you do not understand what is the grain of the dimension, it is important to understand.

Month could be a shrunken dimension of date, which could be connected to a fact which has grain of month like monthly sales. But this is a very classic example, let us deep dive a little with another example.
Considering the Online Movie Ticketing business scenario, we would have 2 tables, a fact for Ticket sales and a related dimension of the Venue for which sold the ticket.
If customer need to view the total consolidated Sales at city level rather than Venue level, we could create a separate dimension for City and a separate fact table for Consolidated Sales. The image below demonstrates the same:

In the example above we have created a base dimension and Shrunken dimension separately. If needed we could also create the Shrunken dimension from the base dimension, once it is loaded. Again it totally depends on the business requirements and the type of source system you are dealing with.

Static Dimension in Datawarehouse

Static Dimension are not extracted from the data source, but created in warehouse itself in the context of warehouse. These are usually loaded just once, manually or by the help of a procedure. The data of this dimension does not change or changed very rarely.

A few common examples of the static dimensions would be "DIM_SOURCE_SYSTEMS", "DIM_STATUS_CODE" or date time as well.

Please do not confuse it with the SCD Type 0. In SCD type 0, we insert data regularly but never update it. Also, they are loaded with the data extracted from source.

Click here to read about the other dimensions in datawarehousing.

Sunday, August 19, 2018

What is Inferred dimension and Why it is important

Inferred Dimension, also referred to as Early coming Fact or Late coming dimension. These all are different names of same scenario.

If the source systems of your Data warehouse are very well designed, you should never face this scenario. But, sometimes it would happen that the fact data is available in the source but related dimension data does not yet exists.

Consider that Online Movie Ticketing business scenario. Suppose a sales person sold some bulk movie tickets to an organisation at some discount. He provided a discount with a new promotion code, which he is supposed to punch in the system but forgot. When the ETL would process the data it will find that the promo code is not available.

There are 2 ways to handle this scenario:

1st one is to mark the FKey as unknown. This is the simplest option but these sales records could not be used for the reporting purpose unless the dimension record arises or the fact is updated accordingly. Which is an ETL overhead altogether.
The other option is to use Inferred Dimension. For any such record we could create an entry in the dimension table and mark it inferred as shown in the image below.

Using this method, our fact record could be used for reporting and a better picture would be generated. At a later load when the promo code details would be available, the dimension record could be updated automatically, depending on what type of Slowly Changing Dimension it is.

Sunday, August 12, 2018

OLTP vs OLAP explained

Both OLAP and OLTP are databases are designed for different purposes.

OLTP stands for Online Transaction Processing and is optimized for insert. update and delete operations. This is the reason these are used behind the application for regular business transactions.

OLAP stands for Online Analysis Processing and this databases are optimized for read operation and are widely used for analyzing, mining and reporting on large datasets.

It is important to note here that the database in which we store all the Facts and Dimensions is also OLTP. Post this we move the data to create a cube, which is stored in a OLAP database.

The image below explains the major differences of a OLTP and OLAP systems.

Now that we know the basic difference between OLTP and OLAP, lets now dig a little deeper.

The core of any OLAP system is a Cube, which is just a three-dimensional spreadsheet, where as in OLTP data is stored in 2D tables.
A cube consists of Numeric Facts called measures and are categorized by dimension data.
The metadata of any cube is star schema or Snowflake Schema or Fact Constellation in relational(OLTP) database. Measures are derived from the records in the fact table and dimensions are derived from dimension tables.
While we can perform CRUD operation in OLTP, there are a few more operations which we could perform on a cube. These include Slicing, Dicing. Pivoting etc.

What is fact table in data warehouse

In data warehousing, a fact table consists of measurable entities of any business. It is surrounded by multiple dimensions in a star or snowflake schema.
In ideal scenario it has 2 types of columns:

Foreign Keys to the related dimension
Measurable columns like price etc.

It usually provides the calculative values(count, sum etc) which could be analysed across different dimensions. It is important to define the grain of each fact to the lowest level of reporting needs.

If we consider the Data-warehouse scenario of Online Ticket Booking, The first fact that comes to mind is for tickets.

So, if at any point of time business needs to find out how many tickets were sold for each cinema, they could simply join Fact_Tickets with Dim_Cinemas table as detailed in below image.

This suggests that 2 tickets were sold in cinema C1 and 1 ticket was sold for cinems C2

Sunday, August 5, 2018

Fact Constellation schema in Data warehousing

Fact Constellation: Fact Constellation allows you to connect multiple facts to dimensions. It is combination of multiple stars connected with each other using dimension, hence called Galaxy or Fact Constellation schema.

As compared to Star and snowflake it is used for more sophisticated application. The dimension tables could be normalized or not as per the requirement.

If you do not understand the basic concepts of data warehousing, please visit my previous article.

If we consider the data warehousing scenario detailed here, below is how is galaxy or fact constellation schema would look like.

Characteristics of Fact Constellation schema:

Complex design
Different variants of aggregations
Suitable for more sophisticate application

Sunday, July 22, 2018

Bridge Tables in datawarehouse

As the name suggests, bridge table is used to link tables when direct linking does not make sense. Usually used in a little complecated senarios when we have many to many relations between 2 tables.

Example: Consider we have multiple customer linked to a single bank account balance. Putting multiple customer keys in that fact would create obvious problems in the fact table.

Bridge table is also used sometimes to manage variable hierarchies. For e.g. to maintain one primary customer and one or more secondary customers.

Bridge table come to rescue in such situation, but just like outrigger dimension don't use it too much. Remember, every query has to pay toll to navigate the bridge.

Although, it is not a dimension but I personally call it Bridge dimension because it has many properties of a dimension.

Click here to read about the other dimensions in datawarehousing.

Tuesday, July 17, 2018

What is Outrigger Dimension and its purpose

Definition: As per Kimball, when in dimension in the DWH is linked with another dimension table, the secondary table is call outrigger dimension. For example a customer dimension referring to the Date dimension for the date of account creation.

Now, it might seem like snowflaked(not complete), but it is not. There are a few differences as detailed below:

Snowflake make the structure highly normalized where as outrigger makes it partially normalized.
Snowflake makes the ETL complicated and reduces the query performance.

While, Now outrigger dimensions are similar to snowflake but are vary limited in nature. They are just one or more levels removed fro the fact table. This is mostly used when there is a business requirement to slice and dice of the particular column.

Consider the following criteria before creating an outrigger dimension:

User Analytics Needed: If the user wants to slice and dice on the column
Data Granularity vs Cardinality : A column with high low granularity and high cardinality is a candidate.
Update Frequency : If a column is updated very frequently, not more than once a month
It is divided in the source itself.
Customer and it location would be a good candidate for outrigger dimension

For list of dimensions in data-warehouse, click here

Tuesday, July 10, 2018

What is Conformed Dimension and its advantages

People usually finds it difficult to understand the Conformed dimension, I am trying to explain it with an example. To start with, please do not confuse it confirmed / static dimension, because of the name.

Definition: A conformed is defined as a dimension, which has the same meaning to all the facts in a data warehouse, even if the facts belong to different data-marts.
It could one single table or multiple copies of same table across different marts. There is no hard rule of the copies having the same table name. One copy could even be a sub set of another copy.

Let us now try to understand it with a example. Consider online movie tickets data warehouse scenario.
Lets assume we have divided our EDW into 2 data marts

Marketing : Marketing team is more concerned about the no of tickets sold.
Finance : Finance team is more concerned about the profit.

Both teams would require to know the stats for each cinema. Now, both teams might also have different terminologies. Marketing team might call it Cinems whereas finance team might call it Screen.
So in this scenario we could create 2 copies of same table with different names as shown below.

But why should we use it. Why don't we create 2 separate tables if we need. Below are a few disadvantages of not creating Conformed dimensions:

There will more ETL work to generate the logic of different tables.
More ETL will surely mean some extra maintenance effort.
and possible difficulties in joining multiple data marts.

Visit my post on Types of Dimensions to understand more about different types of dimensions.

Monday, July 9, 2018

List of dimensions in a Data Warehouse

Below is the list of dimensions types we should consider while designing any data warehouse. Click on the link below for a detailed definition and example.

Basic dimension
Junk dimension
Conformed dimension
Outrigger Dimension
Bridge table
Snowflaked dimension (Details article in progress)
Role Playing dimension (Details article in progress)
Degenerate Dimension (Details article in progress)
Slowly Changing dimension (Details article in progress)
Inferred dimension
Shrunken Dimensions
Static Dimensions

This list will keep on improving with time, so you can bookmark this page and feel free to subscribe to the blog by entering your email

Friday, July 6, 2018

What is Junk Dimension and its purpose

Definition: By definition, it is a collections of random transnational code flags. It is usually created by joining multiple low cardinality dimensions into one.
As we merge different dimensions in to one, it will reduce the number of foreign keys in the fact table.

If we consider the Data warehouse scenario of Online Ticket Booking, The design should have following 3 tables, with fact having 2 foreign keys one for each dim.

Fact_Tickets
Dim Transaction Type
Dim Cert Type

Both Dims are of very low cardinality(i.e. no of records in each dim would be very low)
Transaction Type : The values could be Net-banking, Card, Wallet. The chances of increasing this count is very low
Movie Certificate : The values could be U, A, U/A. Again the chances of increasing/Changing these is very rare.

The picture below describes the 2 possible designs of it, if we keep these parameters in dimension table.

So with the help of junk dimension, we converted 2 forein keys in to one.
No of Records: Notice that, the no of records in the junk transaction is the multiplication of no of records in each dimension. In the example above there are 3 records for both Transaction and Certificate, so the total count in junk dimension is 9.
If we are merging 3 tables with 3 records each, the no of records in the junk dimension would be 27(3*3*3)

Ensure the dimensions we junk together should be of very low cardinality. I would suggest the no of records in the junk transaction should be less than thousand.

Please feel free to post your queries in the comments section and follow this blog.