Learning and Sharing

Monday, August 26, 2019

Machine Learning : Classification vs Regression Algorithm

As shared in my previous blog, algorithm selection is a very important and confusing part for any Machine Learning project. It is very difficult for a newbie to understand which algorithm suits best for the requirement an data.
Before selecting the algorithm, one needs to understand what type of predictive algorithm is needed.

In most of the ML projects, one would come across situations which would fall under the umbrella of supervised machine learning(we will talk about unsupervised ML in some other article). There are 2 types of supervised algorithms:
1.) Classification : A classification problem is when your output is a category. For example deciding if the person is Rich, Middle Class and poor based on the income and other parameters. In these case output could only be a category.
2.) Regression: A regression problem is when your out is a value. For example predicting the weight of the person based on the appearance parameters.

Once we understand what is the type of our problem we can look for the algorithms related to that side.
Hope this clarifies.

Thursday, August 1, 2019

How to learn machine learning?

Machine Learning is the new hype of IT industry and everyone is talking about it. I started learning ML a couple of months back.
I did some hello world models using Python and Microsoft. Spent some time to learn Python(which is fairly easy task), completed a certification and spent descent amount of time.
But even after spending a month, all i knew is some terms with no knowledge. I was not confident that i could handle it.

Although i am still learning but now i understand what machine learning is. It could be done in R, Python, Microsoft or any other. Learning a language or tool is the easiest part of ML, its the concepts that matters.

After doing hands on the language of choice, you would have answer the critical questions like which algorithm to use or which cleansing method to use and lot more.
The pie chart below lists all the things that needs to be done for any ML model.

I will try to cover all the important concepts in this series of articles, which i would feel are important for beginners.

P.S.: You do not have to be good in stats( and you can learn basics during practice), as 99% of us would never need to write a new algorithm. All we need to do is to understand the problem and decide the right set of algorithms for the solution.

Saturday, September 8, 2018

Shrunken Dimension - Very useful for higher level summary

Shrunken Dimension is a very special type of dimension when there is a requirement of higher level of summary and lower grain data is already available in the warehouse.

Before we dive in if you do not understand what is the grain of the dimension, it is important to understand.

Month could be a shrunken dimension of date, which could be connected to a fact which has grain of month like monthly sales. But this is a very classic example, let us deep dive a little with another example.
Considering the Online Movie Ticketing business scenario, we would have 2 tables, a fact for Ticket sales and a related dimension of the Venue for which sold the ticket.
If customer need to view the total consolidated Sales at city level rather than Venue level, we could create a separate dimension for City and a separate fact table for Consolidated Sales. The image below demonstrates the same:

In the example above we have created a base dimension and Shrunken dimension separately. If needed we could also create the Shrunken dimension from the base dimension, once it is loaded. Again it totally depends on the business requirements and the type of source system you are dealing with.

Static Dimension in Datawarehouse

Static Dimension are not extracted from the data source, but created in warehouse itself in the context of warehouse. These are usually loaded just once, manually or by the help of a procedure. The data of this dimension does not change or changed very rarely.

A few common examples of the static dimensions would be "DIM_SOURCE_SYSTEMS", "DIM_STATUS_CODE" or date time as well.

Please do not confuse it with the SCD Type 0. In SCD type 0, we insert data regularly but never update it. Also, they are loaded with the data extracted from source.

Click here to read about the other dimensions in datawarehousing.

Sunday, September 2, 2018

Microsoft Teams is now Free - Can it be the Game Changer

Microsoft Teams is a platform that combines workplace chat, meetings, notes, and attachments. The service integrates with the company's Office 365 subscription office productivity suite, including Microsoft Office and Skype, and features extensions that can integrate with non-Microsoft products.

Microsoft initially released in 2017, but recently Microsoft announced a free version of it, which could be very useful for new startups or educational groups which want to collaborate for free with lots of features.

So what does the free version of Microsoft teams allows you to do:

Add up to 300 members to collaborate with 2GB free space per user.
10 GB shared storage
Unlimited messaging and search.
Secure File Sharing
One on one Meetings with audio and video facilities.
Channel Meetings for Multiple People in team
Screen Sharing
MS word, PowerPoint, excel and One Note
Access to 140+ other apps and services
All this is available over cloud, which means it is not device specific. Just login with your account and you got all you work saved using other machines.