Sunday, November 15, 2015

Cluster analysis and K-mean clustering

Cluster Analysis

When does use Cluster analysis
1. To figure out what to predict
2. to detect patter of interest in data- when does not know enough what patter to expect
3. As an exploratory tool to understand the data



What is cluster analsys

PCA = Reduce number of cloumns

Cluster analysis = Reduce number of rows

Cluster Analysis:
1. Group objects(typically, rows or records or observations) in a data set based on similarity of the properties or attributes(columns) of the objects(rows)
2. There are at least a few hundred(if not more) approaches for performing cluster analysis


Pupular approaches to cluster Analysis
1. Hierarchical clustering(agglomerative) such as single Linkage, Average Linkage, and Complete linkage clustering

2. Partitioning clustering methods such as K-means and K-Medians for numeric data and K-Models for cluster categorical data
3. Overlapping clustering methods
4. Latent Class methods

K-means cluster
While performing partitioning(K-Means) and overlapping clustering (overlapping k-centroids)
1. Normalize normalize normalize
2. Severe local optima: Run procedure from at least 50 random starts for clustering into K- Groups. Otherwise, there is high risk of sub-optima cluster or groups
3. How many cluster to choose
 a. USe Scree plot and interpret-ability
 b. Don't Link solution don't use it
 c. Unless a population is perfectly multi-model i.e. has exactly K models, once can extract as many cluster as one wants
4. Decide how you would handle missing data- missing values in some variables
 a. Replace missing values in a cloumn by the mean of the non-missing values of each columns
 b. Ignore row entirely if column has missing values

 c. Imput the missing values by predicting modeling techniques




Case study
Assume you are a business analyst in a large finance company operating in the secondary (financial derivatives) market. You have access to their monthly operations database, with 500,000 trades and 200 variables, and are asked to summarize the key patterns and relationships in the data. How would you proceed?

above is centered around the Cluster Analysis topic. I would rather focus to explain the cluster analysis and how K-means clustering can be useful here

Stated problem can be covered under Unsupervized-Classification problem and we can use K-means clustering technique to solve the problem

First let me try to explain the clustering and when does it will be useful.
- To figure out what to predict
- To detect pattern of interest in data when we don have enough what patter to expect
- This can be used as exploratory tool to udernstand the data

Based on the available datasets(500000 trade data with 200 variables), we can use cluster analysis to identify key pattern and relationship in data
and to do the same, here is my approach to procceed
- We will first check the distribution of data if its normal or not
- Using the boxplot will try to detect outlier
- Check if there any outlier or missing values in data
- Treat the outlier/missing values by imputing and/or capping/florring method
- Whole concept is try to normalize data as much as possible
- plot the data using plot() or scree() plot in r and try to identify the possoble group
- Us K-mean cluster create cluster
- Based on the created cluster/group.
- find the mean values of Pricipal components of each group
- Draw the conclustion about the each group for business purpose