Saturday, November 14, 2015

Principal Component/Factor Analysis

When does one use PCA in Predictive Modelling
1. Large number of quantitative predictor variables
a. Hard to understand variables and the patterns of the the interactions in them
b. Hard to figure out which ones to use as predictors-either as main effects or as interactions in the predictive models
2. Improve the overall quality of the predictor variables
a. Prediction in hold out samples can become worse as one adds more predictor variables, especially noisy or correlated variable that add little to enhance the predictive power of the model

3. Reduce the degree-of-relatedness(similarity or correlations) among the variables
a. Moderate to high correlation in prediction results in the difficulty in convergence of solutions of logistic Regression/Multinomial logit models and error in findings coefficients of multiple Liner Regression models
b. Even if models converge, the standard error get inflated, and cause t-value to be lower-resulting in incorrect interpretation of coefficients as not being statistically significant and elimination of variables in models
4. Determine the underlying theme of multiple-measure, complex underlying construct
a. Many variables that are very similar(high co variance or correlation between them) or that measure the same underlying concepts

Methodology for Predictive Modelling via PCA
1. Perform Principal component analysis in the training sample
2. Retain top performing principal components that accounts for a substantial portion of the variance in data using scree plot
3. Predict the Principal component Scores for the retained components in the hold-out sub-sample
4. Build a logistic regression model to predict churn/stay dependent variable in the training sample using the retained principal components as predictors
5. Perform predictions of churn/stay in the sub-sample of size 2000 using
a. coefficients from the logistic regression model from the previous step and
b. Predict principal component scores from the pricipal components model developed in (3)

What is Principal component analysis
Principal components Analysis is one of the approaches to Factor Analysis that:
- Can be characterized or interpreted as a rotation of the original variables to new set of primary axes(or dimensions)
- Transfers the maximum amount of information from the original set of variables to fewer set of new variables
- Hence it typically results in lower dimensional space composed of the new variables. Hence it is called an approach for lower dimensional representation of the original data
- Variables in new space are called principal components

Principal components are mutually orthogonal

Case Study

Assume you are given a data set with 500,000 observations and 200 variables and are asked to report on (a) the most useful predictive models to build (b) their specifications in terms of dependent and independent variables and (c) their business impact.

How would you use PCA/FA in answering these questions?

Before building predictive model, we have to first think about the followings:

- What are the dependent and independent variables
- What are the pattern or interaction among the variables
- How variables are correlated. Is there multi-collinerity  exits?.

Considering the give case with 500000 observations and 200 variables. We have to deal with following issues:
- Given the large number of variables
a. its hard to find out the pattern among the variables
b. there could be situation of multi coolinarity problem
c. it will be hard to find the most impact-full variables
d. Which are the variables impacting most on predictor variables

Business Impact:
Even if we managed to develop model using the all 200 variables,
- the standard error get inflated
- it can cause t-value to be lower, resulting incorrect interpretation of coefficient

We can use PCA as an techniques in FA to overcome the above stated problem
Under PCA,
- we have to rotate the original set of variables to a new set of primary variables/dimension
- Transform the maximum mount of information from the original variables to new fewer set of variables without loosing the value of information
- This is also called lower dimantion representation of the original data and variables in new space are called principal components

Methodlogy to be used under PCA
- Create training and test set of data using 80/20 ratio
- Perform prinicipal component analysis in the training sample
- Ratain the top performing principal components that accounts for substantial portion of the variance in data using scree plot
- Predict the Prinicipal component scores for the ratined components in the test data

Model building:
- Using the retained prinicipal component as predictor, build the model on train est data
- Test model on test data to find the accuracy of the model