Tuesday, July 28, 2015

Uni-Variate Analysis

Week 1 - Univariate Analyses Assignment
Bhupendra Mishra
Monday, July 27, 2015
a: Generate a random sample of 500 observations from the Ecommerce data using R. Save as a dataframe.
ecommerce <- read.delim("C:/Users/Bhupendra Mishra/Desktop/donotbackup/BridgeSchoolMgmt/Bridge School Mgmt/Module2/ecommerce.txt")
ecomm_samp<-ecommerce[sample(1:nrow(ecommerce),500),]
b.          Generate univariate profiles of the data, using the summary() function
attach(ecomm_samp)
summary(ecomm_samp)
##   churn_status session_length_seconds session_count     event_count     
##  Churned:227   Min.   :     0         Min.   :  1.00   Min.   :    1.00 
##  Stayed :273   1st Qu.:  1497         1st Qu.:  4.00   1st Qu.:   40.25 
##                Median :  7549         Median : 16.50   Median :  200.00 
##                Mean   : 30605         Mean   : 62.92   Mean   :  623.54 
##                3rd Qu.: 32575         3rd Qu.: 80.50   3rd Qu.:  746.25 
##                Max.   :616183         Max.   :695.00   Max.   :10931.00 
##  closed_session_event_count open_session_event_count
##  Min.   :   0.0             Min.   :   0.0         
##  1st Qu.:   6.0             1st Qu.:   6.0         
##  Median :  28.0             Median :  26.5         
##  Mean   : 106.2             Mean   : 105.8         
##  3rd Qu.: 127.5             3rd Qu.: 128.5         
##  Max.   :1717.0             Max.   :1714.0         
##  quest_completed_event_count store_purchase_event_count  active_days  
##  Min.   :   0.00             Min.   :  0.000            Min.   : 1.00 
##  1st Qu.:   4.00             1st Qu.:  0.000            1st Qu.: 2.75 
##  Median :  21.00             Median :  0.000            Median : 7.00 
##  Mean   : 127.51             Mean   :  5.126            Mean   :15.13 
##  3rd Qu.:  79.75             3rd Qu.:  3.000            3rd Qu.:24.00 
##  Max.   :4419.00             Max.   :243.000            Max.   :55.00
Inferences:
- Categorical variable is churn_status and rest are numerical variable
- Churn status of customer is almost 50:50
- Based on above summary we can see data are not normally distributed.
- Mean is greater than median for every variable hence data are positively skew.
c.           Generate pairwise correlation plots of numeric variables and bar charts of categorical or factor variables.
pairs(~session_length_seconds+session_count+closed_session_event_count+open_session_event_count+event_count+quest_completed_event_count+store_purchase_event_count+active_days, main="Pair wise corelation plot")
Inference: Positive correlation exist among all variable

barplot(table(churn_status))
vars<-c("session_length_seconds","session_count","closed_session_event_count", "open_session_event_count", "event_count","quest_completed_event_count","store_purchase_event_count","active_days" )
cor(ecomm_samp[vars])
##                             session_length_seconds session_count
## session_length_seconds                   1.0000000     0.8503394
## session_count                            0.8503394     1.0000000
## closed_session_event_count               0.9430646     0.9473013
## open_session_event_count                 0.9418005     0.9474239
## event_count                              0.9596580     0.8924715
## quest_completed_event_count              0.8380808     0.6582946
## store_purchase_event_count               0.5607137     0.4329665
## active_days                              0.6299118     0.8381357
##                             closed_session_event_count
## session_length_seconds                       0.9430646
## session_count                                0.9473013
## closed_session_event_count                   1.0000000
## open_session_event_count                     0.9998514
## event_count                                  0.9574356
## quest_completed_event_count                  0.7595074
## store_purchase_event_count                   0.5387210
## active_days                                  0.7450125
##                             open_session_event_count event_count
## session_length_seconds                     0.9418005   0.9596580
## session_count                              0.9474239   0.8924715
## closed_session_event_count                 0.9998514   0.9574356
## open_session_event_count                   1.0000000   0.9578605
## event_count                                0.9578605   1.0000000
## quest_completed_event_count                0.7605956   0.8848637
## store_purchase_event_count                 0.5372247   0.5439314
## active_days                                0.7434909   0.6958241
##                             quest_completed_event_count
## session_length_seconds                        0.8380808
## session_count                                 0.6582946
## closed_session_event_count                    0.7595074
## open_session_event_count                      0.7605956
## event_count                                   0.8848637
## quest_completed_event_count                   1.0000000
## store_purchase_event_count                    0.4610465
## active_days                                   0.4731775
##                             store_purchase_event_count active_days
## session_length_seconds                       0.5607137   0.6299118
## session_count                                0.4329665   0.8381357
## closed_session_event_count                   0.5387210   0.7450125
## open_session_event_count                     0.5372247   0.7434909
## event_count                                  0.5439314   0.6958241
## quest_completed_event_count                  0.4610465   0.4731775
## store_purchase_event_count                   1.0000000   0.3898231
## active_days                                  0.3898231   1.0000000





require(car)
scatterplotMatrix(ecomm_samp[vars])
     
 
Inferences:

- Green line shows the actual regression line where as red line is best possible relation with some interval