Thursday, July 2, 2015

Tree map and Geospatial map

The term mashup started in the music world but  adopted rapidly in the world of web that means an application that combines data from different sources into a whole new application

Geographic data is one of the most common types of data available. Dealing with geographic data without a map like going into the mountain without, well, a map
Geospatial analysis help in identifying "hot spots" where disease or problems are occurring.
Please refer the below article how dominos is leveraging geospatial analytics to know their customer and stores performance.
http://www.zdnet.com/article/dominos-pizza-gets-customer-specific-using-geospatial-analytics/

Treemapping is a method for displaying hierarchical data by using nested rectangle
Treemap display hierarchical(tree-structured) data as a set of nested rectangles. Each branch of the tree is given a rectangle, which is then tiled with smaller rectangle representing sub branches.
Common use case we are seeing these days in analyzing disk space. There is utility called treesize being used in monitoring disk space on servers

Here is very interesting software developed by MIT student which is based on the treemap.

http://pantheon.media.mit.edu/treemap/country_exports/IN/all/-4000/2010/H15/pantheon

For further indepth reading on treemap from NorthWestern University
http://www.cs.uic.edu/~wilkinson/Publications/c&rtrees.pdf


Reference:
https://en.wikipedia.org/wiki/Treemapping
http://searchbusinessanalytics.techtarget.com/news/1507131/Data-mashups-meet-business-intelligence-Bashups-explained


Predicting Wine Quality Analytically


Abstract: Wine industry shows growth in overall consumption of wine. Price of wine depend on two critical factor
- Wine appreciation by wine tester
- Certification and quality assessment in the physicochemical test
We have two dataset red wine and white wine.
We have done exploratory data analysis using standard function in R. During the analysis, we have identified the outliers in different variables using box plot. Also using cor() R function tried to understand the correlation between Quality and rest of variables.
Finally devised model using Liner regression technique to predict the quality of wine
Red wine data used to depict the picture. Same steps can be applied on white wine data

Project Goal:
- Explore the data in dataset and be able to list all the standard summary statistics
- investigate distribution of the variables graphically to determine the outliers
- devise method to handle outlier
- investigate correlation between quality and remaining properties

- suggest methods for the final “Quality” determination



Looking into dataset(Red Wine)
> red.wine.data <-read.delim(file.choose(), header=T)
> dim(red.wine.data)
[1] 1599   12
> names(red.wine.data)
 [1] "fixed.acidity"        "volatile.acidity"     "citric.acid"          "residual.sugar"     
 [5] "chlorides"            "free.sulfur.dioxide"  "total.sulfur.dioxide" "density"            
 [9] "pH"                   "sulphates"            "alcohol"              "quality"
> str(red.wine.data)
'data.frame':       1599 obs. of  12 variables:
 $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
 $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
 $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
 $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
 $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
 $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
 $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
 $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
 $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
 $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
 $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
 $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...
> attributes(red.wine.data)
$names
 [1] "fixed.acidity"        "volatile.acidity"     "citric.acid"          "residual.sugar"     
 [5] "chlorides"            "free.sulfur.dioxide"  "total.sulfur.dioxide" "density"            
 [9] "pH"                   "sulphates"            "alcohol"              "quality"            
$class
[1] "data.frame"
> summary(red.wine.data)
 fixed.acidity   volatile.acidity  citric.acid    residual.sugar     chlorides     
 Min.   : 4.60   Min.   :0.1200   Min.   :0.000   Min.   : 0.900   Min.   :0.01200 
 1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090   1st Qu.: 1.900   1st Qu.:0.07000 
 Median : 7.90   Median :0.5200   Median :0.260   Median : 2.200   Median :0.07900 
 Mean   : 8.32   Mean   :0.5278   Mean   :0.271   Mean   : 2.539   Mean   :0.08747 
 3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420   3rd Qu.: 2.600   3rd Qu.:0.09000 
 Max.   :15.90   Max.   :1.5800   Max.   :1.000   Max.   :15.500   Max.   :0.61100 
 free.sulfur.dioxide total.sulfur.dioxide    density             pH          sulphates    
 Min.   : 1.00       Min.   :  6.00       Min.   :0.9901   Min.   :2.740   Min.   :0.3300 
 1st Qu.: 7.00       1st Qu.: 22.00       1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500 
 Median :14.00       Median : 38.00       Median :0.9968   Median :3.310   Median :0.6200 
 Mean   :15.87       Mean   : 46.47       Mean   :0.9967   Mean   :3.311   Mean   :0.6581 
 3rd Qu.:21.00       3rd Qu.: 62.00       3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300 
 Max.   :72.00       Max.   :289.00       Max.   :1.0037   Max.   :4.010   Max.   :2.0000 
    alcohol         quality    
 Min.   : 8.40   Min.   :3.000 
 1st Qu.: 9.50   1st Qu.:5.000 
 Median :10.20   Median :6.000 
 Mean   :10.42   Mean   :5.636
 3rd Qu.:11.10   3rd Qu.:6.000 
 Max.   :14.90   Max.   :8.000

Identifying outliers

> fa<-red.wine.data$fixed.acidity
> boxplot(fa)





Handling outliers

Removed the outlier values and stored data in new data frame called red.wine

> red.wine<-subset(red.wine.data, fa< 12 & va<1 & ca<.8 & rs< 5 & cl<.2& fsd < 50 & tsd<150 & d<1 & ph<3.5 & sul<1.5 & al< 14 & ql<9)

> summary(red.wine)

fixed.acidity    volatile.acidity  citric.acid     residual.sugar    chlorides     
 Min.   : 5.000   Min.   :0.1200   Min.   :0.0000   Min.   :0.900   Min.   :0.01200 
 1st Qu.: 7.200   1st Qu.:0.3800   1st Qu.:0.1100   1st Qu.:1.900   1st Qu.:0.07000 
 Median : 8.000   Median :0.5100   Median :0.2600   Median :2.100   Median :0.07900 
 Mean   : 8.276   Mean   :0.5093   Mean   :0.2657   Mean   :2.243   Mean   :0.08112 
 3rd Qu.: 9.100   3rd Qu.:0.6200   3rd Qu.:0.4000   3rd Qu.:2.500   3rd Qu.:0.08900 
 Max.   :11.900   Max.   :0.9800   Max.   :0.7300   Max.   :4.800   Max.   :0.19400 
 free.sulfur.dioxide total.sulfur.dioxide    density             pH          sulphates    
 Min.   : 1.00       Min.   :  6.00       Min.   :0.9901   Min.   :2.870   Min.   :0.3300 
 1st Qu.: 7.00       1st Qu.: 21.50       1st Qu.:0.9956   1st Qu.:3.220   1st Qu.:0.5400 
 Median :13.00       Median : 37.00       Median :0.9966   Median :3.300   Median :0.6100 
 Mean   :15.39       Mean   : 44.58       Mean   :0.9965   Mean   :3.293   Mean   :0.6407 
 3rd Qu.:21.00       3rd Qu.: 59.00       3rd Qu.:0.9975   3rd Qu.:3.380   3rd Qu.:0.7100 
 Max.   :48.00       Max.   :149.00       Max.   :0.9998   Max.   :3.490   Max.   :1.3600 
    alcohol         quality    
 Min.   : 8.50   Min.   :3.000 
 1st Qu.: 9.50   1st Qu.:5.000 
 Median :10.10   Median :6.000 
 Mean   :10.37   Mean   :5.655 
 3rd Qu.:11.00   3rd Qu.:6.000 
 Max.   :13.60   Max.   :8.000  





Correlation: Principal component analysis

By doing principal component analysis and plotting, we can easily identify the principal components and their correlation.
#number of element
> temp_red.wine<-length(as.matrix(red.wine))/length(red.wine)
#PCA analysis
> pcx<-prcomp(red.wine, scale=TRUE)
#plotting using biplot

> biplot(pcx, xlab=rep('.', temp_red.wine))

Interesting about the plot is that judging by the first two principal components, a quality is very much correlated with alcohol content and sulphate


Predicting wine quality using liner regression line

> plot(ql~al, data=red.wine)
> mean(ql)
[1] 5.636023
> abline(h=mean(ql))
> model1=lm(ql~al, data=red.wine)
> model1

Call:
lm(formula = ql ~ al, data = red.wine)

Coefficients:
(Intercept)           al 
     1.8750       0.3608 

> abline(model1,col="red")
> plot(model1)