Bhupendra Mishra: 07/02/15

Abstract: Wine industry shows growth in overall consumption of wine. Price of wine depend on two critical factor

- Wine appreciation by wine tester

- Certification and quality assessment in the physicochemical test

We have two dataset red wine and white wine.

We have done exploratory data analysis using standard function in R. During the analysis, we have identified the outliers in different variables using box plot. Also using cor() R function tried to understand the correlation between Quality and rest of variables.

Finally devised model using Liner regression technique to predict the quality of wine
Red wine data used to depict the picture. Same steps can be applied on white wine data

Project Goal:

- Explore the data in dataset and be able to list all the standard summary statistics

- investigate distribution of the variables graphically to determine the outliers

- devise method to handle outlier

- investigate correlation between quality and remaining properties

- suggest methods for the final “Quality” determination

Looking into dataset(Red Wine)

> red.wine.data <-read.delim(file.choose(), header=T)

> dim(red.wine.data)

[1] 1599 12

> names(red.wine.data)

[1] "fixed.acidity" "volatile.acidity" "citric.acid" "residual.sugar"

[5] "chlorides" "free.sulfur.dioxide" "total.sulfur.dioxide" "density"

[9] "pH" "sulphates" "alcohol" "quality"

> str(red.wine.data)

'data.frame': 1599 obs. of 12 variables:

$ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...

$ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...

$ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...

$ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...

$ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...

$ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...

$ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...

$ density : num 0.998 0.997 0.997 0.998 0.998 ...

$ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...

$ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...

$ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...

$ quality : int 5 5 5 6 5 5 5 7 7 5 ...

> attributes(red.wine.data)

$names

[1] "fixed.acidity" "volatile.acidity" "citric.acid" "residual.sugar"

[5] "chlorides" "free.sulfur.dioxide" "total.sulfur.dioxide" "density"

[9] "pH" "sulphates" "alcohol" "quality"

$class

[1] "data.frame"

> summary(red.wine.data)

fixed.acidity volatile.acidity citric.acid residual.sugar chlorides

Min. : 4.60 Min. :0.1200 Min. :0.000 Min. : 0.900 Min. :0.01200

1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090 1st Qu.: 1.900 1st Qu.:0.07000

Median : 7.90 Median :0.5200 Median :0.260 Median : 2.200 Median :0.07900

Mean : 8.32 Mean :0.5278 Mean :0.271 Mean : 2.539 Mean :0.08747

3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420 3rd Qu.: 2.600 3rd Qu.:0.09000

Max. :15.90 Max. :1.5800 Max. :1.000 Max. :15.500 Max. :0.61100

free.sulfur.dioxide total.sulfur.dioxide density pH sulphates

Min. : 1.00 Min. : 6.00 Min. :0.9901 Min. :2.740 Min. :0.3300

1st Qu.: 7.00 1st Qu.: 22.00 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500

Median :14.00 Median : 38.00 Median :0.9968 Median :3.310 Median :0.6200

Mean :15.87 Mean : 46.47 Mean :0.9967 Mean :3.311 Mean :0.6581

3rd Qu.:21.00 3rd Qu.: 62.00 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300

Max. :72.00 Max. :289.00 Max. :1.0037 Max. :4.010 Max. :2.0000

alcohol quality

Min. : 8.40 Min. :3.000

1st Qu.: 9.50 1st Qu.:5.000

Median :10.20 Median :6.000

Mean :10.42 Mean :5.636

3rd Qu.:11.10 3rd Qu.:6.000

Max. :14.90 Max. :8.000

Identifying outliers

> fa<-red.wine.data$fixed.acidity

> boxplot(fa)

Handling outliers

Removed the outlier values and stored data in new data frame called red.wine

> red.wine<-subset(red.wine.data, fa< 12 & va<1 & ca<.8 & rs< 5 & cl<.2& fsd < 50 & tsd<150 & d<1 & ph<3.5 & sul<1.5 & al< 14 & ql<9)

> summary(red.wine)

fixed.acidity volatile.acidity citric.acid residual.sugar chlorides

Min. : 5.000 Min. :0.1200 Min. :0.0000 Min. :0.900 Min. :0.01200

1st Qu.: 7.200 1st Qu.:0.3800 1st Qu.:0.1100 1st Qu.:1.900 1st Qu.:0.07000

Median : 8.000 Median :0.5100 Median :0.2600 Median :2.100 Median :0.07900

Mean : 8.276 Mean :0.5093 Mean :0.2657 Mean :2.243 Mean :0.08112

3rd Qu.: 9.100 3rd Qu.:0.6200 3rd Qu.:0.4000 3rd Qu.:2.500 3rd Qu.:0.08900

Max. :11.900 Max. :0.9800 Max. :0.7300 Max. :4.800 Max. :0.19400

free.sulfur.dioxide total.sulfur.dioxide density pH sulphates

Min. : 1.00 Min. : 6.00 Min. :0.9901 Min. :2.870 Min. :0.3300

1st Qu.: 7.00 1st Qu.: 21.50 1st Qu.:0.9956 1st Qu.:3.220 1st Qu.:0.5400

Median :13.00 Median : 37.00 Median :0.9966 Median :3.300 Median :0.6100

Mean :15.39 Mean : 44.58 Mean :0.9965 Mean :3.293 Mean :0.6407

3rd Qu.:21.00 3rd Qu.: 59.00 3rd Qu.:0.9975 3rd Qu.:3.380 3rd Qu.:0.7100

Max. :48.00 Max. :149.00 Max. :0.9998 Max. :3.490 Max. :1.3600

alcohol quality

Min. : 8.50 Min. :3.000

1st Qu.: 9.50 1st Qu.:5.000

Median :10.10 Median :6.000

Mean :10.37 Mean :5.655

3rd Qu.:11.00 3rd Qu.:6.000

Max. :13.60 Max. :8.000

Correlation: Principal component analysis

By doing principal component analysis and plotting, we can easily identify the principal components and their correlation.

#number of element

> temp_red.wine<-length(as.matrix(red.wine))/length(red.wine)
#PCA analysis

> pcx<-prcomp(red.wine, scale=TRUE)
#plotting using biplot

> biplot(pcx, xlab=rep('.', temp_red.wine))

Interesting about the plot is that judging by the first two principal components, a quality is very much correlated with alcohol content and sulphate

Predicting wine quality using liner regression line

> plot(ql~al, data=red.wine)

> mean(ql)

[1] 5.636023

> abline(h=mean(ql))

> model1=lm(ql~al, data=red.wine)

> model1

Call:

lm(formula = ql ~ al, data = red.wine)

Coefficients:

(Intercept) al

1.8750 0.3608

> abline(model1,col="red")

> plot(model1)

Thursday, July 2, 2015

Tree map and Geospatial map

Predicting Wine Quality Analytically

Handling outliers

Correlation: Principal component analysis