Abstract: Wine industry shows growth in overall consumption of
wine. Price of wine depend on two critical factor
- Wine appreciation by
wine tester
- Certification and
quality assessment in the physicochemical test
We have two dataset red
wine and white wine.
We have done exploratory
data analysis using standard function in R. During the analysis, we have
identified the outliers in different variables using box plot. Also using cor()
R function tried to understand the correlation between Quality and rest of variables.
Finally devised model
using Liner regression technique to predict the quality of wine
Red wine data used to depict the picture. Same steps can be applied on white wine data
Red wine data used to depict the picture. Same steps can be applied on white wine data
Project Goal:
- Explore the data in
dataset and be able to list all the standard summary statistics
- investigate distribution
of the variables graphically to determine the outliers
- devise method to handle
outlier
- investigate correlation
between quality and remaining properties
- suggest methods for the
final “Quality” determination
Looking into dataset(Red Wine)
> red.wine.data <-read.delim(file.choose(), header=T)
> dim(red.wine.data)
[1] 1599 12
> names(red.wine.data)
[1]
"fixed.acidity" "volatile.acidity" "citric.acid" "residual.sugar"
[5] "chlorides"
"free.sulfur.dioxide"
"total.sulfur.dioxide" "density"
[9] "pH" "sulphates" "alcohol" "quality"
> str(red.wine.data)
'data.frame': 1599 obs.
of 12 variables:
$ fixed.acidity : num
7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
$ volatile.acidity : num
0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
$ citric.acid : num
0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
$ residual.sugar : num
1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
$ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069
0.065 0.073 0.071 ...
$ free.sulfur.dioxide :
num 11 25 15 17 11 13 15 15 9 17 ...
$ total.sulfur.dioxide:
num 34 67 54 60 34 40 59 21 18 102 ...
$ density : num 0.998 0.997 0.997 0.998 0.998 ...
$ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36
3.35 ...
$ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57
0.8 ...
$ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
$ quality : int 5 5 5 6 5 5 5 7 7 5 ...
> attributes(red.wine.data)
$names
[1]
"fixed.acidity"
"volatile.acidity"
"citric.acid"
"residual.sugar"
[5] "chlorides"
"free.sulfur.dioxide"
"total.sulfur.dioxide" "density"
[9] "pH" "sulphates" "alcohol" "quality"
$class
[1] "data.frame"
> summary(red.wine.data)
fixed.acidity volatile.acidity citric.acid
residual.sugar chlorides
Min. : 4.60
Min. :0.1200 Min.
:0.000 Min. : 0.900
Min. :0.01200
1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090 1st Qu.: 1.900 1st Qu.:0.07000
Median : 7.90 Median :0.5200 Median :0.260 Median : 2.200 Median :0.07900
Mean : 8.32
Mean :0.5278 Mean
:0.271 Mean : 2.539
Mean :0.08747
3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420 3rd Qu.: 2.600 3rd Qu.:0.09000
Max. :15.90
Max. :1.5800 Max.
:1.000 Max. :15.500
Max. :0.61100
free.sulfur.dioxide
total.sulfur.dioxide density pH sulphates
Min. : 1.00
Min. : 6.00
Min. :0.9901 Min.
:2.740 Min. :0.3300
1st Qu.: 7.00 1st Qu.: 22.00 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500
Median :14.00 Median : 38.00 Median :0.9968 Median :3.310 Median :0.6200
Mean :15.87
Mean : 46.47 Mean
:0.9967 Mean :3.311
Mean :0.6581
3rd Qu.:21.00 3rd Qu.: 62.00 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300
Max. :72.00
Max. :289.00 Max.
:1.0037 Max. :4.010
Max. :2.0000
alcohol quality
Min. : 8.40
Min. :3.000
1st Qu.: 9.50 1st Qu.:5.000
Median :10.20 Median :6.000
Mean :10.42
Mean :5.636
3rd Qu.:11.10 3rd Qu.:6.000
Max. :14.90
Max. :8.000
Identifying outliers
> fa<-red.wine.data$fixed.acidity
> boxplot(fa)
Handling outliers
Removed the outlier values
and stored data in new data frame called red.wine
> red.wine<-subset(red.wine.data, fa< 12 & va<1 &
ca<.8 & rs< 5 & cl<.2& fsd < 50 & tsd<150 &
d<1 & ph<3.5 & sul<1.5 & al< 14 & ql<9)
> summary(red.wine)
fixed.acidity
volatile.acidity citric.acid residual.sugar chlorides
Min. : 5.000
Min. :0.1200 Min.
:0.0000 Min. :0.900
Min. :0.01200
1st Qu.: 7.200 1st Qu.:0.3800 1st Qu.:0.1100 1st Qu.:1.900 1st Qu.:0.07000
Median : 8.000 Median :0.5100 Median :0.2600 Median :2.100 Median :0.07900
Mean : 8.276
Mean :0.5093 Mean
:0.2657 Mean :2.243
Mean :0.08112
3rd Qu.: 9.100 3rd Qu.:0.6200 3rd Qu.:0.4000 3rd Qu.:2.500 3rd Qu.:0.08900
Max. :11.900
Max. :0.9800 Max.
:0.7300 Max. :4.800
Max. :0.19400
free.sulfur.dioxide total.sulfur.dioxide density pH sulphates
Min. : 1.00
Min. : 6.00
Min. :0.9901 Min.
:2.870 Min. :0.3300
1st Qu.: 7.00 1st Qu.: 21.50 1st Qu.:0.9956 1st Qu.:3.220 1st Qu.:0.5400
Median :13.00 Median : 37.00 Median :0.9966 Median :3.300 Median :0.6100
Mean :15.39
Mean : 44.58 Mean
:0.9965 Mean :3.293
Mean :0.6407
3rd Qu.:21.00 3rd Qu.: 59.00 3rd Qu.:0.9975 3rd Qu.:3.380 3rd Qu.:0.7100
Max. :48.00
Max. :149.00 Max.
:0.9998 Max. :3.490
Max. :1.3600
alcohol quality
Min. : 8.50
Min. :3.000
1st Qu.: 9.50 1st Qu.:5.000
Median :10.10 Median :6.000
Mean :10.37
Mean :5.655
3rd Qu.:11.00 3rd Qu.:6.000
Max. :13.60
Max. :8.000
Correlation: Principal component
analysis
By doing principal
component analysis and plotting, we can easily identify the principal
components and their correlation.
#number of element
> temp_red.wine<-length(as.matrix(red.wine))/length(red.wine)
#PCA analysis
#PCA analysis
> pcx<-prcomp(red.wine, scale=TRUE)
#plotting using biplot
#plotting using biplot
> biplot(pcx, xlab=rep('.', temp_red.wine))
Interesting about the plot
is that judging by the first two principal components, a quality is very much
correlated with alcohol content and sulphate
Predicting wine quality using
liner regression line
> plot(ql~al, data=red.wine)
> mean(ql)
[1] 5.636023
> abline(h=mean(ql))
> model1=lm(ql~al, data=red.wine)
> model1
Call:
lm(formula = ql ~ al, data = red.wine)
Coefficients:
(Intercept) al
1.8750 0.3608
> abline(model1,col="red")
> plot(model1)
No comments:
Post a Comment