GitHub - p2krish/Udacity--Data-Analysis-with-R

title	Analysis of White Wine Dataset: Investigating Factors Contributing to White Wines
author	Mohammad Key Manesh
date	Tuesday, May 05, 2015
output	html_document

Understanding Dataset and Objective

Wine industry is a lucrative industry which is growing as social drining is on rise. There are many factors that make the taste and quality of wine unique. These factors are but now limited to the followings:

acidity
pH level
sugar remained in wine
chlorides

In this project we use a dataset of wines. In this dataset there are 4898 observations of White Wines that are produced in Portugal. Different properties of each wine is tested and collected for this dataset. Also, Each variety of wine is tasted by three independent tasters and the final rank assigned is the median rank given by the tasters.

In this project, I try to understand this dataset better and also try to find out if there is a relationship between quality of wine and different properties of it.

EDA

Structure of dataset

Initially we start just looking at data to understand their features better.

## 'data.frame':	4898 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
##  $ volatile.acidity    : num  0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
##  $ citric.acid         : num  0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
##  $ residual.sugar      : num  20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
##  $ chlorides           : num  0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
##  $ free.sulfur.dioxide : num  45 14 30 47 47 30 30 45 14 28 ...
##  $ total.sulfur.dioxide: num  170 132 97 186 186 97 136 170 132 129 ...
##  $ density             : num  1.001 0.994 0.995 0.996 0.996 ...
##  $ pH                  : num  3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
##  $ sulphates           : num  0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
##  $ alcohol             : num  8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
##  $ quality             : int  6 6 6 6 6 6 6 6 6 6 ...

There are 4898 observations and 12 features. Input variables which includes 11 chemical features of white wine and output variable which is wine quality.

Below is brief description of each feature: Input variables (based on physicochemical tests):

Chemical Prperties:

fixed acidity: most acids involved with wine or fixed or nonvolatile (do not evaporate readily) (tartaric acid - g / dm^3)
volatile acidity: the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste (acetic acid - g / dm^3)
citric acid: found in small quantities, citric acid can add 'freshness' and flavor to wines (g / dm^3)
residual sugar: the amount of sugar remaining after fermentation stops (g / dm^3)
chlorides: the amount of salt in the wine (sodium chloride - g / dm^3
free sulfur dioxide: he free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion (mg / dm^3)
total sulfur dioxide: amount of free and bound forms of S02 (mg / dm^3)
density: the density of water is close to that of water depending on the percent alcohol and sugar content (g / cm^3)
pH: describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic)
sulphates: a wine additive which can contribute to sulfur dioxide gas (S02) levels (potassium sulphate - g / dm3)
alcohol: the percent alcohol content of the wine (% by volume)

Output variable (based on sensory data):

quality (score between 0 and 10)

Summary of dataset:

##        X        fixed.acidity    volatile.acidity  citric.acid    
##  Min.   :   1   Min.   : 3.800   Min.   :0.0800   Min.   :0.0000  
##  1st Qu.:1225   1st Qu.: 6.300   1st Qu.:0.2100   1st Qu.:0.2700  
##  Median :2450   Median : 6.800   Median :0.2600   Median :0.3200  
##  Mean   :2450   Mean   : 6.855   Mean   :0.2782   Mean   :0.3342  
##  3rd Qu.:3674   3rd Qu.: 7.300   3rd Qu.:0.3200   3rd Qu.:0.3900  
##  Max.   :4898   Max.   :14.200   Max.   :1.1000   Max.   :1.6600  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.600   Min.   :0.00900   Min.   :  2.00     
##  1st Qu.: 1.700   1st Qu.:0.03600   1st Qu.: 23.00     
##  Median : 5.200   Median :0.04300   Median : 34.00     
##  Mean   : 6.391   Mean   :0.04577   Mean   : 35.31     
##  3rd Qu.: 9.900   3rd Qu.:0.05000   3rd Qu.: 46.00     
##  Max.   :65.800   Max.   :0.34600   Max.   :289.00     
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  9.0        Min.   :0.9871   Min.   :2.720   Min.   :0.2200  
##  1st Qu.:108.0        1st Qu.:0.9917   1st Qu.:3.090   1st Qu.:0.4100  
##  Median :134.0        Median :0.9937   Median :3.180   Median :0.4700  
##  Mean   :138.4        Mean   :0.9940   Mean   :3.188   Mean   :0.4898  
##  3rd Qu.:167.0        3rd Qu.:0.9961   3rd Qu.:3.280   3rd Qu.:0.5500  
##  Max.   :440.0        Max.   :1.0390   Max.   :3.820   Max.   :1.0800  
##     alcohol         quality     
##  Min.   : 8.00   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.40   Median :6.000  
##  Mean   :10.51   Mean   :5.878  
##  3rd Qu.:11.40   3rd Qu.:6.000  
##  Max.   :14.20   Max.   :9.000

Above figure shows the distribution of data over different variables. As we can see, the normal range for fixed acidity is 6.3 to 7.3 g / dm^3. As for sugar, 75% of wines in our dataset have below 9.9 mg / dm^3 sugar remaining after fermentation stops. Average alcohol percentage in our dataset is about 10.51.

Some plotings:

Distribution of data: Quality of Wine

Boxplot of wine quality:

Histogram of wine quality:

For most of the wine in our dataset, quality falls between 5 and 7 which is a range for good wines. There are couple of exceptions as excellent wine(8 or above), and poor (4 or below)

Distribution of data: Wine Acidity

Based on the bottom-right figure, wines are acidic and their pH are ranging from 2.5 to 4, however, most of wine have pH between 3 and 3.5.

Acidic nature of wines can come from three different types of acids:

1- Fixid acidity which is for most cases between 6 and 8.

2- Volatile Acidity which is mostly in range of .1 and .5

3- Citric Acidity which is ranging from 0 to 1 but for most of wines in our dataset is between .2 and .5

These features all seem to follow a normal distribution except Volatile Acidity which is slightly right skewed.

I will do log transformations to see if the result would be more bell-shaped:

It seems that log(volatile acidity) follows normal distribution (at least it is more like bell-shaped in logorithmic than regular); therefore we will use the logarithmic transofomation for our further analysis

Distribution of data: Density, Chlorides, Sugar and Alcohol Percentage

Based on the above figures, chlorides range in wines in our dataset is usually between 0 and .1 with some exceptions more than .1 g/dm^3.

The amount of sugar remained after fermentation is rarely more than 20 g/dm^3.

Density for wine are typically less than water but very slightly. The typical range for density would be (.99, 1)

Alcohol percentage in wine is varies between 8 and 14, however for most of the wines it is between 9 and 13.

Residual Sugar and Chlorides are highly right skewed. We will do logorithmic transformation in the next step:

## Scale for 'x' is already present. Adding another scale for 'x', which will replace the existing scale.

Now these two are more like bell-shaped. However, still Residual sugar is far from normal distribution as it seems like two different bell in the distribution.

Analyzing Correlation among input variables in the dataset

following diagrams give us a good sense of the distribution and correlation among input variables in our dataset:

some observations:

Positive relationship between density and sugar remaining
Positive relationship between total SO2 and free SO2
Positive relationship between total SO2 and chlorides
Positive relationship between alcohol and density
Features in our data seems to follow a normal distribution

To avoid multicollinearity in model building using regressions, we have to be aware of strong correlations among input variables.

Analyzing correlation between Quality and input variables

We use Spearman's rho statistic to estimate a rank-based measure of association. Correlations falls between -1 and 1. 0 suggests there is no association between the two variables while numbers close to -1 or 1 suggests strong negative and positive associations accordingly.

##                              [,1]
## fixed.acidity        -0.113662831
## volatile.acidity     -0.197363215
## citric.acid          -0.009209091
## residual.sugar       -0.064631762
## chlorides            -0.272856661
## free.sulfur.dioxide   0.008158067
## total.sulfur.dioxide -0.174737218
## density              -0.307123313
## pH                    0.099427246
## sulphates             0.053677877
## alcohol               0.435574715

This also shows that wine quality has positive correlation with alcohol and negative correlation with chlorides and density

Now I will dig into relationship between wine quality and its properties more to be able to predict the quality of wine.

Role of pH and Alcohol in Quality of the wine

What is impact of Alcohol and pH in wine quality?

It is difficult to find specific pattern in this figure since quality has a wide range. I will limit the quality of wine into three categories of Poor, Good and Great to be able to differntiate patterns in each category.

below is how the quality of wines is distributed based on the rating that I just introduced:

Now again we plot the two features of pH and Alcohol but this time use the new rating to see a pattern between quality and these two features:

According to the above scatter plot, there seems to be a relationship between alcohol percentage and rating of the wine. most of great wines are in the right side of the plot. More specifically, if the alcohol percentage is above 11% there seems to be a good chance that we will have a good or great wine (great wine has rating 7 or above, good ones has quality above 5). If it is more than 12% the chance is even higher.

However, to see the relationship better, in below chart I use only Alcohol and Quality to find out if there is actually a relationship between the two.

As you can see in the above stacked bar, for the higher quality wines there is more chance that the wine has higher alcohol percentage.

Here is how I categorized the alchol percentage:

"Light": Alcohol percentage below 10%
"Mild" : Alcohol percentage higher than 10% but below 12%
"Strong": Alcohol percentage higher than 12%

Relationship between density and alcohol percentage

There seems to be a correlation between density and alcohol percentage. Less dense, more alcohol. Also, great wines tend to be less dense.

Relationship between Quality and Chlorides

Wines with better quality tend to have less chlorides. If the chlorides level is higher than 0.050, there is a good chance the wine has worse rating.

Predicting Wine Quality

Using the insights that we have now about our data, I will try to predict the quality of wine.

I will use three levels rating ("Poor", "Good", "Great") as an output variable.

## 
##  Poor  Good Great 
##  1640  2198  1060

This is the baseline for accuracy of our model:

2198/ (1640 + 2198 + 1060) = 0.44

Multinomial Logistic Regression

I will use multinomial logistic regression to classify ratings of wine.

In our earlier analysis we found that there is a strong relationship between wine quality and its alcohol percentage. Lets predict the rating of wine just based on its alcohol percentage.

Here is the prediction:

##        pred_mglm
##         Poor Good Great
##   Poor   918  696    26
##   Good   631 1336   231
##   Great  128  619   313

Accuracy = (918+1336+313)/total = 0.52

AIC: 9211.864

We can see that just by using one variable, we could improve the baseline accuracy significantly.

In the next step we will add more variables to our model to imporve its accuracy. Based on EDA section Density, Chlorides and Volatile Acidity have strong correlation with wine quality. However, since Density and chlorides have strong association with alcohol percentage we ignore this variable to avoid multicollinearity. In our next model we predict the rating of a wine based on its alcohol percentage, chlorides and volitile acidity:

Here is the prediction:

##        pred_mglm
##         Poor Good Great
##   Poor   972  643    25
##   Good   531 1404   263
##   Great   62  685   313

Accuracy = (972+1404+313)/total = 0.55

AIC: 8838.35

As expected the accuracy imporved significantly.

Last step is just to use full model (all inputs) to predict quality of wine.

Prediction:

##        pred_mglm
##         Poor Good Great
##   Poor   996  612    32
##   Good   485 1468   245
##   Great   55  630   375

Accuracy: (946 + 1495 + 369)/ total = 0.58

AIC: 8602.719

As you can see, we added 8 more variables to our model and accuracy imporved 3% which suggests that whether combination of other variables are not really impactful in predicting the output or our model is not leveraging the data well (perhaps because there multicollinearity, or the relationship between the input and output is not linear or etc.)

Also we can compare the Akaike information criterion (AIC) for the three models and we can see that from the first model to the second one the AIC improved significantly but from the second model to the full model it improved slightly.

Decision Tree

Using Decision Trees to predict Alcohol Quality:

As we see in the tree, the wine is predicted to be Great if its alcohol percentage is 13% or higher. It is predicted as Poor if alcohol percentage is below 11% and its log(volatile acidity) is equal or greater than -1.4.

Here is the confusion matrix based on this model:

##        pred_CART
##         Poor Good Great
##   Poor   983  643    14
##   Good   565 1514   119
##   Great   62  776   222

Accuracy = (983+1514+222)/total = 0.56

This is a very effective and readable model. We just used two of input variables to predict the quality. For the next model, we make it more complicated:

In above model we used following variables to predict quality: alcohol, free sulfur dioxide, pH, sulphate and volatile acidity.

Now let's see the confusion matrix:

##        pred_CART
##         Poor Good Great
##   Poor   868  749    23
##   Good   387 1651   160
##   Great   30  720   310

Accuracy = (868+1651+310)/total = 0.58

This is the best accuracy that we could achieve so far

Random Forest

As out last model we will use random forest classification to predict quality of wine.

Now let's see the confusion matrix:

##        pred_RF
##         Poor Good Great
##   Poor  1225  401    14
##   Good   293 1737   168
##   Great   22  328   710

Accuracy ~ 0.75

Well!! The accuracy imporved amaingly! But does it mean that it is the best model to predict wine quality? I will discuss this in last section of the project when I will suggest future analysis.

##Final Plots and Summary

Histogram of Wine Quality:

Firstly, in below plot I will display histogram of wine quality to see how quality is distributed in our dataset.

The quality rating with highest number is 6. Also we can see that most of wines in our dataset is rated between 5 and 7.

Relationship between Residual Sugar and Density

To better display relationship between two numerical variable, scatter plot is used:

This scatterplot shows that there is a positive relationship between Total Free SO2 in wines and its density. The blue line is drawn using linear regression mothod.

In EDA section, we calculated the correlation between the two which is 0.53 which suggests a relatively strong positive relationship.

Is there any relationship between Alcohol percentage and Wine Density? Do these features impact wine rating?

I will show a scatter plot of data using Alcohol percentage and Wine density as x and y axis respectively. Also to understand contribution of the two in wine quality, another dimention (color) is added which is wine rating.

Above figure is also interesting and it has very useful information about our dataset. As you can see in the scatterplot there is a relationship between Alcohol Pecentage and Wine Density. The higher the alcohol percentage, the lower is the density. Also in previous section, we found out that the correlation between the two is -0.78 which relatively suggests a strong negative relationship.

Another useful piece of information in this plot is the relationship between alcholo percentage and wine rating. While left side of the plot consists of red points (Poor Wines), right hand side of the plot mostly consist of Green and Blue points (Good and Great wines). In other words, stronger wines (in trems of alchols) tend to be rated higher. (This will be investigated even more in the next plot)

Histogram of Alcohol Percentage and Wine Quality:

I use a stacked bar char to display distribution of wine quality. Also in below chart distribution of alcohol percentage in wines with different quality is displayed:

This is such an interesting plot as it conveys a lot. It provides information about the quality of wine, alcohol percentage and also relationship between the two.

Comparing to the previous plot which simply just displayed the histogram of wine quality, in this plot not only we plot the histogram of wine quality, but also we show if alcohol percentage impacts quality of wine. More specifically, based on this plot one can see following points:

How data is distributed based on wine quality: most of wine is the dataset is rated 5, 6 and 7. There are very few wines rated below 4 or above 7.
Better wines (the ones with higher quality), tend to have higher percentage of alcohol. As you can see in the plot, majority of wines with quality of 5 or lower, are considered as light wines (with low percentage of alcohol), while better wines are stronger in terms of alcohol.
One also can see that majority of wines in our dataset is labeled as Mild wines (this plot is not directly intended to show this, but it can be considered as a power of efficient plot, so one can extract more information from a simple plot)

In previous section, we mentioned that correlation between Wine quality and its alcohol percentage is 0.435.

Note: here is how wines are labeled based on their alcohol percentage:

Light: Alcohol percentage is below 9.5%
Mild: Alcohol percentage is between 9.5 and 12%
Strong: Alchol Percentage is more than 12%

Reflection

Based on the EDA and further analysis that I did for this dataset, I am convinced that Alcohl percentage is the most important factor to decide the quality of White wine. One important factor that contributes to Alcohol percentage is the remaining sugar in wine after fermentation so that the more sugar left after fermentation, the less the percentage of alcohol will be in the wine.

Other important factors for deciding the quality of a white wine are SO2 and Volatile Acidity. Free SO2 has positive relationship with the quality of white wine while Volatile Acidity has negative one!

Future Analysis

There is defenitely a great room to do further analysis and come with better models. Below is some ideas to make this study even better:

In this project the models were evaluated using the same data that was trained. This is not recommended. Performance should be reported based on the seperate set of data. Therefore, for future studies I recommend to split data into train and test and then do the analysis.
In the last model, we used Random Forest Classification which is very prone to over-fitting. Using seperate train and test data would help to report right number for performance. Also we can use Cross Validation to adjust the parameters of the classification method.
The only crriteria we used for perfomance was accuracy. While it is indicative of our model's performance, it is not exhaustive yet. A better idea would be to look at the prediction and see how was the prediction are from actual data. For example if a Great wine is predicted as Good, it is more tolerable than if it is predicted as Poor. Therefore we can use weighted accuracy measures to report on performance.
This is such a rich dataset and many relationship and correlations can be extracted from data and in this project we investigated very obvious relationships between wine qualities and its properties.

##References

Data is taken from the following source:

P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis.

Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.

Available at: [@Elsevier] http://dx.doi.org/10.1016/j.dss.2009.05.016

[Pre-press (pdf)] http://www3.dsi.uminho.pt/pcortez/winequality09.pdf

[bib] http://www3.dsi.uminho.pt/pcortez/dss09.bib