Exploratory Data Analysis: Wine Dataset

Alex Wang
7 min readMar 5, 2021

As a beginner in Machine Learning and as an assignment for my Applications Machine Learning Class, I have selected a dataset detailing the chemical and physical attributes of select types of wine from the UCI Repository. This dataset comes from different cultivars, or vineyards, in the same region of Italy. I find the chemical makeup of different wines fascinating because such small differences in production, whether that be the type of grape used or the aging process, can cause such large distinctions in the flavor profile. There are many chemical attributes to consider and the production is quite a deliberate and intricate process. Based on this data, we can’t necessarily tell taste but we can learn the differences in the chemical composition.

First, in my analysis, I explore the structure and makeup of the data. I’ve applied column titles to the respective columns. We can see that there are 14 attribute columns and 177 different instances.

We can see that the data varies regardless of the cultivar which is quite surprising to me. I initially thought that using the same grapes from the same vineyard would produce wine that is quite similar but even just looking at alcohol content we can see that there is quite a discrepancy between each instance. This could be a result of an inconsistent production process or an inherent difference in the supply of the vineyard.

Since the type of cultivar is the main distinguisher in the dataset, I created a histogram to see how the data is distributed.

We see that our data is not evenly distributed, in fact, we have about a 25 instance difference between cultivar 2 and 3. This is important to consider as we continue analysis — some of our analysis may not be reliable due to restricted sample sizes. We may also see more consistent analysis in cultivar 2 because there are more instances.

It’s also interesting to use data.describe() to view the characteristics of the data. We can see the mean, standard deviation, and quartile analysis of each attribute in our dataset. This is useful in further understanding the data and posing questions to analyze. To further understand the data, I’ve grouped the data by cultivar and looked at the means. By doing so, we can see the biggest discrepancies in attributes by cultivar.

This is just the first 8 columns as the rest could not fit on the page

I’m most interested in analyzing the distribution of Alcohol, Malic Acid and Color Intensity by cultivar. This is the first question I’m posing: How do the three cultivars differ in alcohol, malic acid and color intensity and are there any outliers in our data?

First looking at alcohol by cultivar:

As we expected from our grouping analysis earlier, if the wine has a lower alcohol content, it is likely from cultivar 2. What’s really interesting about these graphs is that for cultivars 1 and 3 the data seems to be quite concentrated meaning there is consistency, but for cultivar 2, there are multiple outliers and the range between both whiskers is significant. In fact, the data from cultivars 1 and 3 almost look normally distributed based on the boxplots. This tells us that there are inconsistencies in the production of the wines from cultivar 2 — I would want to explore this more.

The Malic Acid distribution is even more interesting. We can see that there is a significant range for all three cultivars meaning cultivar type likely doesn’t affect malic acid content. We would not have been able to come to this conclusion from just the mean analysis earlier. Cultivar 1 is most interesting to me because there seem to be two levels of concentration, at 2 and at 4. If I had to guess, this is likely due to a lack of sample size and more instances would further explain this phenomenon. The boxplots would suggest that there are many outliers in Cultivar 1 but that might not necessarily be true.

Color intensity is a physical attribute of the wine and tells us how saturated the color is. It is based on chemical measuring but has physical consequences. Based on this distribution, the range of cultivar 3 is the largest suggesting there are significant discrepancies in how the wine looks from cultivar 3. Again, if the wine has a low color intensity relative to the others, it is likely from cultivar 2. Cultivar 2 also is the only one to have significant outliers. Now we have a further understanding of how each cultivar’s wine is composed.

My next question is: Does alcohol content have any correlation with other chemical/physical attributes of wine? This analysis no longer groups the wines by cultivar and instead looks at the dataset as a whole.

First, we’ll look at the physical attributes: Color Intensity and Hue.

By plotting these data points on a scatter plot and applying a regression line to it we can see how the data trends. There is definitely more of a correlation between Color Intensity and Alcohol Content in comparison to Hue which is understandable. If we used an LSRL, the r-squared of Alcohol vs. Hue would be a lot weaker than the r-squared of Alcohol vs. Color Intensity. In fact, it nearly looks like there is no correlation between alcohol and hue but there is a linear correlation between alcohol and color intensity, and this makes sense. Based on the wines I’ve had, darker wines tend to have more alcohol. We can’t conclude that more alcohol causes a darker color or vice versa and the data doesn’t tell us why that is either but this is an interesting correlation that we would want to further explore.

Looking at the chemical attributes, there seems to be little to no correlation between alcohol content and the respective chemical components. This also makes sense because all of these are separate chemical components that are likely unaffected by the alcohol content. I was just curious to see if alcohol content directly correlated with any other chemical component. The linear regression lines suggest a positive relationship between alcohol and these three chemical components, but to conclude that would be a stretch as our data is so unconcentrated and the r-squared value would be quite weak.

For my final question, I want to explore the chemical Proline among the three cultivars and against the physical attributes of color intensity and hue. Proline is a specific chemical in wine that determines the opaqueness of the wine and can help show us if different grapes were used in each cultivar process. By looking at how the Proline chemical is distributed between the three cultivars, we can have a better understanding of how the wine looks.

As you can see on the scale, there is quite a significant difference between cultivar 1 and cultivars 2 and 3. This suggests that wines from cultivar 1 are more opaque than the wines from cultivars 2 and 3. It would be a stretch to conclude that cultivar 1 produced red wine and cultivars 2 and 3 produced white wines, but this suggests that the types of wines from each cultivar are vastly different and likely produced from different grapes.

Now looking at how Proline content relates to color intensity and hue.

As we previously guessed, there seems to be a correlation between Proline and color intensity. This makes sense because if the wine is more opaque then it is likely a more saturated color. However, the relationship might not be linear as there are many outliers between the 400–1000 level of Proline. Again, this may just be some errors in the data as we experienced with cultivar 2 earlier or some inconsistencies, but I would want to perform some polynomial regressions against this data to explore if there are better regressions to perform. As for hue, there seems to be a very weak correlation between Proline and hue and that may be a result of our limited sample size. I would want more instances to see if there is a correlation or not — intuition tells me that Proline content would correlate with the hue of the wine but I would need more data to come to that conclusion.

This concludes my exploratory data analysis, there is much more to learn about this data, and my analysis only skims the surface of what there is to understand but it represents a foundation in my knowledge of machine learning.

--

--