Predicting the Origin of Wine with Random Forest Classifers
Wine fraud is a multibillion-dollar industry, with a single bottle of fake wine costing up to millions of dollars. We aim to use a dataset containing properties of legitimate wine, which can be used to predict the region of origin of the wine, and thereby test the authenticity of wine by the region it claims to originate from. To select the best classification model, we compared the accuracy score of 4 classifiers: Random Forest, Logistic regression, K nearest neighbors, and Support Vector machines. With feature importance obtained from random forest, we showed that just 4 features of the 13 presented in the dataset - Proline, Color Intensity, Flavanoid and Alcohol content of the wine - were sufficient to train the classifiers to obtain the accuracy score of 1.0.
While the classification is performed on a small dataset with wine originating from 3 regions, the success of the Random Forest model presents an exciting opportunity to expand the wine dataset using just the main 4 features to define wine properties, which can save cost and time compared to extracting 13 features. Furthermore, in testing for authenticity of wine, just a small sample of the expensive wine needs to be extracted to test for the 4 features.
Dataset
The data set we are using was found on the UCI Machine Learning Repository containing 13 features to describe the wine with only 178 data points. These data points were found via a chemical analysis of wines grown in the same region of Italy but from three different cultivars. These are going to be referred to as ‘Region 1’,’Region 2’ and ‘Region 3’. When I analyzed it with Principal Component analysis, it was found that most variance was explained by ‘Alcohol’ and ‘Malic Acid’ features, which account for 55% of total variance. The explained features are listed in the top image of Figure 1. Applying PCA on the entire dataset, the first two principal components are visualized in the bottom image in the figure below. The results of the top table show the variance of each feature with the total sum of explained variance. The bottom plot shows the wine dataset plotted with two principle components.
Methods
Four key methods were tested to analyze the data set.
Random Forest - ensemble learning method that reports the mode of the outcome of multiple decision trees performing Classification, where a decision tree for classification is a supervised learning algorithm that repeatedly splits data based on discrete values of different features
K Nearest Neighbors - supervised learning classifier, typically non-parametric, that focuses on using neighboring nodes or proximity to make a prediction in regards to an individual data point
Multiclass Logistic Regression - supervised learning classifier which relies upon softmax activation to learn the weights of features from a training dataset to make predictions
SVM Classification - supervised learning models that are useful for classifying data in a variety of formats
For this, I specifically looked in depth into the K Nearest Neighbors method for analysis.
The figure shows random forest to be the best classifier with an accuracy score of 1.0. For the random forest model, the accuracy score almost plateaus at 4 features, indicating that just the information on ‘Proline’, ‘Color Intensity’, ‘Flavonoids’ and ‘Alcohol features’ are sufficient to train an accurate random forest classifier.
Conclusion
If we want to classify which region wine comes from, only 4 features are required for perfect classification, Proline, Color Intensity, Flavanoids, and Alcohol features.These four features are capable of being chemically tested to verify wine origin authenticity. Proline refers to specific amino acids found in wine grapes. The color intensity is a measured wavelength of light through a wine (and is typically 520nm for red wines, and will decrease the longer the wine is aged). Flavanoids refer to polyphenols found in a wine which are chemical compounds. Alcohol content is a standardized measure of how much alcohol is contained in a given beverage, which for wine is 5 fluid ounces per the NIH.
A benefit to these features is that there are readily available tools for purchase to chemically test wine. A coravin could be used to extract a small sample of wine from a bottle to chemically test.This sample of wine could then be tested against the wine cultivars’ reported wine grape characteristics to verify the authenticity of the wine. Specifically, since older wines tend to fetch higher prices, the color intensity test will be important as the wavelength of light through the wine can only change with age of wine, which can be very important for analyzing wine.
Having fewer features used to predict regions of origin for the wine means that lower cost and time goes into sampling the wine, which would make this model useful for wine auctioneers who want to be vigilant in verifying their sold wine. Shown in Figure 11 we see that the Random Forest model leads to highly accurate analysis with the least amount of features. We determined that four features is enough to verify the wine authenticity based on the input dataset.