Predicting Restaurant Revenue using Ensemble Methods in Machine Learning
The Problem Statement
The TFI Restaurant Revenue Prediction Challenge was a competition hosted on Kaggle, a website dedicated to solving complex data science and machine learning problems. The purpose of this challenge was to predict the annual sales of restaurants based on given objective measurements.
The organizers had provided a training data set containing 137 entries and a test data set that had 100,000 entries. The data contained 4 categorical columns and 37 ordinal columns. The categorical columns had city names, city type, opening date of the restaurant, and the type of the restaurant. The ordinal columns were named P1, P2, . . . , P37. These columns captured various parameters relating to demographics, real estate, and other commercial evaluations. The revenue column in the training dataset indicated a transformed value of the actual revenue of restaurants, and was the target of the predictive analysis. The training and test datasets are available for download here. A sample of training dataset can be seen in figure 1.
The revenue column in the training dataset had a skewed distribution. A log scaling on the revenue gave a much better distribution of revenue. Sometimes, predictive models tend to be sensitive to the skewness in data. To guard against this, we applied a log transform to the revenue column to make it close to a normal distribution. The histogram of revenue before and after transformation is shown in figure 2.
Figure 1: Sample Data
Figure 2: Revenue Histograms
Next, we imputed values for the columns P1-P37. This reduced the noise in these columns and provided a stable distribution for each of these columns. We assumed that zeros in the columns indicate missing values, and proceeded with imputation. This was done using the mice package in R.
We derived a feature named `DaysOpen’ from the Open.Date column provided to us, by subtracting Open.Date from 1st January 2015. As expected, for most of the rows, we found high correlation between the revenue of the restaurant and the number of days since it had opened. Although we found a few outliers, we decided to keep this column for further analysis.
We found that some city names that were present in the test data were missing in the training dataset. Same was the case with the restaurant type column as well. From these two observations, we decided to leave out these columns from our analysis.
Building the Prediction Model
We first tried using linear regression for generating the prediction model. But, the linear model had a very high prediction error. Therefore, we chose to move to non-linear regression models. We decided to go ahead with two-tree-based approaches for regression. One was using Random Forests, and other was using Gradient Boosting. Both the approaches use an ensemble of decision trees to predict the final outcome. The reason for using two different algorithms was because of the outliers in the training data. Although Gradient Boosting is considered to be very robust, it may not be able to perform well when outliers are many. Random Forests are suited better in such cases. Random Forest algorithm is available through the randomForest package in R. Gradient boosting algorithm is available through the gbm package.
Random Forest is a tree-based regression technique, which accepts a random seed, and using it, creates bags of sample datasets. It then uses these datasets, and based on random selection of a subset of features, builds decision trees using the sample datasets generated earlier. The algorithm returns a mean of the outputs of each of these decision trees as the predicted value. Bagging reduces the probability of over-fitting to the training set.
Gradient boosting is also a tree-based regression technique, which builds a weak ensemble of decision trees in a stage-wise fashion. Then it generalizes the outputs of various stages by optimizing an arbitrary loss-function. It gives a linear combination of the weaker decision trees as the predicted output value.
A comparison between both approaches has been made in blog posts here and here, for further reading.
While the Random Forest approach gave a better public leaderboard rank, the Gradient Boosting approach was more robust in terms of the root mean squared error values.
Since the public leaderboard evaluation was only based on 30% of the test data, trusting the leaderboard rank was a gamble. We had chosen the models based on Random Forests because their public leaderboard scores were better than the other models. Although the Random Forest models had higher error values, during cross-validation, they fared better on Kaggle public leaderboards. We realized only after the final results were declared, that our models based on Gradient Boosting would have got us into the top 100.
What We Learnt
Interpreting the meaning of features, and the values that they take is an important aspect while building the model. This can be especially handy to generate features which contribute significantly in bettering the accuracy of the prediction model. For example, 0 values in columns P1, P2, . . . , P37 could have been interpreted differently. Since these columns represented demographical and geographical properties of the regions around these restaurants, 0 could have meant absence of a certain property, like not having a school, not being in residential area, etc. An analysis with this interpretation gave results with better accuracy than what we achieved by imputing data.
We also learnt that when training sets are small, performing cross-validation on the training set provides us with trustworthy error values. This could be validated because the error-values that were generated were closer to the final evaluation values than the public leader board scores on Kaggle.
The code for the model is available on MSys GitHub. This is for the best submission using Random Forests, which could have placed us at rank 22 on the private leaderboard.