As part of one of comSysto’s Labs (three days of focussed work on a topic that we’re interested in), four of us took part in the Kaggle competition “Personalize Expedia Hotel Searches”. For this competition we applied machine learning algorithms to rank hotels in order to maximize purchases on Expedia. This blog posts describes our approach as well as our experience.
Expedia provided a training and a test data set. The training data contained information about 400.000 hotel search requests containing about 10 million records. I.e. on average, 25 hotels per search request were displayed on Expedia’s website. For each search request the training data contained the hotel’s rank and whether the offer was clicked upon or booked. The test data consisted of 260.000 search requests (about 6.6 million records) without the information about the hotel’s rank and whether the offer was clicked or booked.
The goal of the competition was to predict the probability that a hotel was booked and rank the hotels accordingly, using the features of the search request as input.
Data processing and modelling
All data processing, modelling and predicting was carried out using R running on an AWS EC2 machine with 64GB of RAM. During the course of the project, we followed the workflow pictured above. By far the most time-consuming part of building the model – as is so often the case – was the preparation of the data. About 80% of our time was spent on data inspection and cleansing – understanding the features’ statistical properties, declaring features as factors with specific levels, treating missing values etc.
To evaluate our model’s performance before submitting it to Kaggle, we randomly split off about 10% of the training data to use for cross-validation. An initial check of the data’s statistical properties revealed that it was highly skewed: there were far more negative examples (hotel was neither clicked or booked) than positive ones (hotel was clicked or booked). We addressed this by downsampling our data, reducing the rate of negative to positive examples from 10:1 to 4:1. An alternative approach would have been to adjust the prediction threshold of the algorithm and find its optimal value by evaluating precision and recall via the F-Score.
After inspecting the data, we decided to try both random forests and logistic regression as models. Implementations of both algorithms are availabe in R. For random forests we used the package randomForest. Since we did not have enough time to proceed with both models in the required detail and since random forests delivered better results in early tests, we restricted our approach to random forest only.
It’s obvious that in order to book a hotel, the user first has to click on it. So the only hotels having a chance of being booked are the ones that were also clicked on. One might go ahead and predict only the chance of being booked, ignoring information on the click frequencies.
For the ranking presented to a user, hotels that according to the training data were clicked often (say 1000 times) but only rarely booked (say 100 times) in the past should be able to compete with a hotel that was booked 200 times and clicked 200 times. A trade-off between both events (clicked or booked) is presented in the NDCG metric which was used by Kaggle to evaluate the ranking on the test set.
To account for this, we divided the modelling of the data into two major steps. In the first step, we trained a random forest on the training set, using X out of Y features. This random forest generated predictions for the probability of an offer being clicked. These predictions were then fed into a second random forest, which used them as a new feature in addition to the original ones. The second random forest predicted the probability of the offer being booked. These steps were mirrored in the prediction stage: The first forest takes the test data and predicts the probabilities of being clicked, which are added to the test data and then fed into the second forest, resulting in probabilities of being booked.
Analyzing the metric Kaggle used to assert the quality of the submitted ranking, we decided to use a weighted sum of the two probabilities for the final ranking, giving the booking probability twice the weight of the clicking probability.
On the third (and last) day of our lab, we finally had all the machinery in place to train random forests on our data and let them predict the probabilities of a hotel being clicked and/or booked. After fighting with two nasty bugs, we submitted our results to Kaggle just in time for the competition’s deadline, obtaining a score of 0.478. This ranked us at 118 of of 341 competing teams. Our score of 0.478 is in reach with Expedia’s own ranking score (0.49748) and far better than the Random Order Benchmark (0.34958).
What we took away first of all is that three days is a pretty short time to tackle a problem like this. After all, the Expedia competition was open on Kaggle for two months. We had to spent almost all of the time on data cleansing & preparation as well as on debugging, leaving us little time to improve the actual modelling in order to obtain better results
Still, we learned a lot about machine learning with R and the essentials of tackling data problems as a team. This lab provided a great opportunity to transfer machine learning and data analysis knowledge within comSysto.