Alpitour is an Italian company that sells prepackaged holiday vacations across the globe. It was founded in 1947 in Italy as a small travel agent and today it is an international group with annual revenue exceed 1.2 billion Euros, ranging from tour operating to hotel management, aviation and travel agencies. With a brand portfolio able to serve highly diversified customers by offering travel packages with “made to measure” experiences, the Alpitour Group is ready to explore the opportunities of a big-data society to better understand its travelers.
Using Cuba as a case study, this project aims to predict the demand for travel from Italy to Cuba for each week in the next year. We are given internal company booking data and we gather external online/media information to build a predictive model.
This is important because Alpitour is a travel company. It needs to know where people will travel ahead of time, so that it can plan accordingly. In particular, Alpitour must plan its flights 6 to 12 months ahead of time for its trips. If it plans too many flights for the corresponding demand, it will lose money. Similarly, if it books too few flights, perhaps it could have gained more money by increasing its capacity. In addition to providing a new way to plan the capacity of future flights, we provide insight into Alpitour's customers and their seasonal behavior. Alpitour's current capacity-planning system is a simple linear model in Excel, where the company sees how many flights it planned last year and the total number of departures, and then makes an estimate for the next year. We plan to provide a much more robust method to make these predictions.
We identified two different figures inside the company that will make use of our predictive tool for the Cuba case and we developed two deliverables accordingly:
We were given access to internal data of the company of the past four years regarding Cuba. Specifically:
Booking data from Italy to Cuba:
Google Ads information related to Cuba as a destination:
Google Ads information related to Hotel Caraibi in Cuba:
Page visits on Alpitour’s website:
Capacity per flight
The data preprocessing stage can be summarized as follows:
This shows the dependencies of the datasets in question. If any of the data changes, Alpitour should update all datasets that are "downstream" (i.e. there are arrows leaving the dataset to some other dataset that should be updated). Due to the sparsity of data of the datasets provided, we decided to use a timestep of a week, aggregating values referring to days of the same week. Otherwise, there is simply too much noise in the data. For example, if a given week has 300 orders, it's possible that 250 were on a single day, with the remaining 50 spread across 6 other days. This would be difficult for a model to learn, since the fluctations would be so drastic. Therefore, it is better to aggregate our data by week. We do this for all of our features from both internal and external data, so that our models can consider a snapshot of features for a given week, whether using time series models or not.
Here we can see the plot of orders and departures from January 2014 to January 2019:
This shows the orders and departures. For our models, we used the data through the end of 2017 as our training set, with 2018 as a validation set, and 2019 as a test set (including future dates to test our seasonal approach). We can see in the data that the orders precede the departures, and that there are seasonal trends when people tend to travel the most, shown by the departure peaks (blue) in the January months.
We also investigated the features of Alpitour customers, and found that ages were about normally distributed, with not much variance in the price and duration of trips overall, although more elderly travelers did tend to buy slightly more expensive packages.
We created features from the moving average convergence divergence (MACD), which captures a sense of momentum in demand for travel from Italy to Cuba. Essentially, we subtract the moving average of the last 26 weeks from the moving average of the last 12 weeks. If this number is positive, we get a sense of how the demand is moving higher in the short term compared to the long term. We also compute a "signal line", which is the moving average of the MACD line over the last 9 weeks. For weeks where the MACD line is larger than the signal line, we have a pseudo 2nd-derivative, meaning that the demand is accelerating in the positive direction, so we create a "buy" feature (indicator: 1 or 0) for these time periods.
One of the company’s requirements was to detect the most influential sources both in digital media and social media communities correlated to Alpitour customers behaviors in travel purchasing. Unfortunately, there were no free solutions to evaluate the impact of a topic (such as Cuba, in this case) or a hashtag over time on the most used social networks in Italy: Facebook and Instagram. For these reasons, we focused our attention on Italian digital media sources, since all the customers of the company are Italians. Here is a list of media sources suggested by Alpitour that we investigated. We built generalized web scrapers for these platforms. They can be divided into two different groups:
Travel Digital Media
Consumer digital media
Additionally, we built a scraping tool to extract news from the EMM (European Media Monitor), which is a news aggregator website. This tool increased dramatically the amount of Italian news per week retrieved.
Once we completed the scraping process, we had a dataset containing the following features for each news article related to Cuba:
Using the Google Translate API we translated the title and corpus preview to English. Then, we used a pretrained model to assign a sentiment score to the title and corpus preview. Sentiment analysis is a natural language processing method to identify/classify subjective information of the news regarding its feeling/opinion. Specifically, for each text analyzed we had a corresponding value in the continuous range [-1, 1]:
From these values, we were able to extract weekly features such as the number of articles per week (large volume is correlated with large orders), percentage of positive vs. negative news, average level of positivity among positive news, etc.
In the above plots, we plot the distribution of the sentiment scores conditioned on whether the response variable is up or down. In this case, the response variable is whether or not the average orders/week increases over the next 1 month compared to the current week. Since we see a a non-uniform distribution of the sentiment scores, we expect them to have some predictive capability. However, we discovered empirically that these were not helpful in predicting future orders.
In order to enrich our dataset, we also retrieved Google Trends data of the past four years for general queries such as “Cuba travel” and “Cuba vacation” and for specfic queries regarding Alpitour’s hotels locations: “Cayo Largo”, “Holguin”, etc. Grouping Google Trends results per week gave us a plot which is highly correlated with the number of orders, as we can see from the image below:
Giving the promising results of Google Trends data, we decided to move our analysis of Google data further and we built a scraping tool for Google search. For each week from January 2014 to April 2018, we retrieved the top 100 results of Google search for the query “Cuba” which were published in that specific week. In this way we were able to gather news, posts and reviews from Italian sources. This time, instead of applying a sentiment analysis on the scraped data we tried a different approach. First of all, we built a “ground truth” of travel-related websites performing a set of queries related to topics such as “travel”, “vacation” and so on. Then, for each week, we evaluated the weight of the portion of results which were published on our "ground truth" (identified) travel-related sites. Specifically, we created two different features:
Unfortunately, the extracted features are noisy, and uncorrelated with respect to orders data, as we can see from the image below where they have been both normalized between 0 and 1.
Our first deliverable is a documented software pipeline which goes through all the development stages, from data cleaning to the final prediction. It is intended to be used by Alpitour data scientists/engineers to reproduce our results and apply the methodology to other locations. Alpitour would first need to collect the correct data for the new location. Code and documentation can be found here.
The pipeline is made of three main stages as we can see from the picture above.
We first modeled the problem as a binary classification problem, since we thought it would be easier to determine the direction of movement rather than the exact number of orders or departures. We classified the direction of average order demand per week in the next 1 to 3 months. Here we can see an example of how the data was labeled in order to fit and evaluate the classification models. This is referring to departures one month ahead:
As explained earlier, we used the moving average convergence divergence (MACD) features to capture the momentum in demand for each time step (week), which was effective, reaching a largest AUC value of 0.79. Our benchmark here was a logistic regression model for the next 1 month, using only the order and departure data (no MACD, no other features), which had an AUC Of 0.57. Therefore, we saw a lot of improvement from start to finish in the classification setting, through choice of models, feature generation, and feature selection.
However, although the results were good and could help Alpitour to know when there will be surges in orders, Alpitour would ideally receive this information on a finer granularity, and so we also explored regression.
For regression, we first tried to predict the average number of departures/week in the next 1-3 months using random forest regression, and achieved R2 values of about 0.85. Although this was good, again, we needed finer granularity. So, we tried to predict the number of departures/week for each of the next 12 weeks:
As we see in these plots, the regression works reasonably well for predicting 1 week ahead, but the performance is much worse for predicting 12 weeks ahead. When we met with Alpitour in Italy, and they explained to us their situation of planning capacity up to a year in advance, and therefore they need predictions not only on a per-week basis, but up to a full year. This was challenging, since we had just shown that our model for predicting the 12th week did not perform well. After reaching this point, we began to consider time series models.
Within time series models, we used three models that performed poorly, and one model that was great. The first three models were ARIMA, which was limited by the features it could use at each time step. The second was a Gaussian process with a seasonality kernel, which did not learn the stochastic process of our data. The third was an LSTM network, which performed very well for predicting 1 week ahead, but was too general when trying to predict a full year ahead. Examples of these LSTM plots are included below. Next, we will explain the best model.
To solve the problems explained previously, we considered a seasonal additive model for regression, which ended up being our most valuable contribution to the project. We are able to predict a full year in advance, and therefore aid Alpitour supply managers in planning travel capacity.
The approach is articulated into different steps:
Recall that Alpitour books flights up to 1 year in advance, so they need predictions of departures up to 1 year in advance and manage their flight capacity accordingly. If Alpitour doesn’t fill its capacity, it may lose money. If Alpitour fills its capacity, maybe it could have sold more trips, which is an opportunity cost. So, if we can help Alpitour by getting more accurate predictions of the departures, we are creating business value. The following section is used only as an estimate in order to see how this project adds value to Alpitour.
We want to quantify the value Alpitour can gain by using our model by using the predicted departures for each week in the validation set (2018) as the updated capacity Y'. By comparing the value lost by Alpitour using its current planned capacity vs. the capacity from our predicted departures, we can estimate the money we earn back for them.
In the validation set (year of 2018), for each week t, we have:
When we calculate the loss using our model result of predicted departures, there is a tiny update for the case when Y't > Yt:
Otherwise, if Y't <= Yt, then Alpitour could book more flight and increase revenue, so there is opportunity cost of Pt * (Yt - Y't)
Unbooked value lost using current capacity:
€505,413.52
Unbooked value lost using predicted departures as the planned capacity:
€440,070.02
We save €65,343.50 of value for the Cuba case in 2018. This is revenue, not profit, so the true profit will be smaller and it depends on Alpitour’s operating profit margin. It also may cost Alpitour to change its capacity schedule. Furthermore, Alpitour has its own planes, and this assumes that the company can modify the capacity to any integer number of seats, but in reality they company may need to fill another full flight (or half) before increasing the capacity.
The Dashboard is one of the two final deliverables. The intended user of the dashboard is the supply manager of Alpitour. The skills of the Supply manager are:
With the dashboard it will be possible to visualize the predictions of each airport served by Alpitour in Cuba. This is a mockup, but it helps the supply manager to understand how to interpret the predictions, which will be constantly changing as the company brings in new data.
The Dashboard visualizes the predictions of the model with a prediction line and uncertainty bounds for each airport. The first screen shows the map of Cuba with all the airports that are used by Alpitour:
The user can select the airport he/she wants to see by clicking on the map or on the buttons on the side of the screen. The airport selected will be highlighted in orange (in this case, Havana). The map will zoom in on the airport selected.
By clicking on "See data" it is possible to see in details the prediction for that airport. The orange line is the prediction given by the model up to one year in the future. The white line is the current planned capacity, which is the number of plane reservations bought by Alpitour to be sold to the customers. The yellow line shows the number of orders already purchased by Alpitour’s customers. When the yellow line stops, it means that Alpitour has not started to book any departures/flights for that time of the year (since it is almost a year ahead of time).
By using a calendar, it is possible to select the period of time in months that the users wants to see in detail. In this case, the image is between May and July 2019. The predicted departures trendline will adjust to the new period of time and will highlight the weeks.
By clicking on the trendline the user is able to see the details about a certain week:
In conclusion, we developed two main models:
In addition to these models, we provide the company with a generalized code repository that will be a pipeline for software engineers and data scientists to replicate our methodology and expand the predictive capability to different destinations. This is the result of our final prediction for L’Avana. The Gaussian process regressor provides a confidence interval with upper and lower bounds for the predicted values of 1 standard deviation. In this way, the supply manager can also take into account quantitatively the possible errors of the future predictions and make his/her decision accordingly.
Interesting avenues for potential development include: