Model Evaluation: Train Test Splits

This post is part of the model evaluation chapter within a larger learning series around time series forecasting fundamentals. Check out the main learning path to see other posts in the series.

The example monthly data used in this series can be found here. You can also find the python code used in this post here.

Building Trust In The Forecast

Let’s say your company’s CFO is asking you, the hotshot data & AI person, to produce a revenue forecast for the next 12 months. The CFO needs this forecast to communicate expectations to wall street, help optimize product inventory, and make future capital allocation decisions based on where the business is headed the next year. You take a model like ARIMA off the shelf and produce that 12 month forecast, shown below using a time series from our example dataset. Let’s take a look.

The future 12 month forecast seems to capture the seasonality from month to month, and even the upward trend. You show the results to the CFO, even calling out the 95% prediction interval, which shows the upper of lower bounds with 95% certainty that the future revenue values are likely to fall in between. The CFO looks at it for one second, then says “So what, I can’t use this forecast! How do I know it’s accurate? This is a black box.”. You’re hopes and dreams, including that potential promotion, are now crushed. Congrats, you learned one of your first hard lessons in the forecast game. Building trust in the forecast is harder than creating one in the first place. It might take seconds to train an ARIMA model, but convincing people to use it might take years.

Hang on a second, you just remembered in our chapter covering univariate models that there is this thing called a residual, which allows you to see historical forecasts compared to actual values on the training data. Maybe residuals can help your CFO build trust in the forecast?

Residuals

Residual = Actual Value - Forecast Value

Let’s calculate the residuals for the ARIMA model we trained and plot them on some nice charts using different residual analysis techniques.

Forecast vs Actual Plot

Overall it looks like the residuals look ok. The historical forecast closely tracks the actual value in most months. There are some months where we can see a large residual, shown by a big gap between the two lines.

Residuals Over Time Plot

Now we can see the actual residual across each historical period. Some are positive, where we underforecast, and some are negative, where we overforecast. Ideally residuals hang around zero on average. This means the residuals are truly white noise aka random. If they are not white noise and clustering around zero, this means there is potential for more predictive insights to be learned in the data.

Histogram of Residuals

A histogram is one of the oldest ways to chart data. This chart creates buckets that a value can fall between on the x axis, then on the y axis shows how many times a value falls in that bucket. Choosing the amount of buckets or “bins” is up to the chart creator. Usually you want the histogram to look like a normal distribution centered about zero. This means most of the residuals are close to zero and there is an equal share of residuals above zero as below zero. In our histogram the residuals are mostly centered around zero, which is good, but it looks like we have more positive residuals than negative. This means on average our model underforecasted the target variable. A helpful insight to have when sharing the results with business partners!

Q-Q Residual Plot

Another interesting way to look at residuals is by plotting them in a different graph with the forecast on one axis and the historical actual values on the other. A perfect forecast would follow a straight diagonal line on the chart, since the forecast would match the actual target value. Any values above the line is the model overforecasting, and any values below the line is the model underforecasting. This chart is helpful because you can see the performance for the smallest values in the bottom left all the way to the largest values in the top right. It also makes it easy to spot any potential outlier forecasts.

ACF of Residuals

The last chart we will showcase is creating an ACF chart on the residuals. If you think back to a previous post on autocorrelation, an ACF chart shows the correlation of a variable with lagged versions of itself. For example showing the correlation of revenue today with revenue from 6 months ago. The ACF chart of our residuals looks good. All of the lagged correlations are below the threshold of being significant, meaning they are within the blue shaded region. If we had lags outside of the blue shaded region, it might show that there was still important information in the historical data not being picked up by our model.

Train vs Test Data

Now we know all about residuals and all of the fun ways we can interpret them. Are you going to show these plots to your boss? Not so fast! Understanding the residuals in your trained model is a good idea when training the model, but using residuals to show the performance of the model to others is not a good idea. Analyzing the accuracy of the residuals is kind of like having your model take an open note test. It already has the answers to each question, so usually it will score really well. It’s not a good representation of the real world. Instead you want to make sure your model can generalize well to new and unseen data.

To do that we have to take some of the historical data and hide it from the model. This means splitting your data into two sets. An initial training data set that’s used to, you guessed it, train the model. Then a second testing data set that we’ll use to compare the predictions of the model to. Having this hold out testing data is crucial to understand if our model can actually create strong predictions on data it’s never seen before.

Let me state these two concepts again because they are that important.

Train Data: the historical data used to train a model
Test Data: the historical data held out (not used to train the model) to evaluate the predictions of a trained model

When splitting the data into train and test sets, you need to do it carefully. In most machine learning problems, you can split the data randomly and be just fine. But in the world of time series this is a cardinal sin. Because the data is ordered by time, it has to be split into train/test sets by time. Let’s see a quick example of splitting the data randomly.

In the chart we see some historical months being assigned in the training data, while others are in the testing data. This creates multiple problems when training and evaluating models.

Having training data on either side of a single historical date period allows a model to “peek ahead” and learn about future trend and seasonality patters. Effectively cheating to get better results when creating the prediction for the test date period. This will allow the model to create accurate predictions on the test data, but will not allow us to understand if the model can generalize well to new and unseen data when we need to create a forecast into the future.
Some models, like ARIMA, need every historical data point in a time series to be ordered by time with no missing values between each date period. Splitting data randomly creates unordered time series with missing values in the training data, making it impossible for models like ARIMA to properly train and create predictions.

Let’s split the data by time and see how it looks.

Now we have split the data correctly into train and test splits. We took all of the historical data and withheld the final 12 months as the test data. This will allow us to properly evaluate the our forecast model. Where we can show the future forecast, as well as how well the same model performed historically. Let’s keep in mind the train/test splits and see if we can create a forecast for the test data, and also create a final future forecast.

Excellent! Now we can show the boss the future forecast, in addition to how a similar forecast would have performed on the historical data as a back test.

Final Thoughts

Splitting your data into separate training and testing data helps us understand if the model is capable of creating a robust future forecast. When showing the results of a ML forecast to others, it’s just as important to show how well the model performed historically as it is to show the future forecast. That’s how you build trust. No one wants to use a forecast that has been proven to work poorly in the past.