Time Series First Principles Series
This post dives into the fifth principle of a good time series forecast, order is important. Check out the initial post in this series to get a high level view of each principle.
Baking Cakes Over Making Smoothies
Machine learning (ML) is a lot like cooking. You have various ingredients and can combine them together in clever ways to make for a tasty dish. Most machine learning approaches like classification (predicting an outcome) and regression (predicting a number) can follow a similar process to making a smoothie. We can take some data (fruits and veggies) and blend it all together inside of our model blender.
Time series forecasting is a whole other beast. It still technically falls under the regression family tree but has to be handled very differently. Forecasting is more like baking a cake, where the order in which you do things is very important. For example, you cannot switch when you add the eggs and when you add the frosting. If you do you will certainly not be invited back to your nephew’s birthday party next year. In order to bake something tasty please follow the below guidance.
Time Series Training
Training any sort of machine learning model often requires two separate historical data sets. One that is used to train the initial model, then another that is set aside to create predictions based on the initial model. We can then see how accurate the predictions were on the test data set. This ensures that our new ML model can generalize well to new and unseen data, making sure our model doesn’t overfit to the training data.
Common ML approaches like classification and regression don’t need a lot of sophistication when splitting up the historical data between a training set and a testing set. Often it will be split randomly. This is similar to making a smoothie. You can randomly throw in bananas, apples, spinach, and blueberries. All without having to think about the order of when you do it.
Take the below housing data. This is a traditional regression problem. Let’s use the total square feet and number of bedrooms to predict how much the house will cost. We can randomly split 80% of the data to train the model, then hold out 20% of the data to test how accurate the model is. Randomly splitting the data ensures we get a healthy mix of different data in each split.
Square_Feet | Bedrooms | Total_Cost | Split |
---|---|---|---|
3774 | 2 | 822732 | Training |
1460 | 1 | 245280 | Training |
1894 | 4 | 602292 | Training |
1730 | 4 | 550140 | Training |
1695 | 4 | 539010 | Testing |
3692 | 5 | 1358656 | Training |
2238 | 1 | 375984 | Training |
2769 | 5 | 1018992 | Training |
1066 | 5 | 392288 | Testing |
1838 | 1 | 308784 | Training |
A time series has a built in order to it. It’s said right there in the name, time. Ignoring the order based on time can have disastrous consequences, resulting in your final future forecast not being accurate. Just like baking a cake, we need to make sure how we train a model is done in the right order. When splitting a historical time series into a training set and a testing set, splitting not at random but based on time is the proper way to go. Using the oldest data as the training set and the newest data as the testing set makes sure we respect the order of our data based on time. The example table below is a made up time series of the price of one specific house. In reality we would need a lot more data to train a good time series model but just be cool for a minute and go with me on this one. The split column now has the test data set at the very end instead of randomly split across time. We can now use features like interest rates and gdp growth to help us forecast the price of this house over time. The first 9 months of data will train the model, and the final 3 months will be used to test the model’s accuracy.
Date | Interest_Rate | GDP_Growth | Total_Cost | Split |
---|---|---|---|---|
January 2002 | 3.43635 | 1.58111 | 315052 | Training |
February 2002 | 4.87679 | 0.03085 | 314723 | Training |
March 2002 | 4.32998 | -0.04544 | 312854 | Training |
April 2002 | 3.99665 | -0.04149 | 311865 | Training |
May 2002 | 2.89005 | 0.26061 | 309452 | Training |
June 2002 | 2.88999 | 0.81189 | 311106 | Training |
July 2002 | 2.64521 | 0.57986 | 309675 | Training |
August 2002 | 4.66544 | 0.22807 | 314681 | Training |
September 2002 | 4.00279 | 1.02963 | 315097 | Training |
October 2002 | 4.27018 | -0.15127 | 312357 | Testing |
November 2002 | 2.55146 | 0.23036 | 308345 | Testing |
December 2002 | 4.92477 | 0.41591 | 316022 | Testing |
Data Leakage
Whenever time is involved in machine learning, the probability of shooting yourself in the foot rises. This has to do with the concept of data leakage. Data leakage occurs when information from outside the training dataset is used to create the model, leading it to make overly optimistic predictions. It can also happen when we train with data that may not be available in the future when we need to create new predictions.
In time series forecasting we have already discussed one component of data leakage, related to splitting the data correctly based on time. Take the below table, instead of splitting properly by time the data is now split randomly. Our model can now “see ahead in time” when training, and in effect cheat when being evaluated on the testing splits. For example, for the test observation in July 2002 the model can learn from data on either side of that month. Figuring out previous and future trends and seasonality. This makes it easy to predict what the housing cost in July should be, since it has information before and after that month. With this approach our test accuracy will be a lot better than in the previous example where the splits are based on time. Future forecast performance will suffer though, since we have now trained and chosen a model that may only be good at figuring out how to extrapolate between two points, instead of trying to create predictions on unseen data in the future.
Date | Interest_Rate | GDP_Growth | Total_Cost | Split |
---|---|---|---|---|
January 2002 | 3.43635 | 1.58111 | 315052 | Training |
February 2002 | 4.87679 | 0.03085 | 314723 | Testing |
March 2002 | 4.32998 | -0.04544 | 312854 | Training |
April 2002 | 3.99665 | -0.04149 | 311865 | Training |
May 2002 | 2.89005 | 0.26061 | 309452 | Training |
June 2002 | 2.88999 | 0.81189 | 311106 | Training |
July 2002 | 2.64521 | 0.57986 | 309675 | Testing |
August 2002 | 4.66544 | 0.22807 | 314681 | Training |
September 2002 | 4.00279 | 1.02963 | 315097 | Training |
October 2002 | 4.27018 | -0.15127 | 312357 | Training |
November 2002 | 2.55146 | 0.23036 | 308345 | Training |
December 2002 | 4.92477 | 0.41591 | 316022 | Testing |
Ok, so we know not to split data randomly when training. Another thing to watch out for is how features are used in the model. In our time series housing example, we can use date information (month, quarter, etc) along with our macro features like interest rates and GDP growth. Let’s say we follow the right approach, split the data based on time, and see that we get good results on the test data. We can now take our model into production and try to create a forecast for the future. But wait, what do we do with the future feature values of interest rate and GDP growth? This is another potential data leakage issue, where data used to train the model is not available to create new predictions in the future. You might be thinking, no problem we can just create a forecast of future interest rates and gdp growth right? Wrong. If you can produce accurate interest rate and GDP growth forecasts, then you shouldn’t be reading this post. You should instead be sitting on your own private island, watching the return on your flagship hedge fund skyrocket. See where I’m going here? If you cannot perfectly predict future values of the feature you want to use, then you should consider not using that feature. You might be able to know with 100% certainty when a holiday or special event will take place, but not even expert economists can perfectly predict interest rate fluctuations. Instead you can look at using feature lags. Where instead of using real time interest rates or GDP growth you can instead use lags of them. For example, using a 3 or 12 month lag of each feature. Using lags in this example can help prevent compounding errors in your forecast.
finnts
Looking for a way to to never worry about the order of your time series data again? Have no fear because finnts is here! Ok enough with the used car salesman talk. The finnts package is something myself and other outstanding team members have built to automate all of the tedious aspects related to time series forecasting. The package can automatically handle the proper splits of your data and has build in data leakage prevention. Check out the package to learn for yourself how easy forecasting can be.
Final Thoughts
Remember, order is important in forecasting. Make sure you don’t mix up your data when training models, and keep a look out for data leakage. Do this right and you just might get invited back to your nephew’s birthday party next year.