Exploratory Data Analysis: Autocorrelation

This post is part of the exploratory data analysis chapter within a larger learning series around time series forecasting fundamentals. Check out the main learning path to see other posts in the series.

The example monthly data used in this series can be found here. You can also find the python code used in this post here.

Time Series Data Has Memory

The biggest factor that differentiates a time series from another form of data is order. Each observation in a time series is ordered by well, you guessed it, time. Because of this order, there are unique relationships that develop in the data. Let’s take temperature for example. If today’s temperature is 70°F, and the temperature in the previous seven days was around 65-80, what do you think the temperature tomorrow will be? Probably not below freezing. It’ll most likely be similar to the temperature today. This memory about the past is one of the most powerful concepts in time series forecasting. It’s the concept of autocorrelation.

Autocorrelation is when a value in a time series is related to previous values in that series. In simple terms, it measures how much the current value depends on past values. For example, if high temperatures today make it likely to have high temperatures tomorrow, the temperature series has a positive autocorrelation.

Calculating Autocorrelation

Getting the autocorrelation is as simple as computing the correlation of the existing time series with a lagged time series. For example let’s take a time series from our example data set, shown below.

We can create lags of one month, two months, and onwards. Then calculate the correlation between the original time series (with a lag of 0) and these new lagged time series. The results of that process are shown below, with a lag up to 24 months.

Correlation is shown on the y-axis and the lag amount is shown on the x-axis. The chart shows high autocorrelation for a lag of 0, 1, 12, and 24 months. A lag of 0 has a perfect correlation because it’s the exact same as the original time series. These lag correlations are shown as positive values, meaning that there is a positive correlation with the original time series. There are also above the shaded blue region of the chart. This shaded region is a confidence interval. This confidence interval helps us filter out lags that are not statistically significant, meaning they don’t have a correlation due to random chance. Any lag outside of the shaded region is significant and should have a real relationship with our original time series.

Partial Autocorrelation

Autocorrelation has one fatal flaw though. It can provide misleading correlations since it doesn’t account for correlations between lags. For example, there might be a very strong correlation with the original time series and lag 1 of the time series. Because of this correlation, there is a good chance that a lag 2 also shows a strong correlation. The lag 2 correlation might only be due to lag 2 having similar information to lag 1. So the end results could be misleading.

To account for this, partial autocorrelation was created. This process takes into account the correlations between lags. Making sure that the end correlations are only connected to the original time series. It is the preferred method of analyzing autocorrelation. So make sure it’s part of your time series tool kit. Here is the partial autocorrelation of our example time series.

Reading the Tea Leaves

Let’s take a deeper look at the partial autocorrelation and see what insights it might provide. When looking at the partial autocorrelation chart, the biggest correlation is at lag 0. This makes sense since a lag of 0 is just the original time series, so there will be a perfect correlation of 1. Lag 1 shows a strong correlation, which makes sense, since it’s contains values closest to the original series. There are also strong correlations at lag 11, 12, and 13. This confirms our data has strong seasonality. The highest correlation is at lag 12, which means values today are highly correlated to values at the same time in the previous year (12 months ago). This information will be invaluable when we start to build models in the future.

Final Thoughts

Having strong partial autocorrelation in your time series will always lead to more accurate future forecasts. Especially if you expect the future to be similar to the past. If your data has little or no correlation with historical lags, then it will be more difficult to train accurate models to forecast the future. Calculating partial autocorrelation should always be one of the first things you do when working with time series. It’s quickest way to start building intuition about your data.