Exploratory Data Analysis: External Regressors

This post is part of the exploratory data analysis chapter within a larger learning series around time series forecasting fundamentals. Check out the main learning path to see other posts in the series.

The example monthly data used in this series can be found here. You can also find the python code used in this post here.

May The Outside Forces Be With You

When creating a time series forecast, you really only need three columns. One to identify one time series from another, a timestamp (date) column, and your target variable. With these three things you can produce high quality forecasts. Unfortunately most time series do not live in a vacuum, and are subject to outside forces. These forces could be things like macroeconomic factors like inflation or consumer sentiment. The growing market share of your competitors. Or even when your product might go on sale. These are all external regressors (xregs), or outside data points that can be used to improve the predictive accuracy of our time series forecast.

External regressors come in many shapes and sizes. They are usually chosen based on the domain expertise of whatever you’re trying to forecast. Outside of a hunch that they might work, how can we tell if they should be used in our time series forecast? Let’s walk through a few methods to measure the relationship between external regressors and our target variable.

Measuring Linear Relationships

The most straightforward approach is a simple correlation analysis. We take the external regressor time series, and calculate the correlation against our target variable time series. This gives us a value between -1 and 1. Having either a strong negative correlation or positive correlation are both helpful. Here are a few rules of thumb to measure strong correlations of external regressors based on the date grain of your data.

Year/Quarter/Month: Absolute value correlation of 0.5 or higher.
Week/Day: Absolute value correlation of 0.2 or higher.

It’s important to note that those values are arbitrary. This isn’t a perfect threshold to use. So make sure you adjust that based on your own time series data. Let’s take a time series from our example dataset and calculate the correlation against a few external regressors.

Date	Country	Product	Revenue	xreg1	xreg2
2019-01-01	United States	Cookies	25	12.37	27
2019-02-01	United States	Cookies	10	15.73	2
2019-03-01	United States	Cookies	15	39.68	67
2019-04-01	United States	Cookies	45	60.17	95
2019-05-01	United States	Cookies	55	72.84	5
2019-06-01	United States	Cookies	65	53.31	21
2019-07-01	United States	Cookies	25	9.65	31
2019-08-01	United States	Cookies	15	12.20	3
2019-09-01	United States	Cookies	10	10.04	8
2019-10-01	United States	Cookies	5	19.62	88

The first external regressor (xreg1) seems to have similar trend and seasonality patterns as our revenue target variable. The second external regressor (xregs2) on the other hand is all over the place. Here is the correlation with each xreg.

Variable 1	Variable 2	Correlation
Revenue	xreg1	0.83
Revenue	xreg2	0.05

The table confirms our initial smell test after reading the line chart. Xreg1 has a strong correlation, and xreg2 does not. This tells us that xreg1 could be a good candidate to use as a feature in our ML forecast model. What if we don’t know what xreg1 values are going to be in the future? That’s where calculating correlation on lags becomes valuable. We can calculate lags for 1 month, 2 months, 3 months, and onward to see if lag values of our external regressors can also be used as features.

Variable 1	Variable 2	Correlation
Revenue	xreg1	0.83
Revenue	xreg2	0.05
Revenue	xreg1_lag_1	0.64
Revenue	xreg1_lag_2	0.24
Revenue	xreg1_lag_3	-0.17
Revenue	xreg1_lag_6	-0.07
Revenue	xreg1_lag_9	-0.43
Revenue	xreg1_lag_12	0.65
Revenue	xreg2_lag_1	0.00
Revenue	xreg2_lag_2	-0.11
Revenue	xreg2_lag_3	-0.15
Revenue	xreg2_lag_6	0.17
Revenue	xreg2_lag_9	0.04
Revenue	xreg2_lag_12	0.15

In addition to the original xreg1 (at lag 0), the 1 and 12 month lags also have a high correlation. A 12 month lag shows that there is strong seasonality in both our original time series and xreg1. This will be helpful once we start to train models.

Measuring Non-Linear Relationships

Traditional correlation is great and all, but falls short if there are non-linear relationships between the variables. In order to capture non-linear relationships, we have to bring in some heavy firepower. Using either of the below approaches.

Mutual Information (MI)
- Overview:
  - Measures the shared information between two variables (or time series).
  - Captures linear and non-linear dependencies.
  - Based on information theory: higher MI indicates more shared patterns or dependencies.
- Key Points:
  - Values are always non-negative.
  - MI = 0: No dependence (variables are independent).
  - Higher MI: Greater dependency or shared information.
- Interpretation:
  - MI doesn’t distinguish between positive or negative relationships (unlike correlation).
  - It only reflects the strength of dependency, not its direction or type.
Distance Correlation (dCor)
- Overview:
  - Measures the dependence between two variables by analyzing distances between all pairwise points.
  - Captures both linear and non-linear relationships.
  - Unlike MI, dCor is normalized to a specific range.
- Key Points:
  - Values range from 0 to 1.
  - dCor = 0: Variables are independent.
  - dCor = 1: Perfect dependence (linear or non-linear).
- Interpretation:
  - A value closer to 0: Weak or no dependency.
  - A value closer to 1: Strong dependency.
  - It is symmetric and doesn’t distinguish between causation or directionality.

Distance correlation is easier to understand and compare scores across different data sets, but mutual information can understand more complex non-linear relationships. So either can be helpful. Dealers choice on what you want to use. Let’s calculate a mutual information score on our external regressors and their lags.

Variable 1	Variable 2	Mutual Information
Revenue	xreg1	0.52
Revenue	xreg1_lag_1	0.18
Revenue	xreg1_lag_2	0.00
Revenue	xreg1_lag_3	0.01
Revenue	xreg1_lag_6	0.00
Revenue	xreg1_lag_9	0.04
Revenue	xreg1_lag_12	0.05
Revenue	xreg2	0.00
Revenue	xreg2_lag_1	0.00
Revenue	xreg2_lag_2	0.00
Revenue	xreg2_lag_3	0.00
Revenue	xreg2_lag_6	0.00
Revenue	xreg2_lag_9	0.00
Revenue	xreg2_lag_12	0.00

Similar to regular correlation, xreg1 shows a strong relationship with revenue. But for MI it only has a relationship with lag 0 and lag 1. All other values are either very small or zero, meaning there is not strong relationship. This differs from the original correlation values. Usually it’s a good idea to use a few of these methods together. If external regressors are flagged as important in more than one of these approaches, then it’s safe to say they have some predictive power for our target variable. So for our data we can keep xreg1 to use in our model and drop xreg2 since it’s mostly noise with no signal.

Dealing With Binary Regressors

There’s one final type of analysis we are missing. Sometimes adding simple binary variables as external regressors can result in major accuracy improvements. These are 1 or 0 values that can represent a certain event or action happening. For example you might have new product launches in random months throughout the year. These can be represented with a 1 when there is a product launch, and a 0 when there is not. Without this binary data, a model may pick up on trends and patterns that may not repeat in the future. Your business might thrive during the months leading up to the US presidential election, but that only comes around every four years. So giving the model context of those events is important.

An interesting binary regressor in recent years has been a COVID-19 flag. This shows when COVID was impacting the world the most. When looking at the chart of our time series above, we can see a large dip in 2020. So it looks like COVID has a potential effect. Let’s create a COVID flag with a value of 1 for every month in 2020 and a value of 0 for all other months. We will also create a second flag that contains random 1 and 0 values to see if we can determine which regressor is important to our specific time series.

We can calculate a mutual information score for each one.

Variable 1	Variable 2	Mutual Information
Revenue	COVID Flag	0.0033
Revenue	Random Flag	0.0000

While the mutual information score for COVID is small, it’s still above zero. Which proves that this COVID flag may help improve our models predictive performance.

Finally there are other ways to figure out the importance of external regressors, but most of them deal with training actual models. I’ll wait on discussing those techniques until we build up some more foundational knowledge and eventually get to the sections on models themselves. Stay tuned.

Reversal

When working with external regressors, it’s important to approach correlations and relationships critically. A strong correlation might suggest a connection, but it does not prove causation. This distinction is crucial, as relying on variables without clear logical links to your target can mislead your analysis and undermine your forecast.

Another issue to watch for is multicollinearity. This happens when external regressors are highly correlated with one another. Models can struggle to differentiate the effects of variables that overlap significantly, leading to instability in coefficient estimates and diminished interpretability. To mitigate this, examine the correlations among your regressors before including them. If two variables are redundant, consider removing or transforming one to reduce redundancy.

Ultimately, external regressors should be chosen based on both statistical evidence and domain knowledge. They should have a clear, interpretable relationship with the target variable. Without this foundation, even the most promising-looking variables can introduce unnecessary complexity and noise.

Final Thoughts

Incorporating external regressors into your time series analysis can significantly enhance the accuracy and robustness of your forecasts when done carefully. The methods discussed, including correlation analysis and mutual information, are essential tools for evaluating potential regressors. However, they should always be applied alongside sound reasoning and a thorough understanding of the business.

It’s also worth noting that the relationship between external regressors and your target may not always be straightforward. Non-linear relationships, lags, and interactions with other variables often require additional exploration to capture their full impact. This underscores the importance of iterative experimentation and validation throughout the forecasting process.

As you refine your approach, prioritize quality over quantity. A well-selected set of regressors tailored to the problem at hand will yield better results than an overly broad or poorly vetted selection. The goal is not just to improve the model’s performance but to ensure that the improvements are grounded in meaningful insights.

In time series forecasting, the integration of external regressors bridges the gap between historical patterns and the broader forces shaping the future. By combining statistical rigor with contextual understanding, you can build models that not only predict but also provide actionable insights into the dynamics of your business.