For this blog, I will talk about a feature engineering trick as an alternative to creating dummy variables for months. It’s common practice to create datetime dummy month columns in additive models. For example, I am trying to forecast wildfires in Queensland, Australia.
And in order to capture the month seasonality, I create dummy month features from the datetime index using a loop as such:
When I take these features and forecast using SARIMAX, there might be large step-downs or step-ups from the last day of a given month to the first day of the next month. For example:
This makes a lot of sense because each month's feature has different coefficients. However, maybe creating on and off switches based on month isn’t exactly what I want. Although January 1st is in a new month, it’s not very far from December 31st. I would want the beginning of January(month_1) to take in some elements from December(month_12).
In order to create these features, I used a radial basis function to return me bumps for each month.
However, I had to brute force the solution by creating multiple(n years) arrays of cumulative days of the data frame and minus the cumulative mid-day for that year and returning max values for that array because initially, January didn’t take any elements of December and vice versa.
The month features are now a bit curvier than the previous on and off switches:
When I refit the training set with these features, I get a more generalized and curvy plot as below:
I learned about these ideas from Vincent D. Warmerdam at his PyData talk back in 2018. Many kudos to him! Please check out his talk here:
https://www.youtube.com/watch?v=68ABAU_V8qI&ab_channel=PyData