-
Notifications
You must be signed in to change notification settings - Fork 105
Description
Greykite documentation states that the seasonality "auto" option is meant to let the template decide, based on input data frequency and the amount of training data, whether to model that seasonality with default Fourier order:
https://linkedin.github.io/greykite/docs/0.1.0/html/pages/model_components/0300_seasonality.html?highlight=seasonality
However, with monthly data, this option always defaults to False, both for QUARTERLY_SEASONALITY and YEARLY_SEASONALITY, even when the amount of training data (num_training_days) is greater than the minimum required (default_min_days). Why ? Read below.
These are the Silverkite default settings for minimum training data requirements, as defined in \greykite\algo\forecast\silverkite\constants\silverkite_seasonality.py
SilverkiteSeasonality(name='ct1', period=1.0, order=15, seas_names='yearly', default_min_days=548)
SilverkiteSeasonality(name='toq', period=1.0, order=5, seas_names='quarterly', default_min_days=180)
num_training_days is calculated in \greykite\common\time_properties_forecast.py, whereas the actual test is in \greykite\algo\forecast\silverkite\forecast_simple_silverkite.py(here, num_days is num_training_days calculated above):
num_days >= seas.value.default_min_days
and seas.name in freq_auto_seas_names
The result of the test is always False for monthly data, because freq_auto_seas_names is an empty dictionary, hence the condition seas.name in freq_auto_seas_names is never met ; the reason can be clearly seen in \greykite\algo\forecast\silverkite\constants\silverkite_time_frequency.py, where, e.g., for weekly data freq_auto_seas_names is the following dictionary:
auto_fourier_seas={SeasonalityEnum.MONTHLY_SEASONALITY.name,
SeasonalityEnum.QUARTERLY_SEASONALITY.name,
SeasonalityEnum.YEARLY_SEASONALITY.name})
whereas for monthly, quarterly and yearly data freq_auto_seas_names = {}, e.g. for monthly data:
auto_fourier_seas={
# QUARTERLY_SEASONALITY and YEARLY_SEASONALITY are excluded from defaults
# It's better to use `C(month)` as a categorical feature indicating the month
})
Therefore, based on input data frequency in the first line of this issue really means: if the data frequency is one of MINUTE, HOUR, DAY, WEEK, excluding MONTH, QUARTER, YEAR, MULTIYEAR.
The "better" option in \greykite\algo\forecast\silverkite\constants\silverkite_time_frequency.py when using monthly data is thus to add an extra C(month) column as a categorical feature indicating the month.
Question: Why is this a "better" option than the following definition ?
auto_fourier_seas={SeasonalityEnum.QUARTERLY_SEASONALITY.name,
SeasonalityEnum.YEARLY_SEASONALITY.name})
I see the following alternatives when dealing with monthly data:
- add an extra
C(month)column as a categorical feature indicating the month; this has the disadvantage that the extra column should only be added when bothQUARTERLY_SEASONALITYandYEARLY_SEASONALITYoptions are set to "auto" and not to "True" or "False" (quarterly and/or yearly seasonality terms are added automatically by Greykite when the respective option is set to "True", according to thevalid_seasdictionary defined in _\greykite\common\enums.py; while the term in question is not added when "False") - Add
QUARTERLY_SEASONALITYandYEARLY_SEASONALITYterms (currently excluded from defaults) to the emptyauto_fouries_seasdictionary; but Greykite developers seem to prefer option 1. - Forget about the user setting the seasonality options ("auto", "True", "False") manually - this is applicable to all input data frequencies, not just monthly:
- Let the user configure the Fourier order and the minimum number of cycles for each seasonality
- Set the corresponding seasonality option to either "True" or "False" automatically, according to principles learned from the current logic, i.e., input data frequency,
valid_seasandnum_training_points >= default_min_points
One may argue that num_training_points varies between training sets when using CV splits; however, the following example shows that both num_training_points and num_training_days are invariant between splits, even with cv_expanding_window =True:
[CV 1/3] ... valid_seas={'YEARLY_SEASONALITY', 'QUARTERLY_SEASONALITY'})>, 'num_training_points': 26, 'num_training_days': 789.0, 'days_per_observation': 28.0, ...
[CV 2/3] ... valid_seas={'YEARLY_SEASONALITY', 'QUARTERLY_SEASONALITY'})>, 'num_training_points': 26, 'num_training_days': 789.0, 'days_per_observation': 28.0, ...
[CV 3/3] ... valid_seas={'YEARLY_SEASONALITY', 'QUARTERLY_SEASONALITY'})>, 'num_training_points': 26, 'num_training_days': 789.0, 'days_per_observation': 28.0, ...
This means that the test num_training_points >= default_min_points can be applied only once directly from train_end_date before entering the CV loop (the current Fitting 3 folds for each of 1 candidates, totalling 3 fits section apparently tests the seasonality terms at each split, but the test values are invariant, as mentioned above).