I have an imbalanced time series dataset for use in a time series forecasting problem for regression (forecast 1 video of 24 hour data (144 7x7 images) given a 1 video of 24 hour data (144 7x7 images)). I did a test, in which I filtered the data from the training set, so that the training target data was equal to or less than the mean of the validation target data, this made the loss (Gradient Difference Loss + MAE) and metrics (RMSE and MAE) of training, validation and testing sets become much better in the early epochs since the first epoch, it was a very noticeable change. I understood that this way the training data is more balanced, since the data starts to have a distribution more similar to the validation data and consequently to the test data. From what I verified, this is a sampling technique, I believe it would be downsampling. I would like to know if this approach is valid, and if not, what would be good solutions. Python code that I use to downsample:
mean_Y_val = np.mean(Y_val)
mask_Y_train = np.mean(Y_train, axis=(1, 2, 3)) <= mean_Y_val
Y_train = Y_train[mask_Y_train]
X_train = X_train[mask_Y_train]
Thank you very much in advance!
Yes, your approach to downsampling by aligning the training data distribution with that of the validation data is valid and can improve early performance. However, be aware of potential issues like data loss, overfitting, and bias introduction. Consider also exploring alternative strategies such as data augmentation, weighted loss functions, and anomaly detection techniques to complement your method and further optimize model performance. Your Python code for downsampling achieves this by filtering the training set based on the mean target value of the validation set, which is a straightforward and effective technique for your scenario.
1 Like
Hi, Tim. Nice to know that this is a valid approach. I could also do the same downsampling but instead of aligning my training target data with the validation target data, align my input data (predictors) from the training set with the input data (predictors) from the test set, or even align the input data (predictors) from the training set with the data (predictors) from the validation AND test datasets? Regarding the problems that may arise when using this technique, ok, I will pay attention to them, and regarding alternative techniques, I will try to use them, but for example, I have a certain prejudice against using data augmentation (that is, creating data artificial ones), I believe it is something quite complex to do so that at the very least it does not degrade the data. About using weighted loss functions, I’ve even tried to use it, but I don’t know how I could weight my data, and about using anomaly detection, I don’t know this technique, I’ve only heard about it, I’ll try to study it. Thanks!
@Tim_Wolfe If you could please take a look at my last message above I would greatly appreciate it.