How do you decide when to retrain your production models? Do you just always retrain them on a schedule, whether they need it or not, or do you monitor and evaluate your model’s performance in production?
One common practice I saw, is to collect a new (typically small) test set every few weeks (weekly is not uncommon) and use that to monitor if model performance is dropping consistently over time. If that is the case, it is typical to set a threshold on the drop. Once that drop threshold is reached, a new training set is compiled and new models are trained. Both old and new test sets can be used to compare new models with current production model.
I’ve also seen teams adopt a fixed retraining schedule, but only when they know from long experience that such drift does happen for their particular application.
There are some fancy methods for monitoring distribution drift but they can be deceiving as some drifts won’t actually affect the model performance.
operations has been a big thing for a lot of my clients lately; there are a number of things i think about when deciding on retraining…
how stationary is the input? sometimes it’s clear that a problem is stationary in the input; e.g. a vision model trained on a large diverse natural set images from a phone is usually pretty stable whereas a time series problem often is not. i’ve found rapid retraining of a time series problem on a small recent window can give better results than a larger model trained on a longer window across all data.
how stationary is the output? the big thing i’ve found here to consider is feedback loops. when building a model that has outputs very close to the user experience; e.g. P(click|impression)
then retraining can be critical to ensure the feedback loop doesn’t push things out of distribution too quickly.
monitoring in any case, expectations about feedback loops and stationarity are often wrong so the main thing is to be able to monitor drift. it’s not just about knowing when to retrain but, more importantly in the operational sense, when to occasionally turn off and fall back to whatever graceful degradation plan is in place ( you do have a graceful degradation plan don’t you? )
[related] active learning loops more and more i’ve seen clients wanting to make use of unlabelled data; a key thing being to direct labelling effort. things like uncertainty and diversity sampling, which are needed for active learning loops, end up being super useful for monitoring too.
challenger/champion sometimes the question isn’t “when do i retrain?” but “when do i expose users to my latest trained model?” in which case it’s more about retraining as frequently as you can and then focusing instead on how you want to slowly expose your model through techniques like shadow releases. the big question here is whether the added complexity is justified to get empirical info about a new model’s performance.
so many interesting problems in this space!
Two years ago there was an interesting seminal paper “Continual Learning in Practice” at the 2018 NeurIPS workshop on Continual learning.
I think that in the meantime we are exploring full continual learning systems in resesarch, frameworks like TFX/Kubeflow could try to offer some kind of “zero-touch ML” features.
Recently we approved TFX Periodic Training and I think that we can expand in that neighborhood when we are ready to explore/offer some plugabble automations.
In our team we work with fruit growers so we typically retrain at the end of each season, adding one more season in the training set and shifting the test set forward by one. Before we deploy to production, we have a process to “backfill” the previous season on the beta server, predicting each week with the data available at that point in time. If that’s better than our actual performance, we deploy. We tried different methods of train / test split in order to re-train more frequently, but those always resulted in overfitting models.