I’ve cherry-picked this as an example as this requires us to maintain almost 3k lines of new code in the repository.
This is maintainer-ship overhead is also quite similar to what we have with other custom kernels PRs.
As Addons is one of the few Ecosystem repositories to support custom (c++) ops and the related CI infra it is quite normal that we have this kind of proposed PRs.
But as the codeownership of these components it is generally not so stable over time we would like to not merge, as possible, these custom ops PRs also to achieve a more broad hardware coverage.
What are the alternatives? How we could collaborate when a compositional implementation has huge performance gaps?
Often this kind of issues are shared across the “extend” ecosystem like e.g. for the EmbeddingBag:
Here is fine thanks (all necessary tags). I’m pinging a couple of folks who has been looking at interfacing/third party backends as i don’t think they’ve seen this yet.
[I’ll speculate based on previous conversations while we wait]
One of the parts we have discussed is “keeping” multiple levels of abstraction around, enabling backends to hook/match at appropriate level to enable the “mega” op while exposing the decomposed forms where there is no support. It is also true that the compositional representation has been too rigid and hasn’t composed as well (“just rewrite your computation as convolutions if you want performance” being in effect the indirect suggestion) and should be revised (which is happening albeit slowly). These are great examples to highlight - a common problem is that folks find a case where compositional form does poorly, special cases a transformation and then moves on and without such overarching examples it is easy to miss that the problem isn’t being addressed.
IMHO this is exactly the point.
And I think it is why some specific reusable components ( keras-nlp, keras-cv, tf-addons) that are serving e2e models, also in our selected models in model garden, could be one of the driver to understand what we are expecting from the compiler stack.
Just take a look at our current threshold in TF addons:
we need at least >50 citations to accept a feature related to a paper so it is not something that it is totally brand new.
If we need to have a custom c++ op to reach good enough performance for a new layer but then the codeowner disappear after one or two months or people require to use it in Colab/Google cloud TPU isn’t better to try to interact with these use cases directly with the compiler stack team to understand a little bit how to handle our end2end performance request and to better evaluate a solution that it is alternative to maintain a partial hw covered large custom op?
Not yet (I have a meeting soon that is semi-relevant, but higher level and a couple next week where I could raise it again). There are a few efforts I’m aware of, but they are at various stages.
I do like driving these with specific components. I would also ideally have it be such that the compiler team need not be a bottleneck here as that also doesn’t scale. And I believe separable convolutions have been on your list for a long time
@markdaoust Could you help us to find someone, on the TF side, that could give us an overview on this thread about the custom ops roadmap with the new compiler infra and TF runtime?
Aside: For embedding-bag the docs describe this a merging “embedding lookup” and “reduce”. But for the sum and mean combiners, isn’t it sufficient to implement this as a sparse tensor (the ids and weights) times a dense matrix (the embedding vectors)? Doesn’t that cover the cases except combiner=max? I think it would be possible to implement an efficient combiner=max if the sparse_segment_* series was complete and included a sparse_segment_max
Thanks,
yes the topic is more in general about what is the perspective when the compositional path doesn’t perform well.
Do we need to interact more strictly with the compiler team on the TF side before introducing a custom ops (often it is hard to collect a feedback)? I think new ops are interesting use cases to stress-test the compositional and the compiler stack transformations.
Will we have a new way to use the new compiler and runtime infra to write more portable high level custom ops?
If we are in a python only ecosystem repo, like keras*, where we need to contribute these “missing pieces”?
P.s.
For the embedding bag case (Addons, Pytorch TPU, JAX) at some point we had a sparse proposal at:
But then the custom ops was merged in Addons (+1.100 lines for CPU/CUDA)
It Is still not clear how we are going to interface with these compiler tecnologies/infra when we need to write custom ops without asking to the AVG contributor to have compiler developer skills.