Deformable convolution and other custom ops

Bhack · June 14, 2021, 11:36am

Recently we had a refresh over a Deformable convloution WIP PR in Addons.

I’ve cherry-picked this as an example as this requires us to maintain almost 3k lines of new code in the repository.

This is maintainer-ship overhead is also quite similar to what we have with other custom kernels PRs.

As Addons is one of the few Ecosystem repositories to support custom (c++) ops and the related CI infra it is quite normal that we have this kind of proposed PRs.

But as the codeownership of these components it is generally not so stable over time we would like to not merge, as possible, these custom ops PRs also to achieve a more broad hardware coverage.

What are the alternatives? How we could collaborate when a compositional implementation has huge performance gaps?

Often this kind of issues are shared across the “extend” ecosystem like e.g. for the EmbeddingBag:

https://github.com/pytorch/xla/issues/2403

https://github.com/tensorflow/addons/issues/2201

EmbeddingBag op and layer by Rocketknight1 · Pull Request #2352 · tensorflow/addons · GitHub (1k lines)

https://github.com/tensorflow/tensorflow/issues/43784

https://github.com/google/jax/issues/3206

Thanks,
Stefano

Bhack · June 17, 2021, 10:19pm

@kristen Is the MLIR team registered to this Dscourse instance or are they only in the LLVM MLIR discourse instance?

Cause generally we don’t have TF specific threads in the LLVM MLIR instance.

mihaimaruseac · June 20, 2021, 7:29pm

They have been invited here too

Bhack · August 4, 2021, 11:13am

Ok I’ve cross posted in the MLIR llvm forum instance.

I hope that at least some TF-MLIR team members could be subscribed to their tags and subcategory.

Bhack · August 4, 2021, 2:37pm

/cc @Jacques_Pienaar let me know if you want to move this in in another category and you want to use only the XLA tag.

Jacques_Pienaar · August 4, 2021, 2:51pm

Hey Stefano,

Here is fine thanks (all necessary tags). I’m pinging a couple of folks who has been looking at interfacing/third party backends as i don’t think they’ve seen this yet.

Best,

Jacques

Jacques_Pienaar · August 4, 2021, 3:01pm

[I’ll speculate based on previous conversations while we wait]

One of the parts we have discussed is “keeping” multiple levels of abstraction around, enabling backends to hook/match at appropriate level to enable the “mega” op while exposing the decomposed forms where there is no support. It is also true that the compositional representation has been too rigid and hasn’t composed as well (“just rewrite your computation as convolutions if you want performance” being in effect the indirect suggestion) and should be revised (which is happening albeit slowly). These are great examples to highlight - a common problem is that folks find a case where compositional form does poorly, special cases a transformation and then moves on and without such overarching examples it is easy to miss that the problem isn’t being addressed.

Bhack · August 4, 2021, 9:21pm

IMHO this is exactly the point.
And I think it is why some specific reusable components ( keras-nlp, keras-cv, tf-addons) that are serving e2e models, also in our selected models in model garden, could be one of the driver to understand what we are expecting from the compiler stack.

Just take a look at our current threshold in TF addons:
we need at least >50 citations to accept a feature related to a paper so it is not something that it is totally brand new.

If we need to have a custom c++ op to reach good enough performance for a new layer but then the codeowner disappear after one or two months or people require to use it in Colab/Google cloud TPU isn’t better to try to interact with these use cases directly with the compiler stack team to understand a little bit how to handle our end2end performance request and to better evaluate a solution that it is alternative to maintain a partial hw covered large custom op?

Just my 2¢

Bhack · August 11, 2021, 4:35pm

We could see the same in Keras as now it is again a python only repo:

github.com/keras-team/keras

Support 3D Pre-trained Model (DepthwiseConv3D and SeparableConv3D)

opened 12:22PM - 03 Aug 21 UTC

closed 08:44AM - 24 Aug 21 UTC

innat

type:feature stat:awaiting response from contributor stale

It must be a bug, a feature request, or a significant problem with the documenta…tion (for small docs fixes please send a PR instead). 1. The form below must be filled out. **Here's why we have that policy:**. feature-request 2. Keras developers respond to issues. We want to focus on work that benefits the whole community, e.g., fixing bugs and adding features. Support only helps individuals. GitHub also notifies thousands of people when issues are filed. We want them to see you communicating an interesting problem, rather than being redirected to Stack Overflow. **System information**. General. TensorFlow version (you are using): TF 2.5 Are you willing to contribute it (Yes/No): No 3. **Describe the feature and the current behavior/state**. Describe the feature clearly here. Be sure to convey here why the requested feature is needed. Any brief description of the use-case would help. It will be useful to enhance the research of medical imaging (3D modeling) and work on video data and more. There are available 2D classification models but unfortunately not a single 3D model for classification. For that, DepthwiseConv3D and [SeparableConv3D](https://github.com/keras-team/keras/issues/5639) official implementation is also needed. 4. **Will this change the current api? How?** It will enhance the API. 5. **Who will benefit from this feature?** Researcher on medical imaging and with video data and possibly all 3D format data. 6. **[Contributing](https://github.com/keras-team/keras/blob/master/CONTRIBUTING.md)** - Do you want to contribute a PR? (yes/no): no - If yes, please read [this page](https://github.com/keras-team/keras/blob/master/CONTRIBUTING.md) for instructions - Briefly describe your candidate solution(if contributing): **Others** There is (AFAIK) only one non-official code-bases for 3D modeling in `tf.keras` - [ZFTurbo/efficientnet_3D](https://github.com/ZFTurbo/efficientnet_3D) - [ZFTurbo/classification_models_3D](https://github.com/ZFTurbo/classification_models_3D)

Bhack · August 18, 2021, 2:31pm

@Jacques_Pienaar Any news? I would like to keep this thread alive

/cc @yarri-oss @thea

Jacques_Pienaar · August 18, 2021, 3:02pm

Not yet (I have a meeting soon that is semi-relevant, but higher level and a couple next week where I could raise it again). There are a few efforts I’m aware of, but they are at various stages.

I do like driving these with specific components. I would also ideally have it be such that the compiler team need not be a bottleneck here as that also doesn’t scale. And I believe separable convolutions have been on your list for a long time

Bhack · August 18, 2021, 3:17pm

Thank you, help me to keep this thread alive

Bhack · October 15, 2021, 9:46pm

Just a keep alive message for this thread.

Can we find someone in the TF or MLIR team that can give us any feedback/roadmap or just a rough outlook on this topic?

Thanks

Bhack · November 13, 2021, 6:42pm

@markdaoust Could you help us to find someone, on the TF side, that could give us an overview on this thread about the custom ops roadmap with the new compiler infra and TF runtime?

Thanks

Mark_Daoust · November 15, 2021, 9:31pm

I’ll see if I can find someone.

Aside: For embedding-bag the docs describe this a merging “embedding lookup” and “reduce”. But for the sum and mean combiners, isn’t it sufficient to implement this as a sparse tensor (the ids and weights) times a dense matrix (the embedding vectors)? Doesn’t that cover the cases except combiner=max? I think it would be possible to implement an efficient combiner=max if the sparse_segment_* series was complete and included a sparse_segment_max

Bhack · November 16, 2021, 1:22am

Thanks,
yes the topic is more in general about what is the perspective when the compositional path doesn’t perform well.

Do we need to interact more strictly with the compiler team on the TF side before introducing a custom ops (often it is hard to collect a feedback)? I think new ops are interesting use cases to stress-test the compositional and the compiler stack transformations.

Will we have a new way to use the new compiler and runtime infra to write more portable high level custom ops?

If we are in a python only ecosystem repo, like keras*, where we need to contribute these “missing pieces”?

P.s.
For the embedding bag case (Addons, Pytorch TPU, JAX) at some point we had a sparse proposal at:

github.com/tensorflow/addons

Comment by tanguycdls - EmbeddingBag op and layer

tensorflow:master ← Rocketknight1:master

Hi @Rocketknight1 thanks for your PR. We recently switched from Torch to TF and …Embedding Bag was missing for us too! In our use case we often work with list of non constant len called Ragged tensors in TF which uses a similar data format as Sparse CSR matrix: https://www.tensorflow.org/guide/ragged_tensor We have ragged tensors such as : ```` offsets = [0, 0, 0, 1, 5, 7] indices = [12, 13, 14, 15, 16, 78, 16] tf.RaggedTensor.from_row_splits(indices, offsets) ```` > <tf.RaggedTensor [[], [], [12], [13, 14, 15, 16], [78, 16]]> For now we currently replace embedding bag by converting our ragged to a Sparse Tensor (indices being the rows and y the nbr of item in each row) and values being the indices we want to gather. We also use a second sparse Tensor which will have the weights instead of the indices. We can then use tf.nn.safe_embedding_lookup_sparse and get better result than a simple gather then reduce. I'm not very clear on the ram usage of that one but it does the embedding lookup on unique indices and then apply a gather on it. ([see](https://github.com/tensorflow/tensorflow/blob/582c8d236cb079023657287c318ff26adb239002/tensorflow/python/ops/embedding_ops.py#L511)) Another workaround we found is to create a sparse tensor Indicator: the x coordinate will be the rows of your batch, the y the indices and values being the weight: you can then consider your embedding + sum as a sparse_dense_matmul between the embeddings matrix and your sparse indicator. The sparse_dense_matmul itself is very fast the issue is more on creating the sparse indicator. I'm not sure how that option behaves on memory since the internals are handled by TF. I did a few tests here: https://colab.research.google.com/gist/tanguycdls/9c696097642844fc1e548c0cade48e11/sparseembeddings.ipynb the performance depends a lot on the sparsity of indices and nbr of items to compute in the ragged case. I'll try to compile your branch to compare the performance between sparse matmul and your EmbeddingBag ! We did a few months ago a benchmark Pytorch vs TF and embeddingbag was slightly better than matmul in some cases. Would be happy to help on benchmarks if you need some in that PR !

But then the custom ops was merged in Addons (+1.100 lines for CPU/CUDA)

Bhack · January 10, 2022, 12:18pm

I want to refresh this topic for the new year.

Can we collect a little bit more clear vision on this topic?

Jacques_Pienaar · January 14, 2022, 1:48am

It may be one where we could make this an impromptu virtual meeting to discuss. Some folks aren’t back yet, but let me see.

Bhack · February 8, 2022, 3:10pm

We have a new MLIR paper out:

Composable and Modular Code Generation in MLIR: A Structured and Retargetable Approach to Tensor Compiler Construction

It Is still not clear how we are going to interface with these compiler tecnologies/infra when we need to write custom ops without asking to the AVG contributor to have compiler developer skills.

Bhack · February 11, 2022, 8:55pm

I see that recently some python DSL are emerging in the MLIR community:

Do you suppose that we are going to write TF custom ops in a python DSL?

Topic		Replies	Views
MLIR Code Generation for XLA General Discussion gpu , xla , mlir	5	2899	September 2, 2022
Lowering for tf raw_ops ImageProjectiveTransform MLIR help_request	4	1302	March 22, 2022
Tensorflow plugin vs XLA backend General Discussion pluggable_device , xla	8	1783	August 16, 2022
How to get LLVM IR from XLA tfcompile General Discussion models , xla , help_request	15	2842	May 24, 2022
An example of Compose? General Discussion addons	10	2413	June 7, 2021

Deformable convolution and other custom ops

Related topics