Best place to submit RFC for tensorflow core developers?

hrbigelow · December 13, 2022, 4:59pm

Hi,

I’m working on a proof-of-concept tool for TensorFlow ops argument checking and automatic documentaion. Where is the best place to submit an RFC for core developers?

I tried https://groups.google.com/a/tensorflow.org/g/developers a week ago but still no response. It seems extremely low traffic.

Best,

Henry

Bhack · December 13, 2022, 5:10pm

/cc @markdaoust What do you think? Can we discuss informally these things earlier in the forum before, eventually, preparing a formal RFC PR?

Mark_Daoust · December 13, 2022, 5:55pm

Thanks @Bhack,

Yeah, we can always discuss things here.

I own a lot of the documentation tools, so it’s right to have me in the loop.
And I can always help track down the right people, or ping internal list if necessary.

What’s the idea?

If I miss an update on this thread, just ping me again.

hrbigelow · December 13, 2022, 7:43pm

Hi @markdaoust, @Bhack,

Apologies, it’s not quite ready to present for discussion yet. But I’m pleased to find the right venue, thank you for the prompt response. I’ll fix some things and be in touch shortly.

Best,

Henry

Fishon_Amos · December 14, 2022, 3:54am

The best place to submit an RFC for core developers is the TensorFlow GitHub repository. You can post an issue on the repository with a detailed description of your tool. The core developers will review your issue and provide feedback if needed.

Bhack · December 15, 2022, 9:27pm

The RFC process is formally defined at:

github.com

tensorflow/community/blob/master/governance/TF-RFCs.md

# TensorFlow Request for Comments (TF-RFC)

The purpose of a TensorFlow RFC is to engage the TensorFlow community in
development, by getting feedback from stakeholders and experts, and
communicating design changes broadly.

## Who is involved?

Any **community member** may help by providing feedback on whether the RFC will
meet their needs.

An **RFC author** is one or more community member who writes an RFC and is
committed to championing it through the  process.

An **RFC sponsor** is any maintainer who sponsors the RFC and will shepherd it
through the RFC review process.

A **review committee** is a group of maintainers who have the responsibility of
recommending the adoption of the RFC.

This file has been truncated. show original

But as the process has a not trivial overhead it is better to informally discuss the idea in the forum or in any other SIG specific channel just for an early evaluation.

hrbigelow · December 15, 2022, 9:32pm

Hi @Bhack,

Thank you - sounds good. I’m cleaning up some documentation and fixing bugs at the moment, and will send an email in the next few days.

Best,

Henry

hrbigelow · December 18, 2022, 9:23pm

Hi @Bhack, @markdaoust,

The idea is here: https://github.com/hrbigelow/opschema

If you want to skip the examples and see how it works, check out:

schema section

and the following section #computation-graphs (sorry, it won’t let me link more than two)

As I mention in the README, it is a proof-of-concept only. I originally intended it to be a PyPI package to make using TensorFlow easier. But, I realized only recently that it is not maintainable in that form since these schemas would all need to exist in separate versions corresponding to each TensorFlow release.

When I started it, I thought it would take a week because the idea was “simple”. I was wrong… While the core idea is simple, I discovered that TensorFlow ops were quite a bit more complex in their behavior than I’d realized.

In any case, I have some sense for why this is, and why it is so difficult from an engineering standpoint for TensorFlow to keep documentation in synch with the actual capabilities of the ops. And, why many of the exceptions it raises are cryptic or wrong. I believe it would be feasible to implement within TensorFlow, but it might be a major undertaking. But if so, I’d be glad to discuss further.

And if you see some ideas in here that might be feasible but not others, would be great to hear your thoughts.

Best,

Henry

Mark_Daoust · December 19, 2022, 8:36pm

   import opschema
   opschema.explain('tf.gather_nd')

   Schema for tf.gather_nd
  
   Indexes
   
   Index  Description           
   b      batch                 
   r      read location         
   w      write location        
   e      slice element         
   c      read address component
   
   Signatures
   
   params  indices  return[0]
   bre     bwc      bwe      
   
   Index ranks
   
   rank(b) Unconstrained                          
   rank(r) in [1, 7]                              
   rank(w) Unconstrained                          
   rank(e) Unconstrained                          
   rank(c) = 1                                    
   rank(b,c,w) in [0, 10]     sum-range constraint
   rank(b,e,r) in [0, 10]     sum-range constraint
   
   Computed dimensions
   
   None
  
   Index predicates
   
   None
   
   DType Rules
   
   indices.dtype in (int32, int64)
   params.dtype in (int32, int64, float16, float32, float64)

Interesting, so you’re building up data structures to document the ops.

I do like the named index descriptions. Overall this doe come out clearer than trying to describe it any other way.

For example, compare this to the shape constraints block that I added to tensor_scatter_nd_update:

assert tf.rank(indices) >= 2
index_depth = indices.shape[-1]
batch_shape = indices.shape[:-1]
assert index_depth <= tf.rank(tensor)
outer_shape = tensor.shape[:index_depth]
inner_shape = tensor.shape[index_depth:]
assert updates.shape == batch_shape + inner_shape

My biggest concern here is that there are already multiple systems in tensorflow for describing ops, and I’m not sure we want to add another.

For example look at the code for registering an op in tensorflow:

github.com

tensorflow/tensorflow/blob/master/tensorflow/core/ops/array_ops.cc#L3191-L3216


      
          
          REGISTER_OP("TensorScatterUpdate")
              .Input("tensor: T")
              .Input("indices: Tindices")
              .Input("updates: T")
              .Output("output: T")
              .Attr("T: type")
              .Attr("Tindices: {int16, int32, int64, uint16}")
              .Attr("bad_indices_policy: string = ''")
              .SetShapeFn(ScatterNdTensorShape);
          
          REGISTER_OP("TensorScatterAdd")
              .Input("tensor: T")
              .Input("indices: Tindices")
              .Input("updates: T")
              .Output("output: T")
              .Attr("T: type")
              .Attr("Tindices: {int32, int64}")
              .Attr("bad_indices_policy: string = ''")
              .SetShapeFn(ScatterNdTensorShape);

This file has been truncated. show original

And the api_def pbtxt files:

github.com

tensorflow/tensorflow/blob/5dcfc51118817f27fad5246812d83e5dccdc5f72/tensorflow/core/api_def/base_api/api_def_TensorScatterUpdate.pbtxt

op {
  graph_op_name: "TensorScatterUpdate"
  in_arg {
    name: "tensor"
    description: <<END
Tensor to copy/update.
END
  }
  in_arg {
    name: "indices"
    description: <<END
Index tensor.
END
  }
  in_arg {
    name: "updates"
    description: <<END
Updates to scatter into output.
END
  }

This file has been truncated. show original

github.com

tensorflow/tensorflow/blob/5dcfc51118817f27fad5246812d83e5dccdc5f72/tensorflow/core/api_def/python_api/api_def_LogicalAnd.pbtxt

op {
  graph_op_name: "LogicalAnd"
  endpoint {
    name: "math.logical_and"
  }
  endpoint {
    name: "logical_and"
  }
  description: <<END
Logical AND function.

Requires that `x` and `y` have the same shape or have
[broadcast-compatible](http://docs.scipy.org/doc/numpy/user/basics.broadcasting.html)
shapes. For example, `x` and `y` can be:

  - Two single elements of type `bool`.
  - One `tf.Tensor` of type `bool` and one single `bool`, where the result will
    be calculated by applying logical AND with the single element to each
    element in the larger Tensor.
  - Two `tf.Tensor` objects of type `bool` of the same shape. In this case,

This file has been truncated. show original

And the data-structure definitions are here:

github.com

tensorflow/tensorflow/blob/5dcfc51118817f27fad5246812d83e5dccdc5f72/tensorflow/core/framework/api_def.proto

// Defines the text format for including per-op API definition and
// overrides for client language op code generators.

syntax = "proto3";

package tensorflow;
option cc_enable_arenas = true;
option java_outer_classname = "ApiDefProtos";
option java_multiple_files = true;
option java_package = "org.tensorflow.framework";
option go_package = "github.com/tensorflow/tensorflow/tensorflow/go/core/framework/api_def_go_proto";
import "tensorflow/core/framework/attr_value.proto";

// Used to specify and override the default API & behavior in the
// generated code for client languages, from what you would get from
// the OpDef alone. There will be a set of ApiDefs that are common
// to all client languages, and another set per client language.
// The per-client-language ApiDefs will inherit values from the
// common ApiDefs which it can either replace or modify.
//

This file has been truncated. show original

Plus some ops have additional wrappers+docs in python (this one is trivial, some aren’t, Some of these accidently mask docs generated lower down the stack), these wrappers can do even more than ops:

github.com

tensorflow/tensorflow/blob/v2.11.0/tensorflow/python/ops/array_ops.py#L5817-L6101


      
          @deprecation.deprecated_endpoints("tensor_scatter_update")
          @tf_export(
              "tensor_scatter_nd_update",
              v1=["tensor_scatter_nd_update", "tensor_scatter_update"])
          @dispatch.add_dispatch_support
          def tensor_scatter_nd_update(tensor, indices, updates, name=None):
            """Scatter `updates` into an existing tensor according to `indices`.
          
            This operation creates a new tensor by applying sparse `updates` to the
            input `tensor`. This is similar to an index assignment.
          
            ```
            # Not implemented: tensors cannot be updated inplace.
            tensor[indices] = updates
            ```
          
            If an out of bound index is found on CPU, an error is returned.
          
            > **WARNING**: There are some GPU specific semantics for this operation.
            >

This file has been truncated. show original

Plus there was someone looking into trying to make python typing understand tensors, dtypes and shapes (and shape-functions?), but IDK if that effort is alive or dead.

So personally I feel like this already has a lot of moving parts (too many), so I’d vote on the side of cleaning up and expanding the systems we already have instead of rolling new ones.

The other concern I have about this approach is that there are 3000+ APIs in TensorFlow at the moment, and even if we can find a good sponsor for this, it would be months of work to get the top 10% of the API covered.

You’ve got some nice ideas here, I’m just trying to think if there are ways we could introduce them incrementally.

hrbigelow · December 19, 2022, 9:34pm

Hi Mark,

Yes, it would be a major undertaking to implement this. Ultimately I would only advocate a solution that has a single source of truth, and one that actually achieves a mathematically complete description of legal inputs. I know there is a very high premium on maintainability for any proposed feature, and I respect that completely.

It seems like the current mechanism of REGISTER_OP and pbtxt files is not a single source of truth, and is also not a complete description of legal inputs - maybe there is some place where there are more constraints defined?

Perhaps a first step would be to coalesce the REGISTER_OP and pbtxt mechanisms into one single mechanism? And, then augment them with a richer set of constraints?

I guess it also depends how much enthusiasm there is for cleaning up the exceptions and/or the documentation. At the moment both are pretty wild as you hopefully can see from the many examples.

Would you and @Bhack be up for a pow-wow some time? I’d like to contribute to tensorflow somehow - not sure if this would be it, but I have some smaller ideas as well.

Best,

Henry

Bhack · December 19, 2022, 11:01pm

I would also add that we still have many low level API cases related to asserts.
Just two sample two cases as we don’t have a specific GitHub label:

https://github.com/tensorflow/tensorflow/issues/58270

https://github.com/tensorflow/tensorflow/pull/58358

hrbigelow · December 20, 2022, 12:05am

Hi @Bhack,

I’m trying to write a schema for LSTMBlockCell - would you be interested to try? If I can write one, then I could use opschema to generate many different test cases, valid and invalid.

Or can you give an example of a set of successful inputs that would be helpful.

Henry

Bhack · December 20, 2022, 3:07pm

As mentioned by @markdaoust more then the single case I think that we need to find a way to consolidate what we have evaluating the PR review resources.

Mark I don’t know what is you perception of the review bandwidth currently in the main repo. What kind of PR do you think we could review to progress a consolidation?

hrbigelow · December 20, 2022, 4:59pm

@Bhack

I added an LSTMBlockCell schema to opschema, hoping to use opschema to generate and run lots of variations of inputs on it. Unfortunately when I try to run it, it aborts:

python -m opschema.cl validate tf.raw_ops.LSTMBlockCell reports

From inspection, this is happening at:

/home/henry/miniconda3/lib/python3.7/site-packages/tensorflow/python/eager/execute.py:60

  try:
    ctx.ensure_initialized()
=>    tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
                                        inputs, attrs, num_outputs)
  except core._NotOkStatusException as e:

Inputs were:

x: [495, 21]:float16
cs_prev: [495, 198]:float16
h_prev: [495, 198]:float16
w: [219, 792]:float16
b: [792]:float16
wci: [198, 198]:float16
wcf: [198, 198]:float16
wco: [198, 198]:float16

I tried setting a signal handler in Python:

signal.signal(signal.SIGABRT, handler)

but it doesn’t work. Do you happen to know if there’s a way to trap the signal?

Henry

hrbigelow · December 20, 2022, 7:27pm

Hi @markdaoust @Bhack

(I removed all links - this chat keeps saying “Sorry, you can’t post a link to that host”)

I looked over those artifacts Mark mentioned.

So personally I feel like this already has a lot of moving parts (too many), so I’d vote on the side of cleaning up and expanding the systems we already have instead of rolling new ones.

Yes I agree - I would not advocate for just creating a new system and have it sit alongside the existing infrastructure. That would just create clutter and redundancy and confusion. My sense from looking at some of this current infrastructure, for example shape_inference.h
is that the basic abstractions are lacking. So I feel the first step, if there were interest, would be to discuss just on a logical level what abstractions are needed to define legal inputs.

For instance, tf.gather_nd accepts over 1600 different combinations, just for int32 inputs:

python -m opschema.cl explain tf.gather_nd -i \
    | awk '{ if ($2 == "int32" && $4 == "int32") { print $0 } }' \
    | less

which are specified in ops/tf/gather_nd.py

The TensorFlow counterpart seems to be partly expressed in:

common_shape_fns.cc:GatherNdShape

But, it doesn’t seem like the set of primitives defined in shape_inference.h are sufficient to truly define the legal inputs of an op.

My sense is the first step if this were viable would be to talk to people who wrote shape_inference.{h,cc} and discuss what other problems came up, what motivated its design.

I would love to hear from other developers who have thought about this problem, as well as your thoughts about the other aspects of my proposal - such as the comp_dims and notion of signatures and layouts etc.

The other concern I have about this approach is that there are 3000+ APIs in TensorFlow at the moment, and even if we can find a good sponsor for this, it would be months of work to get the top 10% of the API covered.

When you say ‘APIs’ do you mean APIs for each op, or something else?

For sure it would be a ton of work. However, I spent almost four months full time on opschema (though many blind alleys) so I’d be eager to put in a significant effort on the TF codebase. Part of my motivation, I’ll admit, is to pad my CV.

Also, my hope is that ultimately it will save a lot of time for TF core developers. Defining the op in the schema automatically gives you a generator for unit tests, so things like individual edge cases that cause a new issue (such as with LSTMBlockCell) which are discovered late can be discovered very early, and more thoroughly. As an example, it took about an hour to write the schema for LSTMBlockCell, but I can now use it to quickly discover that it crashes.

Best,

Henry

Bhack · December 21, 2022, 4:33pm

As many TF ops are going to be lowered to StableHLO soon or later, I sthink It could be interesting to investigate how OpenXLA/StableHLO is going to design the coverage with its interpreter:

Bhack · January 12, 2023, 5:09pm

@markdaoust If you are interested take a look at the StableHLO team reply on Github

Topic		Replies	Views
I want to help TF#40074 move forward General Discussion contributing , help_research , tfcore	9	2297	January 18, 2022
Announce RFC PRs Site Feedback rfcs , proposal	6	4478	December 2, 2021
Standard way of sharing architectures, e.g., to PyPi.org General Discussion contributing , keras , community , proposal , tfhub	14	2271	February 3, 2022
How to prepare object detection data for tf.data? General Discussion object-detection	1	833	August 4, 2023
Raise ValueError("as_list() is not defined on an unknown TensorShape.") General Discussion tfmodel , tfcompat	1	130	January 31, 2025

Related topics