Hi,

Working on a pluggable device, I am using `TF_ForwardInputOrAllocateOutput`

in some of my kernels but I don’t see the effect of it in practice.

Here is a simple example:

```
import tensorflow as tf
@tf.function
def run_graph(model, x):
return model(x)
shape = [2, 2]
x = tf.keras.Input(shape, batch_size=1)
y = x
y = tf.keras.layers.Add()([y, y])
y = tf.keras.layers.ReLU()(y)
y = tf.keras.layers.Add()([y, y])
y = tf.keras.layers.ReLU()(y)
model = tf.keras.Model(inputs=x, outputs=y)
x = tf.constant(
[[
[1, 2],
[3, 4],
]]
)
y = run_graph(model, x)
print(y)
```

On the pluggin side, I put the device memory addresses of the each input/output tensors in each kernel:

```
Op: Cast ( 1) Output: mem0005 Input(s): mem0004
Op: Mul ( 1) Output: mem0006 Input(s): mem0002 mem0005
Op: Relu ( 1) Output: mem0007 Input(s): mem0006
Op: Mul ( 2) Output: mem0008 Input(s): mem0007 mem0003
Op: Relu ( 2) Output: mem0009 Input(s): mem0008
```

What I would expect is to see the output address of each operation (except the first one to not override `x`

) being one of its inputs ones since I use forwarding in each of these operations.