Hello,
I am trying to create a dummy pluggable device following the community tutorial.
I have implemented a minimal stream executor (works with calls malloc
) kernels for
Conv2D
AssignVariableOp
ReadVariableOp
I am now trying to run a simple test script:
import tensorflow as tf
input_shape = (4, 28, 28, 3)
x = tf.random.uniform(input_shape)
y = tf.keras.layers.Conv2D(12, 3, use_bias=False)(x)
I am getting a segfault, this is what the back trace looks like:
(gdb) bt
#0 memcmp () at ../sysdeps/aarch64/memcmp.S:53
#1 0x0000ffff95e05050 in tensorflow::internal::ValidateDevice(tensorflow::OpKernelContext*, tensorflow::ResourceHandle const&) ()
from /home/ubuntu/python3-venv/tensorflow/lib/libtensorflow_framework.so.2
#2 0x0000ffff95e08a64 in tensorflow::DeleteResource(tensorflow::OpKernelContext*, tensorflow::ResourceHandle const&) ()
from /home/ubuntu/python3-venv/tensorflow/lib/libtensorflow_framework.so.2
#3 0x0000ffff9cc0f640 in tensorflow::DestroyResourceOp::Compute(tensorflow::OpKernelContext*) ()
from /home/ubuntu/python3-venv/lib/python3.8/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#4 0x0000ffff9a4b0768 in tensorflow::PluggableDevice::Compute(tensorflow::OpKernel*, tensorflow::OpKernelContext*) ()
from /home/ubuntu/python3-venv/lib/python3.8/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#5 0x0000ffffa0bb818c in tensorflow::KernelAndDeviceOp::Run(tensorflow::ScopedStepContainer*, tensorflow::EagerKernelArgs const&, std::vector<absl::lts_20210324::variant<tensorflow::Tensor, tensorflow::TensorShape>, std::allocator<absl::lts_20210324::variant<tensorflow::Tensor, tensorflow::TensorShape> > >*, tensorflow::CancellationManager*, absl::lts_20210324::optional<tensorflow::EagerFunctionParams> const&, absl::lts_20210324::optional<tensorflow::ManagedStackTrace> const&, tensorflow::CoordinationServiceAgent*) ()
from /home/ubuntu/python3-venv/lib/python3.8/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#6 0x0000ffff9b497be0 in tensorflow::EagerKernelExecute(tensorflow::EagerContext*, absl::lts_20210324::InlinedVector<tensorflow::TensorHandle*, 4ul, std::allocator<tensorflow::TensorHandle*> > const&, absl::lts_20210324::optional<tensorflow::EagerFunctionParams> const&, std::unique_ptr<tensorflow::KernelAndDevice, tensorflow::core::RefCountDeleter> const&, tensorflow::GraphCollector*, tensorflow::CancellationManager*, absl::lts_20210324::Span<tensorflow::TensorHandle*>, absl::lts_20210324::optional<tensorflow::ManagedStackTrace> const&) () from /home/ubuntu/python3-venv/lib/python3.8/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#7 0x0000ffff9b498d44 in tensorflow::ExecuteNode::Run() () from /home/ubuntu/python3-venv/lib/python3.8/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#8 0x0000ffffa0ff391c in tensorflow::EagerExecutor::SyncExecute(tensorflow::EagerNode*) ()
from /home/ubuntu/python3-venv/lib/python3.8/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#9 0x0000ffff9b4952c0 in tensorflow::(anonymous namespace)::EagerLocalExecute(tensorflow::EagerOperation*, tensorflow::TensorHandle**, int*) ()
from /home/ubuntu/python3-venv/lib/python3.8/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#10 0x0000ffff9b49598c in tensorflow::EagerExecute(tensorflow::EagerOperation*, tensorflow::TensorHandle**, int*) ()
from /home/ubuntu/python3-venv/lib/python3.8/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#11 0x0000ffff9b1eb49c in tensorflow::EagerOperation::Execute(absl::lts_20210324::Span<tensorflow::AbstractTensorHandle*>, int*) ()
from /home/ubuntu/python3-venv/lib/python3.8/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#12 0x0000ffffa0bc44d8 in tensorflow::CustomDeviceOpHandler::Execute(tensorflow::ImmediateExecutionOperation*, tensorflow::ImmediateExecutionTensorHandle**, int*) ()
from /home/ubuntu/python3-venv/lib/python3.8/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#13 0x0000ffff9af919bc in TFE_Execute () from /home/ubuntu/python3-venv/lib/python3.8/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#14 0x0000ffff9aba1cb4 in TFE_Py_FastPathExecute_C(_object*) () from /home/ubuntu/python3-venv/lib/python3.8/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#15 0x0000ffff93362214 in pybind11::cpp_function::initialize<pybind11_init__pywrap_tfe(pybind11::module_&)::{lambda(pybind11::args)#52}, pybind11::object, pybind11::args, pybind11::name, pybind11::scope, pybind11::sibling>(pybind11_init__pywrap_tfe(pybind11::module_&)::{lambda(pybind11::args)#52}&&, pybind11::object (*)(pybind11::args), pybind11::name const&, pybind11::scope const&, pybind11::sibling const&)::{lambda(pybind11::detail::function_call&)#3}::_FUN(pybind11::detail::function_call&) ()
from /home/ubuntu/python3-venv/lib/python3.8/site-packages/tensorflow/python/_pywrap_tfe.so
#16 0x0000ffff933960bc in pybind11::cpp_function::dispatcher(_object*, _object*, _object*) ()
from /home/ubuntu/python3-venv/lib/python3.8/site-packages/tensorflow/python/_pywrap_tfe.so
For more context, I have a bunch of debug printing everywhere to see what’s going on and the segfault seems to occur after a terminated call to deallocate
function of the stream executor.
I looked at the source of tensorflow::internal::ValidateDevice and the memcmp
is done on
ctx->device()->attributes().name()
(OpKernelContext* ctx
)p.device()
(ResourceHandle& p
)
So I assume one of these has invalid memory, I don’t know which and where and I struggle to find what in my code is causing it.
I am not putting the whole code here as there are quite some bits despite being minimal, but I can post more code on demand.
Thank you in advance for your help.
PS: I particularly struggle to understand how the AssignVarOp
and ReadVarOp
kernels should be implemented and I believe the problem is coming from here.