Kudos to PluggableDevices team; question about AMD GPU

Doug · June 15, 2021, 12:46am

Hi all, I just wanted to show my appreciation for the PluggableDevices implementation. I think it was a good compromise between expanding availability of GPU acceleration and not touching the core CUDA kernels, which would probably need a complete re-development.

In particular, I have been using the MacOS/Metal implementation and liking it very much. One question I have to take this one step further is, what are some guidelines on memory usage the experts could share? For example, in my setting I have an 8GB AMD Radeon pro 5500. When I’m setting buffer size for training my TF models, is there a rule or thumb or any other rough guideline on how I could get the most bang for the buck (in other words, more GPU acceleration for the CPU workload that sending / fetching data to and from the GPU entails).

Many thanks,
Doug

penporn · June 15, 2021, 11:04pm

Hi Doug,

Thank you very much for the kind words! We are excited to hear that you are liking it!

@kulin_seth and his team develop the Metal plug-in. (They have worked really hard for the release – Kudos!) He can help answer your question.

Best,
Penporn

P.S. We have changed the tag of this thread from tfdata to pluggable_device just so it’s easier to look up all PluggableDevice-related posts.

Doug · June 16, 2021, 6:32am

Many thanks @penporn! I’ll be looking forward to @kulin_seth and his team’s inputs.
Happy that you corrected the tag also.

Best,
Doug

Dhruv_Saksena · June 19, 2021, 3:31pm

Hi Doug,

Could you please clarify what you mean by setting buffer size? (maybe only I don’t know)
Do you mean to ask what sized/shape tensors you can create? Or if there are any alignment restrictions? Or something else?

If it is one of my guesses, you should be able to create any size/shaped tensors, individual tensors are backed by MTLBuffers so any size restrictions on MTLBuffers are the only restrictions that apply (which I believe was >= 1GB, which should be usually plenty)
And there are no alignment/shape restrictions with MPS supporting upto 16 dimensions.

I have also informed kulin and others in our team so they will chime in soon.

thanks

Doug · June 21, 2021, 10:45pm

Hi Dhruv,
Many thanks for following up on this. Only reading your question did I note my typo: it wasn’t supposed to be “buffer size” but rather “batch size”. Apologies for the confusion, everyone!

Ultimately, what I am looking to know is, how can I optimise the amount of data flowing into my GPU that maximises GPU usage while not overusing the CPU to transmit the data. What I have noted is that for the same batch size in the same dataset, normal tensorflow in CUDA and tensorflow-metal in my (AMD GPU) Mac lead to different GPU usages. This is expected, of course. But I wanted to know is, whether there is any guideline or rule of thumb that can us users set a batch size that can utilise resources more efficiently.

Please let me know if I can further clarify the question. And thanks again for the follow up, this is much appreciated!

Remy_Wehrung · June 21, 2021, 11:15pm

hello, your post would have been relevant ten / twelve years ago, but like 1 / Apple counts for nothing in the machine learning market
2 / Apple no longer uses AMD architecture, everything has become proprietary
3 / if you want to complain about CUDA: subscribe to the Nvidia forum and report your problems

Doug · June 21, 2021, 11:24pm

Sorry, Remy but I think you misunderstood my post or I expressed myself completely wrong. Kindly do let me clarify below.

I am not saying who counts or does not count on ML. My notebook is a Mac and this is what I prefer to code on.
AMD architecture is the one I have on my notebook. I am well aware of the move to M1.
Where did I complain about CUDA? Frankly I think CUDA is an engineering marvel and I use it a lot in cloud instances.

Doug · June 21, 2021, 11:27pm

Dhruv,
Following up on my earlier post, just a clarification: I know batch size depends on the size of the underlying dataset. I just wondered if the Metal experts had any views (or experience/intuition/rule-of-thumb) on how and whether to adjust batch size of a data coming into a model to make the most of parallelisation at the GPU compared to, say, a CUDA benchmark of similar compute power.

Remy_Wehrung · June 21, 2021, 11:38pm

My apologies, so for AMD particularities, you have to see their side with ROCm (how to complicate when you want simple things)

Doug · June 21, 2021, 11:42pm

Thanks. Actually, ROCm is only for Linux. Macs with AMD also use Metal. Again, I am not discussing the merits of each framework. I just came here to ask for practical insights on how to better use the GPU I am primarily using.

Remy_Wehrung · June 21, 2021, 11:54pm

you tried on the xcode / metal forums? I know silly question. This is why I say that Cupertino is next to the plate compared to our professions, he does not listen to the devs

Doug · June 22, 2021, 12:00am

Thank you for the suggestion. My question is not about Xcode or Metal, it’s about TensorFlow with PluggableDevices running on a Metal backend. Beyond that, I would encourage you to post your views on what the best company or framework is on another topic, so that this exchange remains on topic. Thanks.

Bhack · June 22, 2021, 12:01am

Have you already tried the tips at:

Doug · June 22, 2021, 12:03am

Thanks, Bhack! I wasn’t aware of this page, will definitely try those performance tips out! I appreciate the pointer.

Bhack · June 22, 2021, 12:06am

Check also:

And If available on Metal

Doug · June 22, 2021, 12:07am

Very nice pointers, Bhack! Thanks very much!

Bhack · June 22, 2021, 1:48am

@penporn As ROCm profiler support It is just landing now with:

github.com/tensorflow/tensorflow

[ROCm] Update profiler support for rocm

tensorflow:master ← ROCm:update_upstream_rocm_profiler

opened 10:33PM - 21 Jun 21 UTC

reza-amd

+1570 -494

These changes add the following capabilities to the TF profiler for ROCm * Ma…kes all the profiler views in TensorBoard valid for AMD GPUs * Add host API tracing to the traceviewer * Add some new HIP APIs to track in the profiler * Solve some of the previously known issues: * Assigning correct physical device ID * Fix events being dropped

I suppose that the profiler Is still not part of the puggable device project right?

penporn · June 22, 2021, 3:44am

@Bhack Many thanks for the performance links!

No, it’s not. But there’s an ongoing RFC about this: Pluggable Profiler

penporn · June 22, 2021, 3:46am

Now I know where the tag tfdata came from. Added them back to your post. Sorry!

Subin_1 · June 23, 2021, 3:22pm

Is there a sample app example for pluggable device ?

Topic		Replies	Views
Accelerating TensorFlow using Apple M1 Max? General Discussion mac_os , pluggable_device , help_request , apple_silicon	51	57580	August 16, 2023
CNN net much slower than DNN General Discussion models , keras , gpu , help_request	9	2523	October 30, 2021
Help training MobileNetV3Small model on custom image classification Keras models , help_request	1	1787	December 21, 2021
Gpu utilization is not 100% General Discussion gpu , help_request	13	4299	August 23, 2021
Problems copying data to GPU with Keras and tf.data.Dataset General Discussion nlp , keras , gpu	5	4515	December 13, 2022

Kudos to PluggableDevices team; question about AMD GPU

Related topics