Hi all, I just wanted to show my appreciation for the PluggableDevices implementation. I think it was a good compromise between expanding availability of GPU acceleration and not touching the core CUDA kernels, which would probably need a complete re-development.
In particular, I have been using the MacOS/Metal implementation and liking it very much. One question I have to take this one step further is, what are some guidelines on memory usage the experts could share? For example, in my setting I have an 8GB AMD Radeon pro 5500. When I’m setting buffer size for training my TF models, is there a rule or thumb or any other rough guideline on how I could get the most bang for the buck (in other words, more GPU acceleration for the CPU workload that sending / fetching data to and from the GPU entails).
Could you please clarify what you mean by setting buffer size? (maybe only I don’t know)
Do you mean to ask what sized/shape tensors you can create? Or if there are any alignment restrictions? Or something else?
If it is one of my guesses, you should be able to create any size/shaped tensors, individual tensors are backed by MTLBuffers so any size restrictions on MTLBuffers are the only restrictions that apply (which I believe was >= 1GB, which should be usually plenty)
And there are no alignment/shape restrictions with MPS supporting upto 16 dimensions.
I have also informed kulin and others in our team so they will chime in soon.
Hi Dhruv,
Many thanks for following up on this. Only reading your question did I note my typo: it wasn’t supposed to be “buffer size” but rather “batch size”. Apologies for the confusion, everyone!
Ultimately, what I am looking to know is, how can I optimise the amount of data flowing into my GPU that maximises GPU usage while not overusing the CPU to transmit the data. What I have noted is that for the same batch size in the same dataset, normal tensorflow in CUDA and tensorflow-metal in my (AMD GPU) Mac lead to different GPU usages. This is expected, of course. But I wanted to know is, whether there is any guideline or rule of thumb that can us users set a batch size that can utilise resources more efficiently.
Please let me know if I can further clarify the question. And thanks again for the follow up, this is much appreciated!
hello, your post would have been relevant ten / twelve years ago, but like 1 / Apple counts for nothing in the machine learning market
2 / Apple no longer uses AMD architecture, everything has become proprietary
3 / if you want to complain about CUDA: subscribe to the Nvidia forum and report your problems
Dhruv,
Following up on my earlier post, just a clarification: I know batch size depends on the size of the underlying dataset. I just wondered if the Metal experts had any views (or experience/intuition/rule-of-thumb) on how and whether to adjust batch size of a data coming into a model to make the most of parallelisation at the GPU compared to, say, a CUDA benchmark of similar compute power.
Thanks. Actually, ROCm is only for Linux. Macs with AMD also use Metal. Again, I am not discussing the merits of each framework. I just came here to ask for practical insights on how to better use the GPU I am primarily using.
you tried on the xcode / metal forums? I know silly question. This is why I say that Cupertino is next to the plate compared to our professions, he does not listen to the devs
Thank you for the suggestion. My question is not about Xcode or Metal, it’s about TensorFlow with PluggableDevices running on a Metal backend. Beyond that, I would encourage you to post your views on what the best company or framework is on another topic, so that this exchange remains on topic. Thanks.