Distributed inference with JAX: GPU/TPU interconnect

jwang194 · August 23, 2024, 8:09pm

Hello, this is a follow-up post to Benchmarks for distributed HMC?.

In particular, we’re interested in acquiring a machine with several GPUs in order to begin our own benchmarking. As part of our choice of machine, we’d like to know if GPU interconnect topology can significantly reduce overhead in distributed HMC. For example, is the jax.psum() backbone of the distributed log density built on a Reduce operation, and can it therefore leverage GPU-GPU direct communication via e.g. NVLink?

The motivation for this type of question: if and only if this and other technicalities hold, we may invest in something like a toroidal or fully connected GPU arrangement.

Thanks,
Jeremy

Topic		Replies	Views
Benchmarks for distributed HMC? General Discussion jax , tfpy	1	88	April 23, 2024
Distributed training with JAX & Flax Show and Tell education	2	1128	December 28, 2022
Image classification with JAX & Flax Show and Tell	0	619	July 8, 2022
Multi-GPU inference - am I doing it right? TensorFlow models , gpu , help_request	2	1342	April 11, 2023
Data parallelism on multiple GPUs General Discussion models , gpu , help_request	2	847	September 1, 2022

Distributed inference with JAX: GPU/TPU interconnect

Related topics