In the new year, I've had 11 conversations with AI startups using large H100 reservations who share a common pain point: multitenancy (sharing the GPU reservation among multiple workloads of varying priorities). And multitenancy is fundamentally about cost.
In a typical elastic cloud environment, this is not a problem. Each workload will independently spin up whatever compute resources it needs and shut them down when it's done.
However, high-end GPUs are often purchased via fixed-size reservations. These might be used for a combination of training, inference, and data pipelines.
Here's a typical scenario.
- When my big training job is running, the training job should get all the GPUs it needs.
- However, when the training job finishes or the researcher goes on vacation, the GPUs are idle.
- I want some kind of background job that can act as a "sponge" and soak up all the unused compute, otherwise I'm wasting money.
- A good candidate for this background job is often data processing (typically batch inference) because there's often a big backlog of data to process.
- The data processing workload may also use other cloud instance types outside of the GPU reservation.
- When new training jobs come online, they need to take resources away from the background job.
This is also one of the reasons companies like OpenAI offer cheaper batch APIs because these workloads can be scheduled with more flexibility when the resources are available and can therefore even out overall compute utilization.
The tools we're building with Ray and our platform at Anyscale are geared toward solving these challenges (and other complexities around managing and scaling compute-intensive AI workloads).
And yes, I generated an image based on this post (can you tell which provider?).