MemVerge reposted this
Raise your hand if you are tired of the GenAI marketing and hype 🙋♂️ 🙋♀️ Here's the types of problems I think people on the ground are dealing with and the type of grounded convos I think most founders and HPC platform leaders are open to having: 1. Large enterprises care about a hybrid GPU architecture. How do on-prem and colo-hosted clusters work with public cloud GPUs? If using AWS, GCP, et al, can I take advantage of GPU reservations and Spot GPUs in a better way? Most cannot justify 100% public cloud over the long term due to price and where certain sensitive data sits. 2. Large cloud users need a better way to deal with GPU hardware failures. Checkpointing is one approach to protecting from GPU failures. Everyone wants an easier way to checkpoint single node and especially multi-node training jobs. MemVerge is on the cutting edge of this. 3. How to get just the right amount of GPU infra, no more and no less Plenty of founders I ask are just using a model hosting provider OpenAI API, Bedrock, etc. Others are looking at ways to get a 1-2x H100s at a time vs. having to rent an entire 8x node (some vendors offer this granularity while others don't). Others aren't sure how much infra they need at all and just want to pay for however much infra time was needed to run the experiment (pay per use). 4. Slurm vs. k8s, or both? It seems like there is no straightforward answer to how GPUs are managed and orchestrated. From the app dev side, k8s makes a ton of sense while from the training side & traditional HPC and platform manager side Slurm offers numerous advantages as well. --- It is hard to imagine how anyone can affordably scale without solving some of these problems. What are you seeing on the ground level? Like, repost, or leave a comment below! 👇 #AWS #GCP #Azure #AI #ML #k8s #slurm #GenAI