Streamline GPU usage with Advanced Slicing
Credits - Ron Lach , Pexels.com

Streamline GPU usage with Advanced Slicing

With AI/ML use cases increasing, GPUs are in huge demand. Everyone is talking about AI/ML. I was reading a funny joke that says, if you do a single pushup every time you hear the term AI/ML you can make a nice body without going to GYM. Jokes apart the reality is that everyone is looking at the AI/ML, especially GenAI use cases. Many organisations are including GPU-based servers in their environment. Typically these GPUs are hosted in computer servers like Cisco UCS. You will install hypervisors and virtual machines on top of these bare metal servers. Different VM workloads will have different needs and may not require the entire GPU for itself. That is where the technique of GPU slicing comes into the picture. it allows you to partition a single GPU into multiple GPUs, allowing multiple VMs to access the GPU. Let's dig deeper into this concept.

GPU slicing is a technique that allows multiple virtual machines (VMs) or containers to share a single physical GPU (Graphics Processing Unit). In simple words, GPU slicing means dividing the GPU resources into smaller, isolated units that can be allocated to different tasks or users at the same time.

This is very useful in cloud computing environments, where multiple users or applications need to use GPU power but cannot each have a dedicated GPU. It works by using software to create virtual GPUs (vGPUs) from a single physical GPU. Each vGPU acts like a separate GPU but is actually a part of the physical GPU. This is done using special software like NVIDIA vGPU or AMD MxGPU. The software divides the GPU memory and processing power into smaller chunks, and each chunk is assigned to a different VM or container. This way, many users can run their applications on the same GPU without interfering with each other.

Lets look at the GPU slicing modes offered by NVIDIA

NVIDIA offers several specific GPU slicing modes designed to optimize resource allocation and performance for various use cases. These modes are mainly provided through NVIDIA's vGPU (virtual GPU) technology.

  1. Time-Sliced (vGPU) Mode: In Time-Sliced mode, the physical GPU is divided into multiple vGPUs, each allocated to a different virtual machine (VM). The GPU's resources are time-shared among the VMs, meaning each vGPU gets exclusive access to the GPU for a short period before switching to the next vGPU. This mode is beneficial for applications that require high interactivity and responsiveness, such as virtual desktops and real-time rendering like VDIs, Real time 3D applications.
  2. Multi-Instance GPU (MIG): The MIG mode, introduced with NVIDIA's A100 GPUs, allows the physical GPU to be divided into multiple instances, each acting as a separate, fully isolated GPU. Each instance has its own dedicated memory and compute resources, providing consistent performance without interference from other instances. This mode is ideal for workloads requiring deterministic performance, such as AI inference and high-performance computing (HPC). example - AI/Ml Inferencing tasks or HPC workloads requiring consistent performance.


If you are looking to use vGPU for your workloads. , make sure that

  1. GPU model must support vGPU technology
  2. Hypervisor must support vGPU

You will then need to install the vGPU software from the GPU vendor on the hypervisor host and also vGPU drivers on the Guest VM.

In summary, GPUs are pretty expensive resources and you might to utilise them most efficiently. Plan the vGPU method based on your use-case and need. You can always optimise the vGPU assignment using the vGPU profiles.


To view or add a comment, sign in

More articles by Vinay Saini

Insights from the community

Others also viewed

Explore topics