✨What’s the top speed “holy grail” in tiered storage hierarchy? Tier 0! Host David Nicholson is joined by Hammerspace's Molly Presley, Senior Vice President of Global Marketing to discuss how Hammerspace is unlocking the power of unused NVMe storage in existing GPU servers, creating a massive, high-performance data pool for AI workloads. Their discussion covers: - The concept and technology behind The New Tier 0 - Hammerspace's approach to addressing modern data challenges - Insights into the future roadmap for Hammerspace and its technologies - How The New Tier 0 translates to faster checkpoints, increased GPU utilization, and significant cost savings - Real-world applications and benefits for organizations adopting The New Tier 0 #Hammerspace #TheNewTier0 #tech #TheSixFiveOnTheRoad #DavidNicholson #MollyPresley #SC24
Transcript
Welcome to Super Compute Supercomputing Conference, Atlanta SC 24. I'm Dave Nicholson with six five on the road. And I'm here at Hammer Space with the one the only, Molly Presley from Hammerspace. Good to see you, Molly. Awesome to see you too. I want to hear what's the latest, what's the here? I hear something about Tier 0 going on at Hammerspace. Explain to me what that's all about. Yeah. So we just announced last week a new tier of storage that has never existed. Anywhere before so of industry first that we're very excited about and kind of our marketing campaign around it is you may be sitting on a gold mine and you don't even know it. And what we mean by that is all these DGX servers from NVIDIA, big compute nodes that are shipping. The vast majority of them have some solid-state storage in them and that solid-state storage is largely not going used for big HPC workflows for AI training because it's not really designed to be used as shared storage. So customers like Los Alamos, the big hyperscalers that are all doing these big AI environments have all this storage that is available to them now because the Tier 0 capabilities that Hammer space is released. I mean, what kind of quantities are we talking about? Here is it is it a significant amount of storage that you can have 20-30 petabytes and a larger scale yeah like a lot So you figure what would you spend on 30 petabytes of Tier 1 fast, you know luster or whatever kind of storage you may already own it and be able to just turn it on with a hammer space license. What are you going to store in that tier 0 space it's super high performance but what are you going to use it for yeah so super high performance and the reason people haven't used it so far is because. Couldn't make it into a shared storage environment. So if they have 100 nodes with 100 different data silos, how do you train an LLM with that? Or how do your data engineers get access to it and know what it is? So we had to break down, unify that into a shared storage environment. And take advantage of it being hat fast and also protect it because if it's not protected, people aren't going to put all this data that they're spending all this money computing on drives that might fail and you know, compute nodes do fail. They're hot they this happens. So we needed to protect it. But what they're using it for checkpoints to make checkpointing much faster. OK, they're using it actually to write the actual data and then share it off and so orchestrate it off to their tier one or external environments as it starts to fill up. So it's. Only the primary landing place for data and checkpoints. OK, so when you talk about checkpoints, you're so speed is really important when you're recovering. Is that what we're saying? So it's really yes, but it's more about not having the GPU sitting idle during the checkpoint process. So think of a typical compute environment, maybe they checkpoint once an hour, OK. And that checkpoint takes 5 to 10 minutes. Those GPUs are idle that entire time. So you figure 5 or 10 minutes out of every sixty, you're talking eight, 1012% more GPU time. So all those GPUs can be used more effectively and run much faster. Or more consistently. And then of course if you have a recovery, that recovery is fast. But you know, the first and foremost, most of these environments are think about how do I make my GPUs as productive as possible, OK? And then how do I protect and use that data? So you're recovering stranded resources, which sounds like 100% goodness. What's is there a potential downside to that? Should there should people be concerned about churning away GPU cycles on using this storage or Oh no, definitely not. So the downside and the reason we've talked to a lot of customers ones that you know are designing the really well known 10s of hundreds of 1000 AI environments about why are they not using this today and it. Really came down to I can't have silos of data and I can't have unreliable disks. So when you think about writing your primary data set to storage, they want it protected with erasure coding or mirroring or something. And those SSDs without the hammer space tier 0 aren't protected. They're just a bunch of like J Pods and so they can't risk losing the the data. So those are the two reasons they haven't done it. But the downside is really there isn't any. Downside, they're already configured. All that time that's been spent setting up those GPU environments, configuring them, getting them up and running, it's already installed and configured and the burn in's done and the infant mortality has fallen out. They're all sitting there available. They're already powered on, they're already on the network. So there's really not any downside to speak of. It's not taking away from the GPU's memory or something like that. So under the heading of overall data orchestration. Are you setting up a single large pool with these devices? Do you have options for how that's provisioned? What does that look like in that example of, you know, 1000, a thousand of these things aggregated together with Hammer space? What does that look like? Yeah. So essentially hammer space creates a, it's a parallel file system, but a parallel global file system. Each of these nodes are just members of that global file system. So as data is created on, let's say, 100 DGX notes, all of that data is instantly. Because of the metadata is assimilated into the file system, it's just part of a file system. And so those are just data sources that are aggregated into one file system and one data set. And so instantly the application user or the model, whatever it is that's looking at the data, you can see all of the data across those noses, one single data set. And then if the data needs to be moved because the SSD's are gained full or because you want to move it up into the cloud for some other processing or other. Models the hammer space data orchestration policies take over and say OK, I've met this objective of my SSD's are full or this data set is designed to move to the cloud. It automatically starts that move, but the applications and users are just looking at the file system and they don't know the data is moving around. They can see the data and know what's there no matter which note it was creating on or where it might be getting moved to. You have this spooky thing Molly where you read my mind because I was just going to ask about this connected. We are went because whenever you talk about tearing, people immediately think about moving between tiers and when that might happen. You answered the question about moving to cloud. Is it still relevant today, this idea of moving hot data to higher performance, cold data to lower performance media, is that still something that people do or, or would this tier zero layer tend to have certain things pinned in it for lack of a better term? That's a really good question and what we have found, and I came from the days I used to work for a tape library company. I know a lot about archive and data kind of immediately was created and moved into the archive. Essentially, as an aged and. That's not really how data is being used, especially in these AI environments where you're constantly searching for which data might have interesting information. If I correlate this data that that data, what might I find? And so you never really know which state is going to be relevant to an AI job. And so just. Tearing it based on age doesn't really work. And so how these environments are working with the more modern, modern data systems is all of the data is in a single metadata environment, no matter where it sits. So you're isolating the metadata so the scientists, the models can access and see all the data. And whether it's sitting in object or tape or, you know, Tier 0, it doesn't really matter because they can see all the data. And then when the time comes, they say I'm OK, I'm going to run a job. That data can be moved automatically to faster storage or proximity to an application in the cloud, whatever that might be, but without disrupting the namespace, everyone can still see the data even as it's moving around. So yeah, some people may pin data to a specific GPU. They may let all of the data set in, maybe tier one or Tier 2 after it's created, and use the tier 0 when it's being processed. There's a lot of different ways a customer could use the environment. Depending on if they're running active data sets, if they're really trying to figure out what data they have and then pin the data that's relevant in a project. But that's the cool thing about this is that's all automated and done by software. So you can set objectives and have the data behave automatically the way you want. It's not a bunch of IT guys with tickets open trying to copy data. This is automated data orchestration. So I've seen recently from some component vendors talk of Nvme devices 60. 80. 100 terabytes, huge device. Mm-hmm. But back to the top of what we were talking about, what's, what is kind of the average size of those stranded Nvme devices that you're seeing? Yeah, usually right now we're seeing about 30 terabytes per drive and there's eight of them. So 240 terabytes is kind of a normal deployment today. And then you figure, okay, completely crazy. It is crazy. About hundreds or thousands of GPU nodes for this matter. I mean even think of just 10 GPU knows which is a traditional enterprise environment that's still you're getting into the petabytes of data. So the concept of recovering stranded stuff in the past might have looked like you have a 9 gig boot drive here's a gig to use but. The performance levels and the, and the capacities that you're talking about, these become massively meaningful resources when people are spending a lot of money on GPU, you're, you're essentially, if they leverage this, you're helping them fund. Their AI cluster in a way, no, that's absolutely true. So one of our Tier 0 customers who's in production today had exactly that issue. They had already funded their AI cluster, but they were out of power and so they had the issue of. We're out of capacity for storage, we're out of power for our data center, but we need to keep doing AI. What do we do? Well, turn on hammer space and now all of a sudden you have 30 more petabytes of storage with no more power, no external networking, all the infrastructure you need. And so this just freed up the ability to keep doing more work without having to deploy. Storage, which they didn't have power for. So it's kind of interesting that it's not just money, but it's maybe you have you're out of power, maybe you need cash. And maybe it's just my GPUs aren't as efficient as they should be and another 15% on my GPU cycles would keep me from having to buy more GPUs. So I come from, you know this, I come from an old knuckle dragging storage background and did we all do that? I just you can see the only place I don't. Yeah. And. Often platform upgrades were a huge issue, especially when you talk about the kind of, you know, the, the quantity of data we're migrating now. We're hearing about very large data centers accelerating their. Refresh rates. To get to latest Gen. CPU's to free power up so they can buy more GPUs. Because to your point, finite power, what do you do right? Just a reminder, the nuclear power plant next door if you can. No, that's exactly what's happening. It's it's true. It's true. And a lot of people are gonna pretend like they were never against nuclear power. As we move forward, it's gonna be priceless to watch, but just remind people when you're using hammer space for, for, you know, global file system. And you're doing things in parallel. Is it fair to say that retiring a node that happens to include captive storage becomes a lot more trivial? That's a lot easier to do so when you retire a storage system or a node from the Hammer Space platform is completely transparent to the applications and users. So that time where you may remember where you had to do it on Saturday night and notify everyone of application downtime and plan on it, and everyone was mad at you because they can't get to their system, that's completely gone. Because the applications and the users are interacting with the file system and if a storage system goes offline that they can't see it because they're working with the metadata. And so if you decide as an IT team I'm going to end of life this system, you can copy that data as you wish over time. And the users and applications see the data while it's in flight, so there is no downtime. So even if it takes a month for the data to copy across a really slow network or it doesn't matter because the business isn't interrupted. So it really does make it. It decouples the application user experience from the IT planning model. So if you're going to the cloud or you're coming off the cloud, you don't have to repoint applications and users to the new instance in the data center or in the cloud. You just keep them on hammer space and they're continually running. IT moves things around. Yeah. 20 years ago we had a name for that, if it was called a fantasy. Magic. Like magic. Exactly. We're here at C24. This conference has grown over the decades to be absolutely amazing. Of course, AI front and center here. Great to see how big it is. You talk about high performance computing, supercomputing, the requirements of AI, freeing up this new resource that is tier 0 in a pool. That's a pretty big headline. Do you, do you, do you want, do you dare throw shade on your friends at Hammer Space and mention anything else new that you're doing or do we want to just kind of stay firm on. Tier 000 is the big news for sure because it's the first time anyone in the industry has been able to tackle this and it's so needed. But we absolutely we have new advancements in our object storage capabilities. We have new advancements in the speed our metadata transacts because you know you have to always be increasing the speeds of your system being able to really interlock an AI workloads the S3 and object data that's coming in and then process it. The GPUs, having that be really transparent. So you don't have your object workflow and your file workflow is important. And it all kind of ties together that you want your GPUs to be efficient, you want them to access as much data as possible. And if that data is been ingested over S3 or file, it should be transparent. So those are the other kinds of advancements we have, but none of them are as really industry changing as the Tier 0 announcement.To view or add a comment, sign in