MemVerge

MemVerge

Software Development

Milpitas, California 2,660 followers

About us

MemVerge is a pioneering developer of software for Big Memory Computing. In the cloud or on-premises, data-intensive workloads run faster, cost less, and recover automatically with the company’s award-winning Memory Machine™ products. Memory Machine X is poised to revolutionize how CXL® memory will be used in the future, while Memory Machine Cloud stands out with its ability to continuously right size cloud cost and performance. Learn more about MemVerge and its Memory Machine software at www.memverge.com.

Industry
Software Development
Company size
11-50 employees
Headquarters
Milpitas, California
Type
Privately Held
Founded
2017

Products

Locations

  • Primary

    1525 McCarthy Blvd

    Suite 218

    Milpitas, California 95035, US

    Get directions

Employees at MemVerge

Updates

  • MemVerge reposted this

    View profile for Jing Xie, graphic

    🌊Spot Instance Surfer | 🤖GPU Optimizer

    Raise your hand if you are tired of the GenAI marketing and hype 🙋♂️ 🙋♀️ Here's the types of problems I think people on the ground are dealing with and the type of grounded convos I think most founders and HPC platform leaders are open to having: 1. Large enterprises care about a hybrid GPU architecture. How do on-prem and colo-hosted clusters work with public cloud GPUs? If using AWS, GCP, et al, can I take advantage of GPU reservations and Spot GPUs in a better way? Most cannot justify 100% public cloud over the long term due to price and where certain sensitive data sits. 2. Large cloud users need a better way to deal with GPU hardware failures. Checkpointing is one approach to protecting from GPU failures. Everyone wants an easier way to checkpoint single node and especially multi-node training jobs. MemVerge is on the cutting edge of this. 3. How to get just the right amount of GPU infra, no more and no less Plenty of founders I ask are just using a model hosting provider OpenAI API, Bedrock, etc. Others are looking at ways to get a 1-2x H100s at a time vs. having to rent an entire 8x node (some vendors offer this granularity while others don't). Others aren't sure how much infra they need at all and just want to pay for however much infra time was needed to run the experiment (pay per use). 4. Slurm vs. k8s, or both? It seems like there is no straightforward answer to how GPUs are managed and orchestrated. From the app dev side, k8s makes a ton of sense while from the training side & traditional HPC and platform manager side Slurm offers numerous advantages as well. --- It is hard to imagine how anyone can affordably scale without solving some of these problems. What are you seeing on the ground level? Like, repost, or leave a comment below! 👇 #AWS #GCP #Azure #AI #ML #k8s #slurm #GenAI

  • MemVerge reposted this

    View profile for Jing Xie, graphic

    🌊Spot Instance Surfer | 🤖GPU Optimizer

    If every AI dev you support gets their own dedicated A100 or H100 GPU for their Jupyter or VS Code project, you're probably doing it wrong. If you're in charge of your team's GPU cluster, check out NVIDIA's Multi-Instance GPU (MIG), a feature that works with A100, H100, and newer GPUs regardless of if you use k8s or Slurm to manage your cluster. Using MIG makes it possible to support 7 or potentially even more AI developers on a single A100 (see table below as a reference example). Now if your end users don't have a great grasp of how much GPU resources they really need, k8s starts to shine relative to Slurm (which supports a more fixed and inflexible allocation each time a "job" and "reservation" is granted). Even though prices have come down, GPU resources are still quite expensive and there never seems to be enough when the bigger training and inferencing projects start. Use MIG today and don't let your dev's get away with hogging more resources than they need. I'll include some helpful links in the comments.

    • No alternative text description for this image
  • MemVerge reposted this

    View profile for Jing Xie, graphic

    🌊Spot Instance Surfer | 🤖GPU Optimizer

    MemVerge hit a nice KPI in 2024: Billions of core hours ARR (ann. run-rate) If I add up the annualized core hours of the customers who have already implemented or are in the process of implementing Memory Machine Batch and their annualized core hours on AWS and GCP, it reached and exceeded 1 billion core hours in 2H'24 Excited to see our growth in 2025 and also some of the newer use cases we have coming as well: -GPU SpotSurfing -ARM based CPU SpotSurfing (Graviton, etc.) -UI & compute integration w/ AWS Health Omics (storage & workflows) -Multi-scheduler support (AWS Batch, HT Condor, Slurm, IBM Symphony) -Checkpoint Restore for k8s (via a Memory Machine k8s operator) If you aren't sure whether you can benefit from Memory Machine Batch, check out a free tool we open sourced just before Christmas called Spot Viewer. It measures and helps you better track the interruption impacts of using EC2 Spot in YOUR AWS Batch environment. I'll drop a link to Spot Viewer in the comments. If you are losing too much wall-time (ie: delayed results) & wasting too much in EC2 spend due to Spot Interruptions and reruns/retries, shoot me a DM or book a demo call with me. #HPC #AWS #Genomics #FSI #Semiconductors #EC2 #GCP

  • For AWS Batch users, in case you missed it

    View profile for Jing Xie, graphic

    🌊Spot Instance Surfer | 🤖GPU Optimizer

    AWS Batch users, Christmas came early! 🎄🎅 MemVerge is open sourcing SpotViewer, a tool built to help Cloud HPC users: 1. Understand how Spot interruptions impact your job runtimes and expected cost savings 2. Optimize EC2 instance selection via a historical jobs and instance types database at your fingertips and based on your own workloads 3. Quantify cost savings for jobs run on Spot instances vs On-Demand instances 4. Determine if you need an EC2 checkpoint-restore based SpotSurfing solution like Memory Machine Batch Check it out in Github 👉 https://lnkd.in/euqqU-B8 We plan to continue improving this tool in the open and would love to collaborate with the AWS user community to take it to the next level. Please like, repost, and help share this with your network! #AWS #HPC #EC2 #Nextflow

    GitHub - MemVerge/mv-spot-viewer: Spot Viewer: A powerful extension for monitoring and optimizing AWS HPC workloads, showcasing the cost-saving benefits of Spot Instances and enabling better resource utilization.

    GitHub - MemVerge/mv-spot-viewer: Spot Viewer: A powerful extension for monitoring and optimizing AWS HPC workloads, showcasing the cost-saving benefits of Spot Instances and enabling better resource utilization.

    github.com

  • MemVerge reposted this

    View profile for Achyutha Harish, graphic

    Engineering innovative and cost-efficient solutions to supercharge HPC and AI workloads on AWS | Masters in Computer Science.

    AWS Spot Instances can offer up to 90% savings on your compute bill. But here’s the catch: These savings depend heavily on your workload characteristics and infrastructure design. Managing and optimizing Spot workloads is no small feat and often requires HPC expertise that your team might not have, including: •Job placement strategies to minimize interruptions •Optimizing container images to work across diverse instance types •Handling container runtime differences between instance families In my experience working with researchers, many find themselves caught in the complexities of Spot Instances. Common challenges include: •Understanding historical interruption patterns for different instance families •Conducting cost comparisons across regions for similar workloads •Integrating with AWS Batch job metrics to measure efficiency My goal is to simplify Spot Instances by building open-source tools that make them more accessible, transparent, and effective for AWS Batch users. Today, I’m excited to introduce SpotViewer, a free tool for EC2 users to forecast: •Wasted compute costs when sticking to On-Demand workflows •Runtime metrics, including time lost due to job failures •Potential savings achievable by switching to Spot I’d love to hear your feedback—what challenges have you faced with Spot Instances? 👉 Check it out on GitHub: https://lnkd.in/gdtapi48 #AWS #HPC #Cloud Jing Xie Charlie Yu Christian Kniep Sateesh Peri Ronald Turn MemVerge

    GitHub - MemVerge/mv-spot-viewer: Spot Viewer: A powerful extension for monitoring and optimizing AWS HPC workloads, showcasing the cost-saving benefits of Spot Instances and enabling better resource utilization.

    GitHub - MemVerge/mv-spot-viewer: Spot Viewer: A powerful extension for monitoring and optimizing AWS HPC workloads, showcasing the cost-saving benefits of Spot Instances and enabling better resource utilization.

    github.com

  • MemVerge reposted this

    View profile for Jing Xie, graphic

    🌊Spot Instance Surfer | 🤖GPU Optimizer

    Nested virtualization or checkpoint restore? -Which is better for running your Spot EC2 workloads? -Many customers I’ve met with don’t fully understand the differences and trade-offs. Let’s start with definitions: Checkpoint/Restore Checkpoint/restore saves an application’s state (memory, disk, and process info) at specific intervals, allowing it to restart from the last saved state after a Spot instance interruption. Nested Virtualization Nested virtualization runs a virtual machine (VM) inside an EC2 Spot instance (another VM), decoupling workloads from the host and enabling full VM migration during interruptions. Here are the tradeoffs: -Both work for stateful applications, the hardest workloads to reliably run on Spot. -Nested virtualization has a higher resource overhead, so your total realizable benefit will be more limited. -Some of the additional overhead of nested virtualization can be reduced if you modify the Linux kernel, but do you really wanna go there? -On the other hand, apps such as databases with lots of external dependencies can be trickier with checkpoint restore. HPC apps tend to perform really well with a C+R approach. -Checkpoint restore has less dependencies on the cloud platform provider itself, so your Spot workloads are more portable compared to the specific ways nested virtualization needs to interact with the hypervisor layer unique to each cloud provider. -Integration with complex tech stacks is also easier with a checkpoint restore software library approach. I’ve heard from several customers that services like AWS Batch and 3rd party + custom job schedulers can be hard to integrate with nested virtualization approaches. Hope this was helpful. Would love to hear the experiences both good and bad from others who have tried these approaches on clouds like AWS, GCP, and others. #AWS #HPC #Genomics #EDA #FinancialServices #FinOps

  • MemVerge reposted this

    View profile for Jing Xie, graphic

    🌊Spot Instance Surfer | 🤖GPU Optimizer

    Is your biotech job safe next year? Cloud cost optimization is a top priority as the economy continues to slow down. It isn’t always clear who is responsible how best to set targets, execute the optimization projects and reduce per sample research costs: -Is it the central IT organization? -Is it Finance and planning? -Should software engineering do it? -Should comp bio do it? -How about research informatics? — Here’s my take: -The “what” needs to come clearly and early from leadership, IT, and finance organizations -So many customers have stayed locked into an overpriced platform vendor without budgeting the time and costs needed to migrate and achieve long term savings. -The “how” needs to come proactively from comp bio, sw eng, and research informatics. -Inputs from these teams can help identify vendor consolation opps and eliminate cloud waste across IaaS, PaaS, and SaaS layers. -Cost per analysis reduction gets achieved faster by simply eliminating staff vs waiting for them to execute the initiatives…not great long term but I worked in finance in a previous life and this is how the sausage gets made. — Founders and corporate leaders, don’t reactively start assigning headcount reduction targets when quarterly revenue misses. Instead, develop a better cloud strategy now: - Accelerate adoption and move more workloads to EC2 Spot instances, esp the computational work needed to get from FastQ to VCF. - Get on a savings plan now (check with your AWS AE or partner to see if you qualify). -Plan to use reserved instances for workloads that need to run 24/7/365. -Optimize your workflows and reduce both overprovisioning and out of memory failures at runtime. #finops #pharma #biotech #genomics #HPC #AWS

  • MemVerge reposted this

    View profile for Jing Xie, graphic

    🌊Spot Instance Surfer | 🤖GPU Optimizer

    Even as H100s fall to just $2/hr from$8/hr that is still $17.5k per year for a single GPU. Use two H100s for one year and you spent what it would take to get a nice used Porsche 911. On AWS you rent them 8x at a time (no singles). That’s a nice new Porsche Taycan after 6 months of use (assuming you can get a capacity block at about $4/hr or a 50%-60% discount off vs on demand). I meet a lot of users who don’t need an 8x H100 node for more than a few hours or 1 day at a time. Others want to use Spot GPUs and don’t need an H100, an A10 will do just fine if it’s not used for LLM/DNN type research. If there was one Santa Claus wishlist item I’ve heard from AWS customers this past year it is GPU SpotSurfing. We’ve been working hard at it…it’s not easy. I think we are getting close though… Stay tuned and follow me for updates! Happy Friday and have a wonderful weekend!

  • View organization page for CXL Consortium, graphic

    6,872 followers

    Steve Scargall, Director of Product Management for CXL and AI at MemVerge, is presenting the upcoming CXL Consortium webinar on December 19 at 9:00 am PT to discuss the current state of the CXL ecosystem, highlighting the latest advancements to CXL hardware and software solutions. Register for the webinar to discover how CXL technology is enhancing AI/ML workloads by addressing critical memory capacity and bandwidth challenges in the data center: https://bit.ly/3Vobsb7   #CXLConsortium #ComputeExpressLink #CXL #datacenter #AI #ML

    Breaking Memory Barriers: CXL's Game-Changing Impact on AI/ML

    Breaking Memory Barriers: CXL's Game-Changing Impact on AI/ML

    brighttalk.com

  • MemVerge reposted this

    View profile for Jing Xie, graphic

    🌊Spot Instance Surfer | 🤖GPU Optimizer

    Three ways to run Nextflow on AWS: 1. CLI with Nextflow Host Key features: - Direct control - Technical setup on EC2 or AWS Batch - Good for users needing maximum flexibility - No commercial vendors needed (free) For users going this route, there is a “better than free” EC2 Spot integration for AWS Batch called Spot Viewer and SpotSurfer from MemVerge. I’ll drop a link to the docs in the comments below. 2. GUI via a Commercial Vendor Key features: - User friendly for less technical users - Centralized UI/UX for workflow management - Multiple commercial vendors (choices) - Visibility tools, user management, security AWS Health Omics provides a way for less technical users to run Nextflow and WDL based workflows, pay per use model, which is nice. MemVerge provides a clean and simple UI/UX and we have exciting AWS Health Omics related updates coming soon. If you use us to run better on EC2 Spot instances, this layer comes free (unlimited users and samples). There’s other vendors that sell via an annual software license. Seqera is the most well known as they created Nextflow. 3. API Integration - Programmatic access, automation - Best way to design for scalability - Developer friendly, event driven architecture AWS Batch provides an API which can be used as part of a custom development solution. AWS Health Omics also provides an API. Commercial vendors including MemVerge also offer a simplified “one API” design to run Nextflow on AWS. #AWS #HPC #Nextflow #WDL #WholeGenomeSequencing #LongReadSequencing

Similar pages

Browse jobs

Funding

MemVerge 4 total rounds

Last Round

Series unknown
See more info on crunchbase