Meta’s OSDI’24 Paper on Datacenter Resource Allocation

CQ Tang (Chunqiang Tang)

AI, ASIC/GPU/Accelerator, LLM/Llama, hardware/software co-design, HPC, IaaS, PaaS. Hiring PhDs and managers.

Published Aug 5, 2024

BTW, we are hiring.

Meta published three papers at OSDI’24, including one Best Paper. The previous posts introduced the ServiceLab paper (performance testing) and the MAST paper (ML scheduling). This post introduces the ReBalancer paper: “Optimizing Resource Allocation in Hyperscale Datacenters: Scalability, Usability, and Experiences.” Consistent with the series of past posts introducing Meta’s systems research papers to a wider audience, this post will share anecdotes rather than delve into technical details.

This paper proposes a general framework for solving various resource-allocation problems in the datacenter environment. Setting aside the technical details, a key point of the paper is to separate the issue of formulating the resource-allocation task from the issue of solving it efficiently. This is achieved by (1) using a high-level API to easily formulating a resource-allocation task as a constrained optimization problem (specifically, an assignment problem), and (2) solving it in a scalable fashion using optimized local search.

In contrast, most existing approaches either (a) use mixed integer programming (MIP) to both formulate and solve the resource-allocation problem, which is not scalable, or (b) use ad hoc heuristics to both represent and solve the problem, which is scalable but inflexible, making it difficult to evolve resource-allocation policies.

We have experienced the drawbacks of both approaches (a) and (b) in production. Our RAS system initially used MIP to assign machines to virtual clusters but hit scalability bottlenecks, while our Shard Manager system initially used ad hoc heuristics to assign shards to servers, which became too fragile to accommodate new load-balancing policies. Although both systems had operated in production for years, they eventually converged from opposite directions towards the solution described in the paper: (1) formally but easily representing their resource-allocation tasks as constrained optimization problems via APIs, which makes it easy to modify allocation policies, and then (2) solving them efficiently using optimized local search. For further details about our experiences with the evolution of the resource-allocation solutions for RAS and Shard Manager, please refer to the ReBalancer paper’s Section 7.2 “Experiences with Alternative Approaches.”

Meta’s OSDI’24 Paper on Datacenter Resource Allocation

CQ Tang (Chunqiang Tang)

AI, ASIC/GPU/Accelerator, LLM/Llama, hardware/software co-design, HPC, IaaS, PaaS. Hiring PhDs and managers.

Recommended by LinkedIn

More articles by CQ Tang (Chunqiang Tang)

Insights from the community

Others also viewed

Unlocking the Cloud: A Journey Through AWS Regions, AZs, and VPCs

Migrating On-premises and Virtualization Workloads to Azure: Technical Configuration, Technologies, Protocols, and Best Practices

RapidXperts | Broadcom’s VMware Windfall: Critical Insights for Pakistani CIOs

Benefits of colocation datacenter over on-premise datacenter

Transit Gateway Setup and VPN to Datacenter

What is Hyper Converged Infrastructure?

Virtualization or Cloud: The Best Solution for Your Business

Day 8: Az-900 Series: Azure physical infrastructure

Cisco and Nutanix Forge Global Strategic Partnership to Simplify Hybrid Multicloud and Fuel Business Transformation

Azure Weekly Updates - August 22nd, 2022

Explore topics

Recommended by LinkedIn

More articles by CQ Tang (Chunqiang Tang)

Meta/Facebook’s SOSP’24 and OSDI'24 Best Papers on Saving Millions of Servers from Performance Regressions

Meta’s OSDI’24 Paper on Scheduling for ML Training

Meta’s OSDI’24 Best Paper on Pre-production Performance Testing

Meta/Facebook seeking full-time research scientists in Architecture, HPC and ML Systems

Remote configuration management for apps on billions of mobile devices—Part (6) of stories behind Meta's systems research papers

Serverless Functions in Private Cloud—Part (5) of stories behind Meta/Facebook’s systems research papers

ISCA’23 Best Paper Award for advancements in reducing physical memory fragmentation—Part (4) of stories behind Meta/Facebook’s systems research papers

Global Capacity Management — Part (3) of stories behind Meta/Facebook’s systems research papers

Service mesh and global RPC routing --- Part (2) of stories behind Meta/Facebook’s systems research papers

Continuous software deployment --- Part (1) of stories behind Meta/Facebook’s systems research papers