Meta’s OSDI’24 Paper on Datacenter Resource Allocation

Meta’s OSDI’24 Paper on Datacenter Resource Allocation

BTW, we are hiring.

Meta published three papers at OSDI’24, including one Best Paper. The previous posts introduced the ServiceLab paper (performance testing) and the MAST paper (ML scheduling). This post introduces the ReBalancer paper: “Optimizing Resource Allocation in Hyperscale Datacenters: Scalability, Usability, and Experiences.” Consistent with the series of past posts introducing Meta’s systems research papers to a wider audience, this post will share anecdotes rather than delve into technical details.

This paper proposes a general framework for solving various resource-allocation problems in the datacenter environment. Setting aside the technical details, a key point of the paper is to separate the issue of formulating the resource-allocation task from the issue of solving it efficiently. This is achieved by (1) using a high-level API to easily formulating a resource-allocation task as a constrained optimization problem (specifically, an assignment problem), and (2) solving it in a scalable fashion using optimized local search.

In contrast, most existing approaches either (a) use mixed integer programming (MIP) to both formulate and solve the resource-allocation problem, which is not scalable, or (b) use ad hoc heuristics to both represent and solve the problem, which is scalable but inflexible, making it difficult to evolve resource-allocation policies.

We have experienced the drawbacks of both approaches (a) and (b) in production. Our RAS system initially used MIP to assign machines to virtual clusters but hit scalability bottlenecks, while our Shard Manager system initially used ad hoc heuristics to assign shards to servers, which became too fragile to accommodate new load-balancing policies. Although both systems had operated in production for years, they eventually converged from opposite directions towards the solution described in the paper: (1) formally but easily representing their resource-allocation tasks as constrained optimization problems via APIs, which makes it easy to modify allocation policies, and then (2) solving them efficiently using optimized local search. For further details about our experiences with the evolution of the resource-allocation solutions for RAS and Shard Manager, please refer to the ReBalancer paper’s Section 7.2 “Experiences with Alternative Approaches.”

Finally, I share the abstract of the ReBalancer paper below.

Abstract: 

Meta's private cloud uses millions of servers to host tens of thousands of services that power multiple products for billions of users. This complex environment has various optimization problems involving resource allocation, including hardware placement, server allocation, ML training & inference placement, traffic routing, database & container migration for load balancing, grouping serverless functions for locality, etc.

The main challenges for a reusable resource-allocation framework are its usability and scalability. Usability is impeded by practitioners struggling to translate real-life policies into precise mathematical formulas required by formal optimization methods, while scalability is hampered by NP-hard problems that cannot be solved efficiently by commercial solvers.

These challenges are addressed by Rebalancer, Meta's resource-allocation framework. It has been applied to dozens of large-scale use cases over the past seven years, demonstrating its usability, scalability, and generality. At the core of Rebalancer is an expression graph that enables its optimization algorithm to run more efficiently than past algorithms. Moreover, Rebalancer offers a high-level specification language to lower the barrier for adoption by systems practitioners.

To view or add a comment, sign in

More articles by CQ Tang (Chunqiang Tang)

Insights from the community

Others also viewed

Explore topics