Meta’s OSDI’24 Best Paper on Pre-production Performance Testing

CQ Tang (Chunqiang Tang)

AI, ASIC/GPU/Accelerator, LLM/Llama, hardware/software co-design, HPC, IaaS, PaaS. Hiring PhDs and managers.

Published Jul 27, 2024

BTW, we are hiring.

Meta published three papers at OSDI’24. This post introduces the ServiceLab paper, which won the Best Paper Award at OSDI’24. In keeping with the style of past posts that introduce Meta’s systems research papers to a broader audience, we will focus on anecdotes instead of the paper’s technical content. In particular, we will focus on how industry papers are written.

Meta's hyperscale infrastructure requires continuous innovation, but unlike other hyperscalers, Meta does not have a dedicated systems research lab. Instead, all of its systems research papers are authored by teams developing production systems. These teams advance the state of the art while tackling challenging production issues at scale and then reflect on these experiences to distill working solutions into research papers. The ServiceLab paper is an exemplary outcome of this approach, which ensures that the problems addressed are real and the solutions work at scale, aligning well with key criteria for successful systems research. We highly appreciate that the systems research community (OSDI, SOSP, ASPLOS, etc.) welcomes industry papers following this approach.

With three Meta papers published at OSDI’24, one might perceive that it is relatively easy for hyperscalers to publish systems research papers. However, the publication process of those three papers proves exactly the opposite. The other two papers, MAST and ReBalancer, were both rejected by SOSP’23 initially and then, after revisions based on the SOSP’23 review feedback, resubmitted to and accepted by OSDI’24. In particular, the ReBalancer paper underwent a substantial rewrite.

Recommended by LinkedIn

Prompt Engineering: Tips, Approaches, and Future…

Towards Data Science 6 months ago

What are tech executives saying about the impact of AI…

Brunno Boaventura Ribeiro 1 year ago

There is no secret sauce.

Rahul Singh 2 months ago

Although the ServiceLab paper was accepted on its first submission and won the Best Paper Award, its publication was delayed by three years. The initial writing began in 2020, but it missed multiple conference deadlines. The delay was partly due to the authors' limited time for paper writing, as operating the production system is our top priority. Additionally, distilling a complex production system into succinct writing that highlights novelties appreciated by the research community is often challenging. A production system typically involves a practical mix of many technologies, most of which may not interest the research community, despite the significant engineering effort involved. For example, our 2020 writing plan for the ServiceLab paper identified "record and replay" as the most obvious, No.1 novelty. However, by the time the paper was published in 2024, this aspect was entirely deemphasized.

Despite the challenges of writing research papers (not experience papers) based on production systems, our journey shows that it is both feasible and highly rewarding. We strongly believe that many production systems in the industry (beyond Meta) have advanced the state of the art, deserve publication, and can provide valuable new perspectives to the research community. We encourage our industry colleagues to publish their work so that we can learn directly from it. We also urge reviewers from the research community to consider the constraints these papers may face. For example, unlike small systems developed in a lab environment, conducting comparative studies at production scale is often impractical.

Finally, we share the abstract of the ServiceLab paper below:

ABSTRACT: This paper presents ServiceLab, a large-scale performance testing platform developed at Meta. Currently, the diverse set of applications and ML models it tests consumes millions of machines in production, and each year it detects performance regressions that could otherwise lead to the wastage of millions of machines. A major challenge for ServiceLab is to detect small performance regressions, sometimes as tiny as 0.01%. These minor regressions matter due to our large fleet size and their potential to accumulate over time. For instance, the median regression detected by ServiceLab for our large serverless platform, running on more than half a million machines, is only 0.14%. Another challenge is running performance tests in our private cloud, which, like the public cloud, is a noisy environment that exhibits inherent performance variances even for machines of the same instance type. To address these challenges, we conduct a large-scale study with millions of performance experiments to identify machine factors, such as the kernel, CPU, and datacenter location, that introduce variance to test results. Moreover, we present statistical analysis methods to robustly identify small regressions. Finally, we share our seven years of operational experience in dealing with a diverse set of applications.

Ying Zhang

Senior Engineering Manager at Meta

5mo

Congratulations!

Yue Qi

Senior Engineering Manager at ByteDance. Hiring!

5mo

Congrats! Glad we collaborated together on ML Prediction usecase for ServiceLab.

3 Reactions

Seetharami Seelam

Distinguished Engineer at IBM Research

5mo

Congratulations to you and your team CQ!!

1 Reaction

Shaojun (Steve) Zhao

Natural Language Processing, Machine Learning, Ph.D in AI

5mo

Thanks for sharing

1 Reaction

Amit Yajurvedi

5mo

Congratulations!

1 Reaction

See more comments

To view or add a comment, sign in

Meta’s OSDI’24 Best Paper on Pre-production Performance Testing

CQ Tang (Chunqiang Tang)

AI, ASIC/GPU/Accelerator, LLM/Llama, hardware/software co-design, HPC, IaaS, PaaS. Hiring PhDs and managers.

Recommended by LinkedIn

More articles by CQ Tang (Chunqiang Tang)

Insights from the community

Others also viewed

Hyperdisclaimer: A Word of Caution

Breaking Boundaries: The Revolutionary Vision of DSPyGen

HAI - IBM Optimization Technologies. OPTEX Enterprise-Wide Optimization Systems & IBM Decision Optimization. Co-Partnering.

Digixvalley Granite 3.0: open, state-of-the-art Enterprise Models

Claude + Cursor AI + v0 + Frameworks = Unbound Productivity

BeagleBoard Doubles Down with Open Source and AI

AutoGenesisAgent: Self-Generating Multi-Agent Systems for Complex Tasks

The art of Prompt Crafting - Part three

FOD#63: Open-Ended Exploration

Abstracta Amplify is Now Online!

Explore topics

Recommended by LinkedIn

More articles by CQ Tang (Chunqiang Tang)

Meta/Facebook’s SOSP’24 and OSDI'24 Best Papers on Saving Millions of Servers from Performance Regressions

Meta’s OSDI’24 Paper on Datacenter Resource Allocation

Meta’s OSDI’24 Paper on Scheduling for ML Training

Meta/Facebook seeking full-time research scientists in Architecture, HPC and ML Systems

Remote configuration management for apps on billions of mobile devices—Part (6) of stories behind Meta's systems research papers

Serverless Functions in Private Cloud—Part (5) of stories behind Meta/Facebook’s systems research papers

ISCA’23 Best Paper Award for advancements in reducing physical memory fragmentation—Part (4) of stories behind Meta/Facebook’s systems research papers

Global Capacity Management — Part (3) of stories behind Meta/Facebook’s systems research papers

Service mesh and global RPC routing --- Part (2) of stories behind Meta/Facebook’s systems research papers

Continuous software deployment --- Part (1) of stories behind Meta/Facebook’s systems research papers

Insights from the community

Others also viewed

Hyperdisclaimer: A Word of Caution

Breaking Boundaries: The Revolutionary Vision of DSPyGen

HAI - IBM Optimization Technologies. OPTEX Enterprise-Wide Optimization Systems & IBM Decision Optimization. Co-Partnering.

Digixvalley Granite 3.0: open, state-of-the-art Enterprise Models

Claude + Cursor AI + v0 + Frameworks = Unbound Productivity

BeagleBoard Doubles Down with Open Source and AI

AutoGenesisAgent: Self-Generating Multi-Agent Systems for Complex Tasks

The art of Prompt Crafting - Part three

FOD#63: Open-Ended Exploration

Abstracta Amplify is Now Online!

Explore topics