Meta’s OSDI’24 Best Paper on Pre-production Performance Testing

Meta’s OSDI’24 Best Paper on Pre-production Performance Testing

BTW, we are hiring.

Meta published three papers at OSDI’24. This post introduces the ServiceLab paper, which won the Best Paper Award at OSDI’24. In keeping with the style of past posts that introduce Meta’s systems research papers to a broader audience, we will focus on anecdotes instead of the paper’s technical content. In particular, we will focus on how industry papers are written.

Meta's hyperscale infrastructure requires continuous innovation, but unlike other hyperscalers, Meta does not have a dedicated systems research lab. Instead, all of its systems research papers are authored by teams developing production systems. These teams advance the state of the art while tackling challenging production issues at scale and then reflect on these experiences to distill working solutions into research papers. The ServiceLab paper is an exemplary outcome of this approach, which ensures that the problems addressed are real and the solutions work at scale, aligning well with key criteria for successful systems research. We highly appreciate that the systems research community (OSDI, SOSP, ASPLOS, etc.) welcomes industry papers following this approach.

With three Meta papers published at OSDI’24, one might perceive that it is relatively easy for hyperscalers to publish systems research papers. However, the publication process of those three papers proves exactly the opposite. The other two papers, MAST and ReBalancer, were both rejected by SOSP’23 initially and then, after revisions based on the SOSP’23 review feedback, resubmitted to and accepted by OSDI’24. In particular, the ReBalancer paper underwent a substantial rewrite.

Although the ServiceLab paper was accepted on its first submission and won the Best Paper Award, its publication was delayed by three years. The initial writing began in 2020, but it missed multiple conference deadlines. The delay was partly due to the authors' limited time for paper writing, as operating the production system is our top priority. Additionally, distilling a complex production system into succinct writing that highlights novelties appreciated by the research community is often challenging. A production system typically involves a practical mix of many technologies, most of which may not interest the research community, despite the significant engineering effort involved. For example, our 2020 writing plan for the ServiceLab paper identified "record and replay" as the most obvious, No.1 novelty. However, by the time the paper was published in 2024, this aspect was entirely deemphasized.

Despite the challenges of writing research papers (not experience papers) based on production systems, our journey shows that it is both feasible and highly rewarding. We strongly believe that many production systems in the industry (beyond Meta) have advanced the state of the art, deserve publication, and can provide valuable new perspectives to the research community. We encourage our industry colleagues to publish their work so that we can learn directly from it. We also urge reviewers from the research community to consider the constraints these papers may face. For example, unlike small systems developed in a lab environment, conducting comparative studies at production scale is often impractical.

Finally, we share the abstract of the ServiceLab paper below:

ABSTRACT: This paper presents ServiceLab, a large-scale performance testing platform developed at Meta. Currently, the diverse set of applications and ML models it tests consumes millions of machines in production, and each year it detects performance regressions that could otherwise lead to the wastage of millions of machines. A major challenge for ServiceLab is to detect small performance regressions, sometimes as tiny as 0.01%. These minor regressions matter due to our large fleet size and their potential to accumulate over time. For instance, the median regression detected by ServiceLab for our large serverless platform, running on more than half a million machines, is only 0.14%. Another challenge is running performance tests in our private cloud, which, like the public cloud, is a noisy environment that exhibits inherent performance variances even for machines of the same instance type. To address these challenges, we conduct a large-scale study with millions of performance experiments to identify machine factors, such as the kernel, CPU, and datacenter location, that introduce variance to test results. Moreover, we present statistical analysis methods to robustly identify small regressions. Finally, we share our seven years of operational experience in dealing with a diverse set of applications.

Ying Zhang

Senior Engineering Manager at Meta

5mo

Congratulations!

Like
Reply
Yue Qi

Senior Engineering Manager at ByteDance. Hiring!

5mo

Congrats! Glad we collaborated together on ML Prediction usecase for ServiceLab.

Seetharami Seelam

Distinguished Engineer at IBM Research

5mo

Congratulations to you and your team CQ!!

Shaojun (Steve) Zhao

Natural Language Processing, Machine Learning, Ph.D in AI

5mo

Thanks for sharing

To view or add a comment, sign in

More articles by CQ Tang (Chunqiang Tang)

Insights from the community

Others also viewed

Explore topics