Americas

Asia

Oceania

Bug causes Cloudflare to lose customer logs

News
27 Nov 20243 mins
Data and Information Security

An outage affecting most users of Cloudflare Logs lost more than half of the logs normally sent to customers.

Back behind view photo of dark skin programmer lady look big monitor check id-address work overtime check debugging system wear specs casual shirt sit table late night office indoors
Credit: Roman Samborskyi / Shutterstock

A Wednesday blog post from Cloudflare revealed that a software bug resulted in the loss of about 55% of the logs that would have been sent to customers over a 3.5-hour period on 14 November.

The company explained that every part of its global network of services generates event logs containing detailed metadata about its activities. For example, every request to Cloudflare’s content delivery network (CDN) creates a log. It makes these logs available to customers, who can use them in a number of ways, including compliance, observability and accounting. The company said that on a typical day it sends about 4.5 trillion individual event logs to customers.

The problem originated in a change to a system called Logpush, which collects individual logs from Cloudflare’s network of servers into batches and pushes them to customers. Although customers can receive their logs directly from each server, most choose not to.

“By analogy, imagine the postal service ringing your doorbell once for each letter instead of once for each packet of letters,” Cloudflare explained in the post. “With thousands or millions of letters each second, the number of separate transactions that would entail becomes prohibitive.”

When the company added support for another dataset into Logpush, it also had to add a new configuration to a component called Logfwdr to tell the system which customers’ logs should be forwarded to the new stream.

A bug in the system sent a blank configuration to Logfwdr, telling it no customers had logs to be pushed. That problem, the company said, was quickly spotted and the change reverted in less than five minutes.

However, what had been designed as a failsafe to address just such a problem turned around and bit them. When the Logfwdr configuration was unavailable, the failsafe would send logs to all customers. In this case, that five-minute glitch caused a massive spike in the number of logs to be sent, overloading the buffering system, Buftee, and making it unresponsive.

Buftee provides buffers for each Logpush job, containing 100% of the logs generated by the zone or account referenced by that job, so the failure to process one customer’s job will not affect progress on others. It contained safeguards against being overwhelmed by a massive increase in the number of buffers — but those safeguards had not been configured, Cloudflare said.

“A short, temporary misconfiguration lasting just five minutes created a massive overload that took us several hours to fix and recover from,” the blog stated. “Because our backstops were not properly configured, the underlying systems became so overloaded that we could not interact with them normally.  A full reset and restart was required.”

To prevent a recurrence of the problems “we’re creating alerts to ensure that these particular misconfigurations will be impossible to miss, and we are also addressing the specific bug and the associated tests that triggered this incident.”

More Cloudflare news:

  翻译: