Login Register

Cloudflare firewall update triggers half-hour website outages

Written by Gareth Halfacree

July 3, 2019 | 11:09

Tags: #firewall #insecurity #outage #regular-expression #security #waf #web #web-application-firewall #website

Companies: #cloudflare

Load-management and security firm Cloudflare has apologised for an outage that took down a surprisingly large chunk of the web for around half an hour yesterday, placing the blame on a botched firewall update.

Founded in 2009 by Matthew Prince, Lee Holloway, and Michelle Zatlyn, Cloudflare offers services which aim to improve the performance and security of a variety of web services: It offers a high-performance content delivery network, load-balanced caching system, protection against distributed denial of service (DDoS) attacks, partial encryption for sites that would otherwise be unable to support it, and a web application firewall (WAF) designed to detect attacks and block them.

Sadly, this last feature proved troublesome yesterday afternoon when all of Cloudflares customers' websites - which includes some of the biggest sites on the web - began displaying HTTP 502 Bad Gateway errors. The outage, which lasted around half an hour, was similar in nature to the result of a BGP misroute last month, though more widespread - and it was entirely down to Cloudflare's own firewall system.

'We experienced a global service disruption that affected most Cloudflare traffic for 27 minutes,' the company explains in an email to customers. 'The issue was triggered by a bug in a software deploy [sic] of the Cloudflare Web Application Firewall (WAF) which resulted in a CPU usage spike globally, and 502 errors for our customers. To restore global traffic we temporarily disabled certain WAF capabilities, removed the underlying software bug, then verified and re-enabled all WAF services.

'We're deeply sorry about how this disruption has impacted your services. Our engineering teams continue to investigate this issue and we will be sharing detailed incident report(s) on the Cloudflare blog.'

The aforementioned blog post goes into more detail about the error, which was caused by an update designed to improve the blocking of inline JavaScript code used in attacks. An error in a malformed regular expression (regex) used to match these attacks spiked CPUs in the affected systems to 100 percent load - triggering the errors for around 82 percent of Cloudflare's global traffic.

Cloudflare has confirmed that it plans to improve its software testing and deployment processes in the wake of the outage.

Discuss this in the forums