The outage of a surprisingly large chunk of the internet infrastructure yesterday has been blamed, once again, on the Border Gateway Protocol (BGP) - specifically, an 'optimiser' which turned out to do anything but.
The internet as we know it today got its start as a network for academics and the military dubbed ARPANET. With only authorised personnel allowed to access it and a number of nodes you could comfortably count on one hand, collaboration and trust were key. This attitude extended to the modern internet, including in the development of systems like the Border Gateway Protocol (BGP) which allows a network to broadcast routing changes which are, by and large, trusted by their recipients.
This, naturally, causes issues when - through malice or ignorance - the routing change should not be trusted, as is the case for a broadcast which took down high-profile internet sites yesterday. 'A small company in Northern Pennsylvania became a preferred path of many Internet routes through Verizon (AS701), a major Internet transit provider,' explains hosting and caching specialist Cloudflare's Tom Strickx in a blog post analysing the issue. 'This was the equivalent of Waze routing an entire freeway down a neighbourhood street — resulting in many websites on Cloudflare, and many other providers, to be unavailable from large parts of the Internet. This should never have happened because Verizon should never have forwarded those routes to the rest of the Internet.
'We have blogged about these unfortunate events in the past, as they are not uncommon. This time, the damage was seen worldwide. What exacerbated the problem today was the involvement of a “BGP Optimiser” product from Noction. This product has a feature that splits up received IP prefixes into smaller, contributing parts (called more-specifics). For example, our own IPv4 route 126.96.36.199/20 was turned into 188.8.131.52/21 and 184.108.40.206/21. It’s as if the road sign directing traffic to “Pennsylvania” was replaced by two road signs, one for “Pittsburgh, PA” and one for “Philadelphia, PA”. By splitting these major IP blocks into smaller parts, a network has a mechanism to steer traffic within their network but that split should never have been announced to the world at large. When it was it caused today’s outage.'
Yesterday's BGP mishap is far from the first: Back in November Google saw its cloud service traffic routed through Russian and Chinese IP address for an hour in what was initially thought to be a traffic hijack attack but was later claimed to be the responsibility of small Nigerian ISP MainOne and a misconfiguration on its network. In response to these issues, the Mutually Agreed Norms for Routing Security initiative was founded - though adoption of its recommended practices remains unfortunately slow.
July 1 2020 | 17:34