Service Outage London 17/10/2017

Further to the notification of the Portal outage, a second incident related to the first occurred which created a much larger impact to the network.

At 14:36 one of the two routers in the London DC crashed and did not recover. Traffic re-routed via the other router and apart from the Portal, service was unaffected.

At 14:54, the second router withdrew its BGP routes from upstream neighbours. This is caused by a bug in the routing code. A previous work around implemented did not work and we will investigate alternative methods to ensure stable rerouting during failure scenarios.

By 15:30 all traffic apart from services on the 159.253.160.0/21 range were working.

At 16:20 the remaining prefixes were restored.

21:00: Replacement hardware is on site and will be replaced during an emergency maintenance window at 01:00.

05:30: Replacement hardware successfully commissioned in the network and all services are working.

Posted by Infrastructure Team on October 17th, 2017
Posted in Status | Comments Off on Service Outage London 17/10/2017