Further to the notification of the Portal outage, a second incident related to the first occurred which created a much larger impact to the network.
At 14:36 one of the two routers in the London DC crashed and did not recover. Traffic re-routed via the other router and apart from the Portal, service was unaffected.
At 14:54, the second router withdrew its BGP routes from upstream neighbours. This is caused by a bug in the routing code. A previous work around implemented did not work and we will investigate alternative methods to ensure stable rerouting during failure scenarios.
By 15:30 all traffic apart from services on the 18.104.22.168/21 range were working.
At 16:20 the remaining prefixes were restored.
21:00: Replacement hardware is on site and will be replaced during an emergency maintenance window at 01:00.
05:30: Replacement hardware successfully commissioned in the network and all services are working.