At Web Kinect we are aware that our success over the past 7 years is due almost entirely to customer referral and word of mouth, and we know that our high service level is the primary reason for this. Therefore, when we fail to meet those service levels, I feel it is important to tell you why.
Today (Tuesday 12th August) we experienced a network routing issue between approximately 9AM and 11:30AM. This meant that some visitors in the UK and elsewhere in the world were unable to reach our network, including customer websites, our own website, email services and other services provided by Web Kinect. It’s important to point out that this issue was sporadic and only affected a small number of people. For the majority of the world, services were accessible throughout.
Please rest assured that we dealt with this with the utmost urgency and let me re-iterate that maintaining our reputation for uptime, reliability and personal level of support is of the highest concern to me. I understand that, for many of our clients, your online business is your livelihood.
For those who are interested, the technical explanation…
The global routing table is a ‘map’ of each possible destination on the internet. Every large network operator (such as ourselves, or the ISP you use to connect to the internet) holds a copy of this, or multiple copies in our case. This is what enables every computer on the internet to reach every other computer.
Over the past two decades the routing table has been increasing in size, due to new ipv4 addresses being used and existing ipv4 address ranges being split (meaning that 2 consecutive ranges might have different paths). Today it hit 512,000 routes. This is a magic number as it’s an inbuilt limit in many common routers and switches.
We had pre-empted this. Most of our routers/switches already have a higher limit and we have recently spent £250,000 on network upgrades to improve the rest of our network. These had not yet been installed as we believed we had room to spare. However, last night there was a sudden increase in the number of routes being announced to the world and at 9AM we hit the limit.
Due to human factors it took us approximately half an hour to find the cause of the issue. At that point we applied a fix on the only router we believed was affected. This took effect after a reboot and the majority of people who could not access our network were then able to. However some users were still reporting problems so we continued to investigate. We believed the issue may lie elsewhere as customers were also reporting issues reaching websites such as eBay, Amazon and Skype but, despite the lack of any log entries to indicate, it turns out another of our Cisco routers had also hit its routing limit. The same configuration change was applied to that router and, at that point, the remaining people still having problems accessing their site were now able to again.
It seems many other high profile ISPs, including most BT Internet customers also suffered the same issue today and most have now fixed their own networks.
Over the next 2 weeks we will be replacing all the affected routers with brand new Juniper devices which can hold enough routes to cover us for the next decade.
This issue only affected a small preportion of people accessing our network, and most users would not have seen any disruption. Nonetheless I would like to express my sincere apologies for the issues that some of you faced and thank you for your continued support.
Tuesday, August 12, 2014
Powered by WHMCompleteSolution