Network issues 7th December 2013

By Richard Leyton, parkrun systems engineer
— Saturday 7 December 2013

We had some problems this morning that meant many event teams couldn't login to publish results, and runners couldn't access their profiles or request reminders.

What happened

This morning, from about 7.15am to just after midday, access to the server that hosts the master database, and key services (including event login authentication), started failing/timing out.

This meant runner profiles, reminders, registration was offline, and event teams couldn't access their event email or login to results processing systems.

The cause was the ISP of that system had a major core network outage. Connectivity to our server was intermittent, and as a consequence various pages and services running on that system were inaccessible and timed out.

The core network issue was resolved by our ISP just after midday, at which point our services started working normally again.

What we're doing

It's important to remember that the parkrun's system hosting has evolved over a period of years. As systems were added and built out as needs grew/changed, it wasn't always appropriate or cost-effective to stay with a particular service provider or solution. As a consequence, we have a number of different service providers in the mix, for different parts of the system (largely split by OS, ie. Windows hosting specialist; Linux hosting specialist; the master database on a dedicated machine)

This means we are a bit more exposed to connectivity issues than if we were all in one place, with our servers under one virtual roof. In short, more things to break. But ongoing costs were lower.

We've actually been working hard to rectify this over the last couple of months. Indeed, we're just starting the actual consolidation exercise to bring nearly all of our systems under one virtual roof (that'll be the subject of a post soon): on Friday I switched most of our website traffic to the new platform, and hope to make more progress over the coming week. Much of this is just good practice, but we also need to get the API deployment platform ready.

So, we'll be continuing with that exercise. The outage itself was ultimately just bad timing. After three years of solid and reliable service from that particular ISP, it was really just unfortunate it struck on a Saturday morning.

Hope that explains a bit more about what happened and what we're doing. More soon.


Richard, and the rest of the UKTT team
twitter @rleyton

Showing 1 to 10 of 101 entries