This weekend we had a bit of a service problem that affected a lot of events.
We know many of you are happy with the experience of a social run on Saturday mornings, and don't really mind all the other bits and pieces, or even a delay of a few days.
But we also know many of you are excited by the prospect of a confirmed PB, club membership confirmations, or seeing how your friends got on, and have got used to our rapid turnaround of results.
We also have our own high expectations of what we offer. Event team volunteers all across the world put in hours of time to stage fantastic parkrun events in their communities, and we want to turn around their results as quickly as possible for them.
A bit of background
Results Processing (RP) is taken care of at HQ. Events submit results, and the results are processed one event at a time. RP is a bit more involved than just tallying up a few figures, it needs to recalculate points for the year, gender positions and so on. So for large events it's quite involved, and can take a few minutes.
It's a separate system (ie. not part of the system teams use in the cafe). First unleashed by PSH on the world back in 2007 with 1 event, it's scaled out to support about 300 events in eight countries. Event teams who used desktop FMS won't be shocked to learn that it's written in Microsoft Access. But it's lasted 6 years and now supports 300 times the number of events. It might not be shiny shiny, but it's quite reliable.
We'd made a few changes during the week to fix a few problems that had occurred the week before, so whilst it looks like we had a bad weekend, we had actually fixed one problem that had slowed things down in recent weeks. We'd also changing how some emails were generated.
We had a couple of problems. Firstly some changes we'd introduced during the week to change how emails were generated backfired in a small but significant way (it was an easy mistake to make), and broke the stream of email generation. So it didn't send all the emails it should have per event, and to make matters worse, it didn't mark the event as complete, so it sent them again next time around.
We also had a recurrence of an SMS issue we've been battling with the last month or so. It's a lot involved building up the list of recipients than you might think (we double check a lot of things), and there was a lock conflict between RP and the SMS systems. This caused RP to fail to mark a result set as complete (which pushed it to the back of the queue), and also to cause SMS's to fail requiring manual intervention to schedule up.
We'd said on Sunday we were resending emails, but they didn't come through very fast. Turns out in problem diagnosis we'd dropped a few indexes on RP which slowed things right down.
It took us from Saturday mid-afternoon (with the SMS issues and locking issue), through to early Sunday afternoon to unpick RP, and to Monday to fully resolve issues.
What we're doing
We've already fixed the problems with emails and RP. We've been working on a project to move RP off of Access for a while now - moving it into the main results system, and using more rigorous release processes. We're also breaking SMS's out to a separate system: With 10,000 SMS's each week, it's an order of magnitude larger than it was just a few months back. The scale of parkrun's growth strikes again. We're also rewriting how emails are generated. That should start to come on stream very soon.
We've restarted email delivery. Sorry if you get them an additional time. We couldn't separate out who had and hadn't received an email, and - to be honest - most of the complaints were about the lack of email!
Just a reminder if you have a problem, please report it via http://support.parkrun.com - We can't really take issue reports via social media.
Hope that gives a bit more insight into what happened this weekend.
Richard, and the rest of the UKTT team