Way back in Graphic of the Week #5 I looked at a way to work out how 'fast' one parkrun is compared to another. In this edition I revisit the topic and hopefully come up with a better set of conversion factors.
First a recap
What do I mean by 'conversion factor'? - a conversion factor enables us to compare a run time at one event with a run at another. Runners often talk about how event A is quicker than event B - this could be because of relative hilliness, technicality (i.e. lots of twists and turns), surface quality (e.g. tarmac, grass or gravel), or, dare I whisper this, small differences in distance. The conversion factor is a way to objectively quantify the combined differences. Given a run time at event A we would multiply it by the A-to-B conversion factor to give us an equivalent run time at event B.
How were the conversion factors previously calculated? - For each pair of events (obviously I need to calculate one conversion factor for each pair of events), I first got a list of all the runners who have run at both events and worked out their best times at each event; each of these runners provide one data pair, one time is divided by the other to produce that runner's own personal conversion factor for the two events, and then, I averaged the conversion factors for each of the runners to give the overall conversion factor.
But... The main problem with this method is that it doesn't adequately handle the concept of 'form'. A runners best time at event A might have been six years ago, when the runner was on absolutely tip-top form, but the same runners best at event B may only have been recently when coming back from injury; in other words we're not necessarily comparing like with like - ideally we should really only compare times if the runner is on the same form.
New and improved?
To overcome the problem mentioned above I decided to only create data pairs for runners who run at the two different events on consecutive weeks - the assumption being that it is much more likely that a runner is on the same form on consecutive weeks, and so we would be more likely to be comparing like with like. This has the downside that it eliminates many potential data pairs from our analysis because they happened at completely different times. But on the other hand it adds (a few) compensating extra data pairs - any particular runner could theoretically alternate between event A and event B on a week by week basis contributing one additional data pair for each run into the analysis (previously they only contributed the one data pair based on their best performance).
Show me the data! Sorry, it's a bit big so I've prepared a PDF available here. The most obvious question will be "why are there so many gaps in the table?" The answer is that I have only included conversion factors for event pairs that I am reasonably confident stand a good chance of indicating at least whether one run is faster than another. I have set the minimum number of data pairs for a conversion factor to be included on the table at 100 (i.e. there are at least 100 instances a specific runner running at both events on two successive weeks). To be really confident, I would want to compare 400 data pairs, and the gold standard is 1000 data pairs. So far, Richmond Park and Bushy Park are the only two events to have enough runners attending both events on successive weeks to meet this standard. Dark green indicates more than a thousand data pairs, light green indicates more than 400 (i.e. the high confidence conversion factors) at the other end of the confidence spectrum, light red indicates fewer than 150 and dark red indicates fewer than 120. Plain white represents all the moderate confidence factors with between 150 and 400 data pairs. Empty cells indicate that not enough (if any) runners have attended both events to be at all confident about the results, and if your event is missing that indicates that not enough of your local runners have got out and about to any other runs (or had visitors from elsewhere) in significant numbers yet - at least not on successive weeks.
How to use the table - At a basic level you can tell whether one event is faster than another. The percentages indicate how much faster or slower the event at the top (FROM) is compared with the event at the left (TO). If the value is 100% then the two events are exactly the same speed. If the value is greater than 100% then the FROM event is faster than the TO event, and if less than 100% it's slower.
Let's take an example - say you've just run at Strathclyde parkrun in 25:37 and want to know what the equivalent performance would be at Glasgow parkrun. It's easier if we convert the time to seconds: 25:37 is 1537 seconds. Next look up Strathclyde along the top of the table under FROM, find Glasgow at the right hand side next to TO (we're converting from Strathclyde to Glasgow) and where the Strathclyde column and Glasgow row meet is the figure 102.8% - we multiply the 1537 by this conversion factor to give 1580 seconds or 26:20. In other words on the same form you would expect to take an extra 43 seconds to run Glasgow parkrun (faster runners would have fewer seconds difference, slower runners more - you need to run the calculation for each run time). The pale green of the box means that at least 400 data pairs contributed to the calculation of the conversion factor and so we can be pretty confident about the conversion.
We could also use the factor to compare records at different events. The Strathclyde record currently stands at 14:51 whereas Glasgow stands at 15:01; however after applying the conversion factor we see that the Strathclyde record would be equivalent to 15:15 at Glasgow, or doing the opposite calculation (from Glasgow to Strathclyde the conversion factor is 97.3%), we can see that the Glasgow record would be equivalent to 14:36 at Strathclyde.
Unleash the beautiful fish
And finally, I know how much Graphic of the Week readers look forward to a proper graph, and this one's a corker even if I do say so myself.

Scatter plot of all calculated conversion factors against the number of data pairs that contributed to the calculation of the conversion factor. The data set contains both conversion factors for each event pair (i.e A to B and B to A); The Y axis is logarithmic (base 2). The two points at the extreme right represent the Bushy to Richmond / Richmond to Bushy high confidence conversion factors.
I'll leave you to work out what this graph is saying, but hopefully you'll see why I set the threshold for inclusion in the table at 100 data pairs. And finally, if you are not already familiar with Ben Goldacre's work, may I recommend his column in The Guardian every Saturday. This week's column is pertinent to this analysis - even though it's on a very different topic (see The Bad Science website).