IV. Sochi Ladies Figure Skating: a Statistical Analysis

Tiziano Virgili 

Dipartimento di Fisica and INFN 

Chapter 4

The “Jury Resolution Power”


4.1. The Resolution Power


   Let me now coming back to the general problem of the “objectivity” in Figure Skating (and more in general, in competitions with jury).

   Figure Skating experts usually “trust” the judges evaluation, whereas other people are more skeptical about their objectivity. Can we approach the problem from a scientific point of view? Let’s consider a piano contest. Let’s suppose that the first performer gets scores between eight and nine. Then, a second performer gets scores from six to seven. It is reasonable to say that the first performer is much better than the second one. What however if the second performer gets scores from 7.5 to 8.5? Can we still say for sure that the first performer is better? What matter here is the “resolution power” of the jury, which is the minimal score difference that they are able to separate. Just like any other instrument, there is an “intrinsic resolution”, which gives the limit of your measurement. In other words, you cannot be able to distinguish amounts below this limit. The resolution power of the jury can be different from sport to sport, and from competition to competition. In the following I will try to evaluate the average resolution in case of Figure Skating (ladies events).

   In order to evaluate this “resolution capability” of the Jury, let’s consider the distribution of the scores which come out from the Judges. Each Judge is equivalent to a stopwatch operator, so we take the Mean as the best measurement, and the RMS as an estimate of the distribution width. As an example, in figure 4.1 are reported the distributions of the scores for Carolina Kostner at the last Olympic Games, for both Technical Elements and Components (sum of SP and FP). As usual, each Judge provides an “entry” in the distributions.

   (Note that respect to the plots on page 19 and 24 the Technical Elements include the base values, moreover they are summed on Short and Free Programs).

The average (Mean) and the RMS are reported on the same figures.



   As expected, the RMS for the Components (the “artistic” score) is larger than the Technical score (3.6 compared to 1.6). In this case the error on the total score (sum of Technical + Components) is dominated by the error on the Components (*). However, if we select another skater the distributions will look different, with a different Mean and RMS. We can consider as more representative, the average of the RMS over the full sample of skaters. The distributions of the RMS for all the skaters (2014 Olympic Games, Ladies Single) are reported in the figure below, for Technical Elements (left) and Components (right). A “Gaussian fit” (red line) is also performed.



   Again, the RMS for the Components are larger than those for Technical Elements (2.6 compared to 1.2) (†).

   The sum of Technical Elements and Components produces the Total Score. The RMS on the Total Score can be obtained in the same way, so we get the final result:

   This corresponds to an error on the Mean of about (‡).

   In other words, in Figure Skating the resolution of the Jury is about 1.3 points. The standard “minimal” resolution is just this value, i.e. the Jury is not able to “resolve” skaters within 1.3 points. Therefore skaters with a difference in the total score lower than 1.3 should be considered equivalent. Note that a more safe resolution can be fixed at 2 times(2.6).

   Note also that in principle the resolution depends on the score itself. For simplicity I’m considering here the average only.

   We can now ask if this resolution is enough to have ranking that make sense. Let’s have a look at the distribution of the score differences between the nearest skaters in the final ranking (i.e. the differences first ― second; second ― third: third ― forth and so on). I have considered here the following international competitions:

   WC2014 ― OWG2014 ― WC2013 ― WC2012 ― OWG2010.

   The result is illustrated in figure 4.3. The distribution can be described approximately by a decreasing exponential function. This means that most of the differences are concentrated on the left side of the figure. As explained before, the Jury is not “able” to distinguish between scores lower than the minimal resolution value of 1.3. The values of 1.3 is indicated in the figure as a red line. (The blue line correspond to the value of 2.6, i.e. a “2-sigma” resolution).



   The fraction of values larger than this two limits are 69% and 52% of the total respectively. So, in about 70% of the cases the final classification can be considered “objective”, whereas in about 30% of the cases this just comes out from “statistical fluctuations”. The situation improves however if we consider the top places in the final classification. If we restrict the score differences to the top four places, the distribution looks more extended to the right. Indeed, larger scores correspond on the average to larger differences, so with this selection we have more entries at larger values in the distribution. If we now consider again the previous limits, the fraction of entries larger than the minimal limit is 90% (80% for the 2.6 limit).

   So, in about 90% of the cases the medals standing can be considered as “objective”. This is true in all the cases where the score differences are larger than the previous limit of 1.3 units. Note that all this considerations are true provided no single skater bias are present!


4.2. Intrinsic Fluctuations


   In Figure Skating intrinsic fluctuations (i.e. fluctuations from performance to performance) have a strong influence on the final result, as a single mistake can produce a fault. What about fluctuations in other sports? Let’s have again a look at the men’s 100 m run. Fluctuations on the performances of the athletes are less evident, but they are not negligible.

   In order to estimate this fluctuations, I have considered the best recent performances of the fastest runners[4]. The “RMS” result correlated to the time av-erages. If we select the smaller times we get an average error of about



   In other words, time differences lower than about 0.04 seconds can be considered just fluctuations in the performances (¶). Is this a rare situation in this sport? Let’s have a look again at the distribution of such differences. They are reported in the following figure:



   In this figure the distribution of the time differences (in seconds) between nearest runners is shown (first ― second; second ― third; third ― fourth and so on). I have considered here all the main International Races (OWG and WC), from 2004 to 2013. As in the case of Figure Skating, the distribution can be described approximately by a decreasing exponential function (red line). The “1-sigma” and “2-sigma” resolutions are also reported as straight lines (red and blue).

   The fractions of cases larger than this two limits are now 66% and 48% respectively. However, if we consider the top four places only, the fractions are about 55% and 39% (◇). In other words, intrinsic fluctuations for top places are relevant in about 45% of the cases!

It is clear that in Track and Field, no “single runner bias” are present, and in this sense the results are much more “objective” than Figure Skating. However, as everybody knows, bias are present in most of the competitions. Moreover, the possibility of cheating is unfortunately present in any sport!




Go Conclusions




* For independent measurements it is possible to sum the square of the RMS.

† Note that the distribution of RMS for Technical Elements is different from the same distribution on page 19: in that case only GOE of the Free Program were considered, now we have the full score summed on FP+SP.

‡ Assuming for the error the following formula :  where N= Number of Judges =9. Actually the procedure is slightly more complicated, as the score is obtained by excluding the smallest and the largest values (trimmed average). This reduces the RMS, but then the effective number of Judges N becomes 7, so the final result is essentially the same.

¶ This is true if the performances of the runners are indeed uncorrelated. As runners go in parallel, they influence each other. A residual correlation therefore can be present. and in this case the above statement is not more completely true.

◇ The first places correspond to shorter times, which correspond to smaller differences. So, the first places are now concentrated in the left side of the distribution.