II. Sochi Ladies Figure Skating: a Statistical Analysis

Tiziano Virgili 
Dipartimento di Fisica and INFN 
Universita di Salerno


Chapter 2

Analysis of Sochi's Results

   In Figure Skating the athletes are called to show their ability in two performances of different length: the “Short Program” (SP, lasting 2 minutes and 30 seconds) and the “Free Program” (FP, 4 minutes for senior ladies). There are 7 technical elements (jumps, spins, step sequences,…) in the SP, and 12 in the FP. In both cases the Jury assign a technical score (Technical Elements, TE) and an artistic score (Program Components, PC). They are “calibrated” in such a way that both scores will have approximately the same weight in the total score.

   Moreover, the FP score is about twice the SP. In summary, the total score can be roughly divided in the following way: 1/6 SP (TE), 1/6 SP (PC), 1/3 FP (TE), 1/3 FP (PC).

   Most of the discussions on the Sochi Ladies Figure Skating were focused on the TE of the Free Program, which in fact constitutes about 1/3 only of the total score! In the following I’ll present an analysis of the scores performed with simple, standard statistical methods.

2.1. Technical Score

   Let’s shortly recall how the scores are determined, starting from the technical score [1]. Each skater has to present a fixed number of “elements” (jumps, spins,…). A “technical panel” evaluates the rightness of the elements (i.e. checks for errors like under-rotation, wrong edge,…), and fix the “start value” for each of them, according to well defined rules. At the end, each skater gets a “base value” for each element. Additional points come from the “GOE” (Grade Of Execution). They are evaluated in this way: a team of 9 judges gives a score to each element, from .3 to +3, according to the “cleanness”, the “beauty”, etc. of the execution. Then a trimmed mean is applied (discharging the highest and lowest value). The result is further converted into a value by using the “SOV” (Scale Of Value) and then added to the base score.

   As an example, in figure 2.1 is reported the judge panel for the Japanese skater Mao Asada [2]: as expected, there are some fluctuations from one judge to another in the evaluation of the elements.

   In the following table the Base Values (BV) are reported for the first 12 Skaters, for both Short and Free Programs. The “Base Value” ranks and the total TE Scores are also reported.

   It can be seen that the final ranking of the competition is very different from the ranking that we got from the Base Values.This is not a big surprise, as the Base Values correspond just to a starting point, which doesn’t include the “quality” of the execution. For example, according to Base Values the first skater should be Grace Gold, however she was only fourth in the final ranking. Instead, what looks more surprising is the almost perfect correspondence between the total Technical Element Score and the final ranking. Let me recall that the TES amounts to about one half of the total. However, from the previous table it looks as the “Artistic Score” is completely irrelevant. The “Artistic Score” (Program Components) will be discussed in par. 2.2. The previous table shows the relevance of the “GOE”, as they are responsible from the differences between the Base Values and the final TES. Most of the score however comes from the “Base Values”, so a bias in the BV can be very dangerous for the final result.

   REMARK: In the following, the official base values will be considered, so the analysis will be performed on the GOE only.

   According to several commentators, some elements performed by Sotnikova were over.evaluated by the Technical Panel, and some elements performed by Kim were under.evaluated. I will not enter in this discussion, I just assume that the Technical Panel did a good job. In Appendix 1 several links to analysis by technicians and experts are reported, as they were found on the web. I want to remark here again that most of discussions were focused on the Technical Panel choice.

2.1.1 The Free Program

   Let’s now come back to the analysis of the GOE, starting with the Free Program. In order to check for the possibility of a bias (i.e. a systematic over/under.evaluation), it is possible to sum the scores (GOE) for each judges.

   For example, here are reported the sums of the GOE for the Free Program of the skater Grace Gold (USA, 4th place) [2]: 16, 20, 14, 15, 15, 13, 17, 14, 14. There are of course 9 numbers, one for each Judge. We can now put this 9 numbers in an histogram, as in figure 2.2. In the same figure are also reported the Mean (15.33) and the RMS (2.0).

   Let’s have a look now to the same distributions for the first six skater (we are again considering the Free Program. The results are reported in figure 2.3, in the next page. As usual, the RMS indicate the uniformity of the scores, i.e. the uniformity of the judges evaluation: a larger RMS corresponds to a less uniform jury. It can be seen that all the RMS are more or less comparable, with a clear exception: Sotnikova and (maybe) Lipnitskaya. The two RMS indicated by circles are indeed the largest over the full sample of 24 skaters. Can this be just a statistical fluctuation?

   In order to quantify the observed RMS enhancement, let’s extend the same analysis to the main international competitions(*), and look at the “RMS” of the “GOE” distributions. All the values of the RMS extracted from this distributions can then be put into another histogram.

   The distribution of the “RMS” for the first 6 skaters in the main international competitions is reported in figure 2.4. In the same figure the Mean and the RMS of the distribution are also reported. (This is a distribution of “RMS”, so it has its own RMS!).

   Remark: the “RMS” of this distribution will be quoted as RMS, to distinguish it from the “RMS” of the single skaters.

   Figure 2.4 shows clearly that the distance of the last value (Sotnikova) from the Mean is about 3 RMS. This is indicated in graphic way here:

   Numerically we have: Mean + 3RMS =  3.058 + 3 x 0.748 = 5.302, very close to Sotnikova’s value 5.287. From statistical point of view, we can conclude that the probability that this is a “normal” value is very low (†).

   It is important to stress that this large value doesn’t depend on the skater’s skill! It depends essentially on the uniformity of the judge’s scores. As a limiting case, if all the judges gave the same score (no matter how large), the RMS will be exactly zero. So, this anomalous value indicates a large disagreement of the judges. Unfortunately this large RMS prevent us from using the method described in the first chapter to eliminate the biased values. However, we can use another method, which will be described in the next section.

2.1.2 A new statistical test

   Let’s see the things in another way. Let’s consider for each technical element of a skater the average score (including all the 9 judges), and let갽s count for each judge the number N of times that he gives a score larger than the average. Again, let’s make an example by using Asada’s Technical Panel on Figure 2.1. Firstly, we evaluate the average of the first element (the first raw of GOE): this number is = 0.44. In this case there are 4 judges above the average, and 5 below the average. (This is more or less what one can expect in a normal case). Next, we consider the second element: the average now is = 0. In this case there are 3 judges above the average, 3 judges below the average, and 3 judges on the average. Again, this is what one expect in a normal situation. Let’s repeat the exercise for all the elements. Now let’s consider the first judge, and let’s check how many times he is above or under the average. In our example the first judge is 3 times above the average, 1 time on the average and 8 times under the average. The last judge, on the other hand, is 5 times above the average, 1 time on the average and 6 times under the average. So we have = 3 for the first judge, and = 5 for the last judge. If we conclude the exercise for all the judges, we have 9 numbers, and we can put them in a histogram. This is the distributions for this single skater. However, we can also put in the same histogram all the numbers N from all the other skaters. In this way we can build a big histogram with 9 x 24 = 216 entries.

   In the Free Programs there are 12 elements, so the numbers N are limited from 0 to 12. As we have seen, in case of unbiased scores each judge will give a score sometime larger than the average, sometime lower and sometime (less probable) on the average. So we expect that usually the distribution of N will have a peak more or less in the middle of the range 0 ― 12. (Actually it should be slightly lower than 6, due to the condition “larger than the average”, and not “larger or equal”). Now, let’s consider ALL the skaters and ALL the judges, and let’s have a look at the distribution of N.

   The total distribution has the Mean at (about) 5.2, and a RMS of 2.4. A drastic drop of the entries is shown for N>6 and N<3. However, a not negligible number of entries at large value of N is also present. Is this number compatible with statistical fluctuations? Let me remark again that a “large” N indicate that some Judges have given scores above the average most of the times. Can we say something more precise about it? In fact, it is theoretically possible to determine this distribution from statistics and probability laws. In other words, it is possible to evaluate a priori the probability that a judge will give scores above the average a specific number N of times. However, it is not necessary to be expert of probability laws to understand that if a judge is always above the average, he is by definition biased!

   Let’s now select the entries corresponding to the contested first skater (Adelina Sotnikova). The distribution is reported in red, on figure 2.7. As it can be seen, a couple of judges were 8 times over 12 above the average, and a couple of judges 11 times over 12 above the average! This is clearly out from any statistical fluctuation, this is a bias by definition. We are not claiming here that Sotnikova was the only skater to receive bias., but she had clearly the largest bias.

   Note that the remaining judges are (by construction) shifted towards left, at low values of N. In principle it is also possible to interpret this result in the opposite way: a group of judges are under-evaluating the skater, and so the remaining judges are “pushed” by construction towards right, at large values of N.

   For comparison, we report in figure 2.8 (in green) the distribution of N for the second skater (Yuna Kim). In this case the RMS is also much lower (2.6 against 4), and close to the RMS of the total distribution (2.4). Further details on this method can be found in Appendix 2.

2.1.3 The Short Program

   We can now repeat the previous exercise for the Short Program, starting from the distributions of the GOE for the first 6 skaters:

   In this case the distributions look more homogeneous, as well as the RMS. The average values are smaller compared to the FP, as now there are 7 elements only (instead of 12). The same for the RMS(‡).

   In figure 2.10 is reported the distribution of N (same definition of par.2.1.2). In this case the number of elements is 7, so the numbers N are in the range 0 ― 7.

   In this case however, the smaller number of elements makes more difficult to point out systematic bias in the scores. There are 3 judges who are 7 times over 7 above the average, however they are not associated to the same skaters. From here we can conclude that a small amount of bias is also present in the SP, however it is not possible to clarify better the situation.

2.2. Program Components

   The “Program Components” (PC) refer to the “artistic impression” of the performance, and are evaluated by considering several aspects, including choreography, interpretation and so on. Each judge gives a score from 0 to 10 to each element (10 corresponds to a sort of “perfection”), then the usual trimmed average is performed and the results are summed for all the elements (5 in total). The final score is obtained by applying a further scale factor, 0.8 for the Short Program and 1.6 for the Free Program. This scale factors are needed in order to “balance” the technical and artistic scores.

At first it might seem that this score is the most subjective. In fact, the evaluation of the elements is not more subjective than the “GOE”, discussed before. I will come back on this matter on the next chapter. Here, I want to point out an important difference. Whereas the technical performances can vary a lot from one event to another (due to casual errors, faults,…), the “components” are in principle much more stable.

   For example, an outstanding pianist can do a mistake by chance, but with no doubt the interpretation will remain outstanding. Moreover, some elements (like the choreography) don’t depends strongly on the mistakes that can be done by the skater.

   Let’s proceed in the same way as for the Technical Elements, by summing the scores for each judge. Starting with the Free Program, with reference to the Panel on page 16 (Asada FP scores), and looking at the PC, we get the following numbers (no factor applied):

   42.75, 44.25, 43.0, 44.25, 47.0, 41.0, 41.25, 44.25, 44.75.

   The average is = 43.6, and the RMS = 1.75. To get the final result a further scale factor of 1.6 should be applied, producing = 69.8, and RMS = 2.8. Again, this 9 numbers can be inserted into an histogram, and this exercise can be repeated for all the skaters.

   In figure 2.11 are presented the distributions for the first 6 skaters (FP). Here the situation looks still more strange than the TE scores. The RMS look more compatible, however there are large differences from the judges, and (perhaps) “anomalous peaks”, indicated by the red arrows. The shape of those distributions indeed looks really strange. However the number of entries (9) is too low to perform a more quantitative analysis. Also, the procedure described in 2.1.2 will not help because of the low number of elements (5). So, we have to found another strategy for the analysis.

   In order to have a better understanding, we can use this important aspect of Components: they have to be almost “stable” from event to event (as I said, a good pianist can make a technical error by chance, but still he will be recognized as a good pianist).

   So, let’s try to compare for each skater the Olympic Program Components scores with an average over the most recent international events. For this purpose I considered all the events within 1 year, including 2013 World Championship, Grand Prix, etc.(§). For example, I evaluate for Asada a PC average of 33.66 (Short Program). The PC in Sochi was 33.88, so the difference is 0.22. This differences (one for each skater) can be used to make an histogram: this is what is shown in figure 2.12, for the Short Program.

   The Mean of this distribution is 1.6, indicating that there is an average shift of +1.6 points between Sochi and the average on the other competitions, i.e. each skater got 1.6 points more than “usual”.

   The same plot is reported again in figure 2.13, with the RMS of the distribution (1.65) shown as a red horizontal bar. As usual, this number gives an indication of the global width of the distributions. In the same figure are also reported the values of some skaters (they are indicated by the red arrows). The differences are usually small (1.6 points on the average). Few skaters however show a huge difference. For Sotnikova this difference is 5.8, very far from the Mean. In unit of RMS, this value is more than 2 RMS away from the Mean.

   In other words, this value is not compatible with the distribution, and it cannot be considered a statistical fluctuation.

   The situation is similar in the Free Program (the distribution is reported in figure 2.14). In this case the Mean is about 4 units, i.e. each skater gets on the average 4 points more than the previous competition. This corresponds to a “global bias”, as described in Chapter 1. I will come back on this global bias in the next Chapter.

   If we compare the distances from the Mean by using the RMS unit, Sotnikova’s value (again the largest) is situated now at about 3 RMS from the Mean.

   In summary, the analysis of Components shows an average shift of 1.6 unit (SP) and 4 unit (FP). Most of the skaters show a difference within 1 RMS (i.e. the distance from the Mean is lower than 1 RMS). So, for most of the skaters, the difference from the Olympic Score and the average over other competitions is compatible with “normal” statistical fluctuations plus an overall shift. This shift clearly depends on the particular set of judges, and can be considered “normal”.

Few skaters however show huge differences, in particular Sotnikova with differences larger than 2 RMS, in both Short and Free programs. From statistical point of view the probability that this is just a fluctuation is almost zero. So, unless one believe in “miracles” (¶), we have to conclude that there were strong bias in the components scores also.

Go Chapter 3

* I have considered here the following competitions: WC 2013, WC 2014, Sochi Olympics and 5 ISU Grand Prix. The RMS used are in all cases by the first 6 skaters.

† For Normal distribution the probability is of the order of one per thousand.

‡ Within some range, there is a correlation between the RMS and the Mean.

§ This operation was not possible for some skaters due to their lack of international competitions. In the next plots their entries will not be shown.

¶ This can be equivalent to the case of a “normal” pianist that become suddenly a “genius”.