This is an updated version of "Sochi Ladies Figure Skating: a statistical analysis of the results" by Dr Tiziano Virgili.
You can download the original file (pdf format) in here.
Chapter I Why Statistics?
1.1 Statistical Errors
1.2 Systematic Errors
1.3 Statistic and Figure Skating
Chapter II Analysis of Sochi’s Results
2.1 Technical Score
2.2 Program Components
Chapter III An Exercise: Correcting the Scores
3.1 Global Bias
3.2. Single Skater Bias
Chapter IV The “Jury Resolution Power”
4.1 The Resolution Power
4.2 Intrinsic fluctuations
Since the end of the Sochi’s Olympic Games, a strong debate about the Ladies Figure Skating result has been developed. The main point was the huge score of the winner (Russian Sotnikova), very close to the present world record. Many commentators claimed this result a scandal, including outstanding skaters like Katarina Witt and Kurt Browning, whereas others defended at least the technical score. What was surprising to me was the fact that Italian TV commentators had rightly guessed almost all the scores, with three important exceptions: the two Russians and the Korean skaters. Was this just a coincidence? To increase the discussion, according to media reports, the “curriculum” of a couple of judges was not really appropriate to their important task. Also, the composition of the jury was a bit anomalous, since it contained a couple of Russian judges an no Koreans.
Most of the discussions on the web are focused on the “Technical Score” of the Free Program, with different arguments. Sotnikova’s supporters give the attention to the number and the difficulty of the elements, whereas Kim’s supporters point on the “quality” of the performance. All this discussions however are misleading, as the “Technical Score” of the Free Program constitutes about one third only of the total score.
Looking at the discussions on the web, it is also clear that an increasing number of people is considering Figure Skating a “subjective sport”, namely a sport where the final result is not determined by precise measurements. According to such people, it depends essentially on “personal taste”, or more precisely on the taste of the judges. This could be true more in general in any competition where a jury is involved. Do such competitions have any “objective” meaning? The question is not trivial, and involves a huge number of sportive events (such as Figure Skating, Gymnastic, Diving, Synchronized Swimming, Snowboard, Boxing, Judo,…), artistic competitions (piano, dance, singing contests,…) and also important events like public selections (access to university, public employment, etc.). So, it is very important to provide a “scientific” answer to this question also.
Obviously, I’m not qualified to comment the results of Sochi’s Olympics Games from a technical point of view. As a physicist however, I’m used to perform data analysis, so I have tried to look “blindly” at the numerical scores, as if they were “experimental data”. The basis of any scientific approach to experimental data is the statistical analysis, so I have performed very simple checks, based on standard statistical analysis methods. This methods allows in general to quantify the amount of “objectivity” of a Jury, and (more important) to discover the presence of bias in the results, so they should indeed be used in every competition with a Jury.
In the first chapter of this book I’ll present few simple concepts of statistic (formulas are not necessary!), such as the Mean and the RMS of a distribution. In the second chapter, after a short summary of the basic rules of Figure Skating scores, I will analyze the “Technical Elements” and the “Program Components” of Sochi’s Olympic Games. In chapter three I’ll show how the scores can be corrected for bias, with methods based on the previous statistical analysis. Finally, in the fourth chapter the question of “objectivity” of Figure Skating is discussed. Some technical remarks are reported in the appendixes, as well as a list of links to some discussions on the web. I hope that this work will be helpful not only in the Sochi results, but more in general as a guide to perform a scientific check of any competition with scores.
1.1. Statistical Errors
What is an “objective” result? Let’s start with a basic consideration. From a scientific point of view, an objective result is a result which can be reproduced at any time, everywhere and by everyone. A simple example is the gravity force: it can be experienced by everyone, everywhere and in any time. In general no results can be reproduced exactly (with infinite precision). Measurements can be reproduced within some range, so it is very important to determine this range, known as measurement error. Let’s make now an example from one of the most “objective” sports: Track and Field’s 100 meters.
It is largely assumed that the 100 meter race is among the most “objective” sport, as the times of the runners can be measured with large precision. Let’s now suppose that the runners will go one after another, and that the times will be measured by a manual stopwatch. So, the running time of each athlete will be measured by a human operator, and because of the human sensitivity it will have a non-negligible error. The measured time will be larger or smaller than the “true value”, in random way. This kind of fluctuation is known as statistical error (as we will see later, this is not the only source of error in the measurement). We can improve this measurement by adding several manual stopwatches. In this case, due to the “human indetermination”, any operator will give a slightly different result.
1.1.1. Mean and distributions
It is possible to have a better estimation of the “true value” by taking the average (Mean) of the different values. This is defined by the sum of all the values divided by the number of values itself.
In general, the larger the number of measurements (stopwatches), the lower the error that we have on the average. In other words, the statistical error can always be reduced by increasing the number N of measurements(*). So, in principle, you can reach the precision that you need just by increasing the number of operators.
In our example, as the “human” time resolution is of the order of 0.2 sec, a group of about 100 stopwatches will provide a global resolution of about 0.02 s (the measured time will correspond to the average of the 100 single independent measurement). Not very practical, but still effective!
It is possible to “visualize” all the measurements by constructing a plot known as “histogram”. Let’s suppose that we have ten values:
9.8, 10.2, 10.0, 10.0, 10.1, 10.3, 10.0, 9.9, 10.1, 9.9. We can easily evaluate the average as = 10.03. We can now put this numbers on a graphic in the following way. First, we define an horizontal axis with the appropriate scale (i.e. more or less in the range of our 10 numbers):
Next, for each of the values, we put a “box” over the corresponding number (to be more precise, we should define a “bin size”, and put together all the numbers that are in the same range defined by the bin size). So, in the vertical scale we just count the number of “box” which have that value. For instance, we have 3 values “10.0”, so the total box height at 10.0 will be equal to 3.
Here is the final figure: this is a very simple example of distribution.
The total number of “entries” (i.e. the number of box) is of course 10.
This figure tells something more than the simple average: it is possible to “see” how the values are arranged around the average. In other words, the “shape” of the distribution contains also important information. It is possible to demonstrate that if the measurements contain random errors only, the shape of the distribution will be similar the previous figure: with a maximum in (about) the middle, and two side tails. The exact shape is called “Normal distribution” (or “Gaussian”). In this case the mean value coincides with the maximum. The “width” of the distribution is also a very important. It can be quantified by another number, the “standard deviation”.
1.1.2. Standard Deviation - RMS
The Standard Deviation (σ) provides information on the width of the distribution, i.e. how far the numbers are from the average. It can be evaluated(†) by the “Root Mean Square Deviation” (in the following RMS). An example is shown in the following figure: the “RMS” is indicated by the horizontal bar. Approximately the RMS is the width of the distribution at half height. In summary, if we repeat a number of measurements affected by random errors (as the manual stopwatches) we got a distribution that looks like a “bell”. It should be clear now that a larger RMS means a wider distribution, i.e. larger fluctuations in the measurements.
This parameter is also related to the error on the average M: a small RMS correspond to a small error on ΔM (‡). Coming back to our numerical example, we have for this distribution: RMS = 0.14, and therefore the error on the average is ΔM = 0.045. As we have seen, this error can be further reduced by increasing the number N of measurements.
1.2. Systematic Errors
In addition to the “statistical errors” there is another type of error, known as “systematic error”. A global systematic error is a common bias to all the measurements. As an example, let’s suppose to measure the weight of different objects with a balance. Can we really be sure that the observed values correspond to the “true” weights? That it is hard to say, unless an independent measurement (another “good” balance) is available. Another example can be the measure of a temperature with a thermometer. If the balance or the thermometer are not well calibrated, all the measurements will be shifted of the same amount, and we will observe a global bias.
Generally speaking, systematic errors are rather difficult to treat. As far as you don’t have an “external reference”, it is not possible to apply the “right” correction to all the measurement. If we consider the differences however, we can be more confident that a possible “bias” (an overall shift in the temperatures or in the weights) can be highly reduced or eliminated.
A different type of systematic error is a bias applied to a single measurement. For instance, this can happen if we make a mistake in the measurement procedure. As a consequence, the resulting value will be significantly far from the other results.
It will be seen on the histogram as an isolated point, as in the following figure:
In this example we added to the previous 10 numbers the new value 10.5. This will produce a change of the average, from M=10.03 to M=10.07, and also an increase of the RMS from 0.14 to 0.19.
A simple way to handle this “wrong values” is to consider the trimmed average. This is an average obtained by eliminating the largest and the smallest measurements. Back to the previous example, we should remove the two entries as in the following figure:
In this case the new average will be M=10.06 and the RMS=0.13.
As you can see, the standard deviation σ (the RMS) is decreased a lot, not so much the average. In fact, this is a very rough method to eliminate bias. More sophisticated methods are able to eliminate in more effective way the “wrong values”. It is important to remark here that a single “wrong point” is already effective in producing a bias on the average. Of course, the situation gets worst if the number of “wrong points” is larger than one.
In general, the RMS itself is a good parameter to identify “wrong” data. Almost all the values are indeed contained between “3 RMS”, i.e. the distance of all the values from the Mean is usually shorter than three times the RMS. In the previous example, the average is 10.06, and 3 RMS=0.39.
So, almost all the values should be in the range 10.06±0.39 . It is easy to see that this condition accept the “good value” 9.8, and discharge the “bad value” 10.5 (see figure 1.6).
1.3. Statistic and Figure Skating
Let’s now go back to the example on 100 m run with manual stopwatches. It is clear now that in order to have an objective classification of the runners the indetermination on the time measurements must be smaller than the time differences between the athletes. For instance, if such differences are of the order of 0.02 s, we need an error on the average of the order of (at most) 0.01 s. This would require about 400 manual stopwatches!
We can now substitute the runners with the skaters, and the manual stopwatches with the judges of the jury. We have an “objective result” if the errors on the scores are small compared to the score differences between the skaters.
In other words, it is possible to consider a Jury equivalent to a group of manual stopwatches. There is of course the possibility that some judges - stopwatches are biased: their measurement are very different from the other. The simple trimmed average is the corrections applied to the Figure Skating scores (and in many other sports also). However, as we have seen, this method is ineffective in removing all the biased scores. Biased scores should be removed according to the score distribution, as I have shown in the previous example.
In the next chapter I will apply the previous considerations to the Sochi Ladies scores, with the aim to find possible bias in the results.
* In most of the cases the formula: can be used, where ΔM is the error on the mean, N is the number of measurements and R is the error on the single measure.
† The σ can be evaluated by the following formula: , where N is the number of measurements, are the single measurement and is their average.
‡ They are related by the following formula: , where again N is the number of measurements, σ is the RMS.
Analysis of Sochi's Results
In Figure Skating the athletes are called to show their ability in two performances of different length: the “Short Program” (SP, lasting 2 minutes and 30 seconds) and the “Free Program” (FP, 4 minutes for senior ladies). There are 7 technical elements (jumps, spins, step sequences,…) in the SP, and 12 in the FP. In both cases the Jury assign a technical score (Technical Elements, TE) and an artistic score (Program Components, PC). They are “calibrated” in such a way that both scores will have approximately the same weight in the total score.
Moreover, the FP score is about twice the SP. In summary, the total score can be roughly divided in the following way: 1/6 SP (TE), 1/6 SP (PC), 1/3 FP (TE), 1/3 FP (PC).
Most of the discussions on the Sochi Ladies Figure Skating were focused on the TE of the Free Program, which in fact constitutes about 1/3 only of the total score! In the following I’ll present an analysis of the scores performed with simple, standard statistical methods.
2.1. Technical Score
Let’s shortly recall how the scores are determined, starting from the technical score . Each skater has to present a fixed number of “elements” (jumps, spins,…). A “technical panel” evaluates the rightness of the elements (i.e. checks for errors like under-rotation, wrong edge,…), and fix the “start value” for each of them, according to well defined rules. At the end, each skater gets a “base value” for each element. Additional points come from the “GOE” (Grade Of Execution). They are evaluated in this way: a team of 9 judges gives a score to each element, from .3 to +3, according to the “cleanness”, the “beauty”, etc. of the execution. Then a trimmed mean is applied (discharging the highest and lowest value). The result is further converted into a value by using the “SOV” (Scale Of Value) and then added to the base score.
As an example, in figure 2.1 is reported the judge panel for the Japanese skater Mao Asada : as expected, there are some fluctuations from one judge to another in the evaluation of the elements.
In the following table the Base Values (BV) are reported for the first 12 Skaters, for both Short and Free Programs. The “Base Value” ranks and the total TE Scores are also reported.
It can be seen that the final ranking of the competition is very different from the ranking that we got from the Base Values.This is not a big surprise, as the Base Values correspond just to a starting point, which doesn’t include the “quality” of the execution. For example, according to Base Values the first skater should be Grace Gold, however she was only fourth in the final ranking. Instead, what looks more surprising is the almost perfect correspondence between the total Technical Element Score and the final ranking. Let me recall that the TES amounts to about one half of the total. However, from the previous table it looks as the “Artistic Score” is completely irrelevant. The “Artistic Score” (Program Components) will be discussed in par. 2.2. The previous table shows the relevance of the “GOE”, as they are responsible from the differences between the Base Values and the final TES. Most of the score however comes from the “Base Values”, so a bias in the BV can be very dangerous for the final result.
REMARK: In the following, the official base values will be considered, so the analysis will be performed on the GOE only.
According to several commentators, some elements performed by Sotnikova were over.evaluated by the Technical Panel, and some elements performed by Kim were under.evaluated. I will not enter in this discussion, I just assume that the Technical Panel did a good job. In Appendix 1 several links to analysis by technicians and experts are reported, as they were found on the web. I want to remark here again that most of discussions were focused on the Technical Panel choice.
2.1.1 The Free Program
Let’s now come back to the analysis of the GOE, starting with the Free Program. In order to check for the possibility of a bias (i.e. a systematic over/under.evaluation), it is possible to sum the scores (GOE) for each judges.
For example, here are reported the sums of the GOE for the Free Program of the skater Grace Gold (USA, 4th place) : 16, 20, 14, 15, 15, 13, 17, 14, 14. There are of course 9 numbers, one for each Judge. We can now put this 9 numbers in an histogram, as in figure 2.2. In the same figure are also reported the Mean (15.33) and the RMS (2.0).
Let’s have a look now to the same distributions for the first six skater (we are again considering the Free Program. The results are reported in figure 2.3, in the next page. As usual, the RMS indicate the uniformity of the scores, i.e. the uniformity of the judges evaluation: a larger RMS corresponds to a less uniform jury. It can be seen that all the RMS are more or less comparable, with a clear exception: Sotnikova and (maybe) Lipnitskaya. The two RMS indicated by circles are indeed the largest over the full sample of 24 skaters. Can this be just a statistical fluctuation?
In order to quantify the observed RMS enhancement, let’s extend the same analysis to the main international competitions(*), and look at the “RMS” of the “GOE” distributions. All the values of the RMS extracted from this distributions can then be put into another histogram.
The distribution of the “RMS” for the first 6 skaters in the main international competitions is reported in figure 2.4. In the same figure the Mean and the RMS of the distribution are also reported. (This is a distribution of “RMS”, so it has its own RMS!).
Remark: the “RMS” of this distribution will be quoted as RMS, to distinguish it from the “RMS” of the single skaters.
Figure 2.4 shows clearly that the distance of the last value (Sotnikova) from the Mean is about 3 RMS. This is indicated in graphic way here:
Numerically we have: Mean + 3RMS = 3.058 + 3 x 0.748 = 5.302, very close to Sotnikova’s value 5.287. From statistical point of view, we can conclude that the probability that this is a “normal” value is very low (†).
It is important to stress that this large value doesn’t depend on the skater’s skill! It depends essentially on the uniformity of the judge’s scores. As a limiting case, if all the judges gave the same score (no matter how large), the RMS will be exactly zero. So, this anomalous value indicates a large disagreement of the judges. Unfortunately this large RMS prevent us from using the method described in the first chapter to eliminate the biased values. However, we can use another method, which will be described in the next section.
2.1.2 A new statistical test
Let’s see the things in another way. Let’s consider for each technical element of a skater the average score (including all the 9 judges), and let갽s count for each judge the number N of times that he gives a score larger than the average. Again, let’s make an example by using Asada’s Technical Panel on Figure 2.1. Firstly, we evaluate the average of the first element (the first raw of GOE): this number is = 0.44. In this case there are 4 judges above the average, and 5 below the average. (This is more or less what one can expect in a normal case). Next, we consider the second element: the average now is = 0. In this case there are 3 judges above the average, 3 judges below the average, and 3 judges on the average. Again, this is what one expect in a normal situation. Let’s repeat the exercise for all the elements. Now let’s consider the first judge, and let’s check how many times he is above or under the average. In our example the first judge is 3 times above the average, 1 time on the average and 8 times under the average. The last judge, on the other hand, is 5 times above the average, 1 time on the average and 6 times under the average. So we have N = 3 for the first judge, and N = 5 for the last judge. If we conclude the exercise for all the judges, we have 9 numbers, and we can put them in a histogram. This is the distributions for this single skater. However, we can also put in the same histogram all the numbers N from all the other skaters. In this way we can build a big histogram with 9 x 24 = 216 entries.
In the Free Programs there are 12 elements, so the numbers N are limited from 0 to 12. As we have seen, in case of unbiased scores each judge will give a score sometime larger than the average, sometime lower and sometime (less probable) on the average. So we expect that usually the distribution of N will have a peak more or less in the middle of the range 0 ― 12. (Actually it should be slightly lower than 6, due to the condition “larger than the average”, and not “larger or equal”). Now, let’s consider ALL the skaters and ALL the judges, and let’s have a look at the distribution of N.
The total distribution has the Mean at (about) 5.2, and a RMS of 2.4. A drastic drop of the entries is shown for N>6 and N<3. However, a not negligible number of entries at large value of N is also present. Is this number compatible with statistical fluctuations? Let me remark again that a “large” N indicate that some Judges have given scores above the average most of the times. Can we say something more precise about it? In fact, it is theoretically possible to determine this distribution from statistics and probability laws. In other words, it is possible to evaluate a priori the probability that a judge will give scores above the average a specific number N of times. However, it is not necessary to be expert of probability laws to understand that if a judge is always above the average, he is by definition biased!
Let’s now select the entries corresponding to the contested first skater (Adelina Sotnikova). The distribution is reported in red, on figure 2.7. As it can be seen, a couple of judges were 8 times over 12 above the average, and a couple of judges 11 times over 12 above the average! This is clearly out from any statistical fluctuation, this is a bias by definition. We are not claiming here that Sotnikova was the only skater to receive bias., but she had clearly the largest bias.
Note that the remaining judges are (by construction) shifted towards left, at low values of N. In principle it is also possible to interpret this result in the opposite way: a group of judges are under-evaluating the skater, and so the remaining judges are “pushed” by construction towards right, at large values of N.
For comparison, we report in figure 2.8 (in green) the distribution of N for the second skater (Yuna Kim). In this case the RMS is also much lower (2.6 against 4), and close to the RMS of the total distribution (2.4). Further details on this method can be found in Appendix 2.
2.1.3 The Short Program
We can now repeat the previous exercise for the Short Program, starting from the distributions of the GOE for the first 6 skaters:
In this case the distributions look more homogeneous, as well as the RMS. The average values are smaller compared to the FP, as now there are 7 elements only (instead of 12). The same for the RMS(‡).
In figure 2.10 is reported the distribution of N (same definition of par.2.1.2). In this case the number of elements is 7, so the numbers N are in the range 0 ― 7.
In this case however, the smaller number of elements makes more difficult to point out systematic bias in the scores. There are 3 judges who are 7 times over 7 above the average, however they are not associated to the same skaters. From here we can conclude that a small amount of bias is also present in the SP, however it is not possible to clarify better the situation.
2.2. Program Components
The “Program Components” (PC) refer to the “artistic impression” of the performance, and are evaluated by considering several aspects, including choreography, interpretation and so on. Each judge gives a score from 0 to 10 to each element (10 corresponds to a sort of “perfection”), then the usual trimmed average is performed and the results are summed for all the elements (5 in total). The final score is obtained by applying a further scale factor, 0.8 for the Short Program and 1.6 for the Free Program. This scale factors are needed in order to “balance” the technical and artistic scores.
At first it might seem that this score is the most subjective. In fact, the evaluation of the elements is not more subjective than the “GOE”, discussed before. I will come back on this matter on the next chapter. Here, I want to point out an important difference. Whereas the technical performances can vary a lot from one event to another (due to casual errors, faults,…), the “components” are in principle much more stable.
For example, an outstanding pianist can do a mistake by chance, but with no doubt the interpretation will remain outstanding. Moreover, some elements (like the choreography) don’t depends strongly on the mistakes that can be done by the skater.
Let’s proceed in the same way as for the Technical Elements, by summing the scores for each judge. Starting with the Free Program, with reference to the Panel on page 16 (Asada FP scores), and looking at the PC, we get the following numbers (no factor applied):
42.75, 44.25, 43.0, 44.25, 47.0, 41.0, 41.25, 44.25, 44.75.
The average is = 43.6, and the RMS = 1.75. To get the final result a further scale factor of 1.6 should be applied, producing = 69.8, and RMS = 2.8. Again, this 9 numbers can be inserted into an histogram, and this exercise can be repeated for all the skaters.
In figure 2.11 are presented the distributions for the first 6 skaters (FP). Here the situation looks still more strange than the TE scores. The RMS look more compatible, however there are large differences from the judges, and (perhaps) “anomalous peaks”, indicated by the red arrows. The shape of those distributions indeed looks really strange. However the number of entries (9) is too low to perform a more quantitative analysis. Also, the procedure described in 2.1.2 will not help because of the low number of elements (5). So, we have to found another strategy for the analysis.
In order to have a better understanding, we can use this important aspect of Components: they have to be almost “stable” from event to event (as I said, a good pianist can make a technical error by chance, but still he will be recognized as a good pianist).
So, let’s try to compare for each skater the Olympic Program Components scores with an average over the most recent international events. For this purpose I considered all the events within 1 year, including 2013 World Championship, Grand Prix, etc.(§). For example, I evaluate for Asada a PC average of 33.66 (Short Program). The PC in Sochi was 33.88, so the difference is 0.22. This differences (one for each skater) can be used to make an histogram: this is what is shown in figure 2.12, for the Short Program.
The Mean of this distribution is 1.6, indicating that there is an average shift of +1.6 points between Sochi and the average on the other competitions, i.e. each skater got 1.6 points more than “usual”.
The same plot is reported again in figure 2.13, with the RMS of the distribution (1.65) shown as a red horizontal bar. As usual, this number gives an indication of the global width of the distributions. In the same figure are also reported the values of some skaters (they are indicated by the red arrows). The differences are usually small (1.6 points on the average). Few skaters however show a huge difference. For Sotnikova this difference is 5.8, very far from the Mean. In unit of RMS, this value is more than 2 RMS away from the Mean.
In other words, this value is not compatible with the distribution, and it cannot be considered a statistical fluctuation.
The situation is similar in the Free Program (the distribution is reported in figure 2.14). In this case the Mean is about 4 units, i.e. each skater gets on the average 4 points more than the previous competition. This corresponds to a “global bias”, as described in Chapter 1. I will come back on this global bias in the next Chapter.
If we compare the distances from the Mean by using the RMS unit, Sotnikova’s value (again the largest) is situated now at about 3 RMS from the Mean.
In summary, the analysis of Components shows an average shift of 1.6 unit (SP) and 4 unit (FP). Most of the skaters show a difference within 1 RMS (i.e. the distance from the Mean is lower than 1 RMS). So, for most of the skaters, the difference from the Olympic Score and the average over other competitions is compatible with “normal” statistical fluctuations plus an overall shift. This shift clearly depends on the particular set of judges, and can be considered “normal”.
Few skaters however show huge differences, in particular Sotnikova with differences larger than 2 RMS, in both Short and Free programs. From statistical point of view the probability that this is just a fluctuation is almost zero. So, unless one believe in “miracles” (¶), we have to conclude that there were strong bias in the components scores also.
* I have considered here the following competitions: WC 2013, WC 2014, Sochi Olympics and 5 ISU Grand Prix. The RMS used are in all cases by the first 6 skaters.
† For Normal distribution the probability is of the order of one per thousand.
‡ Within some range, there is a correlation between the RMS and the Mean.
§ This operation was not possible for some skaters due to their lack of international competitions. In the next plots their entries will not be shown.
¶ This can be equivalent to the case of a “normal” pianist that become suddenly a “genius”.
Universita di Salerno
An Exercise: Correcting the Scores
The statistical analysis has shown that the scores contain several bias, in both Technical and Components elements. Is it possible to “correct” the scores? In general it is not easy to rectify a result with systematic bias. The correction will depend on some assumptions that one has to do on the systematic effect itself, so there is no a unique, simple way to proceed.
In the following I’ll present an attempt to make such corrections, based on some “reasonable” assumptions. Let me stress here that this should be considered just an exercise, without any claim of “truth”.
3.1. Global Bias
The effect of global bias was discussed in Chapter 1. It consists of a global, fixed shift of all the values. It is clear that this shift doesn’t affect the final ranking of a competition, so it is not necessary apply some correction, unless one wants to compare different competitions. This is the case when people are looking for “World Records”, the largest scores ever realized. It should be clear that world records don’t have any absolute value, as they depend on the particular Jury. In general scores from different competitions cannot be compared in simple way. Instead, what can be compared are differences, as explained in Chapter 1. In order to make some examples, I selected four different International Competions:
the Olympic World Games 2010 (OWG2010), the Olympic World Games 2014 (OWG2014), the World Championship 2013 (WC2013) and the World Champi-onship 2014 (WC2014).
Let’s start by comparing the total scores for OWG2014 and WC2014.
The distributions of the total scores are reported in figure 3.1.
Both Means and RMS looks comparable.
In the following figure are reported the same distributions for WC2013 and OWG2010:
Again, the two distributions look compatible. However, if we compare now with the previous distributions, we find a difference of about 5 points in the Mean. In other words, if you want to compare the results from this two competitions, you need a global shift of 5 points on the total.
So, what about the World Records? We have to correct first for this global bias. The simplest way to do that is to subtract the mean from each score (*), as for the tare weight in a balance. To be more specific, let’s compare the following competitions again:
OWG 2010 ― WC 2013 ― OWG 2014 ― WC 2014.
In the following table are reported the averages (Mean) for total, short and free programs. Averages are evaluated by excluding the first skater (†).
By using the previous table it is possible to evaluate again the scores of the top skaters in this specific four competitions. The results are presented in table 3.2.
It is clear that after correction, all the World Records were performed in the first competition, namely the 2010 Olympics World Games.
Another type of “World Records” which sometime are considered are the largest score differences, usually between the first and the second skaters. This differences are “safe” from the global bias, however they depends on the particular combination of skaters. In any case it is interesting to note that the largest differences ever observed are the following:
23.06 for Olympics World Games,
20.42 for World Championships,
36.04 for International Competitions (Grand Prix).
In all the cases the first skater was Kim Yuna from Korea.
3.2. Single Skater Bias
The more insidious bias come from “single skater bias”, i.e. a bias which is applied to a single skater. This doesn’t mean necessarily “cheating”. It is very likely that a Judge will favor a skater of his own nation, and this explain why the Jury is typically composed by Judges from the main nations involved in the competition. On the average, all the “bias” introduced by the Judges tend to cancel each other. In some case however the bias may favor one (or more) specific skater. More details on the amount of bias that can be produced by one or more judges are reported in Appendix 3.
Let’s now consider separately the Technical Elements and the Program Components.
3.2.1. Technical Score
I have shown in the first Chapter that the “trimmed average” (discharging of the largest and lowest values) in general doesn’t work. It is the easiest way to make a correction, and this is the reason why it is used in almost every sport where a jury is involved. However, it is not very effective in eliminating the “wrong” values, which is the correct thing to do. Another criterion is therefore needed to correct the score.
In paragraph 2.1.2 I have shown a way to give a quantitative (probabilistic) indication of the “fairness” of a judge. A more effective correction consists in eliminating from the Mean all the “biased judges”, according to this indication. My assumption then is to exclude all the scores from judges which have a very large “N” (see figure 2.6). For this exercise (Free Program) the threshold was set at the arbitrary values of 7, 8 and 9 in three different trials (i.e. were excluded all the judges with N larger than 7, 8 or 9).
As the final result is not strongly sensitive to this threshold, the value of 8 is finally used. The new averages are then scaled according to the SOV, and summed for each skater. The whole exercise was finally repeated for the Short Program, now with a threshold of 6.
Note that by construction the new TE scores will be in general smaller than the official ones.
3.2.2. Components Score
As already observed, in this case the procedure described in 2.1.2 will not help because of the low number of elements (5).
A correction of the components score however can be performed on the basis of figures 2.12 and 2.12. A “safe” choice could be to take a weighted.average of all the results, i.e. the results of all the recent International Competitions. This corresponds basically to consider a much larger number of judges. Clearly, the actual event (the Olympic Event) should have the largest weight, since we want to determine a score for this specific competition. So, a “fair” choice could be to give a 50% weight to the Olympic result, and 50% to the average of the other events. However, in order to be as much conservative as possible, I set this weight to be 2/3 (0.66) for the Olympic scores and 1/3 (0.33) for the others. Let me stress that for most of the skaters the exact value of this weights doesn갽t change strongly the resulting score.
For example, the PC scores for skater Asada (Short Program) are the following:
= 33.88 (OWG2014), = 33.66 (average over other competitions)
So, if we take the weight factors as 50% and 50%, we obtain the corrected value
If the weight are set to 2/3 and 1/3, the corrected value will be:
For this skater, the difference between the numbers is really small, just few cents. However, for some skaters the difference is not negligible.
To evaluate the new PC scores we have to repeat this exercise for both SP and FP, and for all the skaters. Note that as for the TE, the new scores are expected to be on the average smaller than the official ones (because of the global shift).
3.2.3. Total Score
Finally, to get the new Total Scores we have still to sum all the partial scores, which are the TE and the PC for both SP and FP. Note that for some skaters some deductions have also to be applied (due to one or more faults).
The final results for the first 12 skaters in Sochi’s ranking are presented in the table 3.3.
It can be seen that most of the scores are almost unchanged (all are slightly reduced, as expected), as well as the final classification, with two important exceptions: the first ― second and the 5 ― 6 skaters. This results is clearly consequence of what previously discussed.
Finally, we need an estimation of the “errors” associated to this new scores, or in other words, an estimation of the range in which they should be found. The largest uncertainty in the final result comes from the relative weight of the Components Scores. As discussed before, a conservative value of 2/3 has been used. In order to get an approximate indication of the associated uncertainty, it is possible to change this weight in a “reasonable” range. I used as limits the values 0.5 and 0.7. The results are presented in table 3.4, where are reported the “official total scores”, the two “limits” (the values corresponding to the two weights above), and the average of the two (the reported “errors” correspond to the half.difference between the two limits, they are not statistical errors!).
The same result is shown here in graphic form. The vertical bars (“error bars”) indicate the indetermination from the minimum to the maximum value, as indicated in the previous table.
As expected, the error bars are larger for the points with larger distances from the official scores. Also, it is interesting to note that Kostner’s position goes up to the second place when we consider the average (see table 3.4).
As remarked before, the error bars presented in the table don’t have a true statistical meaning, they can be considered as a rough estimation of the range where the unbiased scores should be found. Note that the values reported on table 3.3 correspond to the “very conservative” conditions (weights = 2/3 and 1/3).
* In principle the Mean should be evaluated by excluding the specific score that we are considering, i.e. the largest one.
† This gives a difference on the Mean of the order of 1 ― 2 points.
The “Jury Resolution Power”
4.1. The Resolution Power
Let me now coming back to the general problem of the “objectivity” in Figure Skating (and more in general, in competitions with jury).
Figure Skating experts usually “trust” the judges evaluation, whereas other people are more skeptical about their objectivity. Can we approach the problem from a scientific point of view? Let’s consider a piano contest. Let’s suppose that the first performer gets scores between eight and nine. Then, a second performer gets scores from six to seven. It is reasonable to say that the first performer is much better than the second one. What however if the second performer gets scores from 7.5 to 8.5? Can we still say for sure that the first performer is better? What matter here is the “resolution power” of the jury, which is the minimal score difference that they are able to separate. Just like any other instrument, there is an “intrinsic resolution”, which gives the limit of your measurement. In other words, you cannot be able to distinguish amounts below this limit. The resolution power of the jury can be different from sport to sport, and from competition to competition. In the following I will try to evaluate the average resolution in case of Figure Skating (ladies events).
In order to evaluate this “resolution capability” of the Jury, let’s consider the distribution of the scores which come out from the Judges. Each Judge is equivalent to a stopwatch operator, so we take the Mean as the best measurement, and the RMS as an estimate of the distribution width. As an example, in figure 4.1 are reported the distributions of the scores for Carolina Kostner at the last Olympic Games, for both Technical Elements and Components (sum of SP and FP). As usual, each Judge provides an “entry” in the distributions.
(Note that respect to the plots on page 19 and 24 the Technical Elements include the base values, moreover they are summed on Short and Free Programs).
The average (Mean) and the RMS are reported on the same figures.
As expected, the RMS for the Components (the “artistic” score) is larger than the Technical score (3.6 compared to 1.6). In this case the error on the total score (sum of Technical + Components) is dominated by the error on the Components (*). However, if we select another skater the distributions will look different, with a different Mean and RMS. We can consider as more representative, the average of the RMS over the full sample of skaters. The distributions of the RMS for all the skaters (2014 Olympic Games, Ladies Single) are reported in the figure below, for Technical Elements (left) and Components (right). A “Gaussian fit” (red line) is also performed.
Again, the RMS for the Components are larger than those for Technical Elements (2.6 compared to 1.2) (†).
The sum of Technical Elements and Components produces the Total Score. The RMS on the Total Score can be obtained in the same way, so we get the final result:
This corresponds to an error on the Mean of about (‡).
In other words, in Figure Skating the resolution of the Jury is about 1.3 points. The standard “minimal” resolution is just this value, i.e. the Jury is not able to “resolve” skaters within 1.3 points. Therefore skaters with a difference in the total score lower than 1.3 should be considered equivalent. Note that a more safe resolution can be fixed at 2 times(2.6).
Note also that in principle the resolution depends on the score itself. For simplicity I’m considering here the average only.
We can now ask if this resolution is enough to have ranking that make sense. Let’s have a look at the distribution of the score differences between the nearest skaters in the final ranking (i.e. the differences first ― second; second ― third: third ― forth and so on). I have considered here the following international competitions:
WC2014 ― OWG2014 ― WC2013 ― WC2012 ― OWG2010.
The result is illustrated in figure 4.3. The distribution can be described approximately by a decreasing exponential function. This means that most of the differences are concentrated on the left side of the figure. As explained before, the Jury is not “able” to distinguish between scores lower than the minimal resolution value of 1.3. The values of 1.3 is indicated in the figure as a red line. (The blue line correspond to the value of 2.6, i.e. a “2-sigma” resolution).
The fraction of values larger than this two limits are 69% and 52% of the total respectively. So, in about 70% of the cases the final classification can be considered “objective”, whereas in about 30% of the cases this just comes out from “statistical fluctuations”. The situation improves however if we consider the top places in the final classification. If we restrict the score differences to the top four places, the distribution looks more extended to the right. Indeed, larger scores correspond on the average to larger differences, so with this selection we have more entries at larger values in the distribution. If we now consider again the previous limits, the fraction of entries larger than the minimal limit is 90% (80% for the 2.6 limit).
So, in about 90% of the cases the medals standing can be considered as “objective”. This is true in all the cases where the score differences are larger than the previous limit of 1.3 units. Note that all this considerations are true provided no single skater bias are present!
4.2. Intrinsic Fluctuations
In Figure Skating intrinsic fluctuations (i.e. fluctuations from performance to performance) have a strong influence on the final result, as a single mistake can produce a fault. What about fluctuations in other sports? Let’s have again a look at the men’s 100 m run. Fluctuations on the performances of the athletes are less evident, but they are not negligible.
In order to estimate this fluctuations, I have considered the best recent performances of the fastest runners. The “RMS” result correlated to the time av-erages. If we select the smaller times we get an average error of about
In other words, time differences lower than about 0.04 seconds can be considered just fluctuations in the performances (¶). Is this a rare situation in this sport? Let’s have a look again at the distribution of such differences. They are reported in the following figure:
In this figure the distribution of the time differences (in seconds) between nearest runners is shown (first ― second; second ― third; third ― fourth and so on). I have considered here all the main International Races (OWG and WC), from 2004 to 2013. As in the case of Figure Skating, the distribution can be described approximately by a decreasing exponential function (red line). The “1-sigma” and “2-sigma” resolutions are also reported as straight lines (red and blue).
The fractions of cases larger than this two limits are now 66% and 48% respectively. However, if we consider the top four places only, the fractions are about 55% and 39% (◇). In other words, intrinsic fluctuations for top places are relevant in about 45% of the cases!
It is clear that in Track and Field, no “single runner bias” are present, and in this sense the results are much more “objective” than Figure Skating. However, as everybody knows, bias are present in most of the competitions. Moreover, the possibility of cheating is unfortunately present in any sport!
* For independent measurements it is possible to sum the square of the RMS.
† Note that the distribution of RMS for Technical Elements is different from the same distribution on page 19: in that case only GOE of the Free Program were considered, now we have the full score summed on FP+SP.
‡ Assuming for the error the following formula : where N= Number of Judges =9. Actually the procedure is slightly more complicated, as the score is obtained by excluding the smallest and the largest values (trimmed average). This reduces the RMS, but then the effective number of Judges N becomes 7, so the final result is essentially the same.
¶ This is true if the performances of the runners are indeed uncorrelated. As runners go in parallel, they influence each other. A residual correlation therefore can be present. and in this case the above statement is not more completely true.
◇ The first places correspond to shorter times, which correspond to smaller differences. So, the first places are now concentrated in the left side of the distribution.
I have shown that statistic is a powerful tool for the analysis of competitions were there are scores given by a jury.
The statistical analysis of Sochi Ladies Figure Skating results has shown the presence of systematic bias in the scores, in both Technical Elements and Program Components. The largest bias was assigned in both cases to the first skater, and this probably explain the “uproar” which has followed the end of the competition.
The analysis has shown also the following results:
- The “trimmed average”, used in almost all sports with a jury, is a very rough method to correct for bias. A better method should eliminate only the scores which have a distance very far from the Mean. The reference distance is the RMS of the distribution.
- The resolution power of a typical Jury is about 1.3 points. This indicates that score differences smaller than (about) 1.3 points don’t have any “objective” meaning, they are just statistical fluctuations. This happens in about 30% of the cases, considering all the skaters. However medal standing can be considered “objective” in about 90% of the cases (every time the score differences are larger than );
- World Records in Figure Skating don’t have intrinsic meaning due to the “global bias” which are depending on the particular jury. More relevant are score differences. In all cases the actual World’s Record holder in Ladies Figure Skating is Korean Kim Yuna;
- Fluctuations in the performances can be very important in all sports. In men’s 100m race they are relevant on the average in about 33% of cases.
On the basis of this analysis, an exercise to “correct” the official scores has been also performed. The result (Table 3.4) shows clearly that the first place should be assigned to the second skater Kim Yuna. The next scores (second and third places) are more or less equivalent within the errors, with a small advantage of Carolina Kostner over Adelina Sotnikova. In any case, a true unbiased result can only come from a non-biased Jury.
As a final remark, I would like to stress once again that the simple methods shown here can be in principle used by anybody who wants to perform a scientific check of any competition with scores.
In the following a sample of some discussions on the web are presented. The list doesn’t claim to be complete or “fair”, it just reflects what I have mostly found.
Coming back to the statistical test proposed in Par. 2.1.2, the amount of bias can be better quantified by looking the distribution of N, reported again in the next figure. Let me recall again the meaning of N: it is the number of times that a judge has provided a score above the average, for each element.
In the Free Program there are 12 elements, so N should be in the range 0 ― 12. It is clear that if a judge is most of the time above the average, most likely he is biased. It is possible evaluate a priori the probability that this happens, as N is expected to follow a binomial distribution, just as the number of times that you have red / black at the roulette, given the number of trials (12 in this case).
In the previous figure the distribution is compared with the expected theoretical binomial distribution (black dots). In the shadowed area the number of entries is much larger than what can be expected on statistical base (*). This indicate clearly bias in the scores, which is not a surprise, considering that for many skaters there is a judge of their own nationality (†).
I will evaluate here the total amount of bias that can be introduced in the score, as a function of the number N of biased judges.
For this purpose, I have recalculated all the scores by modifying the points given by a number N of individual judges. The change has been performed by adding some extra.points, as explained later. The number N was varied from 1 to 4 (hopefully no more than 4 biased judges should be present in a competition!)
The results of the analysis are presented in the next figures, for both Tech-nical Elements (TE) and Program Components (PCS). The scores are summed on Short and Free programs.
Let me recall once again that in all cases a “trimmed average” is performed (exclusion of the highest and lowest scores). In figure A.2 the differences in scores (TES, sum of SP and FP) are reported as a function of the number N of “biased Judges” (up to 4). The black dots correspond to the case where the Judges give (for each GOE) one point more than the "true" score (moderate bias). The red dots correspond to the case of two points more in each GOE (strong bias). For instance, in case of two biased Judges you get a difference of about 2 points in case of “moderate” bias, and 3 points in case of “strong” bias. The error bars indicate the “range” where the difference can be found (this depends on the details of the scores). Note that a difference of about 1 point is found also in case of N=1. In this case however it doesn’t depend on the amount of bias, because of the trimmed average.
In the figure below (figure A.3) the differences in Program Components (PCS) are considered. The three colors correspond to three different bias, from 0.5 points (moderate bias) to 1.5 points (strong bias). This means that a number N of judges gives for each element 0.5 points more than the “real” score, etc.
Again, a difference of about 1 point can be seen for N=1, independently on the amount of bias (because of the trimmed average). The bias on the total score can be obtained just by summing the TES and PCS differences. Note that as Judges can give negative bias also, the overall score difference between two skaters can be twice this difference.
Note also that the largest part of the TES score comes from the “Base Values” and they are not considered here. So, in principle a further, large bias can be also introduced by the “Technical Panel”.
From the previous figures it is possible to evaluate the amount of bias that can be introduced on the total score by a specific number of “unfair” (biased) Judges. For instance, in case of N=2 and moderate bias (black points for TES and red points for PCS), you get an average bias of about 1.8+1.8=3.6 points. As Judges can also provide negative bias also (by lowering the scores), and due to the symmetry of the situation, the total score differences between two skaters in this example can be of about 7 points!
 The discussed results can be found here:
 The table of SOV used in the exercise can be found here:
 The best runners performances are available here:
See also references therein.
Please, send questions, comments, etc. to:
* According to the general statistical laws, the expected number of entries in the shadowed area should be about 2.5% of the total, (about 5 units). The observed number is 23.
† Indeed the first 13 skaters in the ranking of Sochi had a judge of their own nation, with the exception of Korea.