This is an updated version of "Sochi Ladies Figure Skating: a statistical analysis of the results" by Dr Tiziano Virgili.
You can download the original file (pdf format) in here.
Chapter I Why Statistics?
1.1 Statistical Errors
1.2 Systematic Errors
1.3 Statistic and Figure Skating
Chapter II Analysis of Sochi’s Results
2.1 Technical Score
2.2 Program Components
Chapter III An Exercise: Correcting the Scores
3.1 Global Bias
3.2. Single Skater Bias
Chapter IV The “Jury Resolution Power”
4.1 The Resolution Power
4.2 Intrinsic fluctuations
Since the end of the Sochi’s Olympic Games, a strong debate about the Ladies Figure Skating result has been developed. The main point was the huge score of the winner (Russian Sotnikova), very close to the present world record. Many commentators claimed this result a scandal, including outstanding skaters like Katarina Witt and Kurt Browning, whereas others defended at least the technical score. What was surprising to me was the fact that Italian TV commentators had rightly guessed almost all the scores, with three important exceptions: the two Russians and the Korean skaters. Was this just a coincidence? To increase the discussion, according to media reports, the “curriculum” of a couple of judges was not really appropriate to their important task. Also, the composition of the jury was a bit anomalous, since it contained a couple of Russian judges an no Koreans.
Most of the discussions on the web are focused on the “Technical Score” of the Free Program, with different arguments. Sotnikova’s supporters give the attention to the number and the difficulty of the elements, whereas Kim’s supporters point on the “quality” of the performance. All this discussions however are misleading, as the “Technical Score” of the Free Program constitutes about one third only of the total score.
Looking at the discussions on the web, it is also clear that an increasing number of people is considering Figure Skating a “subjective sport”, namely a sport where the final result is not determined by precise measurements. According to such people, it depends essentially on “personal taste”, or more precisely on the taste of the judges. This could be true more in general in any competition where a jury is involved. Do such competitions have any “objective” meaning? The question is not trivial, and involves a huge number of sportive events (such as Figure Skating, Gymnastic, Diving, Synchronized Swimming, Snowboard, Boxing, Judo,…), artistic competitions (piano, dance, singing contests,…) and also important events like public selections (access to university, public employment, etc.). So, it is very important to provide a “scientific” answer to this question also.
Obviously, I’m not qualified to comment the results of Sochi’s Olympics Games from a technical point of view. As a physicist however, I’m used to perform data analysis, so I have tried to look “blindly” at the numerical scores, as if they were “experimental data”. The basis of any scientific approach to experimental data is the statistical analysis, so I have performed very simple checks, based on standard statistical analysis methods. This methods allows in general to quantify the amount of “objectivity” of a Jury, and (more important) to discover the presence of bias in the results, so they should indeed be used in every competition with a Jury.
In the first chapter of this book I’ll present few simple concepts of statistic (formulas are not necessary!), such as the Mean and the RMS of a distribution. In the second chapter, after a short summary of the basic rules of Figure Skating scores, I will analyze the “Technical Elements” and the “Program Components” of Sochi’s Olympic Games. In chapter three I’ll show how the scores can be corrected for bias, with methods based on the previous statistical analysis. Finally, in the fourth chapter the question of “objectivity” of Figure Skating is discussed. Some technical remarks are reported in the appendixes, as well as a list of links to some discussions on the web. I hope that this work will be helpful not only in the Sochi results, but more in general as a guide to perform a scientific check of any competition with scores.
1.1. Statistical Errors
What is an “objective” result? Let’s start with a basic consideration. From a scientific point of view, an objective result is a result which can be reproduced at any time, everywhere and by everyone. A simple example is the gravity force: it can be experienced by everyone, everywhere and in any time. In general no results can be reproduced exactly (with infinite precision). Measurements can be reproduced within some range, so it is very important to determine this range, known as measurement error. Let’s make now an example from one of the most “objective” sports: Track and Field’s 100 meters.
It is largely assumed that the 100 meter race is among the most “objective” sport, as the times of the runners can be measured with large precision. Let’s now suppose that the runners will go one after another, and that the times will be measured by a manual stopwatch. So, the running time of each athlete will be measured by a human operator, and because of the human sensitivity it will have a non-negligible error. The measured time will be larger or smaller than the “true value”, in random way. This kind of fluctuation is known as statistical error (as we will see later, this is not the only source of error in the measurement). We can improve this measurement by adding several manual stopwatches. In this case, due to the “human indetermination”, any operator will give a slightly different result.
1.1.1. Mean and distributions
It is possible to have a better estimation of the “true value” by taking the average (Mean) of the different values. This is defined by the sum of all the values divided by the number of values itself.
In general, the larger the number of measurements (stopwatches), the lower the error that we have on the average. In other words, the statistical error can always be reduced by increasing the number N of measurements(*). So, in principle, you can reach the precision that you need just by increasing the number of operators.
In our example, as the “human” time resolution is of the order of 0.2 sec, a group of about 100 stopwatches will provide a global resolution of about 0.02 s (the measured time will correspond to the average of the 100 single independent measurement). Not very practical, but still effective!
It is possible to “visualize” all the measurements by constructing a plot known as “histogram”. Let’s suppose that we have ten values:
9.8, 10.2, 10.0, 10.0, 10.1, 10.3, 10.0, 9.9, 10.1, 9.9. We can easily evaluate the average as = 10.03. We can now put this numbers on a graphic in the following way. First, we define an horizontal axis with the appropriate scale (i.e. more or less in the range of our 10 numbers):
Next, for each of the values, we put a “box” over the corresponding number (to be more precise, we should define a “bin size”, and put together all the numbers that are in the same range defined by the bin size). So, in the vertical scale we just count the number of “box” which have that value. For instance, we have 3 values “10.0”, so the total box height at 10.0 will be equal to 3.
Here is the final figure: this is a very simple example of distribution.
The total number of “entries” (i.e. the number of box) is of course 10.
This figure tells something more than the simple average: it is possible to “see” how the values are arranged around the average. In other words, the “shape” of the distribution contains also important information. It is possible to demonstrate that if the measurements contain random errors only, the shape of the distribution will be similar the previous figure: with a maximum in (about) the middle, and two side tails. The exact shape is called “Normal distribution” (or “Gaussian”). In this case the mean value coincides with the maximum. The “width” of the distribution is also a very important. It can be quantified by another number, the “standard deviation”.
1.1.2. Standard Deviation - RMS
The Standard Deviation (σ) provides information on the width of the distribution, i.e. how far the numbers are from the average. It can be evaluated(†) by the “Root Mean Square Deviation” (in the following RMS). An example is shown in the following figure: the “RMS” is indicated by the horizontal bar. Approximately the RMS is the width of the distribution at half height. In summary, if we repeat a number of measurements affected by random errors (as the manual stopwatches) we got a distribution that looks like a “bell”. It should be clear now that a larger RMS means a wider distribution, i.e. larger fluctuations in the measurements.
This parameter is also related to the error on the average M: a small RMS correspond to a small error on ΔM (‡). Coming back to our numerical example, we have for this distribution: RMS = 0.14, and therefore the error on the average is ΔM = 0.045. As we have seen, this error can be further reduced by increasing the number N of measurements.
1.2. Systematic Errors
In addition to the “statistical errors” there is another type of error, known as “systematic error”. A global systematic error is a common bias to all the measurements. As an example, let’s suppose to measure the weight of different objects with a balance. Can we really be sure that the observed values correspond to the “true” weights? That it is hard to say, unless an independent measurement (another “good” balance) is available. Another example can be the measure of a temperature with a thermometer. If the balance or the thermometer are not well calibrated, all the measurements will be shifted of the same amount, and we will observe a global bias.
Generally speaking, systematic errors are rather difficult to treat. As far as you don’t have an “external reference”, it is not possible to apply the “right” correction to all the measurement. If we consider the differences however, we can be more confident that a possible “bias” (an overall shift in the temperatures or in the weights) can be highly reduced or eliminated.
A different type of systematic error is a bias applied to a single measurement. For instance, this can happen if we make a mistake in the measurement procedure. As a consequence, the resulting value will be significantly far from the other results.
It will be seen on the histogram as an isolated point, as in the following figure:
In this example we added to the previous 10 numbers the new value 10.5. This will produce a change of the average, from M=10.03 to M=10.07, and also an increase of the RMS from 0.14 to 0.19.
A simple way to handle this “wrong values” is to consider the trimmed average. This is an average obtained by eliminating the largest and the smallest measurements. Back to the previous example, we should remove the two entries as in the following figure:
In this case the new average will be M=10.06 and the RMS=0.13.
As you can see, the standard deviation σ (the RMS) is decreased a lot, not so much the average. In fact, this is a very rough method to eliminate bias. More sophisticated methods are able to eliminate in more effective way the “wrong values”. It is important to remark here that a single “wrong point” is already effective in producing a bias on the average. Of course, the situation gets worst if the number of “wrong points” is larger than one.
In general, the RMS itself is a good parameter to identify “wrong” data. Almost all the values are indeed contained between “3 RMS”, i.e. the distance of all the values from the Mean is usually shorter than three times the RMS. In the previous example, the average is 10.06, and 3 RMS=0.39.
So, almost all the values should be in the range 10.06±0.39 . It is easy to see that this condition accept the “good value” 9.8, and discharge the “bad value” 10.5 (see figure 1.6).
1.3. Statistic and Figure Skating
Let’s now go back to the example on 100 m run with manual stopwatches. It is clear now that in order to have an objective classification of the runners the indetermination on the time measurements must be smaller than the time differences between the athletes. For instance, if such differences are of the order of 0.02 s, we need an error on the average of the order of (at most) 0.01 s. This would require about 400 manual stopwatches!
We can now substitute the runners with the skaters, and the manual stopwatches with the judges of the jury. We have an “objective result” if the errors on the scores are small compared to the score differences between the skaters.
In other words, it is possible to consider a Jury equivalent to a group of manual stopwatches. There is of course the possibility that some judges - stopwatches are biased: their measurement are very different from the other. The simple trimmed average is the corrections applied to the Figure Skating scores (and in many other sports also). However, as we have seen, this method is ineffective in removing all the biased scores. Biased scores should be removed according to the score distribution, as I have shown in the previous example.
In the next chapter I will apply the previous considerations to the Sochi Ladies scores, with the aim to find possible bias in the results.
* In most of the cases the formula: can be used, where ΔM is the error on the mean, N is the number of measurements and R is the error on the single measure.
† The σ can be evaluated by the following formula: , where N is the number of measurements, are the single measurement and is their average.
‡ They are related by the following formula: , where again N is the number of measurements, σ is the RMS.