Universita di Salerno
An Exercise: Correcting the Scores
The statistical analysis has shown that the scores contain several bias, in both Technical and Components elements. Is it possible to “correct” the scores? In general it is not easy to rectify a result with systematic bias. The correction will depend on some assumptions that one has to do on the systematic effect itself, so there is no a unique, simple way to proceed.
In the following I’ll present an attempt to make such corrections, based on some “reasonable” assumptions. Let me stress here that this should be considered just an exercise, without any claim of “truth”.
3.1. Global Bias
The effect of global bias was discussed in Chapter 1. It consists of a global, fixed shift of all the values. It is clear that this shift doesn’t affect the final ranking of a competition, so it is not necessary apply some correction, unless one wants to compare different competitions. This is the case when people are looking for “World Records”, the largest scores ever realized. It should be clear that world records don’t have any absolute value, as they depend on the particular Jury. In general scores from different competitions cannot be compared in simple way. Instead, what can be compared are differences, as explained in Chapter 1. In order to make some examples, I selected four different International Competions:
the Olympic World Games 2010 (OWG2010), the Olympic World Games 2014 (OWG2014), the World Championship 2013 (WC2013) and the World Champi-onship 2014 (WC2014).
Let’s start by comparing the total scores for OWG2014 and WC2014.
The distributions of the total scores are reported in figure 3.1.
Both Means and RMS looks comparable.
In the following figure are reported the same distributions for WC2013 and OWG2010:
Again, the two distributions look compatible. However, if we compare now with the previous distributions, we find a difference of about 5 points in the Mean. In other words, if you want to compare the results from this two competitions, you need a global shift of 5 points on the total.
So, what about the World Records? We have to correct first for this global bias. The simplest way to do that is to subtract the mean from each score (*), as for the tare weight in a balance. To be more specific, let’s compare the following competitions again:
OWG 2010 ― WC 2013 ― OWG 2014 ― WC 2014.
In the following table are reported the averages (Mean) for total, short and free programs. Averages are evaluated by excluding the first skater (†).
By using the previous table it is possible to evaluate again the scores of the top skaters in this specific four competitions. The results are presented in table 3.2.
It is clear that after correction, all the World Records were performed in the first competition, namely the 2010 Olympics World Games.
Another type of “World Records” which sometime are considered are the largest score differences, usually between the first and the second skaters. This differences are “safe” from the global bias, however they depends on the particular combination of skaters. In any case it is interesting to note that the largest differences ever observed are the following:
23.06 for Olympics World Games,
20.42 for World Championships,
36.04 for International Competitions (Grand Prix).
In all the cases the first skater was Kim Yuna from Korea.
3.2. Single Skater Bias
The more insidious bias come from “single skater bias”, i.e. a bias which is applied to a single skater. This doesn’t mean necessarily “cheating”. It is very likely that a Judge will favor a skater of his own nation, and this explain why the Jury is typically composed by Judges from the main nations involved in the competition. On the average, all the “bias” introduced by the Judges tend to cancel each other. In some case however the bias may favor one (or more) specific skater. More details on the amount of bias that can be produced by one or more judges are reported in Appendix 3.
Let’s now consider separately the Technical Elements and the Program Components.
3.2.1. Technical Score
I have shown in the first Chapter that the “trimmed average” (discharging of the largest and lowest values) in general doesn’t work. It is the easiest way to make a correction, and this is the reason why it is used in almost every sport where a jury is involved. However, it is not very effective in eliminating the “wrong” values, which is the correct thing to do. Another criterion is therefore needed to correct the score.
In paragraph 2.1.2 I have shown a way to give a quantitative (probabilistic) indication of the “fairness” of a judge. A more effective correction consists in eliminating from the Mean all the “biased judges”, according to this indication. My assumption then is to exclude all the scores from judges which have a very large “N” (see figure 2.6). For this exercise (Free Program) the threshold was set at the arbitrary values of 7, 8 and 9 in three different trials (i.e. were excluded all the judges with N larger than 7, 8 or 9).
As the final result is not strongly sensitive to this threshold, the value of 8 is finally used. The new averages are then scaled according to the SOV, and summed for each skater. The whole exercise was finally repeated for the Short Program, now with a threshold of 6.
Note that by construction the new TE scores will be in general smaller than the official ones.
3.2.2. Components Score
As already observed, in this case the procedure described in 2.1.2 will not help because of the low number of elements (5).
A correction of the components score however can be performed on the basis of figures 2.12 and 2.12. A “safe” choice could be to take a weighted.average of all the results, i.e. the results of all the recent International Competitions. This corresponds basically to consider a much larger number of judges. Clearly, the actual event (the Olympic Event) should have the largest weight, since we want to determine a score for this specific competition. So, a “fair” choice could be to give a 50% weight to the Olympic result, and 50% to the average of the other events. However, in order to be as much conservative as possible, I set this weight to be 2/3 (0.66) for the Olympic scores and 1/3 (0.33) for the others. Let me stress that for most of the skaters the exact value of this weights doesn갽t change strongly the resulting score.
For example, the PC scores for skater Asada (Short Program) are the following:
= 33.88 (OWG2014), = 33.66 (average over other competitions)
So, if we take the weight factors as 50% and 50%, we obtain the corrected value
If the weight are set to 2/3 and 1/3, the corrected value will be:
For this skater, the difference between the numbers is really small, just few cents. However, for some skaters the difference is not negligible.
To evaluate the new PC scores we have to repeat this exercise for both SP and FP, and for all the skaters. Note that as for the TE, the new scores are expected to be on the average smaller than the official ones (because of the global shift).
3.2.3. Total Score
Finally, to get the new Total Scores we have still to sum all the partial scores, which are the TE and the PC for both SP and FP. Note that for some skaters some deductions have also to be applied (due to one or more faults).
The final results for the first 12 skaters in Sochi’s ranking are presented in the table 3.3.
It can be seen that most of the scores are almost unchanged (all are slightly reduced, as expected), as well as the final classification, with two important exceptions: the first ― second and the 5 ― 6 skaters. This results is clearly consequence of what previously discussed.
Finally, we need an estimation of the “errors” associated to this new scores, or in other words, an estimation of the range in which they should be found. The largest uncertainty in the final result comes from the relative weight of the Components Scores. As discussed before, a conservative value of 2/3 has been used. In order to get an approximate indication of the associated uncertainty, it is possible to change this weight in a “reasonable” range. I used as limits the values 0.5 and 0.7. The results are presented in table 3.4, where are reported the “official total scores”, the two “limits” (the values corresponding to the two weights above), and the average of the two (the reported “errors” correspond to the half.difference between the two limits, they are not statistical errors!).
The same result is shown here in graphic form. The vertical bars (“error bars”) indicate the indetermination from the minimum to the maximum value, as indicated in the previous table.
As expected, the error bars are larger for the points with larger distances from the official scores. Also, it is interesting to note that Kostner’s position goes up to the second place when we consider the average (see table 3.4).
As remarked before, the error bars presented in the table don’t have a true statistical meaning, they can be considered as a rough estimation of the range where the unbiased scores should be found. Note that the values reported on table 3.3 correspond to the “very conservative” conditions (weights = 2/3 and 1/3).
* In principle the Mean should be evaluated by excluding the specific score that we are considering, i.e. the largest one.
† This gives a difference on the Mean of the order of 1 ― 2 points.