Today I am writing another piece about figure skating, also another piece about data analysis in this event. But I am not focusing solely on the Olympics this time but on the best scoring athletes and the best scoring event of each athlete. There is a lot of data to go through, so let’s get right into it! The data comes from the (International skating Union) website which I downloaded on the 20/01/2018 (if you try my code and the results are different it might be because the data on the website has changed). Reproducible blogging is important!
Personal Best for athletes: a first look at the data
I first load up the files I compiled from the data found on the ISU website. The ladies’ and men’s event are separated into two different files because the data is separated on the website and it wouldn’t make any sense to compare anyways. Let’s check out a few lines of each the dataframes to see how they are structured:
## rank Name NationID Event
## 1 1 Evgenia MEDVEDEVA RUS ISU World Team Trophy 2017
## 2 2 Yuna KIM KOR XXI Olympic Winter Games 2010
## 3 3 Adelina SOTNIKOVA RUS XXII Olympic Winter Games 2014
## 4 4 Alina ZAGITOVA RUS ISU Grand Prix Final 2017/18
## 5 5 Satoko MIYAHARA JPN ISU Grand Prix Final 2016/17
## 6 6 Mai MIHARA JPN ISU World Team Trophy 2017
## Date Score Category
## 1 20.04.2017 241.31 S
## 2 25.02.2010 228.56 S
## 3 20.02.2014 224.59 S
## 4 09.12.2017 223.30 S
## 5 10.12.2016 218.33 S
## 6 20.04.2017 218.27 S
## rank Name NationID Event
## 1 1 Yuzuru HANYU JPN ISU Grand Prix Final 2015/16
## 2 2 Shoma UNO JPN ISU CS Lombardia Trophy 2017
## 3 3 Javier FERNANDEZ ESP ISU World Championships 2016
## 4 4 Nathan CHEN USA ISU Four Continents Championships 2017
## 5 5 Boyang JIN CHN ISU World Championships 2017
## 6 6 Patrick CHAN CAN ISU GP Trophee Bompard 2013
## Date Score category
## 1 12/12/15 330.43 S
## 2 16/09/17 319.84 S
## 3 01/04/16 314.93 S
## 4 19/02/17 307.46 S
## 5 01/04/17 303.58 S
## 6 16/11/13 295.27 S
This data is very rich, it’s perfect for us. Let’s break it down and look a little more specifically. Last time I was very interested in how France did. I was talking with a Turkish friend and she wondered what her fellow Turkish did in the sport, so let’s look at Turkish athletes for a change:
## rank Name NationID Event
## 293 286 Tugba KARADEMIR TUR European Championships 2008
## 420 410 Guzide Irmak BAYIR TUR ISU JGP Brisbane 2017
## 502 487 Sila SAYGI TUR European Championships 2013
## 550 531 Birce ATABEY TUR ISU CS Denkova Staviski Cup 2015
## 572 552 Ilayda BAYAR TUR ISU JGP Riga Cup 2017
## 573 553 Elif ERDEM TUR ISU JGP Croatia Cup 2015
## 714 685 Sinem KUYUCU TUR ISU CS Denkova Staviski Cup 2015
## 750 718 Selin HAFIZOGLU TUR ISU JGP Czech Skate 2013
## 784 748 Melisa Sema ATIK TUR ISU JGP Sencila Bled Cup 2012
## 822 781 Ekin SAYGI TUR ISU JGP Minsk Arena Cup 2017
## 845 802 Zeynep Dilruba SANOGLU TUR ISU JGP Cup of Austria 2015
## 1142 1073 Ecem ERTENLI TUR ISU JGP Bosphorus 2012
## 1183 1110 Duygu SEN TUR ISU JGP Austria 2011
## 1284 1210 Aybike KAHRIMAN TUR ISU JGP Bosphorus 2009
## 1293 1219 Buse COSKUN TUR Pokal der Blauen Schwerter 2004
## 1307 1233 Beril BEKTAS TUR World Junior Championships 2005
## Date Score Category
## 293 26.01.2008 138.73 S
## 420 26.08.2017 124.78 J
## 502 26.01.2013 119.27 S
## 550 23.10.2015 114.73 S
## 572 08.09.2017 113.06 J
## 573 10.10.2015 113.05 J
## 714 23.10.2015 101.95 S
## 750 05.10.2013 98.24 J
## 784 29.09.2012 96.70 J
## 822 22.09.2017 95.03 J
## 845 11.09.2015 93.52 J
## 1142 21.09.2012 73.84 J
## 1183 30.09.2011 70.75 J
## 1284 16.10.2009 52.52 J
## 1293 08.10.2004 49.54 J
## 1307 28.02.2005 41.34 J
As my friend warned me, Figure Skating is not really the national sport, they have yet to win a medal in the Olympics in this discipline but they do have a number of juniors that could provide an interesting suprise in a senior category. I havent’ really studied this category yet but if you’re interested in turkish figure skating they are bringing the ice dancing couple Alisa Agafonova / Alper Uçar.
Average Personal Best scores per country
Let’s look at the top 10 countries with best average PB:
## NationID mean_value
## 60 RUS 161.5981
## 35 JPN 152.4915
## 73 USA 147.2498
## 12 CAN 129.0514
## 38 KOR 126.5404
## 67 SWE 124.7574
## 13 CHN 124.5560
## 34 ITA 119.2320
## 21 FRA 118.2176
## 24 GER 116.3685
## NationID mean_value
## 49 RUS 194.9438
## 6 BEL 187.8933
## 60 USA 186.8308
## 12 CHN 185.5706
## 31 JPN 181.2509
## 19 FRA 172.8991
## 11 CAN 171.9211
## 52 SRB 170.6200
## 21 GEO 170.2317
## 39 MON 167.8100
Russia is well above any other country both in ladies’ and men’s events, as expected. But below that there is a difference between the two events: in ladies’ Japan is second, and to my surprise USA is actually 3rd, so they’re not that bad after all. In men’s events, Belarus is second and again USA is third. In both events France is in the top 10, it could be worse, and it looks like our men’s events are better. China is also in both top tens, their artistic component is really good. Canada is also in both lists although I expected them to be higher up than fourth and seventh respectively. Italy is in the ladies top ten but not in the men top 10. Other countries I don’t know so much about in this sport and were a bit of a surprise.
Rank vs score plot
Now we have a pretty good idea of the countries that usually come on top but let’s focus on the athletes rather than their countries now and look at the plot of scores and ranks:
The plot of rank according to score is interesting, it is a sigmoid. It shows a huge chunk of competitive athletes in the middle where the drop in rank and score is more linear. On the right side lower in the ranks, the scores tend to drop really fast under 50 points. On the left side up in the ranks, above 150 the scores become exponentially higher.
A linear regression model to predict the rank according to the score
Let’s build a very simple regression model which will work in the middle part of the plot. Regression models, are a way to predict given a value of rank here, the corresponding score. The reason this is interesting here, is say I am rank 601 overall in the world for PB and I want to know what score I will need to have to be rank 600 (come up a rank). This regression model will tell you, which doesn’t make you better at skating, but can help you understand what kind of an effort you need to put in to do that. So this model doesn’t take into account the fact that towards the ends of the plot it is no longer linear, so it is not perfect, but it is a model. Let’s see how good it is:
##
## Call:
## lm(formula = Score ~ rank, data = ISU_ladies_df)
##
## Coefficients:
## (Intercept) rank
## 175.2582 -0.1018
## (Intercept) rank
## 175.2582253 -0.1018119
## [1] "Predicted Score with rank 601: 114.069277725803"
## [1] "Predicted Score with rank 600: 114.171089618544"
So if I am rank 601 according to the model my score is 114.069277725803 and if I want to jump to rank 600 I have to have a score of 114.171089618544 so I want to add 0.1018119. But let’s see how good our model is by pulling up the actual numbers:
## [1] "Actual Score with rank 601: 108.77"
## [1] "Actual Score with rank 600: 108.86"
The model as expected is not perfect but the difference to get from one rank to the other in reality is 0.09. Let’s try to make it a little better!
Building a better model using only the linear part of the plot
We will subset only the part of the plot that is pretty linear and remove those extreme values at the ends and use the same commands as before on that subset.
##
## Call:
## lm(formula = Score ~ rank, data = new_model_df)
##
## Coefficients:
## (Intercept) rank
## 161.95347 -0.08542
## (Intercept) rank
## 161.95347226 -0.08541764
## [1] "Predicted Score with rank 601: 110.617471176053"
## [1] "Predicted Score with rank 600: 110.702888815125"
## [1] "Actual Score with rank 601: 108.77"
## [1] "Actual Score with rank 600: 108.86"
That’s much better! the difference to get from one rank to the other in reality is 0.09. The model still isn’t perfect because the plot isnt’ completely linear.
Testing the model on the men’s data
Let’s do the same thing for the men’s data:
##
## Call:
## lm(formula = Score ~ rank, data = men_model_df)
##
## Coefficients:
## (Intercept) rank
## 219.9326 -0.1731
## (Intercept) rank
## 219.9325997 -0.1731332
## [1] "Predicted Score with rank 601: 115.879553288167"
## [1] "Predicted Score with rank 600: 116.052686476926"
## [1] "Actual Score with rank 601: 118.59"
## [1] "Actual Score with rank 600: 118.65"
For the men’s data the difference to get from one rank to the other in reality is 0.06. It looks even better than in the ladie’s data.
There is lots more to do with this data but for today that’s all.
Sciathlete
Here is the code for those interested in how I did this:
#loading libraries and reading data from ladies' and mens' events
library(readODS)
ISU_ladies_df <- read_ods("ISU_events_PB_ladies.ods")
ISU_men_df <- read_ods("ISU_events_PB_men.ods")
#showing the first six rows of the dataframes
head(ISU_ladies_df)
head (ISU_men_df)
#subsetting the athletes from turkey
athletes_turkey <- ISU_ladies_df[ISU_ladies_df$NationID == "TUR",]
athletes_turkey
#calculating the mean score values and showing the first 10 values
library(plyr)
score_nations_ladies <- ddply(ISU_ladies_df, .(NationID), summarize, mean_value = mean(Score))
score_nations_ladies <- score_nations_ladies[order(-score_nations_ladies$mean_value),]
head(score_nations_ladies, n=10)
score_nations_men <- ddply(ISU_men_df, .(NationID), summarize, mean_value = mean(Score))
score_nations_men <- score_nations_men[order(-score_nations_men$mean_value),]
head(score_nations_men, n=10)
#generating the ladies plot
plot(ISU_ladies_df$Score, main="Score vs rank in Ladie's event", xlab="World rank of personal best", ylab="Personal best score")
#calculating the coefficients and predictions
scores.lm = lm(Score ~ rank, data=ISU_ladies_df)
scores.lm
coeffs = coefficients(scores.lm)
coeffs
pred_601 = data.frame(rank=601)
#looking up the actual values and comparing them
real_score_601 <- predict(scores.lm, pred_601)
print(paste0("Predicted Score with rank 601: ", real_score_601))
pred_600 = data.frame(rank=600)
real_score_600 <- predict(scores.lm, pred_600)
print(paste0("Predicted Score with rank 600: ", real_score_600))
real_score_601 <- ISU_ladies_df[ISU_ladies_df$rank == 601,]$Score
real_score_600 <- ISU_ladies_df[ISU_ladies_df$rank == 600,]$Score
print(paste0("Actual Score with rank 601: ", real_score_601))
print(paste0("Actual Score with rank 600: ", real_score_600))
#calculating the new model with only the linear part of the plot
new_model_df <- ISU_ladies_df[(ISU_ladies_df$rank < 1000 & ISU_ladies_df$rank > 175),]
plot(new_model_df$Score, main="Score vs rank in Ladie's event in linear part", xlab="World rank of personal best", ylab="Personal best score")
scores.lm = lm(Score ~ rank, data=new_model_df)
scores.lm
coeffs = coefficients(scores.lm)
coeffs
pred_601 = data.frame(rank=601)
real_score_601 <- predict(scores.lm, pred_601)
print(paste0("Predicted Score with rank 601: ", real_score_601))
pred_600 = data.frame(rank=600)
real_score_600 <- predict(scores.lm, pred_600)
print(paste0("Predicted Score with rank 600: ", real_score_600))
real_score_601 <- ISU_ladies_df[ISU_ladies_df$rank == 601,]$Score
real_score_600 <- ISU_ladies_df[ISU_ladies_df$rank == 600,]$Score
print(paste0("Actual Score with rank 601: ", real_score_601))
print(paste0("Actual Score with rank 600: ", real_score_600))
#applying the new model to the men's data
plot(ISU_men_df$Score, main="Score vs rank in Men's event", xlab="World rank of personal best", ylab="Personal best score")
men_model_df <- ISU_men_df[(ISU_men_df$rank < 750 & ISU_men_df$rank > 175),]
plot(men_model_df$Score, main="Score vs rank in Men's event in linear part", xlab="World rank of personal best", ylab="Personal best score")
scores.lm = lm(Score ~ rank, data=men_model_df)
scores.lm
coeffs = coefficients(scores.lm)
coeffs
pred_601 = data.frame(rank=601)
real_score_601 <- predict(scores.lm, pred_601)
print(paste0("Predicted Score with rank 601: ", real_score_601))
pred_600 = data.frame(rank=600)
real_score_600 <- predict(scores.lm, pred_600)
print(paste0("Predicted Score with rank 600: ", real_score_600))
real_score_601 <- ISU_men_df[ISU_men_df$rank == 601,]$Score
real_score_600 <- ISU_men_df[ISU_men_df$rank == 600,]$Score
print(paste0("Actual Score with rank 601: ", real_score_601))
print(paste0("Actual Score with rank 600: ", real_score_600))