Pyweek 25 Judges Investigation
Thank you for bearing with me while I investigated some reports of unjustified low scoring in Pyweek 25.
I received a number of complaints about low scores associated with comments including:
- I was completely confused by this entry (etc)
- Yeah.... no...
I downloaded the reviews from the Pyweek database and analysed them in a Jupyter notebook. One entrant stood out as responsible for many of the low ratings. However this person was not universally biased. Here's a plot of their ratings (slightly randomised to preserve anonymity), with the overall rating of each game on the x axis and the user's rating on the y axis:
For an unbiased user this distribution should approximate a straight line. The low scores that I've circled appear to be unexpectedly low, although the majority of reviews do correlate well. For comparison, here is a similar plot of my own reviews:
I wanted to construct a more appropriate statistical test than "this looks low". I considered reviews that are more than 2 standard deviations below the mean score for each game. This finds low scores for games that are consistently rated highly, while allowing low scores for games that split the crowd. Under this test I find 14 low reviews.
Some of these reviews are associated with longer or specific comments (The numbers you see are standard deviations from the overall review):
- -2.18448239424969 Can't play on level 2. Screen blinking is too painful.
- -2.459674775249769 Unacceptable behavior by the participant, including the use of sexualized language or imagery and unwelcome sexual attention or advances. This breaks our community guidelines / rules.
- -2.2068053204820104 This game, even when working, its plainly average, average mechanics, average production, average enjoyment. Although the optimizations you pulled made the game silky smooth and I can give credit for that :)
- -2.2051543778147487 I was never a fan of tetris in the first place, but playing this makes me appreciate it. The decision to make the tiles different shapes was a bad one. It was confusing and i'm pretty sure there is no shape to fill in a hole on the very left side of the level. I really wish I could say something positive about this, but I couldn't really find anything that I liked
- -2.3148868868049037 I was completely confused by this entry, the obvious premade assets, the confusing dialog and backgroud, the fact that this pretty much relates nothing to the challenge, ill pass on this.
- -2.478902576381991 Collision detection broken, poor art, I dont even know whats going on. Does not seem to be related in the slightest.
The other category are all very terse and weakly justified:
- -2.02409725481996 Yeah, i get the concept, the implementation is blahhh.
- -2.2387210960990034 Yeah.... no...
- -2.1522169248014134 Almost not playable
- -2.483773282901211 Not really much of a game
- -2.050609665440988 Ehhh... nothing special
- -2.085665361461421 Yuk
- -2.3435206489194926 what what
And this one would appear to be indicate the game did not work:
- -3.1427117740247756 please make a simple, easy way to run the game consistent with the readme.
All of these latter category of reviews are associated with the entrant identified above. Removing some of these reviews (such as the latter category alone) could change the individual competition winner, but this is highly dependent on the criteria used to select which to remove.
I believe that these unduly low ratings, specifically when accompanied by no real justification or constructive criticism, are not fair to the entrants who received them.
I believe the entrant who left them did not do so maliciously, as there is no clear pattern to them.
Vacating these specific ratings would change the result of the individual competition, narrowly putting Flip by Tee as the winning entry. Vacating all of the above low ratings including the one gummbum complained about, would result in no change. No other overall positions would change under any scenario I considered.
There is a precedent for vacating some of the ratings, in that Richard has sometimes removed ratings in response to a challenge. However the rules do not explain how such discretion should be applied. I am reluctant to strip winners of their title several days afterwards based on a subjective analysis.
- To remove specifically these weakly justified exceptionally low ratings, and adjust tallies etc.
- To appoint Flip as a joint winner of Pyweek 25, alongside The Desert And The Sea.
- To put in place clearer rules around challenges to ratings and the extent of judges discretion in such matters.
- To change the rating system so that ratings are no longer left anonymously, in time for Pyweek 26.
(log in to comment)