PyWeek judging idea

Some issues came up this last PyWeek:
  1. Two people appeared to abuse the ratings system,
  2. Not all games were rated an equal number of times,
  3. There should have been DNW ratings when there weren't, and
  4. There should have been DQ flags when there weren't.
To this end, I'm two changes:
  1. Allowing comments to be posted against games, even outside the judging period, and
  2. Altering judging to have it be performed by a smaller group of people. Each person would rate a smaller number of games, and we'd make sure that each game was rated by someone using each of the three major platforms.
Thoughts?

(log in to comment)

Comments

I like it. As the popularity of pyweek goes up it's going to become less and less feasable to judge all of the games. I thought I might not get through all of them, so I compiled a list of games and then shuffled it so I could test in a random order, and not just start at the top alphabetically like I did last time (but last time I got through them all). As it turned out, I didn't quite make it through my randomized list, and it didn't actually have all of the games (because it was based off the list of games in the torrent). So yes, I think it would be helpful to come up with some way to get a more even distribution of number of rankings.

I don't understand which of the issues #1 is supposed to address.

As for #2, are you suggesting that the competition would no longer be peer-reviewed? I'd be wary of moving to an intentionally smaller pool of judges for each game; there's already quite a big swing factor in terms of how much your game happens to appeal to the players' particular preferences (as seen by the widely-varying scores for 'fun' which games receive in the current system), and decreasing the number of judges would only increase that (unless you moved away from 'fun' as a scoring category to something less subjective).

Support (from both the pyweek judging guidelines and the web interface) for bjorn's suggestion for giving judges a random order to play the games in seems like it would address your concern #2---I've seen other contests use this very effectively. #3 I think is just a case of interface design, and the discussions taking place in other threads seem to be generating sensible suggestions for making sure DNWs don't come through as 1/1/1s.

As far as #1 goes, I've often wondered whether there's a deliberate reason not to normalise users' votes so that they all have the same mean (and variance?). This would eliminate a further random factor arising from the fact that different judges may have different thresholds for the different scores (of course, it does mean you're not strictly judging each game on its own merits any more).

Addressing #4 needs a rethink of the criteria by which a game can be disqualified from the competition. Here I think having a smaller panel of judges who are qualified to make that decision *would* be useful, particularly since the DQ criteria should be matters of fact rather than opinion. You could still allow participants to flag a game as possibly worthy of disqualification, but leave the actual decision-making to a trusted group.

+1 for pre-assigned game list. I think 10-15 games is pretty much my limit, before I start not putting a lot of effort into judging them. I made it through nearly all of the games this time, but I don't think I gave every game a fair shake. I think there should be moderators too, to throw out situations where unjustified low (or high) scores are given. Preferably moderators who are not participating in the competition. I think that the lowest scores should always have justification.

I'm not so sure about averaging a judges scores. If this is done, I would like to see it tested against previous pyweeks first. It's important for the scores to be comparable between weeks (although it's debatable if they are at this time), and averaging per judge might skew the results.

And Adam, my impression with the comments, is so that people can comment about games without having to be a judge (lets people who didn't do the week still say they liked a game). Actually I wouldn't mind if scoring were opened to the general public after the judging period is over. Let's see what non pyweekers think of our games ;)
I quite like the idea of assigning games to judges, however, rather than having a small pool of games per judge it seems better to force people to judge in a certain order. This would mean you could still judge all the games if you were keen enough, but games which had been rumoured to be good, or games alphabetically first would not get more votes.

As for normalising - that means the rankings won't be consistant between PyWeeks, which in my opinion is bad.

Finally; comments. I agree. Comments are good. Maybe allow ranking too, but send it to a different page, so that improved versions don't get mixed up with the PyWeek versions.

I'm against restricting certain games to certain judges, for two reasons. First, I like the everyone-is-equal nature of the way it is, and as Adam said, this would compromise that.

Second, not everyone can get every game to work for them, and there's no way of telling ahead of time what will work for who. If this were done, there would have to be some way for a judge to say "I can't get this to work, give me another one."

When I originally envisaged this challenge I intended that games that didn't work be disqualified. Clearly given that more than half of the games are submitted in such a state this isn't reasonable.
Well, why not simply have it so that if a game does not work it gets penalized, like, the 1/1/1 rating that usually comes from the takes effect?

That way they aren't necessarily DQ'd automatically - but that also gives an incentive for teams to try and make it work for everyone...

On the other hand, a bunch of DNWs come from people not reading the README and not realising they're missing a dependency. If this results in a DNW as it often does at the moment, it's not a huge issue; if it resulted in a 1/1/1, pulling your score down a lot, it would be a lot more frustrating.

Also, while I do think people should be aware of good practices for ensuring cross-platform compatibility (like treating filenames as case-sensitive), it may not be practically possible to test your game on all platforms, simply because you don't have access to them. I'd very much hope we didn't see a change whereby any team without access to at least, say, five computers---WinXP, Vista, OSX and a couple of different Linux distros---was effectively prevented from ever scoring well in the contest. (And different GL implementations make things even worse---I don't think it's practical within a week to make your game cope properly with all the possible vagaries of different users' graphics hardware, especially if you're doing anything complicated. Last Pyweek, we tested Wound Up! on all the machines we could, but still ended up with 17% DNW due to GL problems on other systems.)

@adam: when the majority of cross-platform problems stem from filename case sensitivity, I think it's reasonable to expect people to get that right.

How about splitting the voting process in two different phases:

  • Week 1: Everyone votes for their 10 favorite games. I guess you'd have to do these for individual and group entries separately. You can still leave comments for all games, but no numerical ratings yet.
  • Week 2: Games which end up in the top 10 go through to this phase. Voting is done as always (Fun, Innovation, Production).

This would have the effect of reducing the number of entries people have to grade, while hopefully giving approximately the same amount of attention (and votes) to the top contenders. The downside is low-ranked games won't receive numeric ratings (they could be ranked by votes), but most of the time these ratings are not really significant and it's the comments that matter the most.

I also agree that the DQ/DNW situation should be revised. It's quite hard to these the game on all platforms, so DNWs are okay as long as they're fair across the board (i.e. not considered 1/1/1 for some teams and for others not). DQ should be more strictly enforced by a special judge panel imho. Maybe put a review process in place after the second voting to check for these things.

minor typo: "these the game" -> "test the game"
@richard: Maybe my post wasn't clear. I was saying "I think people ought to be able to avoid stupid errors like expecting filenames to be case-insensitive, but there are plenty of other things that could go wrong with your game on platforms you don't have access to that aren't so easy to prevent".
Well, I think a simple rule that you may not rate a game until you have given it a fair chance would help, ie:
finish game, or at least play for 15-30 mins,
read the ReadMe/in-game instructions!,
Looked through the wiki for bug fixes to make sure it is not your setup,
Get a traceback of why it didn't crash.

I think that will make it so that DNW's can carry a score penalty (obviously not the fault of the judge), and would also handle the problem of people just blank rating games.

You would probably need to have a few "super" judges, who basically have the same rating power of everyone else, but they moderate the ratings, to make sure there aren't any bogus/DOH! ones given.

This may limit the number of games each judge can rate, but as long as you do the random list thing it should be fine...