Statistics
Hey guys,I've seen that some people have been looking at stats of various kinds over in the other thread, so I thought I'd add my own contribution. I've scraped the Pyweek website for data, which is available here and here. The format should be fairly self-explanatory, but let me know if you have any issues.
One of the issues that I've heard people batting about is a perceived decline in the quality of games recently. I decided to check if I could see such a trend in the scores - but there doesn't seem to be an obvious trend.
As you can see in the graph, average Pyweek scores have stayed pretty steady since the beginning, with some weeks having higher scores than others, but no general trend. Given that, in my opinion, standards of Pyweek voters have risen substantially, at least in terms of production and general polish, it seems that our games can't be getting that much worse. I do wonder whether people are looking back with slightly rose-tinted glasses at earlier Pyweeks, remembering the better games and forgetting the bad ones.
Another interesting point here is that the average score for a Pyweek game is well below 3! In fact, average fun scores (around 2.66) are considerably lower than those for production and innovation (both in the 2.9-3.0 range). I'm not sure why this is, although it may be that somepeople are subconsciously thinking of 2.5 as average, rather than 3.
Finally, something interesting I've found is that across all three categories, there's a strong correlation between the number of votes received and the scores given.
There's two possibilities here. One is that more people play the good games, either because they're less buggy, or because they have more interesting screenshots and generally drum up more enthusiasm among players, who tell others. The other possibility is a bit more worrying. It may be that those voters who vote on everything are less generous with their scores than those who only vote on a few games. In this case, which games get voted on more could be having a large effect on scores.
I'd like to know if anyone has any thoughts on this stuff, and please do post if you do anything else interesting with the data.
PS Richard, the voting comments pages are outputting unsanitised HTML - not only does this make it quite difficult to scrape (a Python traceback quite often includes a lot of < and > characters), but it may be a minor security risk.
(log in to comment)
Comments
2 36 0.324 1.997
3 30 0.456 2.714
4 53 0.127 0.914
5 50 0.250 1.787
6 57 0.459 3.835
7 55 0.202 1.505
8 57 0.410 3.337
9 53 0.345 2.628
10 48 0.338 2.433
That's the number of points, the Pearson correlation coefficient, and the corresponding t-value. (3 is highly significant.) Obviously some of them are more significant than others. Anyway, it doesn't surprise me at all. I think that Martin's first hypothesis is correct. Buggy and boring games tend to get fewer responses.
I probably shouldn't have used "strong" to describe these correlations; "significant" would have been more apt. However, they're certainly there, and at levels which could definitely impact the results of the competition.
it might be that the higher scoring games tend to have more impressive screen-shots, which draw more people to play them. If word of mouth is a factor, I'm not seeing much of it on the message boards during judging, but I don't know about IRC or other communication channels.
Then, of course, I don't always have time to come back to it later. This time there were two games like that for me, but I imagine it's more for some people.
It doesn't surprise me much that the average Pyweek score is below 3. My guess is that people think of 3 as the average score according to a certain standard, and not relative to all other games. For example, one might think that a 3 of production means "it's nicely produced but it has nothing special" instead of "average production between all games". In that sense, if there are plenty of games that "aren't nicely produced", it makes sense to have your own production ratings average below 3.
If this guess is true, then that means that are a lot Pyweek games that are below voters' standards.
I'm sometimes guilty of what Cosmologicon said: there are some games that crash or are buggy or I can't figure out how to play and I end up skipping them for later so I can rate them more fairly. Normally, I try to rate most of the games, but this Pyweek I left some of them for the last day and ended up not having time for them. (Note: I also end up not rating games that have some less common dependency I don't have the patience to install, since it doesn't make sense to give them a DNW.) But I'm not sure this is generalized. If it does, it might explain the correlation. Probably the other hypothesis of people playing the games that look more interesting also counts for it.
Further questions:
I wonder why the spike in ratings in Pyweek #7. Were the games actually better in general, or was there another factor?
Why did this Pyweek have so few voters (almost like Pyweek #2)? Why did Pyweek #4 have so many?
Best games so far" thread.
It really looks like the number of ratings per game is on the decline, especially if I look at the #10, #7, #4 column (I only performed visual analysis :)). Was the number of participants also lower?
And there does appear to be a pattern to the average scores, at least for the last 5 competitions — looks like we do better in the autumn!
Martin: There was a "It really looks like the number of ratings per game is on the decline, especially if I look at the #10, #7, #4 column (I only performed visual analysis :)). Was the number of participants also lower?
And there does appear to be a pattern to the average scores, at least for the last 5 competitions — looks like we do better in the autumn!
Plus (while I'm nitpicking) the x-axis, though represented by integers, is actually discrete, so a line chart probably isn't the right tool for the job. There is no Pyweek 8.7 with an an average votes per entrant of 12.0. A bar chart would be more appropriate.
I highly recommend adding http://junkcharts.typepad.com/ to your RSS client.
As for the zero-axis thing, while I would generally agree, the purpose of this graph was to emphasise the relative magnitude of the variability compared to the trend - something which is quite significant, but which is underplayed by starting at zero. Also, a zero-based graph would involve quite a lot of dead space, which is also a "bit of a no-no".
Here is the chart already posted, avg votes per participant (blue line), updated to last pyweek.
Red line above is the relative value, calculated vs number of game entries less one (assumption: people belong to only one team). Probabily numbers and percentages should not be in the same chart, but I think it is readable... and you can compare their trends :)
Averages look to me very low: are we really judging less than 25% of the games submitted? It means less than 10 games in the last pyweek.
I was wondering if team entries are someway dropping down these numbers. Maybe that in some cases teammates meet and evaluate together a game, submitting one single vote? Maybe that e.g. a "pure" artist of a team leaves judging task to other teammates because he doesn't have python? Just guessing...
From the other point of view, it would be good to know how entrants can encourage more people to rate their game. Making them as bug-free as possible is helpful, of course, as is avoiding unusual libraries that require separate installs. Making binaries for Windows and Mac probably helps as well. The game BubbMan 2 got a huge number of ratings. pymike must have done something right for that. But if the statistics give any hints as to how to get more ratings, that would be great to know.
I would hope that artists have people on their team they can get to help them set up python, and if not, they could at least post here for help.
- Time. I put a lot of effort into Pyweek - everything else takes a bit of a back seat while it's happening. This means that in the week or two after Pyweek I have a lot of extra stuff to catch up on. I already feel guilty about all the things I haven't been doing because Pyweek was on. Now I have to give up free time to judge games?
- Energy. Playing a whole bunch of games sounds like fun. Playing a whole bunch of buggy badly documented games in a critical way, not so much. I find it a pretty draining experience rating more than a couple of games in a sitting.
- Quality. This is possibly going to upset some people, but most of the games in Pyweek are just plain not fun. For every BubbMan 2, there's someone's first try at learning Python, and the game that someone knocked together in half an hour, and the prototype of an interesting mechanic that just ended up going nowhere. These games are not fun to play. If I didn't feel any kind of obligation, I wouldn't be playing them. Often, I simply don't.
- Lack of responsibility. I'm not proud of this one. Simply put, if I can get away without rating games, then why should I? It's a pretty thankless task. Even when people respond to criticism it's usually either in a hostile way, or it's couched as "yeah, that would have been cool, but we didn't have time". There's very little sense that anything I say as a reviewer goes into making better games.
So those are my reasons. I'd rather if people didn't try to change my mind on these, I think you're wasting your breath a bit. I am a fundamentally lazy individual. Please think of them more as data on why people don't rate.
I see where you're coming from when you say that people don't seem to take the feedback to heart. Possibly some of this is because the feedback is not that helpful to begin with. Saying "You should have less bugs" is not very actionable. Not sure what to do about this.
I wonder if a "pyggy potential" (i.e. further development) award or rating would boost voting participation. If some games are dull, aggravating, or a letdown because there wasn't enough time to complete them, perhaps a decent incentive to vote would be the chance to stimulate a continuation of projects so that I can play the fully realized concepts.
Gumm
As for awards for rating more games, I don't think those would motivate me, but I can't speak for others. I think it might lead to a bit of ill feeling, as people feel guilty about not having rated games. This might result in more ratings, but it might also just lead to people feeling bad about not rating things. Could be worth a shot though.
I wonder if a more structured reviewing mechanism might be a good idea, from the point of view of getting more "actionable" feedback. Looking at the ratings for Mortimer the Lepidopterist, I find comments like "Brilliant", "I liked this a lot" and "Awesome game!". That sort of thing is a lovely ego boost, but pretty useless as a comment, in my opinion. The comments for lower-rated games seem to be a bit more constructive, but there's still a lot of "This does not seem to be a complete game" and so forth. Perhaps it would be good to suggest a few questions as jumping off points for a useful comment, like "What would you have done differently about this game?" or "What aspects of this game would you like to see more of?".
Related to that point, I think it would be good for reviewers to have some more concrete guidelines about what's expected of them. We've made games with hours of gameplay, and had people give ratings after only a few minutes' play of the first level. At the same time, I've spent twenty minutes poking at something which only really had thirty seconds' worth of game, simply because I didn't feel it was fair to give up on the game after such a short period of time. It would be nice if there was a way I could know how much time to spend, either because there's a standard guideline (Pyweek games should aim to be a half-hour of fun), or because the author has specified it (Reindeer Blaster SX is a short shoot-em-up/snowboarding hybrid which lasts about ten minutes. You've seen everything if you get to level three.)
The idea of a standard amount of time to spend on a review comes from IFComp, a text adventure competition. There, reviewers have two hours to look over a game. I think that would be pretty excessive for Pyweek, but the idea is sound. It gives both authors and reviewers something to aim for. There are two other ideas we might want to borrow from their voting. First, each voter has to rate a minimum of five games, or their votes don't count. Again, this is about reasonable expectations. By setting a minimum, you give people something to aim for. Rather than my saying "Fine, I'll rate some games at some point", I'll say "I have ten games to rate over the next two weeks, I'll do two today, and a few more this weekend". At the moment, the only people who think like that are those who plan to rate every game.
The second idea from IFComp, and possibly a more controversial one, is to let non-participants vote. The most recent IFComp had 26 entries (mostly single author) and 150 voters. So even though people rated about 16 games each on average, every game got at least 50 ratings. If you want to have lots of votes, the easiest way is to have lots of voters.
One piece of constructive criticism I would like to see is "You spent too much time on X and not enough time on Y."
As for letting non-participants vote, that should work if we also implemented the 5-vote minimum. My worry is that after Pyweek, I show my game to all my friends, and if they could vote, they would give me unfair ratings. We could at least let previous Pyweek participants vote, or people who DNF. This might also keep people who DNF from submitting a buggy entry just so they get to vote.
I'd like to see another voting metric I saw on Gamespot and liked: Tilt. This is a value representing how much you like the genre, game style, and/or theme. This may not be easy or useful to work into the contest scores, but it's a very easy feedback mechanism, for example "not my kind of game" where I might otherwise have purposely withheld my vote so as not to hurt or help the game's score by casting an unfair vote. As a game contributor, if I can see individuals' scores then I can see Jim loves shmups but he hated mine, or Joe hates checkers but he loved mine; this tells me something significantly deeper about their scoring in both cases, and if I care I can work towards getting some detailed feedback.
In my opinion we should keep voting within the PyWeek community. These are the folks who are deeply invested. If folks want to be treated fairly and with respect, they will treat others in kind. And it seems like they do for the most part. :)
Maybe we could have a spectator vote that would have no bearing on the competition's outcome, or have its own awards category. This could have benefits: more votes, thus sense of accomplishment, to competitors; some feedback from people who come here looking for games to play, and a community to take part in; get non-participants involved, and give the events more exposure. Care should be taken not to let this ruin the community. Think worst case: Blizzard forums, Diablo chat.
And lastly, don't put too much stock in the input of enthusiastic PyWeek noobs (That'd be me. :)) At any rate, hope my comments fuel some good ideas.
Gumm
I'm not nearly as concerned about the state of voting/judging as some of you clearly are. In my view the current setup works well enough to serve its purpose: for PyWeek peers to pick the games they think are the best of the crop for any given challenge. Gathering some other feedback or metrics is a bonus, but not the primary objective.
If anyone has anything concrete they'd like to contribute in terms of code, or if you'd like to get access to the raw data from the database (minus passwords, of course) I could see what I could do... I'm not overburdened with spare time these days though ;-)
Feeling starved of information is really annoying when a couple of minutes documentation would answer all your queries.
Maybe each entry could have a README or FAQs page on the site that can be updated during the reviewing period?
mauve on 2010/04/20 18:18:
The 3x3 set of graphs don't look very correlated to me. They look normally distributed in the y axis and skew-normal in the x. What are the correlation coefficients?