Statistics

Hey guys,

I've seen that some people have been looking at stats of various kinds over in the other thread, so I thought I'd add my own contribution. I've scraped the Pyweek website for data, which is available here and here. The format should be fairly self-explanatory, but let me know if you have any issues.

One of the issues that I've heard people batting about is a perceived decline in the quality of games recently. I decided to check if I could see such a trend in the scores - but there doesn't seem to be an obvious trend.

As you can see in the graph, average Pyweek scores have stayed pretty steady since the beginning, with some weeks having higher scores than others, but no general trend. Given that, in my opinion, standards of Pyweek voters have risen substantially, at least in terms of production and general polish, it seems that our games can't be getting that much worse. I do wonder whether people are looking back with slightly rose-tinted glasses at earlier Pyweeks, remembering the better games and forgetting the bad ones.

Another interesting point here is that the average score for a Pyweek game is well below 3! In fact, average fun scores (around 2.66) are considerably lower than those for production and innovation (both in the 2.9-3.0 range). I'm not sure why this is, although it may be that somepeople are subconsciously thinking of 2.5 as average, rather than 3.

Finally, something interesting I've found is that across all three categories, there's a strong correlation between the number of votes received and the scores given.
 
There's two possibilities here. One is that more people play the good games, either because they're less buggy, or because they have more interesting screenshots and generally drum up more enthusiasm among players, who tell others. The other possibility is a bit more worrying. It may be that those voters who vote on everything are less generous with their scores than those who only vote on a few games. In this case, which games get voted on more could be having a large effect on scores.

I'd like to know if anyone has any thoughts on this stuff, and please do post if you do anything else interesting with the data.

PS Richard, the voting comments pages are outputting unsanitised HTML - not only does this make it quite difficult to scrape (a Python traceback quite often includes a lot of < and > characters), but it may be a minor security risk.

(log in to comment)

Comments

The 3x3 set of graphs don't look very correlated to me. They look normally distributed in the y axis and skew-normal in the x. What are the correlation coefficients?
I just happen to have computed them from Martin's data just now. Here you go:

 2 36 0.324 1.997
 3 30 0.456 2.714
 4 53 0.127 0.914
 5 50 0.250 1.787
 6 57 0.459 3.835
 7 55 0.202 1.505
 8 57 0.410 3.337
 9 53 0.345 2.628
10 48 0.338 2.433

That's the number of points, the Pearson correlation coefficient, and the corresponding t-value. (3 is highly significant.) Obviously some of them are more significant than others. Anyway, it doesn't surprise me at all. I think that Martin's first hypothesis is correct. Buggy and boring games tend to get fewer responses.
While I'd agree that buggy and boring games get fewer responses, the question to be asked here is why. Is it because they have fewer people playing them, or is it because they have fewer people bothering to vote? If it's the former, then how do people know the game is buggy and boring without trying it out? If it's the latter, then why aren't they voting? Even a DNW would be more useful than no vote at all.
Also, in terms of significance, fitting a linear model of score as a function of number of votes, and an independent bias for each challenge, yields a trend of about 0.024 per vote, with a p-value of the order of 1e-10! You get a similar, but slightly larger p-value if you drop the bias terms, as the results from different challenges tend to blur things out.

I probably shouldn't have used "strong" to describe these correlations; "significant" would have been more apt. However, they're certainly there, and at levels which could definitely impact the results of the competition.
Martin - thanks for scraping the data and posting the graphs. 

it might be that the higher scoring games tend to have more impressive screen-shots, which draw more people to play them.  If word of mouth is a factor, I'm not seeing much of it on the message boards during judging, but I don't know about IRC or other communication channels.
I think it's the latter. And buggy isn't the same thing as DNW (although they're probably pretty highly correlated). Often if the game just runs slowly, or it randomly crashes from time to time, or the learning curve is too steep to get anywhere, or I can't figure out the controls, or it involves a dependency I don't have, after a few minutes I'll decide to skip it for now and come back to it later. But that doesn't mean I've played enough that I feel like I can rate it fairly. And it doesn't mean it's broken enough for me to mark it DNW.

Then, of course, I don't always have time to come back to it later. This time there were two games like that for me, but I imagine it's more for some people.
There have definitely been a few "List your favourite games" threads during judging in previous Pyweeks; I don't recall one this time, but I may be wrong. I'd be very surprised if no word of mouth communication was happening - at the very least within teams. It's also possible that people are focusing on games made by people who they know from previous Pyweeks, which are possibly of higher quality than those of relative newbies. Unfortunately, I forgot to scrape information on teams, so I can't check that hypothesis immediately.
Thanks for the graphs, Martin.

It doesn't surprise me much that the average Pyweek score is below 3. My guess is that people think of 3 as the average score according to a certain standard, and not relative to all other games. For example, one might think that a 3 of production means "it's nicely produced but it has nothing special" instead of "average production between all games". In that sense, if there are plenty of games that "aren't nicely produced", it makes sense to have your own production ratings average below 3.

If this guess is true, then that means that are a lot Pyweek games that are below voters' standards.

I'm sometimes guilty of what Cosmologicon said: there are some games that crash or are buggy or I can't figure out how to play and I end up skipping them for later so I can rate them more fairly. Normally, I try to rate most of the games, but this Pyweek I left some of them for the last day and ended up not having time for them. (Note: I also end up not rating games that have some less common dependency I don't have the patience to install, since it doesn't make sense to give them a DNW.) But I'm not sure this is generalized. If it does, it might explain the correlation. Probably the other hypothesis of people playing the games that look more interesting also counts for it.

Further questions:

I wonder why the spike in ratings in Pyweek #7. Were the games actually better in general, or was there another factor?

Why did this Pyweek have so few voters (almost like Pyweek #2)? Why did Pyweek #4 have so many?
Martin: There was a "Best games so far" thread.

It really looks like the number of ratings per game is on the decline, especially if I look at the #10, #7, #4 column (I only performed visual analysis :)). Was the number of participants also lower?

And there does appear to be a pattern to the average scores, at least for the last 5 competitions — looks like we do better in the autumn!
That would be autumn in the Northern Hemisphere. If there is correlation between rating and season, southern hemispherians had an advantage this PyWeek. The cheating bastards.
Martin: thanks for this analysis... I'm happy to give you the raw data to work with...
I don't think it's surprising that Fun tends to have a low average score, because putting fun into a game is a difficult thing to do. Production quality can usually be improved to some extent just by putting in work, but finding fun requires talent and luck. Innovation is harder as well, but I think it's still easier than fun -- it's possible for a game to be very innovative but still not much fun.
Interesting statistics.. .. another interesting fact is that.. there haven't been any such exhaustive "trends study" for the pyweeks till now :)
.. just skimmed through the "Number of Respondents" for my entries.. it was 40 for pyweek4 and for rest of pyweeks it averaged around 20 .. i guess that sux.. .. oh well.   
On the issue of whether the number of ratings per game is on the decline, the answer is yes, although that's largely because the number of entrants is also on the decline - less people entering means less people voting overall. If we look at the average number of votes cast per entrant, it's down a bit, but it's hard to claim there's an overall trend.
Martin: Starting your continuous axes other than at zero is a bit of a graph usability no-no - you've made it look since there is a 90% drop since last Pyweek, when it's only 20% or so.

Plus (while I'm nitpicking) the x-axis, though represented by integers, is actually discrete, so a line chart probably isn't the right tool for the job. There is no Pyweek 8.7 with an an average votes per entrant of 12.0.  A bar chart would be more appropriate.

I highly recommend adding http://junkcharts.typepad.com/ to your RSS client.
I'm familiar with the work of Junk Charts, thanks, and generally find that I disagree with them as often as I agree. In particular, I dislike the overuse of bar charts to indicate ordered quantities. We're all smart people, and nobody would think that Pyweek 8.7 existed. If it bothers you that much, think of it as a bumps chart.

As for the zero-axis thing, while I would generally agree, the purpose of this graph was to emphasise the relative magnitude of the variability compared to the trend - something which is quite significant, but which is underplayed by starting at zero. Also, a zero-based graph would involve quite a lot of dead space, which is also a "bit of a no-no".
While waiting for next challenge, I was looking for statistics about pyweek participants of last competitions and I found this interesting thread.
Here is the chart already posted, avg votes per participant (blue line), updated to last pyweek.
Red line above is the relative value, calculated vs number of game entries less one (assumption: people belong to only one team). Probabily numbers and percentages should not be in the same chart, but I think it is readable... and you can compare their trends :)



Averages look to me very low: are we really judging less than 25% of the games submitted? It means less than 10 games in the last pyweek.
I was wondering if team entries are someway dropping down these numbers. Maybe that in some cases teammates meet and evaluate together a game, submitting one single vote? Maybe that e.g. a "pure" artist of a team leaves judging task to other teammates because he doesn't have python? Just guessing...
I wish there were more ratings for the games. I always rate at least 90% of them (around 40). It would be good to know why people who rate 10 or fewer aren't able to get to rate more. Is it because they lose interest? Or they aren't able to run the game? Or they don't feel qualified to rate it? Some statistics on when the ratings were submitted might be instructive, but I don't know if those data are kept.

From the other point of view, it would be good to know how entrants can encourage more people to rate their game. Making them as bug-free as possible is helpful, of course, as is avoiding unusual libraries that require separate installs. Making binaries for Windows and Mac probably helps as well. The game BubbMan 2 got a huge number of ratings. pymike must have done something right for that. But if the statistics give any hints as to how to get more ratings, that would be great to know.

I would hope that artists have people on their team they can get to help them set up python, and if not, they could at least post here for help.
I'm one of those people who doesn't rate all that many games. Sorry about that, guys. But I suppose that gives me some insights into why people don't rate lots of games. So here are some of my reasons:

  1. Time. I put a lot of effort into Pyweek - everything else takes a bit of a back seat while it's happening. This means that in the week or two after Pyweek I have a lot of extra stuff to catch up on. I already feel guilty about all the things I haven't been doing because Pyweek was on. Now I have to give up free time to judge games?
  2. Energy. Playing a whole bunch of games sounds like fun. Playing a whole bunch of buggy badly documented games in a critical way, not so much. I find it a pretty draining experience rating more than a couple of games in a sitting. 
  3. Quality. This is possibly going to upset some people, but most of the games in Pyweek are just plain not fun. For every BubbMan 2, there's someone's first try at learning Python, and the game that someone knocked together in half an hour, and the prototype of an interesting mechanic that just ended up going nowhere. These games are not fun to play. If I didn't feel any kind of obligation, I wouldn't be playing them. Often, I simply don't.
  4. Lack of responsibility. I'm not proud of this one. Simply put, if I can get away without rating games, then why should I? It's a pretty thankless task. Even when people respond to criticism it's usually either in a hostile way, or it's couched as "yeah, that would have been cool, but we didn't have time". There's very little sense that anything I say as a reviewer goes into making better games.

So those are my reasons. I'd rather if people didn't try to change my mind on these, I think you're wasting your breath a bit. I am a fundamentally lazy individual. Please think of them more as data on why people don't rate.
Well don't take this as an argument, but the obvious followup question is, what would help encourage you to rate more games? You mention time. Would an extra week or two of judging, realistically, get more ratings out of you? You mention rating being a thankless task. What if there were awards for rating more games? They do that in Ludum Dare.

I see where you're coming from when you say that people don't seem to take the feedback to heart. Possibly some of this is because the feedback is not that helpful to begin with. Saying "You should have less bugs" is not very actionable. Not sure what to do about this.
Aa, math! Make it stop! Ohh pictures... :)

I wonder if a "pyggy potential" (i.e. further development) award or rating would boost voting participation. If some games are dull, aggravating, or a letdown because there wasn't enough time to complete them, perhaps a decent incentive to vote would be the chance to stimulate a continuation of projects so that I can play the fully realized concepts.

Gumm
Cosmologicon: An extra week or two of judging might help, but it also might stretch the thing out unnecessarily, and discourage people from getting started right away (and thus cause more people to forget entirely). I'd be interested to know if Richard has data on when people do their rating - whether there's a big rush at the last minute or if people spread them out over the two week period.

As for awards for rating more games, I don't think those would motivate me, but I can't speak for others. I think it might lead to a bit of ill feeling, as people feel guilty about not having rated games. This might result in more ratings, but it might also just lead to people feeling bad about not rating things. Could be worth a shot though.

I wonder if a more structured reviewing mechanism might be a good idea, from the point of view of getting more "actionable" feedback. Looking at the ratings for Mortimer the Lepidopterist, I find comments like "Brilliant", "I liked this a lot" and "Awesome game!". That sort of thing is a lovely ego boost, but pretty useless as a comment, in my opinion. The comments for lower-rated games seem to be a bit more constructive, but there's still a lot of "This does not seem to be a complete game" and so forth. Perhaps it would be good to suggest a few questions as jumping off points for a useful comment, like "What would you have done differently about this game?" or "What aspects of this game would you like to see more of?".
 
Related to that point, I think it would be good for reviewers to have some more concrete guidelines about what's expected of them. We've made games with hours of gameplay, and had people give ratings after only a few minutes' play of the first level. At the same time, I've spent twenty minutes poking at something which only really had thirty seconds' worth of game, simply because I didn't feel it was fair to give up on the game after such a short period of time. It would be nice if there was a way I could know how much time to spend, either because there's a standard guideline (Pyweek games should aim to be a half-hour of fun), or because the author has specified it (Reindeer Blaster SX is a short shoot-em-up/snowboarding hybrid which lasts about ten minutes. You've seen everything if you get to level three.)

The idea of a standard amount of time to spend on a review comes from IFComp, a text adventure competition. There, reviewers have two hours to look over a game. I think that would be pretty excessive for Pyweek, but the idea is sound. It gives both authors and reviewers something to aim for. There are two other ideas we might want to borrow from their voting. First, each voter has to rate a minimum of five games, or their votes don't count. Again, this is about reasonable expectations. By setting a minimum, you give people something to aim for. Rather than my saying "Fine, I'll rate some games at some point", I'll say "I have ten games to rate over the next two weeks, I'll do two today, and a few more this weekend". At the moment, the only people who think like that are those who plan to rate every game.

The second idea from IFComp, and possibly a more controversial one, is to let non-participants vote. The most recent IFComp had 26 entries (mostly single author) and 150 voters. So even though people rated about 16 games each on average, every game got at least 50 ratings. If you want to have lots of votes, the easiest way is to have lots of voters.
Martin: Nice, that all sounds pretty reasonable. Hopefully others will chime in so this isn't just a two-sided thing. But I wanted to say I definitely agree with entrants giving more information to judges. We could probably think of several useful questions they could answer, but "How long does it take to finish?" and "How do I know when I've seen everything?" or "What's the minimum point I need to reach before I can give it a fair rating?" would be at the top of my list.

One piece of constructive criticism I would like to see is "You spent too much time on X and not enough time on Y."

As for letting non-participants vote, that should work if we also implemented the 5-vote minimum. My worry is that after Pyweek, I show my game to all my friends, and if they could vote, they would give me unfair ratings. We could at least let previous Pyweek participants vote, or people who DNF. This might also keep people who DNF from submitting a buggy entry just so they get to vote.
I think documenting play time is a good idea all around. A minimum play time, level, chapter, whatever milestone, would be very helpful to raters whose limited time must be divvied among the projects. In games and books there's usually a point where I become hooked--or not. If it don't grab me by then, it's either not my thing or some important aspects weren't done well enough. That is a good point to be able to rate fairly. If I'm hooked I'll continue playing, and may even boost the score a notch. If I give up before the hook then I can't do it justice. (This could be considered a spoiler and should be treated as such.)

I'd like to see another voting metric I saw on Gamespot and liked: Tilt. This is a value representing how much you like the genre, game style, and/or theme. This may not be easy or useful to work into the contest scores, but it's a very easy feedback mechanism, for example "not my kind of game" where I might otherwise have purposely withheld my vote so as not to hurt or help the game's score by casting an unfair vote. As a game contributor, if I can see individuals' scores then I can see Jim loves shmups but he hated mine, or Joe hates checkers but he loved mine; this tells me something significantly deeper about their scoring in both cases, and if I care I can work towards getting some detailed feedback.

In my opinion we should keep voting within the PyWeek community. These are the folks who are deeply invested. If folks want to be treated fairly and with respect, they will treat others in kind. And it seems like they do for the most part. :)

Maybe we could have a spectator vote that would have no bearing on the competition's outcome, or have its own awards category. This could have benefits: more votes, thus sense of accomplishment, to competitors; some feedback from people who come here looking for games to play, and a community to take part in; get non-participants involved, and give the events more exposure. Care should be taken not to let this ruin the community. Think worst case: Blizzard forums, Diablo chat.

And lastly, don't put too much stock in the input of enthusiastic PyWeek noobs (That'd be me. :)) At any rate, hope my comments fuel some good ideas.

Gumm
Oh, hai interesting and lengthy discussion! :-)

I'm not nearly as concerned about the state of voting/judging as some of you clearly are. In my view the current setup works well enough to serve its purpose: for PyWeek peers to pick the games they think are the best of the crop for any given challenge. Gathering some other feedback or metrics is a bonus, but not the primary objective.

If anyone has anything concrete they'd like to contribute in terms of code, or if you'd like to get access to the raw data from the database (minus passwords, of course) I could see what I could do... I'm not overburdened with spare time these days though ;-)
Well I'm here to learn to make better games. There are several reasons that doing it as part of Pyweek is better than just doing it on my own, but the primary one for me is feedback. I guess this isn't really set up as a community of teaching and learning, but that aspect is there, and I'd kind of like to encourage it. :)
I do like the idea of more information about the game to be submitted by entrants for the benefit of reviewers and more specific questions to be asked of reviewers (should they care to answer them).

Feeling starved of information is really annoying when a couple of minutes documentation would answer all your queries.

Maybe each entry could have a README or FAQs page on the site that can be updated during the reviewing period?