Open rating fairness report

mauve on 2018/11/18 16:11
— edited on 2019/10/23 07:24

In PyWeek 26, for the first time, ratings were not submitted anonymously. This was an experiment intended to address potential unfairness in the ratings and produce more useful feedback. I was keen to find out whether this has been successful. To do this I both polled users about their subjective experiences and looked at all submitted ratings.

Poll Results

Thanks to all who participated in the poll. The subjective result was very clear: 53% of returning entrants thought that the ratings were more fair this time. Only one respondent thought they were less fair.

Respondents also thought that the feedback was more valuable. 67% of returning entrants thought the feedback was more valuable this time.

Ratings Analysis

I wanted to see if this anecdotal evidence was supported by the data of the real ratings.

Hypothesis 1: Given participants thought ratings are fairer, we should see lower standard deviation in the ratings given by different reviewers.
Hypothesis 2: Given participants thought feedback is more valuable, we should expect to see words on average per review comment.

Let's look at hypothesis 1 first. I calculated the standard deviation for the ratings for each entry. Then I took the average of these. PyWeek 26 does indeed have a lower average standard deviation in ratings for each entry. It's not quite an all-time low, but it is lower than the last couple of PyWeeks:

Interestingly, I did by accident find something else: there's much less variation overall in the ratings, even between different entries. It's by far the lowest of all time.

In fact on average we gave slightly higher and more clustered ratings this time than before. Here's the distribution of all ratings in PyWeek 26 vs all previous competitions (the older the competition the fainter the line). Notice how this competition is slightly shifted to the right, with fewer very low scores. It also has a much higher central peak:

Let's now look at Hypothesis 2: did reviewers write more? Yes, they did. On average, they wrote more words per review than ever before:

Conclusion

I think the case is made that the reviews were fairer and the comments more valuable. I cannot say that the non-anonymous ratings were the reason; there were other changes as well. For example, users felt more engaged with the community this time, too (perhaps due to the improved e-mail and timeline features):

Is this a cause, or a result, or an unrelated effect? I can't tell. Regardless, I think this is a positive thing. Therefore I think we should make the change permanent. I'm going to run all future competitions with non-anonymous ratings too.