Paper: Improving the Presentation and Interpretation of Online Ratings Data with Model-Based Figures
Every so often, I come across a paper I find really exciting. Presently, it's Daniel E. Ho and Kevin M. Quinn, Improving the Presentation and Interpretation of Online Ratings Data with Model-Based Figures, The American Statistician, November 2008(doi:10.1198/000313008X366145).
The paper tackles one problem that irritates me --- aggregating online ratings --- and uses a solution I've previously considered and wanted to investigate --- profiling the raters. (It's always nice to discover you don't have to do ALL the work yourself.)
The problem is illustrated by the following snapshot from Amazon.
With only three ratings, no product merits a 5-star rating. To Amazon's credit, they clearly note there are only three ratings aggregated into a single 5-star rating. They also show a histogram of the individual ratings elsewhere on the product page.
A simple fix would be to apply well known statistical confidence intervals to the ratings and use the lower bound. Of these fixes, the most naive would be to add pseudo-counts for each possible rating. With three 5-stars, the pseudo-count score would be 3.75. Evan Miller proposes a more sophisticated technique based on Wilson's score in a recent blog post.
Ho and Quinn do something different. They propose a model that incorporates the raters behavior on the site and thus, all ratings are not equal. The types of behavior captured are
- Uncritical --- easy to please,
- Non-discriminating --- useless, and
- Discriminating --- a "critic."
Go read the paper for the details of the model. The punch line is that the aggregate rating depends on all ratings submitted to the site.
After looking at lots of rating data from Netflix, LAUNCHcast, and even the smaller datasets from HelloMovies, incorporating this information seems critical to generate useful aggregate ratings.
One issue with their techniques is the lack of transparency. As a user, I would really have to trust your site to take the ratings seriously. The Amazon approach with the histogram allows me to evaluate the data myself; though I lack the critical context that the Ho/Quinn ratings provide. A second issue is computation. I did not fully check their paper, but I don't think the model is trivial to fit and could be done in real time.
These are just theoretical issues. They have an R package: Ratings, so go give it a try.