The Problems with 5-Star Rating Systems, and How to Fix Them

Executive Summary

Online marketplaces for goods and services are increasingly valuable and powerful. Yet many of them remain surprisingly unsophisticated when it comes to their reputation systems, which typically take the form of five-star ratings. From investing in and advising dozens of marketplace businesses for more than a decade, the authors have found that while simple five-star systems are good enough at identifying and weeding out very low-quality products or suppliers, they do a poor job of separating good from great products or suppliers. This may not be a big issue for marketplaces offering commodity products and services , but it can be a serious problem for marketplaces where it is important to allow truly great providers to differentiate themselves clearly. This articles discusses several options for providing users of a marketplace with a better sense of the relative ranking of suppliers.

RichVintage/Getty Images

Online marketplaces for goods and services are increasingly valuable and powerful. Yet many of them remain surprisingly unsophisticated when it comes to their reputation systems, which typically take the form of five-star ratings.

From investing in and advising dozens of marketplace businesses for more than a decade, we have found that while simple five-star systems are good enough at identifying and weeding out very low-quality products or suppliers, they do a poor job of separating good from great products or suppliers. This may not be a big issue for marketplaces offering commodity products and services (e.g., Lyft and Uber), but it can be a serious problem for marketplaces where it is important to allow truly great providers to differentiate themselves clearly (e.g., 99designs, Fiverr, Upwork). (Disclosure: One of us — Josh — owns shares of Upwork and Yelp).

As currently implemented, five-star rating systems suffer from several shortcomings. Lacking incentives for providing truthful feedback, users who have extreme experiences (either very bad or very good) are much more likely to leave feedback than users who have average experiences, thus creating selection biases. Ratings are also prone to “grade” inflation, so that in some marketplaces having a 4.8-star average, or 96% positive feedback, does not mean that the supplier is particularly exceptional. And on some marketplaces, the difference between 4.5 stars and 4.8 stars could be massive, making it hard for users to differentiate OK suppliers from very good ones.

There are several options for providing users of a marketplace with a better sense of the relative ranking of suppliers. A basic thing they could do would be to show users the average score for all suppliers in the relevant category (e.g., Amazon, toys; Yelp, restaurants in San Francisco; Task Rabbit, movers; Expedia, budget hotels). They could even consider showing (in a simple way) the distribution of scores within a category, indicating the current supplier’s position in that distribution. By seeing the whole distribution and where the supplier fits in it, the user could quickly get a sense of the overall situation.

Another way to help distinguish among the top suppliers is to indicate when a supplier is in a top percentile group (e.g., “this is a top 10% supplier”) while not disclosing this for the rest of suppliers. This is akin to treating the top percentiles as badges like eBay and Airbnb have done with their respective “top-rated seller” and “superhost” designations.

The second key measure we advocate is to adjust user ratings for differences in reviewing behavior. Specifically, a given user’s review could be given a larger weight if there is a higher variance in the individual’s reviewer scores (in contrast to someone who always gives the same or similar score). Going further, one should consider adjusting a user’s score by the average rating that user has given in the past, so only relative differences across the ratings given by the user would shift a supplier’s rating. This would help adjust for differences between users who are intrinsically very generous in their ratings and users who are very demanding.

An example of the issues created by these differences is Lyft and Uber. Unlike Lyft, drivers on Uber can see riders’ ratings so they tend to turn down requests from riders with lower ratings since they expect that such riders are more likely to give low driver ratings. Uber could avoid this problem by adjusting the riders’ feedback to take into account that some riders consistently give lower ratings and a driver should not be penalized for taking such riders. Doing so would make drivers more willing to take riders with lower scores.

Other ways marketplaces could improve their reputation systems include using more private information (reviews and comments that are never revealed to the suppliers) and relative comparisons (how the supplier compared to the user’s previous supplier) to rank suppliers. They could also ask users questions whose answers will help the marketplace better match them to the right suppliers in the future (e.g., What attributes of this supplier did you like/dislike the most?).

Rating systems may never be enough on their own to ensure trust and safety on online marketplaces. They typically need to be supplemented by insurance coverage and other guarantees. However, implementing the measures we’ve discussed would go a long way toward making rating systems more robust and reducing marketplaces’ needs and costs of providing these supplementary services.

Powered by WPeMatico

Antiques

AdSense