In Defense of Purist Skill Rating

Weds. Jun 21st:

Intro:

This essay will defend a vastly simpler implementation of Skill Rating adjustment than currently exists in Overwatch’s Ranked Matchmaking. I will suggest that removing all influencers of Skill Rating besides winning & losing (adjusted to game difficulty) will result in a number of improvements to the Ranked Matchmaking experience, especially with an eye towards the OWL and the eSports possibilities for Overwatch in general.

Incentives & Behavior:

Most game theoretic models begin with a simple assumption termed ‘rational self interest’, or the idea that individuals will take the course of action which most benefits themselves. This assumption is imperfect, as humans have been repeatedly shown to exhibit altruistic and pay-to-punish behavior patterns in empirical studies. However, broadly speaking, the notion that people will act in service of their own goals is a plausible one. It is especially so in an online context that lacks face-to-face empathic accountability.

Beginning from rational self interest, then, we can understand and predict the behavior patterns of players in Overwatch by examining the incentive structures that they face. Furthermore, alterations to these incentive structures have the power to dramatically change the decisions players make and even the mindset with which individuals approach the game.

The most clear and impactful incentive that Overwatch players (or at least those that choose to play Ranked Matchmaking) face is Skill Rating (hereinafter ‘SR’). Rising through the ranks feels satisfying and validating, placing in a top division can be a status symbol, and a high top-500 placement might even land you tryouts to play professionally. Naturally, then, many players are highly incentivized to seek to maximize their SR.

Skill Rating Maximization:

SR maximization will always be an incentivized behavior pattern. People want to be highly skilled, but more than that they want to appear to be highly skilled. This distinction seems small but is in fact very important. Crucially then, the key motivation for many (especially for the vast majority of players who will never compete in an eSports context) is to reach the highest SR that they can. This should be juxtaposed against the incentive to become the best player one can be: seeking to have the maximum impact upon a given team’s win probability (i.e. the eSports motivation).

Ideally then, the SR system should be set up such that ‘SR maximization behavior’ guides players to make the sort of decisions that positively impact the community and create the best gameplay environment possible. In my judgement, such an ideal system would align the SR maximization behavior with the eSports motivation, especially with an eye towards the Overwatch League. The current system fails to accomplish this alignment.

One Trick Players (OTPs):

While ‘one-tricking’ is not a behavior that I think should be actively discouraged or disallowed, I contend that it’s also a behavior that shouldn’t be specifically incentivized. In my view, the ideal system would be entirely equivocal towards OTPs.

Consider a hypothetical Mercy OTP (anecdotally the most commonly one-tricked hero, although I don’t have data that support this) who has reached a very high SR with essentially no other heroes played.

The current SR system rewards players who are playing at a high skill percentile compared to other players on that hero. This comparison is drawn not within one game instance, but rather across the entire dataset of all Ranked Matchmaking time played on that hero. What this means for our hypothetical Mercy OTP is that, so long as he/she plays better than other Mercy players, lost games will net a smaller SR drop and won games will net a larger SR gain. This impact is so significant that winning vs. losing is in fact a secondary concern to the ‘Mercy percentile’ our OTP is playing at.

We’ll get back to our hypothetical OTP in a moment, but now let’s take a step back to examine the bigger picture. The current SR system is crucially problematic for many reasons, but I’ll focus on two: (1) statistical judgements of skill are weak (for some heroes more than others) and (2) it leads different players to have different incentive structures.

(1) Statistical Judgements of Skill Are Weak:

The strength of this proposition is such that I’ll use the best counterexample as my own starting point: McCree. He is a hero with extremely low utility, extremely low survivability, and extremely high damage potential. A player with high accuracy, high damage per minute, and few deaths per minute is very likely to be a higher impact player than someone with weaker statistics. Such a player is minimizing McCree’s weaknesses (i.e. avoiding death) while playing to his strengths (high damage output). It is very likely that such a player is contributing more to an average game than a player with worse statistics. Even for McCree, though, these statistics are imperfect. Is a given player’s damage relevant? How often is he/she spamming enemy heroes without any plausible follow up (i.e. feeding ultimate charge to enemy supports)? A player who hits a few precise shots to pick a key player at a key moment (e.g. a support at the beginning of the fight or a DPS who is preparing to ult) is inarguably much more impactful to securing wins than one who merely sits in the back making poor focus decisions, yet the latter player would be statistically superior by the previously stated standards.

We can apply this same analysis to quite a few heroes, revealing that statistical judgements of skill become weaker and weaker as we move from the most mechanically demanding heroes in the roster to those with very little ‘traditional FPS skill’ requirements. Even a hero such as Roadhog demands a deeper statistical evaluation to really get at skill. One must weigh damage per minute and survivability against damage taken, as a great Roadhog knows how to minimize his exposure and with it the rate at which he feeds the enemy team ultimate. There is no magic formula to successfully achieve such a balancing act. How can one statistically capture the impact of a Whole Hog that prevents a Dragonblade and a Primal Rage from destroying one’s backline (while doing very little damage and earning no kills)? In a game as complex and decision-rich as Overwatch, I don’t see a way that these judgements can be made accurately and reliably by a predetermined formula.

The ultimate example of how useless statistical measurements of skill are–and how bad percentile-based SR adjustment can be–is of course my favorite foil Mercy. The impact of virtually every aspect of Mercy’s kit is poorly captured by statistical measurements. Hitting a 5 player Resurrection that is responded to by a 6 player Earth Shatter or Graviton Surge is in fact game losing. The statistics show a high ‘resurrected players per ultimate cast’ while the reality in game is that the enemy team just farmed MULTIPLE new ultimates. The entire HP pool of your composition just went into the enemy team’s ultimate bank TWO TIMES OVER. I can’t really overstate how bad it is to make a poor decision about using Resurrection. In these cases, not only would it have been better to save one’s own ultimate, but also it would have been better to disconnect from the server and let your team play 5v6 because at least then you would have had a chance to swing Ultimate tempo. Even if there is no immediate Ult-response to a big Resurrection, if your team fails to win the fight the situation is the same: massive Ultimate tempo swing to the opposing team. Very often, the most impactful Resurrections are instant casts to revive one key player that just died (because the opposing team has often expended cooldowns and cannot kill them again). Thus, playing to maximize the statistical measurements of Resurrection (i.e. waiting for a big Res) is in fact seriously detrimental to the success of the team.

Resurrection is furthermore a relatively weak support ultimate because it requires your teammates’ deaths instead of preventing them as all of the others do (once again Symmetra is not a support). Thus a very smart Mercy player actually chooses not to heal in many scenarios so that her support partner can get his/her ultimate faster. Heals per minute is therefore a fickle statistic whose maximization does not reliably communicate skillful or intelligent play.

Low deaths per minute and high damage boosted are the only statistical measurements of Mercy play that I see as actually meaningful, as these statistics communicate intelligent play and impact maximization. Solo kills with the pistol are also probably quite meaningful, but of course a Mercy player who seeks these out at poor times would be called a thrower. It’s not that Mercy is a ‘no-skill hero’, the key problem is that skillful Mercy play is almost never communicated by impressive stats. Even these statistics I mention as impactful fail to even come close to telling the whole story of player skill and game impact.

(2) Failure to Align Incentives:

Not only are OTPs highly incentivized to  by the current SR system to continue one-tricking and to play for statistical maximization over wins and losses, these incentives are crucially opposed to the incentive structure that flexible players face. A flex player knows that he/she won’t be playing at the far right tail of his/her heroes’ skill distributions because his/her mastery of the game is spread across many heroes and many situations. The flex player seeks to achieve a high SR by playing the perfect hero imperfectly while the OTP seeks to achieve a high SR by playing the imperfect hero perfectly. While I don’t think that either of these strategies is deserving of punishment, I think that its important that the system not prioritize one over the other at any echelon of SR.

In the current system, the flexible player must maintain a higher win percentage (abstracting away from game difficulty) to reach the same SR as the OTP. This is deeply problematic in my eyes, as I see hero swapping as a fundamental part of the game. If an OTP doesn’t wish to engage with hero swapping as a part of gameplay, that’s fine, but their SR should reflect that choice. The same goes for players who don’t wish to engage with communication as a fundamental part of the game: you don’t have to talk, but if you lose games because of it then that is on you and ought to be reflected in your Skill Rating. A truly great player has the knowledge, intelligence, and decisiveness to pick the right hero for the right situation, filling in the gaps of his/her team composition while at the same time countering opposing composition decisions. Not every player has to aspire to be the greatest player of all time, but in my view the entire purpose of having a Skill Rating system to begin with is to measure and validate that very pursuit of greatness.

Suggestion:

Incentive alignment is a goal very worth of pursuing. When all players have the same goals, the potential for toxicity is greatly diminished (though certainly not eliminated). I personally find it quite frustrating to queue into Ranked Matchmaking with the goal of winning games, only to find other players do not share the same incentives. At the very top of the Skill Rating system, one should find other players that want to win games, not those that wish to engage in roleplay. This isn’t to say that OTPs can’t be good or impactful to winning games, my argument is rather that OTPs should be judged by their wins and losses rather than by the extent to which they engage in one-tricking. The current system punishes adaptation and experimentation vastly more than it needs to.

There is only one way to guarantee that every player has the same incentive: strip away all of the hidden formulas and percentile adjustments. Only when each player has only one incentive–to win–will incentive alignment truly come about. The only thing that should impact the SR consequences of a win or a loss is the relative skill of each team. Win a hard game and you should clearly be rewarded more than for winning an easy game, vice versa for losses.

The meaningfulness of Skill Rating is especially important as it is the only clearly available measurement of player skill outside of actual eSports experience. With the Overwatch League on the horizon, the time is now to restructure the system such that the very best rise to the top and have a fair shot at becoming professionals. Right now, the only way to scout talent is to do it on an individual, observational basis. Look at Dota 2, you will see fresh talent rising out of Ranked Matchmaking and being given a shot at a professional career simply for reaching the very top of the ladder. That’s because their MMR system answers exactly one question: ‘how good are you at winning difficult games?’

If I worked at Blizzard, I’d be demanding a HARD Skill Rating reset at the end of this season and an entirely purified win-loss SR adjustment regime going forward. If Blizzard really wants the best of the best to get their chance at fame and fortune in eSports, then there really is only one way.

Counterarguments:

The existence of percentile SR adjustment is primarily, in my understanding, to combat smurfing (or the purchasing of new accounts to play at a lower level than one’s true skill). Want to get serious about smurfing, Blizzard? IP & MAC check new accounts and tag them for evaluation while adding a report option for suspected smurfs to cross reference: if you can statistically target and punish throwers then there is no reason you can’t statistically target and adjust smurf accounts. It’s fine if statistical adjustments are used in exceptional and targeted cases, just get rid of them as the default for the entire player base.

“But I wanna one trick!” Go right ahead. No one can (or should) stop you. But if you lose games because of it, don’t expect special treatment. OTPs don’t deserve punishment, but they certainly don’t deserve specific rewards over players who choose to engage with hero-swapping as a fundamental and crucially necessary mechanic in Overwatch. This is especially the case as Blizzard is beginning to employ SR as a way to qualify for tournaments (see: OW Open) and they seem to be considering it as a potential scouting mechanic for new talent once the scene is more established.

To Blizzard: fix it now, or condemn the eSports potential of Overwatch in the long run.

 

EDIT: An earlier version of this article referenced Contenders as an example of a SR-gated tournament. This is inaccurate, as Contenders was never SR restricted. Rather it is the Overwatch Open that Blizzard is requiring a certain SR for.

11 thoughts on “In Defense of Purist Skill Rating”

  1. Nice essay; have you seen David Sirlin’s article “Overwatch’s Ranking Point System”?
    I won’t link since I’m not sure if that’s allowed, but it covers similar ground. I don’t think he makes the point as clearly as you do though.

    Liked by 1 person

    1. have to agree entirely with this article. My higher win % has not translated to a higher SR rating. My quick play hidden SR seems to be much more accurate as I’ve played many more games in this mode to slowly work past the SR hump I used to be in. It’s sad when my quick play teams are far better then my comp play teams.

      In competitive, despite a 55-60% win ratio, my SR slowly slid down. At the same time many friends with a 40% win ratio would slowly rise. I couldn’t comprehend this and so became frustrated with competitive play and eventually quit it.

      I would fully support a pure win weighted SR as it would more accurately reflect my contributions to winning, not how amazing I was with a particular hero compared to everyone else.

      Like

  2. Really good read.

    I think the other point to consider is whether Blizzard’s own incentives align with those of top pro players with regard to the SR system; while Overwatch is ostensibly designed in an eSports mold, there’s always tension between inclusiveness and competitive purity.

    In the case of SR specifically, creating a chaotic and unreliable competitive ecosystem is actually a shrewd way of stirring the pot to promote player engagement; incentivizing players to do farmy, unskilled things provides a second path to higher rankings beyond the harsh reality of one’s own skill ceiling.
    Rank mobility is important to maintaining the impression of a return on time spent in-game, which is a basic motivational factor. In this respect, the more people who get to touch higher ranks the better for overall product engagement from Blizzard’s perspective (though detrimental to those who are established and playing at the highest level). At broad scale, the same intangibility of ‘quality’ vs bulk measurement you mention on individual performances is applicable to the question of time spent throughout Overwatch. Promoting the highest level of play is important as a soft target insomuch as the game retains credibility as an eSport, but the hard metric of daily / monthly unique users carries massive revenue implications for their company in a much more immediate way.

    All of which is to say, I think there’s a clear case SR’s primary function is a prize for playing lots of Overwatch. It derives its value from its secondary function as a measurement system, but there’s definitely a couple strands of ‘loot box’ in its DNA as well.

    Like

  3. I do find your arguments for a pure win system compelling. However, there’s an undercurrent to some of your logic that’s really bothersome to me, since it’s such a common thing in the community, despite being so damaging.

    In my experience, the single most common cause of toxicity in Overwatch is an excessive focus on hero choice. People think you’re throwing if you pick a non meta hero. People throw to punish “one trickers” who may have a main, but don’t even play it much over half the time. I’ve had people proclaim that they expected no aim from “this Symmetra main” despite my second most played hero being Widowmaker. People give up, flame, blame, and do everything they can to shift responsibility away from themselves. Hero choice, usually others’, is the most common scapegoat.

    Yes, you will lose games by one tricking. But you’ll also lose games by flexing too much. In addition to never getting the statistics to succeed in the system you’re criticising, excessive flexing can also limit your actual impact on the match. I’ve had games where people choose the perfect heroes only to fail to perform, thus preventing me from picking up that hero when I’ve specifically cultivated it as an alternate character to combat situations where I can’t perform on my main.

    The reality is that any extreme policy on hero choice can be damaging to your ability to perform, but only one gets because attention. The “Overwatch is about hero flexibility” meme enforces this, and has created a player base more obsessed with their team comps than their teamwork, personal performance, or positive outlook, all of which contribute to losses and toxicity. Many content creators claim to be helping you intend spend as much time or more talking meta than improvement. And so on.

    Anyway, the point is that hero choice is more complex than “flexibility is good.” Focus is also good. There’s a whole spectrum of flexibility and balance that people don’t consider, and anything outside the most extreme ends can be viable. Cultivating a main character or two and then only expanding your hero pool to cover specific major weaknesses is far better than trying for 8-9 heroes. In the long term, trying to fill in small holes is often not worth your time in matchmaking, especially given games that will be won or lost without any special performance from you (the YouTuber Skyline did a video on that in more detail).

    So thanks for the read. You successfully convinced at least one person that a pure win system is better. Just consider rethinking how you repeat the “Overwatch is about flexibility” mantra.

    Like

  4. Just discovered your blog and wanted to let you know how much I appreciate it. You are an exceptional writer and have put so elegantly what I often find myself trying to articulate on the forums with little to no success.

    I love to see people apply their real world skills to the games we all enjoy together. You’re making the community a better place. Keep it up, Jake.

    Like

  5. Great article!

    I think you need a couple modifications though.

    (1) One-Tricking is not a method of stat padding — you can “one trick” with the goal of winning matches, you can succeed, and the SR system may punish you just as you assume it punishes flex players. I happen to be a 1-trick Pharah who tries every match to win the match, period. And the SR system punishes me for it. I have data going back 1300 games (and counting) back to the beginning of season 4. Ever since the Mid-Season 4 Insurrection patch (April 12) that changed how streaks affect SR adjustments. What also changed (or was inadvertently exposed) was how “individual performance” impacts SR adjustments. Starting April 12 is it crystal clear in my rating data, that I started losing 4 more points on average for a loss than I gained from a win. My SR went down and I have churned away in a rang 200 points lower than where I had maintained a steady rating prior to that patch. I rating has fallen to the point that I can steadily win 20% more games than I lose, at which, on average, I break even.

    (2) I suspect (but have no data) that Flex players can benefit from “individual performance stats” just as one-tricks can. It is largely a matter of play style — do you selfishly go for “big plays” or go “kill hunting” while abondoning the point? That sort of thing. I think Mercy, because of the way her “stats” paradoxically can work directly against the team’s best interest, is a great example for how the system can be potentially abuse. But what about a flex player on Mercy? I think a mediocre or bad Mercy player is often more likely to do the kinds of things that “pad” stats, but purely as mistakes. For that matter, many one-trick Mercys probably do not know what they are doing wrong to get their SR so high. But I think in general, certain approaches to play that end up padding “stats” can be the result of mistakes and style of play, without being aware of what is wrong, and Flex plays may well engage in this as well, entirely unintentionally. Whether one is rewarded for “stat padding” or punished for “team play” is largely going to be a matter of play style with whatever heroes you play the most, and for most players is unintentional.

    Would be ideal if we actually match histories — including team average SR’s befor each match, to do detailed analysis.

    From my own game history I can tell you a few things, such as the “scale” of the SR system is roughly the same as standard chess federation ELO: a 100 point higher team average SR is about a 64% win probability (my sigma on that remains fairly high at about 4%, so the 95% confidence interval is about 56% o 72%). I can use this with my win percentage to estimate an unbiased SR from game history. I can also tell you that, at least in Gold tier, there is no (or negligible) overdog/underdog effect (which I believe you refer to as “game difficulty”) on SR adjustment.

    I discuss my data in this very rough video (unlisted – I’m not fishing for views): https://youtu.be/R1I7ozJimLk

    Like

    1. Hey Scott, great feedback although I do disagree with you on a few key points. 1) My argument is that one-tricking is a method of stat padding insofar as its much easier to improve at one hero if you only play that one hero. Contrast this with players who flex to different heroes who have to learn many different sets of mechanics as opposed to just one. Avoiding death, maximizing cooldown value, and even accuracy are skills that are to some extent hero dependent, and therefore pursuit of their maximization (i.e. ‘stat-padding’, not that this is necessarily a bad thing) is most effective when it is undertaken with a total focus on one hero.

      Tbh I don’t know what it’s like in gold, almost every game I play has a team average SR difference of 100 or greater. I suppose matchmaking is very different at different tiers.

      Like

      1. Agreed. One-tricking probably makes deliberate (or even unintentional) stat-padding easier. I worry though that the message getting out is that one tricks always or nearly always are getting a boost in SR that they do not deserve, thus increasing the hate on one-tricks.

        I am pretty sure it is a fairly mixed bag, with some one-tricks artificially rewarded, others penalized, and most are just playing the hero they like best/are best at, with no intention of gaming the system either way.

        You are GM/Pro, right? So I imagine the matchmaker has a tough time finding even matches for you.

        However, I do wonder about that. Does not seem like it should be all that hard for it to team you up with players a bit weaker, then find an opposing team of players somewhere in between.

        Perhaps the matchmaker emphasizes putting players on the same team at the same skill over finding opposing teams of the same average skill.

        Like

Leave a comment