Elo rating in online games

The Elo rating system is a simple mathematical system for estimating the relative strength of competitors in a sport or game based on the results of a series of matches. In its basic implementation each player is initially given the same rating, but whenever a match is played, the rating of the winner is increased, and the rating of the loser decreased by the same amount. When a high ranking player beats a low ranking player only a small amount of rating is transferred, in the opposite case, when the low ranking player wins, a large amount of rating is transferred. Thus when two unequally skilled players play a long enough series of matches the difference in rating between them will trend towards an equilibrium, where the fewer wins to the less skilled player will cause a rating transfer so much greater than that of the losses that the result is even. A similar equilibrium can be achieved for a group of players that play sufficiently many matches in a wide enough array of pairings. When such an equilibrium has been achieved the ratings can be used for assessing the relative strength of players reasonably accurately.

The following simulation demonstrate an ideal example of Elo rating in action. 1000 black dots represent 1000 players of varying strength. The strongest players are furthest to the right and the weakest furthest to the left. Furthermore the highest ranked players are displayed towards the top and the lowest ranked towards the bottom. Before you start the simulation all the players will have the same rating, when the simulation runs the players will play matches against random opponents and have their rating changed according to the Elo rules.

Notice that the lowest and highest ranked players take a lot of time to reach their proper rating. Pairing players of highly differentiated strength is inefficient for finding the rating equilibrium, thus a better method than simply pairing players randomly is to pair them only against players of approximately the same strength, the following simulation use a biased matchmaking that is inclined to pair players of similar rating. As a result the diagonal distribution is achieved much faster.

In the graphical representation that I have chosen the ideal distribution of players is a thin 45 degree diagonal line from the lower left to the upper right. If the line is thick it means that there is high average inaccuracy in the rating. If the line is steeper than 45 degrees it means that the rating is overestimating the skill gap. While a moderate inclination means that the rating underestimates the skill gap. Small deviations from the ideal 45 degree inclination are usually of little practical concern as the rating still does a good job of revealing the relative skill levels. Such deviations are easy to spot on these graphs, but in real life there is no way to know the true skill of a player, and thus no way to gather meaningful data for position on the horizontal axis. As a result the issues that occur in real rating systems are not very well understood, with these simulations I hope to reveal some of the most widespread issues.

The first two simulations lacked a lot of realism, no real game has a constant player base that keep on playing for eternity. Eventually players quit, while new players join, for online games there is typically a quite large amount of players who only play a few games and never return, often because they find themselves unable to compete with the established player base. Also, the skill of a player is rarely constant, new players will typically perform poorly, but improve their play over time. My simulation model is built on some simple formulas that reflect these realities. New players have a far greater chance of dropping out than old players, especially while they haven't won their first game. And as players gain experience they also become more skilled. The gain of new players is simplified to a constant rate. When this model is paired with a simple Elo scheme the result is less than stellar.

The simulation settles in a banana curve, while the high skill players are placed in a nice diagonal line, the inclination lessens with the skill level, right down to the bottom where it is practically flat. The bottom is where new players start, and the structure here means that most new players will mostly get matched against much better opponents, so they will mostly lose. The result is that most new players quit quite quickly.

So why doesn't the Elo system correct this? Why don't the good players rise in ranking so far that the mid level players end up well above the beginners? Because basic Elo preserves the average rating of the system, only when a player quits will the average change, so in a settled system the average rating of the leaving players must be equal to the rating of new players. This can only be true if new players are given a rating around the average skill level, so that is how it ends up.

A common modification of the Elo system is to increase the rate of change for new players. While this breaks the preservation of average rating, it does not solve the problem, on the contrary, this allows new players to drop much further in rating, ultimately making the problem worse.

Another common modification is to not allow players to drop below the initial rating. While the banana curve remains the player retention is markedly improved.

As players bounce off the rating floor the average rating is increased, thus the rating curve can rise without needing players to drop from the game. This effect however requires a significant proportion of players to regularly reach the bottom rating, thus maintaining the mostly flat low-skill section.

For completeness, some games use a practice of regularly resetting the rating of all players. I can't imagine why this would help, and indeed, the simulation seems to confirm that regularly throwing away all the gathered data is a pretty poor choice.

So, what is a good Elo modification for online games? Ideally rating points should be injected into the system at a rate corresponding to the players' skill gain, for this purpose I have given each new player a pool of hidden rating points. When a player wins a match some of these points are transferred to the player's rating. Initially this transfer is twice the ordinary gain from the win, but it drops gradually as the pool is depleted.

The result of this system is pretty close to the ideal thin diagonal line, and the player retention is the best so far. The most notable remaining problem is that the players at the very bottom are still rated somewhat poorly. This is unavoidable, no rating system can make up for the lack of data on players who have no or very few matches.