Mathematics in Sports: Will athletes become mathletes?
To provide you with some solid background to start from, I first of all provide you with some general terminology and explain the basic concepts of baseball, abbreviations used are included in brackets. A baseball field consists of four plates/bases set up in a square with the pitcher’s mound in the middle. A point (or run) is scored when a player reaches the home plate after visiting all other plates in the counter-clockwise direction without recording an ‘out’ in the process. A player may only start running after successfully batting the ball into play from the home plate. When a player does not hit the ball, several things may have happened. If the hitter swung his bat but missed the ball, a strike is called. If the hitter did not swing his bat, but the ball flew through the hitting zone, a strike is also called. In the event a batter did not swing and the ball did not fly through the hitting zone a ball is called. Four balls result in a free run to the first base (BB or ‘base on balls’). A hitter also receives a free walk to first base when he is hit by the ball (HBP or ‘hit by pitcher’).
A baseball game consists of nine innings in which both the home team and the visiting team have a batting turn. This batting turn only ends when the batting team has recorded three outs. These outs can be recorded in four major ways:
– Strikeout: A batter receives three strikes before putting the ball into play;
– Flyout: A fielder catches the ball before it hits the ground;
– Force out: A runner fails to reach the next base before a fielder touches the base while holding the ball;
– Tag out: A runner is touched by a fielder holding the ball while not touching a base.
A very common and simple way to predict winning percentages is using Baseball´s Pythagorean Theorem developed by Bill James:
This formula seems deceitfully simple but actually has some desirable properties. Both an increase in runs scored and a decrease in runs allowed will increase the predicted winning percentage. Besides that the result will always be between zero and one. As shown in Mathletics by Wayne L Winston, this naïve estimation actually performs very well when applied to Major League Baseball in the USA.
Besides simply predicting how many games a team will win, this formula has another important application. Using this formula one can compute how much wins a team will ‘gain’ by replacing one of their players. To do this one needs the amount of runs this player created as a batter. How we do this is explained later on in this article. Suppose a season consists of 100 games and suppose a baseball team called Asset | Econometrics scored 850 runs during a season and gave up 800 runs. Suppose also they now trade one of their players (John), who created 120 runs, for another player (Bob), who created 140 runs in the same number of plate appearances. Before the trade, one would have predicted them to win 53 games and after trading John for Bob one would have predicted them to win 54.2 games. Thus you can argue this trade will increase
Asset | Econometrics’ season performance by approximately 1.2 win.
Successful hits by batters will be divided into four categories: Singles, Doubles (2B), Triples (3B) and Home Runs (HR). These categories denote the number of bases the batter advances due to his hit, respectively one up to and including four. After this we define the total amount of bases (TB) to be equal to the intuitive formula:
TB = Singles + 2 * 2B + 3 * 3B + 4 * HR. Next we use the term ‘at bats’ (AB) to be the number of plate appearances of a hitter, where neither BB nor HBP occurs, thus the amount of successful hits plus the amount of flyouts and strikeouts. With this information we will try to accurately determine how many runs a hitter created in a season. For this we use the formula:
In this formula (hits + BB + HBP) is basically the number of times a player ended up to be a runner. The remaining fraction indicates the rate at which the player advances runners per plate appearance. The product of this does indeed seem a reasonable estimate for the total amount of runs a player creates.
However, there is one major problem concerning this method, a very bad hitter with a lot of plate appearances will create more runs than a superstar with just a few plate appearances. To solve this problem it is important to consider that any hitter is also involved in the consumption of outs. By definition of AB, BB and HBP it follows that the amount of outs a hitter creates is approximately equal to AB- hits, assuming no errors, double plays or sacrifice flies or bunts. Taking into consideration that a game consists of 9 innings with each 3 outs, one can compute the runs created per game by:
Although this does seem to be a good way to measure a hitter’s performance it does not work for any hitter. An average hitter will be explained in quite a good way but a simple example proves that hitters with extreme statistics are overvalued.
Suppose there is a player, Sam, who hits a home run on 50% of his plate appearances and receives a strikeout on the other 50%. Logic dictates that a team consisting of only players exactly like Sam would score exactly three home runs per inning, thus resulting in an average of 27 runs per game. However, when applying the formula above the results are quite different. A team of only Sams will have, in an average game, 54 at bats, 27 hits and 27 homeruns. This yields: AB = 54,
hits = 27, HR = 27, TB = 108 and thus a runs created of 54. Obviously, the runs created per game also equal 54 since we considered one average game only, and thus (AB – hits)/27 = 1. This shows that the method above predicts Sam to score twice as many runs as he actually would, thus greatly overestimating his value as a player. To solve this problem one could use Monte Carlo simulation to predict the actual runs created more accurately. We will not go into detail in this article, however. Neither will we go into detail on how to measure pitcher or fielder performance from a player, it is simply too complicated to discuss briefly.
In basketball mathematics and numbers are also used a lot to evaluate players. However, contrary to baseball, scoring points is less of an individual effort. Therefore simply rating players by the number of points they score during a game might not be a very good approach. We are looking for a method which incorporates other factors that influence the score, such as assists, rebounds, steals, etcetera. These things are all offensive in nature, but we also want to include defensive skills, such as blocking and defensive rebounding. We will discuss one metric to rate a player by, namely the +/- player rating. After introducing a simple form of this we will expand it to the adjusted +/- player rating to solve several issues.
The +/- player rating is in principle a very simple statistic. It merely measures the amount of goals the player’s team outscored their opponents when the player was playing, converted to be per game, an example will hopefully clear up any confusion. Suppose one game is played between two teams and two players Joe and Ben played for the winning team (team
Asset | Econometrics). Joe only played during the first half of the match and Ben played for the full 48 minutes. Also suppose score at half-time was 40-44 in favor of the opposing team and the final score was 90-79 in favor of
Asset | Econometrics. Then the +/- player rating of Joe would be
-4 * 48/24 = -8. The +/- player rating of Ben would equal 11 * 48/48 = 11. The +/- player rating is thus simply a measure which indicates how well the team performed while the player was playing. Thus in this example it is clear Ben performed better as a player than Joe did, taking all things into consideration.
There is, however, quite a big issue when using this metric to analyze player performance. It is very hard to compare players from different teams in this method. This is due to the fact that one does not take into consideration that the difference in points scores largely depends on the skill of the other players in the field (both of the player’s own team as the opposing team). Therefore players from badly performing teams will never obtain a high player rating, although they might be one of the best players of the league, due to the fact that they are playing with a lot of bad players. This problem is solved by using the adjusted +/- player rating.
For the adjusted +/- player rating we will divide every game we analyze into time segments. These time segments are defined by taking the largest amount of time in which no substitutions occur. Of every time segment we analyze we then collect the following data:
– Players on the field of the home team;
– Players on the field of the visiting team;
– Length of the time segment in minutes;
– Score during the time segment.
We will predict the home team’s margin in score, in any time segment, to be the sum of the home team player ratings minus the sum of the visiting team player ratings, multiplied by the fraction of a game the time segment lasts. This fraction is calculated by simply dividing the length of a time segment over 48 (the total length of a game in minutes). After this we will calculate the error in prediction by subtracting the true margin in points from this prediction. Last but not least we will calculate the sum of squared prediction errors by summing up the prediction errors over all time segments. Of course, we cannot actually calculate all of these numbers since the actual player ratings are unknown, that are the variables we want to determine. To determine them we will minimize the sum of squared prediction errors in a numerical way.
In this way we determine the player ratings which will most closely resemble the assumption underlying the prediction. This assumption states that the difference in player ratings over both teams should equal the difference in score. This method, with the above assumption as core element, gives us player ratings which tell us exactly what we want to know, namely how much better a team would perform if one would replace an average basketball player by the player whose rating you consider. Therefore the player with the highest rating will have actually performed the best in the time period considered, regardless of which team he plays for.
Taking all methods discussed into consideration we can conclude that athletes will not have to become mathematical geniuses anytime soon. The vast majority of mathematical and statistical applications in sports is used for evaluating players and teams as a whole and is thus not relevant while actually playing. It is however important for managers and coaches to regard these methods, since they can give a lot more insight in the performance of players than the usual, intuitive statistics might. However, the most important thing when playing sports is still to have fun while you do it!
Text by: Ernst Roos
(1) Winston, W.L. (2009). Mathletics: How Gamblers, Managers, and Sport Enthusiasts Use Mathematics in Baseball, Basketball, and Football. Princeton: Princeton University Press