Wednesday 2 September 2015

Box Plots Added to Golf Predictor

New box plot for Luke Donald for the 2015 GP Season up to The Barclays
As above, but showing hover text.

I am pleased to announce the addition of useful new box plots to Golf Predictor. Box plots are a type of statistical chart that show the distribution of a data set at a glance. I have adapted them for use with golfer results in Golf Predictor, using the already available statistics on the various pages. This can be seen in the screenshots above, for Luke Donald for the 2015 GP Season up to The Barclays. There are five main parts to a box plot:
  1. The minimum value in the data set is the left most point (seen as a vertical tick above). This corresponds to the best finish for the golfer in the results analysed.
  2. The maximum value in the data set is the right most point (also seen as a vertical tick above). This corresponds to the worst finish for the golfer in the results analysed.
  3. The average value in the data set is the red vertical line in the box. This corresponds to the average finish for the golfer in the results analysed.
  4. I have adapted the box itself to represent one standard deviation either side of the average value. In layman's terms, this is where most* data points in the data set fall. This corresponds to the majority of the golfer finishing positions in the results analysed.
  5. The lines that join the box to the minimum and maximum values. These are known as whiskers and the longer they are, the more of an aberration (outlier) the minimum/maximum value is compared to the majority of the data points. In this case, a long whisker indicates that the best/worst finish is significantly different from the golfer's usual finishing position in the results analysed.

While ideally, a golfer should have the lowest best and worst result possible, in the real world, most top golfers will have some mixture of good, so-so and bad results, especially over the course of a GP season or longer time frame. A box plot tells you immediately where a golfer usually finishes and how different from normal his best and worst finishes are. Looking at the screenshot above, you can see that the whisker from the box to the best result is pretty short, while the whisker to the worst result is very long. This tells us that Luke Donald has usually performed much closer to his best result than his worst one in the 2015 GP season and that his worst result was a significant aberration from his usual result.

The ideal box plot for golfer results therefore is one with a low best finish, a narrow box** (=low standard deviation, which means high consistency), a very short (or no) whisker to the left and a long whisker to the right. For example, the box plot for Jordan Spieth for the 2015 GP Season so far almost meets these criteria (his standard deviation is a bit high!). Since his results have been so good in the main, the lower end of the box encompasses his best finish of first (see note 2 below) and there is a long whisker from the box to his worst finish of 119th.

Hovering over the box plot will display the main values for the chart, as shown in the second screenshot above. Most of this information is also just above and below the box chart, but the hover text also displays the lower and upper bounds of the box (which is the average minus/plus the standard deviation). Some notes on these new box plots:

  1. The box plot is hidden if the best finish is equal to the worst finish. In this case, the best/worst/average result are all the same, the standard deviation is zero and the box plot is a single vertical line!
  2. If the box bounds are lower than the best result or higher than the worst result, the box bounds are truncated at the best/worst result. This means if the average minus the standard deviation is less than the best result, the best result becomes the left side of the box (ditto for the other side of the box). This is because none of the majority of the golfer results can be outside the bounds of his actual results! For example, for Jordan Spieth in 2015 so far, his average minus his standard deviation is a negative number, which obviously can't be a finishing position in a tournament and is less than his best actual finish of first. Therefore, his box begins at 1, his best result.
  3. I have noticed that the edge of the box plot is slightly off in certain rare situations and only when at least one end of the box is the same as the best/worst result. This is only a very minor issue and beyond my control. All attempts to remedy it resulted in a worse situation, so I left it as is.
  4. These charts have been implemented using a modestly sized JavaScript library and should not impact noticeably on the page load time. If you experience any issues with this new functionality, please contact me immediately.

There has been a total of thirteen new box plots added to the pages on the site that show statistical information on a golfer's results in a certain collection of tournaments. Specifically, the box plots have been added to the following pages:
  1. Season Data page (1 box plot for season results)
  2. Golfer Data page (5 box plots for overall/regular/major/WGC/FedEx results in the golfer's career)
  3. Prediction Data page (7 box plots for GP season/last five events/course/tournament/last twelve similar events/similar weather/similar length course results)

This brings the total number of charts on the site to 433. I trust you will find these new box plots useful as a graphical representation of a golfer's performance in certain key areas. Just another way to make Golf Predictor even better!

*For the statistically inclined, for golfer results that follow a normal (or symmetrical) distribution, this will be 68% of his results. It will vary for non-symmetrical distributions and difficult to quantify, but it should still be above 50%. Most golfers should have a fairly symmetrical distribution (some good results, some bad ones and most in the middle), especially over a large result set, e.g. over a season or a career.

**The display of the box on screen is dependent on how different the best/worst results are from each other. For consistent golfers, this range will be small and the box may appear wide, while for golfers with widely fluctuating results, the box may appear narrow. You should take this into account or use the absolute values of the box ends as shown in the hover over text.

No comments: