Bye, R-squared

Out of every concept ever invented in the world of statistics, few have been as misused or misunderstood as the R-squared metric.

It’s so simple to do a regression in Microsoft Excel and get the regression equation and R-squared metric reported to you that many people do it inappropriately and misinterpret the results. Some people do little more than take a data set and then test to see which regression line has the highest R-squared value. This can easily lead to big problems.

For example, let’s say I have the data shown below. At X=0, Y=0 and at X=1, Y=1. My X-values are the integers from 0 to 9.

Each Y-value is made by taking X and dividing by the previous Y. So, for X=2, Y = 2/1 (or, 2) and for X = 3, Y = 3/2 (or, 1.5).

My question is: What’s the best regression fit for these value?

I can do a regression simply by making a graph, right-clicking on it, and selecting “Add Trendline…”

Excel gives me 6 options, ‘Linear’, ‘Logarithmic’, ‘Polynomial’, ‘Power’, ‘Exponential’, and ‘Moving Average’, but for this case, ‘Power’ and ‘Exponential’ are grayed-out, so I can’t choose them.

Oddly, if I select ‘Logarithmic’ I get an error message that says “Some trendlines cannot be calculated from data containing negative or zero values”, and then I get no trendline for that option. So I really can’t choose that option either.

That leaves me only ‘Linear’, ‘Polynomial’ and ‘Moving Average’ to choose from, but ‘Moving Average’ doesn’t calculate an R-squared value, so if I wan’t to use R-squared as my metric to decide which trendline is best, I can’t select ‘Moving Average’.

If I choose a ‘Linear’ fit, the R-squared value is 0.63384, but if I choose a 2nd-order polynomial fit, the R-squared is 0.74853, which is much better. So, using R-squared as my metric for ‘goodness of fit’ tells me to choose a 2nd-order polynomial. (Actually, I can improve the R-squared value to 0.81444 by selecting a 6th-order polynomial, but I have never seen an actual system whose underlying ‘physics’ was governed by a 6th-order polynomial, so I’ll stick with the 2nd-order one instead.)

Now let’s say that a really unexpected measurement comes along. I’ve made it by multiplying the last row of the spreadsheet by 10. So, the X-value is 90 and the Y-value is 24.61. It’s not necessarily an error or bogus data – it could be completely legitimate – It’s just way out-of-line with existing values.

What does adding this outlier do to the R-squared value? Well now the best-fit 2nd-order polynomial has an R-squared of 0.992! (So does the best-fit linear regression for that matter.)

So, adding an extreme outlier (which could be an erroneous measurement or could be totally legitimate) makes the R-squared value go way up, not down.

The funny thing about this whole exercise though, is that the method I described for making Y-values by dividing X by the previous Y-value more and more closely generates the square root of X as the step-size between X-values gets smaller and smaller.

The graph below shows the case for a step-size of 0.01 between X-values. Of course, the square root of X is simply X^0.5, which is a power-law relationship. So, it turns out that the best trendline option to use after all was the ‘Power’ fit, even though Excel didn’t offer that as an option when the point X=0, Y=0 was in the data set (even though the square root of 0 is, in fact, equal to 0).

There are a couple of lessons to be learned here:

One is that R-squared is not a measure of “goodness of fit” nor the confidence level of a regression. (We are certainly not 99+% confident in our 2nd-order polynomial fit.) In this article the author says of a given linear regression with an R-squared of 0.0525: “the statistical confidence measure (R squared) is 5.25% indicating that we are not confident that the regression is meaningful.” This is an appallingly bad misinterpretation of what R-squared means.

A second is to try to avoid using Microsoft products whenever possible. It’s baffling to me how, 25 years after it was introduced to the marketplace, Microsoft Excel still treats undefined numbers as being equal to zero and, in this case, that it claims that a power-law regression cannot contain X=0 as an X-value.

Overall, I’m inclined to think that calculating the R-squared metric usually does more harm than good and that most people should simply ignore it.

(If this sort of stuff about R-squared values and so forth interests you, be sure to learn about Anscombe’s Quartet, four number sets that have identical statistical properties even though they are very different from each other.)