Friday, September 9, 2016

Spreadsheet Errors

Five years ago I wrote a post titled, "Beware of Econometricians Bearing Spreadsheets". 

The take-away message from that post was simple: there's considerable, well-documented, evidence that spreadsheets are very, very, dangerous when it comes to statistical calculations. That is, if you care about getting the right answers!

Read that post, and the associated references, and you'll see what I mean.

(You might also ask yourself, why would I pay big bucks for commercial software that is of questionable quality when I can use high-quality statistical software such as R, for free?)

This week, a piece in The Economist looks at the shocking record of publications in genomics that fall prey to spreadsheet errors. It's a sorry tale, to be sure. I strongly recommend that you take a look.

Yes, any software can be mis-used. Anyone can make a mistake. We all know that. However, it's not a good situation when a careful and well-informed researcher ends up making blunders just because the software they trust simply isn't up to snuff!  


© 2016, David E. Giles

Friday, September 2, 2016

Dummies with Standardized Data

Recently, I received the following interesting email request:
"I would like to have your assistance regarding a few questions related to regression with standardized variables and a set of dummy variables. First of all, if the variables are standardized (xi-x_bar)/sigma, can I still run the regression with a constant? And, if my dummy variables have 4 categories, do I include all of them without the constant? Or just three and keep the constant in the regression? And, how do we interpret the coefficients of the dummy variables in such as case? I mean, idoes the conventional interpretation in a single OLS regression still apply?"

Here's my (brief) email response:
"If all of the variables (including the dependent variable) have been standardized then in general there is no need to include an intercept - in fact the OLS estimate of its coefficient will be zero (as it should be).
However, if you have (say) 4 categories in the data that you want to allow for with dummy variables, then the usual results apply:
1. You can include all 4 dummies (but no intercept). The estimated coefficients on the dummies will sum to zero with standardized data. Each separate coefficient gives you the deviation from zero for the intercept in each category.
OR (equivalently)
2. You can include an intercept and any 3 of the dummies. Again, the estimated coefficients of the dummies and the intercept will sum to zero. Suppose that you include the intercept and the dummies D2, D3, and D4. The estimated coefficient of the intercept gives you the intercept effect for category 1. The estimated coefficient for D2 gives you the deviation of the intercept for category 2, from that for category 1, etc."
You can easily verify this by fitting a few OLS regressions, and there's a lot more about regression analysis with standardized data in this earlier post of mine.


© 2016, David E. Giles