Collinearity and Zero inflated data problems


Hi all,

I have a simple data set with some very frustrating distributions. We’re simply comparing the morphology of two groups of grasshoppers. Using a GLMM kept giving us impossible to interpret results that appeared to be type I errors so we opted for some simple non-parametric tests instead.

In short the reviewers have accepted the paper but don’t like the simple stats and would like us to further justify why a model isn’t possible (I’ve copied a quite below). I’m a bit lost and not sure how to demonstrate convincingly why distribution problems and collinearity are proving difficult to deal with.

Is anyone able to help out and have a chat about what information I could present to demonstrate in the paper that the model wasn’t working? OR maybe I am missing something entirely and a model will work after all…

Thanks all!

P.S. Apologies if this is confusing, its not my data set so I’m a bit lost too!

“Firstly, in a normal procedure of exploratory analysis, as the authors know, high levels of co-linearity between explanatory variables in a model can be avoided by manually checking correlation between all the pairs of the variables and omitting one variable of highly correlated pair. If the high levels of co-linearity could not be excluded even after this procedure, please explain so. Secondly, did zero-inflation of ‘explanatory’ variable in the GLM model really result in apparent type I error? Of course, zero-inflation in ‘response’ variable often causes this type of problem, but this is not the case. Please check this again”


Not sure without knowing more/seeing the dataset, but is it useful to get some PCA axes from your dataset (since these are independent by definition) and regress on those? Just a completely naïve thought =p Or you could do nonlinear regression modelling e.g. random forests of contingency trees which are apparently pretty good at picking informative variables and identifying/excluding collinearity… I used functions from package {party} which were rather good.

At the risk of suggesting the obvious or guiding you into a swamp of reading, have you looked at relevant articles on r-bloggers?

But I may have completely the wrong end of the stick here as I haven’t had much experience dealing with collinearity.


Hey there,

would be great to know what distribution you got exactly. Have a look here maybe:

Selecting important features could be a good way. I agree that PCA might be useful to get rid of some unimportant features and reduce collinearity.

I am right now on holidays and will have another look on what the reviewers responded next week assuming you don’t want to do your analysis again :grinning:

So long,



If I’m understanding this, the reason you think you have type 1 error from the GLMM is that it’s counting highly-correlated variables as if they were independent sources of evidence?

In that case the reviewer is right, you should be able to cut variables down till the ones you have left are substantially uncorrelated, and the excessive type 1 should then go away. Or as Michael describes, you can condense variables via PCA or similar multivar methods.

And then you’re also saying that some of the variables are known to be zero-inflated? – and this couldn’t be dealt with in the GLMM?

Are you familiar with mvabund? – designed for species abundance data, but key features include dealing with the supposed difficulties of zero inflation. Haven’t used it myself, but papers explaining how it deals with different distributions and I believe helpful youtubes are available at David Warton’s website.