Explanatory variables with missing data: can (ordination of) pairwise distances be substituted?


Hi all. I have a really patchy dataset from some early work, and had a kind of weird brainstorm about how to get around the problem of missing observations (at least, for my particular application). Please have a look and let me know if I’m completely off-base. Happy to provide code/data if it would help. It makes sense in my brain…



I would be hesitant to come up with some “new” method of dealing with missing data tailored specifically to your problem without first trying established techniques. As a first step, my suggestion would be to use standard methods of data imputation. R has packages to do this (e.g. MICE), as discussed here:

Imputation will fill in the gaps for your dataset with plausible randomly chosen values based on specific mathematical assumptions/constraints (e.g. multivariate normality, missing values distributed randomly). You could construct a number of synthetic datasets using this methodology, separately perform analyses on each of these synthetic datasets, and then compare results to ensure that imputation isn’t giving you results that are all over the map. (This is, unfortunately, a possibility given the large number of missing values you have.)




Thanks Drew, I had a look at the link and will be testing that approach over the next few weeks…



I was also going to recommend using imputation provided by one of the available R-packages. My experience with the MICE imputation protocol is that it is a little more involved to get working. I also found that it was OK at imputing values when relationships are linear. I’ve had much more success imputing missing trait values using the package missForest. It is very easy to apply, is robust and recovered non-linear relationships easily. So, I’d strongly recommend doing exploratory data analysis to see what kind of relationships you are recovering.


Peter Wilson


Thanks Peter. I love random forests…