This QA first appeared in Data Science Briefings, the DataMiningApps newsletter. Also want to submit your question? Just Tweet us @DataMiningApps. Want to remain anonymous? Then send us a direct message and we’ll keep all your details private. Subscribe now for free if you want to be the first to receive our articles and stay up to data on data science news, or follow us @DataMiningApps.
You asked: Is it really necessary to split a data set into training and validation when building a random forest model since each tree built uses a random sample (with replacement) of the training dataset? If it is necessary, why?
Good question! Indeed, one can argue that the construction of a validation set might not be necessary in this case, as random forests protect against overfitting by construction through bootstrapping. Even if one would want to get an estimate on the model’s capability to generalize, a commonly used approach is to get out-of-bag error estimates: by using the mean prediction error for a training sample using only the trees that did not have this sample in their bootstrap.
As such, the requirement to construct a validation set to perform an unbiased evaluation might not be as strict for random forests as is the case for other techniques, which in fact highlights the great performance and generalization properties of the technique. This being said, it still is a good idea to nevertheless construct a validation set, i.e. in the following cases:
- You’re using out-of-bag (OOB) error estimates to optimize a hyperparameter, such as the number of trees or number of features to consider per split. Just as is the case when you’d be using cross-validation to determine an optimal parameter setting, you’d still want to test the final outcome on a completely unseen test set to get an unbiased estimate.
- You’re comparing a random forest model to other techniques, in which case you’d already have to perform cross-validation or keep a validation set around to perform a fair comparison.
- It’s simply good practice, especially if you have enough instances around, it’s always a good idea to keep a clean validation set around to improve your confidence in the model’s ability to generalize. In case you only have very few instances, it might perhaps be nice to know that you could avoid using a validation set when using random forests, though the setup of this technique will likely lead it to overfit on such small data sets anyway. In those cases, it’s generally better to stick to simple, linear models.