See also
Machine learning
K-fold cross-validation
otutab_forest_kfold command
otutab_select command
Random forest parameter file
An OTU is informative if its count or frequency can be used effectively in a rule which sorts samples into a given set of metadata categories such as healthy / sick, day / night etc.
Usually, an OTU is informative because it has higher frequency in one category than other categories, as shown in the figure below. Such cases can be found using the otutab_select command.
With a random forest classifier, a so-called importance value in the range 0 (not informative) to 1 (maximally informative) is calculated for each OTU. Random forests can discover more complicated rules than the simple frequency sort assumed by otutab_select. To extract the OTU importance values from a random forest parameter file and sort them in order of decreasing importance, you can use:
grep -w "^varimp" forest.txt | cut -f3,5 | sort -rgk2
If an OTU is found to be informative by a random forest classifier but not by the otutab_select command, this implies that the implied rules incorporating this OTU are more complicated than the typical form "if count is high, sample is in category A, otherwise in a different category".