Abundance rarefaction
See also
Rarefaction
alpha_div_rare command
Suppose you observe K different species, finding N k individuals in the k th species. The sum N = N 1 + N 2 + ... + N K is the total number of observations. To calculate the rarefaction curve for (number of observed species) vs. (number of observations), we take random subsets with n =1, 2, ... N observations and count the number of species S ( n ) that are found in each subset. (More accurately, we should average over many different random subsets at each size n until the mean converges) . Some species will disappear with fewer observations, and the curve will approach zero species as n approaches zero. It is not actually necessary to implement subsampling and average, S ( n ) can be calculated using this formula:
My colleague Henrik Flyvbjerg and I have developed the modified formula for S ( n ) when singletons are discarded; let me know if you are interested and I will send you the details.
Rarefaction for OTUs
If species with exactly one observation are ignored, then the above formula does not apply. Thus, if singleton reads are discarded , as recommended in the UPARSE pipeline , then you cannot use standard rarefaction software and the above formula does not give the correct result. If singletons are retained, then the formula is a reasonable approximation, but is not exactly correct because de novo OTUs are necessarily "unstable", meaning that a given pair of sequences may belong to the same OTU with one subset but in different OTUs in another subset.
With de novo OTUs, including those made by UPARSE , then strictly speaking, rarefaction curves must be generated by running the pipeline from scratch for each random subsample and noting the number of OTUs obtained.