USEARCH manual

UNBIAS algorithm

See also
unbias command
UNBIAS paper
Abundance bias

The UNBIAS algorithm attempts to adjust an OTU table to correct for the two sources of abundance bias I believe to be most important in practice: 16S copy number and primer mismatches. This requires predicting the copy number and mismatch number for each OTU sequence, then adjusting the read counts accordingly.

Abundance bias distorts diversity metrics
The diversity in a single sample is commonly measured using alpha diversity metrics such as the Shannon index and the Chao estimator, while the variation between pairs of samples is measured using a beta diversity metric such as the Jaccard distance or Bray-Curtis dissimilarity. Many such metrics, including Shannon, Chao, Jaccard, and Bray-Curtis, are calculated from estimated species frequencies. The correlation between read abundance and species abundance is very low, so species frequencies cannot be reliably estimated from marker gene reads, and traditional diversity estimates based on species frequenices are therefore invalid or difficult to interpret.

Predicting copy number and primer mismatches
Prediction of copy number and primer mismatches is done by the SINAPS algorithm. SINAPS is based on essentially the same algorithm as SINTAX. The top hit in a reference database is identified using k-mer similarity. Confidence is estimated by bootstrapping. In each bootstrap iteration, a subset of k-mers is selected and used to find the top hit and the trait of interest (here, copy number or primer mismatches) is taken from reference sequence annotation. The trait with highest bootstrap frequency is reported as the prediction, and the frequency with which it occurred is reported the bootstrap confidence. UNBIAS reqyuires a prediction for every OTU, so the bootstrap confidence is ignored. In this case, SINAPS is effectively equivalent to finding the top database hit using the USEARCH algorithm.

Copy number correction
If the predicted 16S copy number is C, the read count is multiplied by 4/C because the mean 16S copy count is approximately four.

Primer mismatch correction
If the predicted number of primer mismatches is m, the read count is multiplied by 10^m; i.e., an order of magnitude loss in efficiency is assumed for each mismatch. Using 10 as a base is a rather arbitrary choice that probably does not work very well in practice because the true efficiency loss depends on several factors which are unknown or hard to predict. For example, the loss will depend on the mismatch position in the oligonucleotide (mismatches close to the 3' end give higher losses) and will tend to be greater if more rounds of PCR are used.

Accuracy in practice
UNBIAS achieves a substantial improvement on mock community tests when known values for copy numbers and primer mismatches are used. This confirms that these biases are significant in practice. However, UNBIAS is less successful when reference sequences have 97% identity or less to the OTU sequences, as will often be the case. Thus, UNBIAS is not a full solution to the problem of abundance bias.