FAQ: How can I compare two sets of OTUs?

See also
Quality control for OTUs

Suppose you want to compare two sets of OTUs, e.g. made by two different pipelines (say, QIIME and UPARSE) or a single pipeline with two different sets of parameters (say, with different filtering, length or abundance thresholds). How can you do this?

Compare OTUs for control samples
If you have control samples such as a mock community or single strain, then you can assess the accuracy of OTUs for the control samples samples.

Finding the common subset between two sets of OTUs
If you have two sets of OTUs X and Y, a simple question to ask is: which OTUs are in (a) both X and Y, (b) in X only, and (c) in Y only. Those that are in both X and Y are more likely to be correct. Those in X only or Y only are certainly errors: they are either false negatives in one set or false positives in the othe set. You can find these subsets using the usearch_global command, like this:

usearch -usearch_global otusx.fa -db otusy.fa -id 0.97 -matched x_and_y.fa -notmatched x_only.fa

usearch -usearch_global otusy.fa -db otusx.fa -id 0.97 -notmatched y_only.fa

If the OTUs sequences have different lengths, e.g. because you used different global trimming lengths, then the results are harder to intepret because it is possible for two longer sequences to be <97% identical to each other but have identical prefixes C. Therefore, you may have two long OTUs (A, B) identical to one short OTU (C), and all three sequences could be correct.

Chimeras
Most chimeras are derived from the most abundant sequences in the PCR reaction. As a check on which set of OTUs has more chimeras, take the OTUs with highest abundance and use them as a reference database for uchime2_ref using -mode specific. If you have the original reads, then even better is to find the most abundant unique sequences and use those as a reference, e.g. take the top 20 most abundant:

usearch -fastx_uniques reads.fq -fastaout top_uniques.fa -topn 20 -sizeout

Then check both sets of OTUs:

usearch -uchime2_ref otusx.fa -db top_uniques.fa -strand plus -uchimeout otusx.uchime \
-uchimealnout otusx.uchimealns -mode specific

usearch -uchime2_ref otusy.fa -db top_uniques.fa -strand plus -uchimeout otusy.uchime \
-uchimealnout otusy.uchimealns -mode specific

Now you can review the output to see if there is a big difference between the number of chimeras in each set.

Searching a large database (SILVA)
Using a large search database such as SILVA can give useful insights into the OTUs. Exact matches (100% identity) are very likely to be correct sequences. If one set has more exact matches, it is probably more sensitive, though keep in mind it may also have more spurious OTUs due to chimeras and read errors. Hits with >97% identity are more difficult to interpret: they could be correct, but could have errors. An OTU with <3% errors is relatively harmless because 97% OTUs are allowed to vary by this much. If an OTU has >3% errors then it is harmful because the correct sequence is probably in a different OTU, so this OTU is completely spurious and inflates the apparent diversity. Unfortunately, even large databases such as SILVA have only a small subset of extant species so it is difficult to interpret hits that are <100%.