USEARCH manual

Should mapping reads to OTUs consider OTU abundance?

A read sequence can match two or more OTUs with >=97% identity. It has been suggested that the read should be assigned to the OTU with highest abundance rather than highest identity. I disagree, I believe that identity is better because abundance is a good signal that a sequence is correct, but identity is a better signal that the sequence is from the same species. This is explained in detail below.

(Even better is to use denoising, because some ambiguity is inevitable with 97% OTUs -- you cannot avoid the problem that different species or strains will sometimes be lumped into the same OTU).

The cluster_otus command does not assign input sequences to OTUs. The input is assumed to include correct sequences, chimeras and bad reads with sequencing error. If a sequence (S) matches an existing OTU with >= 97% identity, it is discarded. S could be a chimera, a read with errors, or a correct biological sequence. It is very challenging to distinguish between these three possibilities, but it doesn't matter which if the goal is to make a set of 97% OTU sequences which are correct biological sequences that cover the reads. In this step, considering abundance is very important because correct sequences almost always have higher abundance than bad reads or chimeras from the same template.

Now consider the OTU table construction step where reads are mapped to the OTUs sequences. Suppose a sequence (S) matches two OTU sequences: A with 98% identity and B with 99% identity. The unique sequence A has abundance 1,000 and B has abundance 100. Should we assign S to A or B? A has higher abundance but lower identity, vice versa for B.

There are three possible reasons why S does not exactly match an OTU sequence: 1. it is a correct biological sequence that exceeds the 97% clustering threshold, 2. it has sequencing errors, and 3. it is chimeric.

(1) S is a correct biological sequence
Here, either S is a paralog of A or B derived from the same genome, or S is from a different species.

(1a) If S is a paralog, we would prefer to assign it to the same OTU as the other paralog(s) from the species. This is more likely to be B because paralogs tend to have high identity. Paralogs in a given species almost always have higher identity to each other than to genes in another species. (Same argument applies to intra-species variation).

(1b) If S belongs to a different species, then we are lumping two species into the same OTU and there is no reason to prefer A or B.

Conclusion: if S is a correct biological sequence, it is better to choose the OTU with highest identity because it is the most likely to belong to the same species so we should assign S to B.

(2) S has sequencing errors.
Here, either S is a bad read of A or B, or S is a bad read of a correct biological sequence which is above the identity threshold so does not have its own OTU.

(2a) If we know that S is a bad read of an OTU sequence then we should again choose the highest identity match because this is much more likely to be the correct sequence. Suppose S is a bad read of A. Adding errors to A will probably reduce identity to both A and B. A bad read of A which has higher identity to B must have base call errors that reproduce letters in B by chance; this is very unlikely.

(2b) If S is a bad read of a biological sequence which is not A or B then this case is similar to (1) and we should therefore prefer the highest identity match.

Conclusion: if S has sequencer error we should assign it to the OTU with highest identity because this is much more likely to be the correct sequence, so we should assign S to B.

(3) S is an undetected chimera.
This scenario is less common than (1) or (2) because chimeras are rare as a fraction of the reads. If S is chimeric, there is no reason to prefer A or B.