Abundance estimation is the process of determining how common a biological
sequence is in a sample. This is most often using next-generation sequencing
reads, which introduce the problem of amplification and sequencing errors.
There are two important sources of bias:
PCR amplification bias and
gene density bias.
With short reads that only partially cover amplicons,
it is important to trim the reads intelligently before making an abundance
estimate. See global trimming.
Use the ‑sizeout option to
create size annotations.
Abundance estimation by dereplication
If we assume that sequences are error-free, then the most straightforward
way to estimate abundance is to use dereplication.
This gives the correct answer assuming (1) there are no PCR errors, there are no
PCR biases (some sequences are amplified more than others), and (3) there are no
These are questionable assumptions, but even if there are sequencing and/or amplification
errors, using dereplication is often a good strategy. There isn't much we can do
about PCR bias, because as far as I know, it isn't possible to predict PCR bias
from primary sequence.
With errors due to sequencing and amplification
(imperfect copying during PCR), some
sequences fail to match, creating small spurious clusters (mostly singletons)
and causing abundance of the true biological sequence to be underestimated.
However, this may not be very harmful in practice for a couple of reasons.
1. The underestimate due to errors applies consistently
to all clusters, and it is usually only ratios between abundances that are
significant, not absolute values. The ratios between abundances of larger
clusters should be reasonably stable against errors, and it is usually these
abundances that are the most important.
2. The small, spurious clusters due to errors may be
ignored or merged into the correct cluster in a post-processing step. In the
case of uchime_denovo, the most important
aspect of abundance estimation is that potential parents have higher abundances
than their chimeras. Amplicons that are sufficiently abundant to be parents will
usually also have enough error-free reads to give a large cluster. In the case
of UCLUST, sequences with errors will tend to be
merged into their correct cluster.
Abundance estimation by clustering
An alternative to dereplication is to use clustering at a threshold <100% in
order to allow some errors in the input sequences. This raises the question of
which identity threshold to use. This depends on
a number of factors, including the error characteristics of the sequencing
technology and which algorithms will be used for downstream processing. The best
way to determine the optimal threshold is to create a benchmark test using
simulated reads. In practice, a high threshold such as 99% should usually work
usearch -derep_prefix reads.fasta -output
Example using clustering
usearch -cluster_fast reads.fasta -id 0.99 -consout cons.fasta -sizeoout