See also
OTU / denoising
pipeline
Read preparation
Defining and interpreting OTUs
Expected errors
To get good
OTU sequences, low-quality reads should be discarded because they often cause
spurious OTUs. I strongly recommend using expected error
filtering using the fastq_filter command,
which is much more effective than most other quality filters.
Discarding singletons is also a
strategy for quality filtering.
Quality filtering should be performed after paired read merging, stripping primers and length trimming.
Paired read merging should be done before quality filtering because the
posterior Q scores in the overlapping region are more accurate. You should
use the usearch fastq_mergepairs command
to get this benefit, because most other paired read assemblers generate
incorrect concensus Q scores, most notably PANDAseq which systematically
reduces Q scores at positions where both reads agree.
Trimming should
be done before quality filtering because trimming always reduces expected
errors, so e.e. will be over-estimated if it is calculated before trimming.
I recommend using setting the maximum expected error threshold to 1.0,
regardless of the read length.
Example
usearch -fastq_filter trimmed.fq -fastq_maxee 1.0 -fastaout filtered.fa
Validating quality filtering
The best way to validate
the effectiveness of quality filtering, and the other steps in your
pipeline, is to use control samples with
known composition. If you don't have control samples, the
fastx_learn command can be used to
estimate the error rate de novo. This can be used as a check that
the error rate after quality filtering is low, i.e. that the Q scores give
good predictions of base call errors. For some machines, e.g. 454 and Ion
Torrent, the Q scores are much less effective than Illumina because they are
estimates of the homopolymer length error, not of the base call error. Using
fastx_learn can reveal this type of problem. If that happens, a more
effective quality filtering strategy is to increase the minimum abundance
threshold. The minimum abundance can be tuned using a
mock community control sample.