Read quality filtering

See also
FASTQ commands
Quality scores
Expected errors
Average Q is a bad idea!
Global trimming
Choosing FASTQ filter parameters

Raw reads generated by a next-generation sequencing machine such as 454 or Illumina have predicted error probabilities for each base indicated by quality (Q) scores. In many applications it is important to filter reads to reduce the number of errors, especially in marker gene sequencing experiments such as 16S or ITS where it is very challenging to distinguish true biological sequences and between-sample variations from sequencing error and PCR artifacts (chimeras and point mutations during amplification).

In USEARCH, quality filtering is done with the fastq_filter command. I strongly recommend using expected error filtering.

You can use fastx_learn to estimate the error rate after filtering.

There is an important difference between Q scores in pyrosequencing reads from 454 and Illumina reads. In effect, 454 ignores the possibility of substitution errors and Illumina ignores indels. With 454, the Q score is the estimated probability that the length of the current homopolymer is wrong, and with Illumina the Q score is the probability that the base call is wrong. In the case of Illumina, this is reasonable because indel errors are very rare. But with 454, substitution errors are quite common, occurring with comparable frequency to homopolymer errors. This means that 454 Q scores are not as predictive of read errors as Illumina Q scores.