See also Quality scores Average Q is a bad idea! FASTQ files The quality (Q) score of a base indicates the sequencing machine's estimated probability that the base call is wrong. Consider a read with a given set of Q scores, and suppose the machine's Q scores are accurate. Now consider a large sample of reads with the same Q scores, then the expected number of errors is the average number of errors per read that we would find in that sample of reads. This is roughly equivalent to the most likely number of errors in the read, though the number of expected errors is not always an integer and can be less than one. Take a simple example: a read of length two with quality scores Q3 and Q40, corresponding to error probabilities P=0.5 and P=0.0001. The base with Q3 is much more likely to have an error than the base with Q40 (0.5/0.0001 = 5,000 times more likely), so we can ignore the Q40 base to a good approximation. Consider a large sample of reads with (Q3, Q40), then approximately half of them will have an error (because of the P=0.5 from the Q2 base). We express this by saying that the expected number of errors in a read with quality scores (Q3, Q40) is 0.5. As this example shows, low Q scores (high error
probabilities) dominate expected errors, but this information is lost by
averaging if low Qs appear in a read with mostly high Q scores. This explains
why expected errors is a much better indicator of read accuracy than average Q. |