Results on the box/whisker plots in panels (a) and (b) show the correlation
between expected errors (E) and measured errors before filtering in the
forward and reverse reads respectively. E is rounded to integers and binned
so that, e.g., the bin for E=2 contains reads with 1.5≤E< 2.5. For each bin,
the top and bottom of the box indicates the upper and lower quartile,
respectively, and the line inside the box indicates the median value. The
upper and lower whiskers indicate the maximum and minimum measured errors,
respectively. In all cases, the maximum value is >25 and is probably
explained by a read that is a PCR artifact, such as an unfiltered chimera,
with true number of sequencing errors much less than 25. The upper
histograms in panels (a) and (b) show the numbers of reads falling into each
E bin. This shows that the reverse reads have more reads with lower quality,
as is typically seen with Illumina sequencing. However, the correlation seen
in the box/whisker plots appear similar between the forward and reverse
reads, suggesting that the Q score accuracy is comparable. These results
show that E tends to underestimate the number of errors for larger values of
E. The histograms in panels (c) and (d) report the distribution after
merging and filtering of the observed numbers of errors per read in the head
(<3% errors) and tail (>3% errors) respectively, showing that Emax=1 allows
most reads with no errors and a majority of reads with one error, and
further dramatically reduces the frequency of reads in the tail compared
with QIIME and PANDAseq.