Classification of mock community sequences

Category	Description
Perfect	Identical to a biological sequence.
Good	>= 99% identical to a biological sequence.
Noisy	>= 97% identical to a biological sequence.
Chimeric	None of the above, and <97% identical to a biological sequence. Some Good and Noisy sequences may also be chimeras, but these are less likely to disrupt downstream analysis and are not classified as Chimeric.
Contaminant	MEGABLAST search of the NCBI nt database16 reports a hit with >=95% identity covering >=95% of the OTU sequence.
Other	None of the above. Could be a novel biological sequence, or more likely is a sequence with >3% errors.

OTU sequence accuracy was assessed by making pair-wise global alignments with all reference sequences and selecting the alignment with highest identity; call this identity V. Ideally, all sequences would have V=100%, indicating that they are identical to a reference sequence.

Each OTU sequence is assigned to exactly one of the following six categories: Perfect (V=100%), Good (100%>V>=99%), Noisy (99%>V>=97%), Chimeric, Contaminant or Other. An OTU is classified as Chimeric if V<97% and the sequence is chimeric according to uchime_ref or uparse_ref by comparison with the Haas et al. reference database. Some Good or Noisy sequences may also be chimeras, but these are less likely to degrade analysis. In the case of uparse_ref, it is required that the model is <=1% different from the OTU in order to classify is a chimeric because parsimony is less reliable when there are many inferred point mutations that may indicate a missing reference sequence or make crossover points harder to detect. An OTU is classified as a Contaminant if a MEGABLAST search of the NCBI nt database16 reports a hit with >=95% identity covering >=95% of the OTU sequence. The Other category indicates that it was not possible to assign the OTU to any of the previous categories, indicating an artifact with more than 3% errors or a novel biological sequence.