UPARSE-REF algorithm

Given a reference database D of sequences in a sample that is assumed to be complete and correct, UPARSE-REF infers errors in a sequence using parsimony. The goal of UPARSE-REF is to explain a given sequence S with the fewest possible events starting from sequences in D. Here, "events" are mutations that arise from PCR or sequencing errors. This is done by constructing a model sequence M using one or more sequences from the database (refseqs). Typically, M is a single refseq representing a non-chimeric amplicon. Otherwise, M is made from m refseq segments that are concatenated to represent a chimeric amplicon. If M has one segment, i.e. is a single refseq, then the distance between M and S is defined to be the number of mismatches, which are interpreted as sequencer or PCR errors.

The figure below shows an example where the read has a chimeric model. Here, the penalty for a chimeric crossover is +3 and the penalty for a mismatch is +1. The total score for the model is 4 (+1 for one mismatch +3 for one chimeric crossover).

UPARSE-REF is used internally as a step in the UPARSE-OTU algorithm for OTU construction (cluster_otus command). The main use for UPARSE-REF as a standalone command (uparse_ref) is annotation of reads, OTUs and other sequences in mock community experiments where the set of biological sequences in the sample is known.