Validating merged reads to check for problems

A paired read assembler makes two types of error: false negatives and false positives. This page explains how you can check for these errors. See also quality control for OTU sequences .

Merging errors may not matter
If you are using a UPARSE or UNOISE pipeline, then errors by the merger may not matter. Typically, the errors are pretty random so the large majority are singletons and should therefore be discarded before making OTUs. Wrong alignments often have many mismatches, causing low Q scores and hence expected errors > 1 so that the mis-aligned reads will be discarded during quality filtering. Errors by the merger matter only if they are reproduced frequently enough, and have high enough Q scores, that they propogate through the complete pipeline and generate spurious OTUs. You can verify this by doing quality control on your OTU sequences .

1. False negatives
A false negative is a read pair that has a valid alignment, but is discarded. If you have a low rate of merged reads, and you believe the amplicons are short enough that the pairs should overlap, then you may have a problem with false negatives. The way to check this is to review the merge report to identify the most common reason(s) that pairs are discarded. The simplest way to investigate is to use the -fastqout_notmerged_fwd and -fastqout_notmerged_rev options to get the pairs which did not merge, then (if needed) use fastx_subsample to get a small subset for manual investigation. If there are several different reasons, then y ou can use the tabboutout file to find some examples of pairs that failed for a given reason. Take the example pairs and experiment with different parameters. You can use ublast as an independent check for alignments, and / or align the unpaired reads to a reference database of full-length sequences (see trouble-shooting problems with fastq_mergepairs for more discussion). These tests should show if there are good alignments that fastq_mergepairs is missing, and suggest changes to the parameters which might improve the merge rate.

2. False positives
A false positive caused when the alignment is wrong. There are two cases.

(A) The pair does not in fact overlap, but a spurious alignment is found due to a spurious similarity between short segments at the end of R1 and the start of the reverse-complemented R2.

(B) The pair does overlap, but the alignment is wrong (too long or too short because it is offset compared to the correct position). This tends to happen more often when the true overlaps are short.

The way to check for false positives is to align the merged sequences to a reference database of full-length sequences. An incorrect alignment is then visible as a gap. If the pair does not in fact overlap, then the alignment will show a gap in the merged sequence. If the alignment is offset compared to the correct alignment, then this can create a gap in the merged sequence or in the full-length sequence, depending on whether the alignment is too long or too short. Keep in mind that gaps may be valid biological insertions or deletions. The clearest cases are where the segments on both sides of the gap have high identity, especially in a highly conserved gene such as 16S.

It is usually impractical to manually review a large number of aligments to look for such gaps. You can deal with this by spot-checking, or by writing scripts to search for suspicious gaps which you can review manually. If you are running an OTU (or denoising) pipeline, then you could spot-check the sequences for some low-abundance OTUs, because these where most problems are found.